Application of Soft-Clustering Analysis Using Expectation Maximization Algorithms on Gaussian Mixture Model
Research on soft-clustering has not been explored much compared to hard-clustering. Soft-clustering algorithms are important in solving complex clustering problems. One of the soft-clustering methods is the Gaussian Mixture Model (GMM). GMM is a clustering method to classify data points into different clusters based on the Gaussian distribution. This study aims to determine the number of clusters formed by using the GMM method. The data used in this study is synthetic data on water quality indicators obtained from the Kaggle website. The stages of the GMM method are: imputing the Not Available (NA) value (if there is an NA value), checking the data distribution, conducting a normality test, and standardizing the data. The next step is to estimate the parameters with the Expectation Maximization (EM) algorithm. The best number of clusters is based on the biggest value of the Bayesian Information Creation (BIC). The results showed that the best number of clusters from synthetic data on water quality indicators was 3 clusters. Cluster 1 consisted of 1110 observations with low-quality category, cluster 2 consisted of 499 observations with medium quality category, and cluster 3 consisted of 1667 observations with high-quality category or acceptable. The results of this study recommend that the GMM method can be grouped correctly when the variables used are generally normally distributed. This method can be applied to real data, both in which the variables are normally distributed or which have a mixture of Gaussian and non-Gaussian.
conference series, volume 1142, page 012012. IOP Publishing.
Androniceanu, A., Kinnunen, J., and Georgescu, I. (2020). E-government clusters in the eu based on the gaussian mixture models.
Administratie si Management Public, (35):6–20.
Bishop, C. M. and Nasrabadi, N. M. (2006). Pattern recognition and machine learning, volume 4. Springer.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of
the Royal Statistical Society: Series B (Methodological), 39(1):1–22.
Efron, B. (1992). Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics, pages 569–593. Springer.
Fraley, C. and Raftery, A. (2007). Model-based methods of classification: using the mclust software in chemometrics. Journal of
Statistical Software, 18:1–13.
Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American
statistical Association, 97(458):611–631.
Fraley, C., Raftery, A. E., Murphy, T. B., and Scrucca, L. (2012). mclust version 4 for r: normal mixture modeling for model-based
clustering, classification, and density estimation. Technical report, Technical report.
Kassambara, A. (2017). Practical guide to cluster analysis in R: Unsupervised machine learning, volume 1. Sthda.
Mohammed, M., Khan, M. B., and Bashier, E. B. M. (2016). Machine learning: algorithms and applications. Crc Press.
Pardede, T. (2007). Perbandingan metode model-based dengan metode k-mean dalam analisis cluster. Jurnal Matematika Sains dan
Pardede, T. (2013). Kajian metode berbasis model pada analisis kelompok dengan perangkat lunak mclust. Jurnal Matematika Sains
dan Teknologi, 14(2):84–100.
Reynolds, D. (2009). Gaussian mixture models.
Samuel, A. L. (2000). Some studies in machine learning using the game of checkers. IBM Journal of research and development,
Scrucca, L., Fop, M., Murphy, T. B., and Raftery, A. E. (2016). mclust 5: clustering, classification and density estimation using
gaussian finite mixture models. The R journal, 8(1):289.
Tan, P., Steinbach, M., and Kumar, V. (2006). Instructors solution manual. Introduction to Data Mining.
Tiro, M. A. (1991). Edgeworth expansion and bootstrap approximation for M-estimators of linear regression parameters with
increasing dimensions. Iowa State University.
Utami, R. S. and Danardono, D. (2019). Metode multiple imputation untuk mengatasi kovariat tak lengkap pada data kejadian
berulang. Journal of Fundamental Mathematics and Applications (JFMA), 2(2):47–57.
Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E., and Ruzzo, W. L. (2001). Model-based clustering and data transformations for
gene expression data. Bioinformatics, 17(10):977–987.
This work is licensed under a Creative Commons Attribution 4.0 International License.