Comparative Evaluation of Data Clustering Accuracy through Integration of Dimensionality Reduction and Distance Metric
DOI:
https://doi.org/10.30812/matrik.v24i3.5057Keywords:
Clustering, Cluster Evaluation, Distance Metric, K-Means, Principal Component AnalysisAbstract
The primary issue in clustering analysis of multivariate data is the low accuracy resulting from a mismatch between the Distance Metric used and the characteristics of the data. This study aims to comprehensively evaluate the effect of eight Distance Metric in the KMeans algorithm integrated with the Principal Component Analysis (PCA)dimension reduction technique. The analysis process was conducted by transforming the data into two principal components using PCA, then applying K-Means to each Distance Metric. Performance evaluation was conducted based on five internal metrics: Silhouette Score, Davies-Bouldin Index, Sum of Squared Errors, Calinski-Harabasz Index, and Dunn Index. The results show that the Bray-Curtis formula provides the best performance, with a Silhouette Score of 0.4291 and SSE of 30.3673. This is followed by Euclidean and Minkowski, which yield the highest Calinski-Harabasz Index value of 2239.85 and Dunn Index of 0.0108, respectively. In contrast, Hamming’s formula yielded the lowest performance across all metrics, with a Silhouette Score of 0.0000 and an SSE of 1996.00. The ANOVA test revealed significant differences between the Distance Metric, with a p-value of ¡0.000 for all metrics, which was further supported by the Tukey HSD follow-up test results. The implications of these findings confirm the importance of selecting an appropriate Distance Metric in the clustering process to ensure the validity, efficiency, and interpretability of multivariate data analysis results.
Downloads
References
[1] C.-E. Ben Ncir, A. Hamza, andW. Bouaguel, “Parallel and scalable Dunn Index for the validation of big data clusters,” Parallel
Computing, vol. 102, p. 102751, 2021, https://doi.org/10.1016/j.parco.2021.102751.
[2] S. Suboh, I. A. Aziz, S. M. Shaharudin, S. A. Ismail, and H. Mahdin, “A Systematic Review of Anomaly Detection within High
Dimensional and Multivariate Data,” vol. 7, no. March, 2023, https://doi.org/10.30630/joiv.7.1.1297.
[3] J. Yin, S. Sun, L.Wei, and P.Wang, “Discriminatively Fuzzy Multi-View K-means Clustering with Local Structure Preserving,”
vol. 38, no. 5, pp. 16 478–16 485, 2024, https://doi.org/10.1609/aaai.v38i15.29585.
[4] M. Zubair, M. D. A. Iqbal, A. Shil, M. J. M. Chowdhury, M. A. Moni, and I. H. Sarker, “An improved K-means clustering
algorithm towards an efficient data-driven modeling,” Annals of Data Science, vol. 11, no. 5, pp. 1525–1544, 2024, https:
//doi.org/10.1007/s40745-022-00428-2.
[5] J. Zhao, G. Wang, J.-S. Pan, T. Fan, and I. Lee, “Density peaks clustering algorithm based on fuzzy and weighted shared
neighbor for uneven density datasets,” Pattern Recognition, vol. 139, p. 109406, July, 2023, https://doi.org/10.1016/j.patcog.
2023.109406.
[6] O. Dorabiala, A. Y. Aravkin, and J. N. Kutz, “Ensemble principal component analysis,” IEEE Access, vol. 12, pp. 6663–6671,
January, 2024, https://doi.org/10.1109/ACCESS.2024.3350984.
[7] F. Zou and G. G. Yen, “Dynamic multiobjective optimization with varying number of objectives assisted by dynamic principal
component analysis,” Information Sciences, vol. 665, p. 120398, April, 2024, https://doi.org/10.1016/j.ins.2024.120398.
[8] G. T. Reddy, M. P. K. Reddy, K. Lakshmanna, R. Kaluri, D. S. Rajput, G. Srivastava, and T. Baker, “Analysis of Dimensionality
Reduction Techniques on Big Data,” IEEE Access, vol. 8, no. March, pp. 54 776–54 788, 2020, https://doi.org/10.1109/
ACCESS.2020.2980942.
[9] G. K. Patel, V. K. Dabhi, and H. B. Prajapati, “Clustering Using a Combination of Particle Swarm Optimization and K-means,”
vol. 26, no. 3, pp. 457–469, May, 2017, https://doi.org/10.1515/jisys-2015-0099.
[10] K. Yu, S. Fang, and Y. Zhao, “Heavy metal Hg stress detection in tobacco plant using hyperspectral sensing and data-driven
machine learning methods,” Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, vol. 245, p. 118917, 2021,
https://doi.org/10.1016/j.saa.2020.118917.
[11] C. E. Coombes, X. Liu, Z. B. Abrams, K. R. Coombes, and G. Brock, “Simulation-derived best practices for clustering clinical
data,” Journal of Biomedical Informatics, vol. 118, p. 103788, June, 2021, https://doi.org/10.1016/j.jbi.2021.103788.
[12] M. Tripathi and S. K. Singal, “Allocation of weights using factor analysis for development of a novel water quality index,”
Ecotoxicology and Environmental Safety, vol. 183, p. 109510, November, 2019, https://doi.org/10.1016/j.ecoenv.2019.109510.
[13] A. C. P. Fernandes, L. F. S. Fernandes, R. M. V. Cortes, and F. A. L. Pacheco, “The role of landscape configuration, season,
and distance from contaminant sources on the degradation of stream water quality in urban catchments,” Water (Switzerland),
vol. 11, no. 10, 2019, https://doi.org/10.3390/w11102025.
[14] P. M. Hasugian, B. Sinaga, J. Manurung, and S. A. Al Hashim, “Best Cluster Optimization with Combination of K-Means
Algorithm And Elbow Method Towards Rice Production Status Determination,” International Journal of Artificial Intelligence
Research, vol. 5, no. 1, pp. 102–110, 2021, https://doi.org/10.29099/ijair.v6i1.232.
[15] S. Sumathi and H. G. Gunaseelan, “A Review of Data and Document Clustering pertaining to various Distance Measures,”
Salud, Ciencia y Tecnolog´ıa, 2022, https://doi.org/10.56294/saludcyt2022194.
[16] N. Faris, A. Sahi, M. Diykh, S. Abdulla, and S. Siuly, “Enhanced Polycystic Ovary Syndrome Diagnosis Model Leveraging
a K-means Based Genetic Algorithm and Ensemble Approach,” Intelligence-Based Medicine, vol. 11, p. 100253, 2025, https:
//doi.org/10.1016/j.ibmed.2025.100253.
[17] R. Perera, M. C. Huerta, C. Barris, and M. Baena, “Clustering classifier of FRP strengthened concrete beams using superpixels
and principal component analysis,” Construction and Building Materials, vol. 453, no. June, p. 139019, 2024, https://doi.org/
10.1016/j.conbuildmat.2024.139019.
[18] A. K. Abdalameer, M. Alswaitti, A. A. Alsudani, and N. A. M. Isa, “A new validity clustering index-based on finding new
centroid positions using the mean of clustered data to determine the optimum number of clusters,” Expert Systems with Applications,
vol. 191, p. 116329, April, 2022, https://doi.org/10.1016/j.eswa.2021.116329.
[19] Q. Zhang, X. Zhang, J. Yang, M. Sun, and T. Cao, “Introducing Euclidean distance optimization into Softmax loss under neural
collapse,” Pattern Recognition, vol. 162, no. November 2024, p. 111400, 2025, https://doi.org/10.1016/j.patcog.2025.111400.
[20] Y. Yuan, J.Wang,W. Li, K.Wang, H. Rao, and J. Xu, “Fast supervoxel segmentation of connectivity median simulation based on
Manhattan distance,” International Journal of Applied Earth Observation and Geoinformation, vol. 133, p. 104108, September,
2024, https://doi.org/10.1016/j.jag.2024.104108.
[21] S. Liaquat, M. F. Zia, O. Saleem, Z. Asif, and M. Benbouzid, “Performance analysis of distance metrics on the exploitation
properties and convergence behaviour of the conventional firefly algorithm[Formula presented],” Applied Soft Computing, vol.
126, p. 109255, September, 2022, https://doi.org/10.1016/j.asoc.2022.109255.
[22] N. Krivulin, “Algebraic solution of minimax single-facility constrained location problems with Chebyshev and rectilinear distances,”
Journal of Logical and Algebraic Methods in Programming, vol. 115, p. 100578, October, 2020, https://doi.org/10.
1016/j.jlamp.2020.100578.
[23] A. Ghosh, A. K. Ghosh, R. SahaRay, and S. Sarkar, “Classification Using Global and Local Mahalanobis Distances,” vol. 207,
no. February 2024, 2024, https://doi.org/10.1016/j.jmva.2025.105417.
[24] P. M. Hasugian, H. Mawengkang, P. Sihombing, and S. Efendi, “Development of distance formulation for high-dimensional data
visualization in multidimensional scaling,” Bulletin of Electrical Engineering and Informatics, vol. 14, no. 2, pp. 1178–1189,
2025, https://doi.org/10.11591/eei.v14i2.8738.
[25] W. Zhao, L. Yang, C. Dang, R. Rocchetta, M. Valdebenito, and D. Moens, “Enriching stochastic model updating metrics: An
efficient Bayesian approach using Bray-Curtis distance and an adaptive binning algorithm,” Mechanical Systems and Signal
Processing, vol. 171, no. September 2021, p. 108889, 2022, https://doi.org/10.1016/j.ymssp.2022.108889.
[26] P. Agarwalla and S. Mukhopadhyay, “Gene expression selection for cancer classification using intelligent collaborative filtering
and hamming distance guided multi-objective swarm optimization,” Applied Soft Computing, vol. 170, no. November 2024, p.
112654, 2025, https://doi.org/10.1016/j.asoc.2024.112654.
[27] J. Wu, J. Chen, H. Xiong, and M. Xie, “External validation measures for K-means clustering: A data distribution perspective,”
Expert Systems with Applications, vol. 36, no. 3, Part 2, pp. 6050–6061, 2009, https://doi.org/10.1016/j.eswa.2008.06.093.
[28] A. Arunkumar, A. Pinceti, L. Sankar, and C. Bryan, “PMU Tracker: A Visualization Platform for Epicentric Event Propagation
Analysis in the Power Grid,” IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 1, pp. 1081–1090, jan
2023, https://doi.org/10.1109/TVCG.2022.3209380.
[29] I. K. Khan, H. Daud, N. Zainuddin, and R. Sokkalingam, “Standardizing reference data in gap statistic for selection optimal
number of cluster in K-means algorithm,” Alexandria Engineering Journal, vol. 118, no. January, pp. 246–260, 2025, https:
//doi.org/10.1016/j.aej.2025.01.034.
[30] M. Raeisi and A. B. Sesay, “A Distance Metric for Uneven Clusters of Unsupervised K-Means Clustering Algorithm,” IEEE
Access, vol. 10, no. August, pp. 86 286–86 297, 2022, https://doi.org/10.1109/ACCESS.2022.3198992.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Paska Marto Hasugian, Devy Mathelinea, Siska Simamora, Pandi Barita Nauli Simangunsong

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
How to Cite
Similar Articles
- Angelina Ervina Jeanette Egeten, Yanes Hardianto S, Putri Ayu P, Okky Marita S, Analisis dan Perancangan Sistem Informasi E-Procurement Modul pada Pemesanan Barang Non Produksi di PT Toyota Motor Manufacturing Indonesia , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 17 No. 2 (2018)
- Eka Hartati, Mardiana Mardiana, Evaluasi Penerapan Computer Based Test (CBT) sebagai Upaya Perbaikan Sistem pada Ujian Nasional untuk Sekolah Terpencil di Sumatera Selatan , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 18 No. 1 (2018)
- Bobby Poerwanto, Fajriani Fajriani, Resilient Backpropagation Neural Network on Prediction of Poverty Levels in South Sulawesi , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 20 No. 1 (2020)
- Christofer Satria, Anthony Anggrawan, Tinjauan Kritis Jurnal Ilmiah: “The Influence of Transformational Leadership and Organizational Culture on Learning Organization: a Comparative Analysis of The it Sector†, MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 18 No. 1 (2018)
- Sepyan Purnama Kristanto, Lutfi Hakim, Ekstraksi Informasi Destinasi Wisata Populer Jawa Timur Menggunakan Depth-First Crawling , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 21 No. 1 (2021)
- F.ti Ayyu Sayyidul Laily, Didik Dwi Prasetya, Anik Nur Handayani, Tsukasa Hirashima, Revealing Interaction Patterns in Concept Map Construction Using Deep Learning and Machine Learning Models , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 24 No. 2 (2025)
- Ni Putu Widiani, Ni Made Estiyanti, I Putu Satwika, Rancang Bangun Sistem Informasi Persediaan dan Permintaan Barang Proyek Kelistrikan Berbasis Web (Studi Kasus pada PT. Tea Kirana) , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 18 No. 1 (2018)
- Didit Suhartono, Khairunnisak Nur Isnaini, Strategi Recovery Plan Teknologi Informasi di Perguruan Tinggi Menggunakan Framework NIST SP 800-34 , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 20 No. 2 (2021)
- Husain Husain, I Putu Hariyadi, Kurniadin Abd Latif, Galih Tri Aditya, Implementation of Port Knocking with Telegram Notifications to Protect Against Scanner Vulnerabilities , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 23 No. 1 (2023)
- Muhammad Zaki Wiryawan, Didik Dwi Prasetya, Anik Nur Handayani, Tsukasa Hirashima, Wahyu Styo Pratama, Lalu Ganda Rady Putra, Enhancing Semantic Similarity in Concept Maps Using LargeLanguage Models , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 24 No. 3 (2025)
You may also start an advanced similarity search for this article.
.png)











