Cluster Validity for Optimizing Classification Model: Davies Bouldin Index – Random Forest Algorithm
Abstract
Several factors impact pregnant women’s health and mortality rates. The symptoms of disease in pregnant women are often similar. This makes it difficult to evaluate which factors contribute to a low, medium, or high risk of mortality among pregnant women. The purpose of this research is to generate classification rules for maternal health risk using optimal clusters. The optimal cluster is obtained from the process carried out by the validity cluster. The methods used are K-Means clustering, Davies Bouldin Index (DBI), and the Random Forest algorithm. These methods build optimum clusters from a set of k-tests to produce the best classification. Optimal clusters comprising cluster members with
strong similarities are high-dimensional data. Therefore, the Principal Component Analysis (PCA) technique is required to evaluate attribute value. The result of the research is that the best classification rule was obtained from k-tests = 22 on the 20th cluster, which has an accuracy of 97% to low, mid, and high risk. The novelty lies in using DBI for data that the Random Forest will classify. According to the research findings, the classification rules created through optimal clusters are 9.7% better than without the clustering process. This demonstrates that optimizing the data group has implications for enhancing the classification algorithm’s performance.
Downloads
References
Maternal Mortality,” vol. 13, no. 4, pp. 70–80, 2023, https://doi.org/10.9790/9622-13047080.
[2] M. Y. Al-Hindi, T. A. Al Sayari, R. Al Solami, A. K. AL Baiti, J. A. Alnemri, I. M. Mirza, A. Alattas, and Y. A. Faden,
“Association of Antenatal Risk Score With Maternal and Neonatal Mortality and Morbidity,” Cureus, vol. 12, no. 12, pp. 1–8,
2020, https://doi.org/10.7759/cureus.12230.
[3] J. Lopes, T. Guimaraes, and M. F. Santos, “Identifying Diabetic Patient Profile Through Machine Learning-Based Clustering
Analysis,” in Procedia Computer Science, vol. 220. Elsevier B.V., 2023, pp. 862–867, https://doi.org/10.1016/j.procs.2023.
03.116.
[4] A. Raza, H. U. R. Siddiqui, K. Munir, M. Almutairi, F. Rustam, and I. Ashraf, “Ensemble learning-based feature engineering
to analyze maternal health during pregnancy and health risk prediction,” PLoS ONE, vol. 17, no. 11, pp. 1–29, 2022, https:
//doi.org/10.1371/journal.pone.0276525.
[5] M. N. Islam, S. N. Mustafina, T. Mahmud, and N. I. Khan, “Machine learning to predict pregnancy outcomes: a systematic review, synthesizing framework and future research agenda,” BMC Pregnancy and Childbirth, vol. 22, no. 1, pp. 1–19, 2022,
https://doi.org/10.1186/s12884-022-04594-2.
[6] T. O. Togunwa, A. O. Babatunde, and K. U. R. Abdullah, “Deep hybrid model for maternal health risk classification in
pregnancy: synergy of ANN and random forest,” Frontiers in Artificial Intelligence, vol. 6, no. July, pp. 1–11, 2023,
https://doi.org/10.3389/frai.2023.1213436.
[7] G. J. Paul, S. A. Princy, S. Anju, S. Anita, M. C. Mary, G. Gnanavelu, K. Kanmani, M. Meena, M. Nandakumaran, S. Ramya,
G. Ravishankar, G. Shaanthi, S. Shoba, V. Sangareddi, S. Vijaya, Gomathy, Geetha, U. Rani, N. Tamil Selvi, Sarala, B. Tamil
Selvi, Prema Elizabeth, Nalina, Priyadarsene, Kasthuri, Sadhana, Sindhumathy, Sudarshini, Nazreeen, Devika, Shoba Sivakumar,
C. Umarani, R. Priya, Kaleeswari, Suganya, R. M. Shunmugam, P. Ganapathy, M. Chandran, S. Nagarajan, M. Ganesan,
A. M. Angappamudali, N. Jeyabalan, B. P. Palani, Saravanababu, K. Srinivasan, E. M. Elangovan, N. P. Mohandoss, E. Chandrasekaran,
R. R. Duraipandian, P. K. Gorijavaram, T. Kunjjitham, Ravindran, Dharmarajan, T. Kaliyamurthy, J. Sreeram,
A. Seeralan, Mangalabharathi, B. Mariappan, C. Manimaran, and E. J. Kumar, “Pregnancy outcomes in women with heart
disease: the Madras Medical College Pregnancy And Cardiac (M-PAC) Registry from India,” European Heart Journal, vol. 44,
no. 17, pp. 1530–1540, 2023, https://doi.org/10.1093/eurheartj/ehad003.
[8] A. A. Sinha and S. Rajendran, “A novel two-phase location analytics model for determining operating station locations of
emerging air taxi services,” Decision Analytics Journal, vol. 2, no. June 2021, p. 100013, 2022, https://doi.org/10.1016/j.dajour.
2021.100013.
[9] J. Yu, L. Zhu, R. Qin, Z. Zhang, L. Li, and T. Huang, “Combining k-means clustering and random forest to evaluate the gas
content of coalbed bed methane reservoirs,” Geofluids, vol. 2021, no. -, pp. 1–8, 2021, https://doi.org/10.1155/2021/9321565.
[10] A. Ultsch and J. L¨otsch, “Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans),”
BMC Bioinformatics, vol. 23, no. 1, pp. 1–18, 2022, https://doi.org/10.1186/s12859-022-04769-w.
[11] W. Ramdhan, O. S. Sitompul, E. B. Nababan, and Sawaluddin, “A Framework for Dominant Factors Revelation of the Outbreak’s
Cause,” in 2021 International Conference on Data Science, Artificial Intelligence, and Business Analytics, DATABIA
2021 - Proceedings. IEEE, 2021, pp. 52–57, https://doi.org/10.1109/DATABIA53375.2021.9649732.
[12] K. Rodolaki, V. Pergialiotis, N. Iakovidou, T. Boutsikou, Z. Iliodromiti, and C. Kanaka-Gantenbein, “The impact of maternal
diabetes on the future health and neurodevelopment of the offspring: a review of the evidence,” Frontiers in Endocrinology,
vol. 14, no. July, pp. 1–19, 2023, https://doi.org/10.3389/fendo.2023.1125628.
[13] W. Li, “Optimization and Application of Random Forest Algorithm for Applied Mathematics Specialty,” Security and Communication
Networks, vol. 2022, no. -, pp. 1–9, 2022, https://doi.org/10.1155/2022/1131994.
[14] M. Jiang, J. Wang, L. Hu, and Z. He, “Random forest clustering for discrete sequences,” Pattern Recognition Letters, vol. 174,
no. September, pp. 145–151, 2023, https://doi.org/10.1016/j.patrec.2023.09.001.
[15] M. Savargiv, B. Masoumi, and M. R. Keyvanpour, “A new random forest algorithm based on learning automata,” Computational
Intelligence and Neuroscience, vol. 2021, no. -, pp. 1–19, 2021, https://doi.org/10.1155/2021/5572781.
[16] S. Kumar, P. Kaur, and A. Gosain, “A Comprehensive Survey on Ensemble Methods,” in 2022 IEEE 7th International conference
for Convergence in Technology, I2CT 2022, no. April, 2022, pp. 1–8, https://doi.org/10.1109/I2CT54291.2022.9825269.
[17] R. J. Janse, T. Hoekstra, K. J. Jager, C. Zoccali, G. Tripepi, F.W. Dekker, and M. Van Diepen, “Conducting correlation analysis:
Important limitations and pitfalls,” Clinical Kidney Journal, vol. 14, no. 11, pp. 2332–2337, 2021, https://doi.org/10.1093/ckj/
sfab085.
[18] A. Nobi, K. H. Tuhin, and J. W. Lee, “Application of principal component analysis on temporal evolution of COVID-19,” PLoS
ONE, vol. 16, no. 12 December, pp. 1–12, 2021, https://doi.org/10.1371/journal.pone.0260899.
[19] S. P and K. Pothuganti, “Overview on Principal Component Analysis Algorithm in Machine Learning,” @International Research
Journal of Modernization in Engineering, vol. 02, no. 10, pp. 241–246, 2020.
[20] M. Greenacre, P. J. Groenen, T. Hastie, A. I. D’Enza, A. Markos, and E. Tuzhilina, “Principal component analysis,” Nature
Reviews Methods Primers, vol. 2, no. 1, pp. 1–24, 2022, https://doi.org/10.1038/s43586-022-00184-w.
[21] G. J. Oyewole and G. A. Thopil, Data clustering: application and trends. Springer Netherlands, 2023, vol. 56, no. 7,
https://doi.org/10.1007/s10462-022-10325-y.
[22] K. A. Abbas, A. Gharavi, N. A. Hindi, M. Hassan, H. Y. Alhosin, J. Gholinezhad, H. Ghoochaninejad, H. Barati, J. Buick,
P. Yousefi, R. Alasmar, and S. Al-Saegh, “Unsupervised machine learning technique for classifying production zones in unconventional
reservoirs,” International Journal of Intelligent Networks, vol. 4, no. October 2022, pp. 29–37, 2023, https:
//doi.org/10.1016/j.ijin.2022.11.007.
[23] R. Buaton and S. Solikhun, “The Application of Numerical Measure Variations in K-Means Clustering for Grouping Data,”
MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, vol. 23, no. 1, pp. 103–112, 2023, https://doi.org/
10.30812/matrik.v23i1.3269.
[24] F. Ros, R. Riad, and S. Guillaume, “PDBI: A partitioning Davies-Bouldin index for clustering evaluation,” Neurocomputing,
vol. 528, no. -, pp. 178–199, 2023, https://doi.org/10.1016/j.neucom.2023.01.043.
[25] B. Zagajewski, M. Kluczek, E. Raczko, A. Njegovec, A. Dabija, and M. Kycko, “Comparison of random forest, support vector
machines, and neural networks for post-disaster forest species mapping of the krkonoˇse/karkonosze transboundary biosphere
reserve,” Remote Sensing, vol. 13, no. 2581, pp. 1–23, 2021, https://doi.org/10.3390/rs13132581.
[26] M. Aria, C. Cuccurullo, and A. Gnasso, “A comparison among interpretative proposals for Random Forests,” Machine Learning
with Applications, vol. 6, no. April, p. 100094, 2021, https://doi.org/10.1016/j.mlwa.2021.100094.
[27] A. D. Purwanto, K. Wikantika, A. Deliar, and S. Darmawan, “Decision Tree and Random Forest Classification Algorithms
for Mangrove Forest Mapping in Sembilang National Park, Indonesia,” Remote Sensing, vol. 15, no. 16, pp. 1–31, 2023,
https://doi.org/10.3390/rs15010016.
[28] T. G. Pratama, R. Hartanto, and N. A. Setiawan, “Machine learning algorithm for improving performance on 3 AQ-screening
classification,” Communications in Science and Technology, vol. 4, no. 2, pp. 44–49, 2019, https://doi.org/10.21924/cst.4.2.
2019.118.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.