Analysis of Preprocessing Technique Combinations and Hyperparameter Tuning for Building a Reliable Random Forest–Based Stroke Prediction Model

Authors

  • ID Aidina Ristyawan Universitas Nusantara PGRI Kediri, Kediri, Indonesia
  • ID Arie Nugroho Universitas Nusantara PGRI Kediri, Kediri, Indonesia

DOI:

https://doi.org/10.30812/ijecsa.v5i1.6080

Keywords:

Stroke prediction, preprocessing combinations, hyperparameter-tuning combinations, model fitting analysis, Random Forest

Abstract

Stroke is a major health threat that can result in permanent disability or death, yet its risks can be mitigated through accurate early detection. Although the Random Forest algorithm is frequently utilized for stroke prediction, prior studies have often neglected model reliability, specifically the stability of performance between training and testing phases. This research aims to develop a dependable stroke prediction model by implementing the CRISP-DM methodology on a public dataset comprising 5,110 data points. The proposed methodology involves a comprehensive evaluation of 48 preprocessing technique combinations—addressing missing values in the BMI attribute, categorical transformation, feature scaling, and class balancing—followed by a two-stage hyperparameter optimization strategy: Randomized Search for broad exploration and Grid Search Refine for local refinement to ensure optimal stability. Model performance was evaluated using accuracy, precision, recall, and F1-score metrics. The results demonstrate that hyperparameter tuning successfully enhanced model performance by up to 38.80%. Additionally, it was found that the hybrid balancing technique (SMOTETomek) did not consistently yield the most stable models in this specific case. The optimal model (Model No. 8) achieved a training accuracy of 0.925 and a testing accuracy of 0.877. With a minimal performance gap of 0.047 (below the 0.05 threshold), this model is classified as "good fitting," signifying superior generalization capabilities. Consequently, this model is highly recommended for implementation as a robust and trustworthy early warning decision support system for medical professionals.

Downloads

Download data is not yet available.

References

[1] J. Liljehult, T. Christensen, and K. B. Christensen, “Early Prediction of One-Year Mortality in Ischemic and Haemorrhagic Stroke,” Journal of Stroke and Cerebrovascular Diseases, vol. 29, no. 4, Apr. 2020, doi: 10.1016/j.jstrokecerebrovasdis.2020.104667.

[2] M. Guhdar, A. Ismail Melhum, and A. Luqman Ibrahim, “Optimizing Accuracy of Stroke Prediction Using Logistic Regression,” Journal of Technology and Informatics (JoTI), vol. 4, no. 2, pp. 41–47, Apr. 2023, doi: 10.37802/joti.v4i2.278.

[3] E. Wulandari et al., “Classification Of Stroke Prediction Using The Support Vector Machine (SVM),” Jurnal Teknik Informatika dan Sistem Informasi, vol. 11, no. 3, pp. 17–29, Sep. 2024.

[4] T. K. Amarya, A. C. A. G, R. Achmad, E. Daniati, and A. Ristyawan, “Analisa Perbandingan Algoritma Classification Berdasarkan Komposisi Label,” Prosiding SEMNAS INOTEK, vol. 8, no. 1, pp. 32–40, 2024, doi: https://doi.org/10.29407/inotek.v8i1.4906.

[5] W. Wang et al., “A systematic review of machine learning models for predicting outcomes of stroke with structured data,” PLoS One, vol. 15, no. 6, pp. 1–16, 2020, doi: 10.1371/journal.pone.0234722.

[6] X. Huang et al., “Novel Insights on Establishing Machine Learning-Based Stroke Prediction Models Among Hypertensive Adults,” Front. Cardiovasc. Med., vol. 9, no. May, pp. 1–11, 2022, doi: 10.3389/fcvm.2022.901240.

[7] T. Zhu, “Analysis on the applicability of the random forest,” J. Phys. Conf. Ser., vol. 1607, no. 1, 2020, doi: 10.1088/1742-6596/1607/1/012123.

[8] A. Nugroho and D. Harini, “Teknik Random Forest untuk Meningkatan Akurasi Data Tidak Seimbang,” JSITIK, vol. 2, no. 2, pp. 128–140, Jun. 2024, doi: 10.53624/jsitik.v2i2.XX.

[9] A. Nugroho, A. Husin, J. Provinsi, T. Hulu, and I. Hilir Indonesia, “Analisis Performa Random Forest Menggunakan Normalisasi Atribut Performance Analysis of Random Forest Using Attribute Normalization,” Jan. 2022. doi: https://doi.org/10.32520/stmsi.v11i1.1681.

[10] A. Ristyawan, A. Nugroho, and T. K. Amarya, “Optimasi Preprocessing Model Random Forest untuk Prediksi Stroke,” JATISI (Jurnal Teknik Informatika dan Sistem Informasi), vol. 12, no. 1, pp. 29–44, 2025, doi: 10.35957/jatisi.v12i1.9587.

[11] T. K. Amarya, A. Ristyawan, and R. Firliana, “Optimasi Random Forest untuk Deteksi Dini Penyakit Stroke dengan Data Rekam Medis,” Jurnal Informatika: Jurnal pengembangan IT, vol. 10, no. 3, pp. 832–840, 2025, doi: 10.30591/jpit.v10i3.8424.

[12] A. Nugroho, A. Z. Fanani, and G. F. Shidik, Evaluation of Feature Selection Using Wrapper For Numeric Dataset With Random Forest Algorithm. IEEE, 2021. doi: https://doi.org/10.1109/iSemantic52711.2021.9573249.

[13] H. Li, G. K. Rajbahadur, D. Lin, C. P. Bezemer, and Z. M. Jiang, “Keeping Deep Learning Models in Check: A History-Based Approach to Mitigate Overfitting,” IEEE Access, vol. 12, no. March, pp. 70676–70689, 2024, doi: 10.1109/ACCESS.2024.3402543.

[14] C. Xu, P. Coen-Pirani, and X. Jiang, “Empirical Study of Overfitting in Deep Learning for Predicting Breast Cancer Metastasis,” Cancers (Basel)., vol. 15, no. 7, 2023, doi: 10.3390/cancers15071969.

[15] M. Zlobin and V. Bazylevych, “A Data-Driven Approach for Balancing Overfitting and Underfitting in Decision Tree Models,” Central Ukrainian Scientific Bulletin. Technical Sciences, vol. 1, no. 11(42), pp. 14–26, 2025, doi: 10.32515/2664-262x.2025.11(42).1.14-26.

[16] R. Wirth and J. Hipp, “CRISP-DM: towards a standard process model for data mining. Proceedings of the Fourth International Conference on the Practical Application of Knowledge Discovery and Data Mining, 29-39,” Proceedings of the Fourth International Conference on the Practical Application of Knowledge Discovery and Data Mining, no. 24959, pp. 29–39, 2000, [Online]. Available: https://www.researchgate.net/publication/239585378_CRISP-DM_Towards_a_standard_process_model_for_data_mining

[17] M. Kaur, S. R. Sakhare, K. Wanjale, and F. Akter, “Early Stroke Prediction Methods for Prevention of Strokes,” Behavioural Neurology, vol. 2022, 2022, doi: 10.1155/2022/7725597.

[18] P. N. Kumar, K. V. Kumar, and A. E. College, “Systematic Approach to Perform Task Centric Exploratory Data Analysis with Case study,” International Journal of Advanced Trends in Computer Science and Engineering, vol. 10, no. 3, pp. 1920–1927, 2021, doi: 10.30534/ijatcse/2021/601032021.

[19] U. K and M. Hanumanthappa, “Heart Disease Data Analysis Using Exploratory Data Analysis Method,” International Journal of Engineering Applied Sciences and Technology, vol. 7, no. 4, pp. 121–126, 2022, doi: 10.33564/ijeast.2022.v07i04.016.

[20] C. J. Harrison and C. J. Sidey-Gibbons, “Machine learning in medicine: a practical introduction to natural language processing,” BMC Med. Res. Methodol., vol. 21, no. 1, pp. 1–18, 2021, doi: 10.1186/s12874-021-01347-1.

[21] Fedesoriano, “Stroke Prediction Dataset,” https://www.kaggle.com/. Accessed: Oct. 28, 2024. [Online]. Available: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/data

[22] U. Shafique and H. Qaiser, “A Comparative Study of Data Mining Process Models ( KDD , CRISP-DM and SEMMA ),” International Journal of Innovation and Scientific Research, vol. 12, no. 1, pp. 217–222, 2014, [Online]. Available: http://www.ijisr.issr-journals.org/

[23] A. Apicella, F. Isgrò, and R. Prevete, “Don’t push the button! Exploring data leakage risks in machine learning and transfer learning Andrea,” Artif. Intell. Rev., vol. 58, no. 339, pp. 1–58, 2025, doi: https://doi.org/10.1007/s10462-025-11326-3.

Downloads

Published

2026-03-02

How to Cite

[1]
A. Ristyawan and A. Nugroho, “Analysis of Preprocessing Technique Combinations and Hyperparameter Tuning for Building a Reliable Random Forest–Based Stroke Prediction Model”, IJECSA, vol. 5, no. 1, pp. 9–22, Mar. 2026, doi: 10.30812/ijecsa.v5i1.6080.