Dampak SMOTE terhadap Kinerja Random Forest Classifier berdasarkan Data Tidak seimbang
Abstract
Dalam aplikasi machine learning sangat umum ditemukan kumpulan data dalam berbagai tingkat ketidakseimbangan mulai dari ketidakseimbangan kecil, sedang sampai ekstrim. Sebagian besar model machine learning yang dilatih pada data tidak seimbang akan memiliki bias dengan memberikan tingkat akurasi yang tinggi pada kelas mayoritas dan sebaliknya rendah pada kelas minoritas. Tujuan penelitian ini adalah untuk mengevaluasi dampak dari SMOTE (Synthetic Minority Oversampling Technique) pada pengklasifikasi Random Forest untuk memprediksi penyakit jantung. Data berjumlah 299 berasal dari UCI Machine learning Repository digunakan untuk membangun model prediksi berdasarkan 12 variabel independen dan 1 variabel dependen. Kelas minoritas dalam dataset pelatihan di oversampling menggunakan teknik SMOTE (Synthetic Minority Oversampling Technique). Model dievaluasi tidak hanya menggunakan ukuran kinerja Accuracy dan Precision saja, namun juga menggunakan alternatif ukuran kinerja lainnya seperti Sensitivity, F1-score, Specificity, G-Mean dan Youdens Index yang lebih baik digunakan untuk data yang tidak seimbang. Hasil penelitian menunjukkan bahwa teknik SMOTE (Synthetic Minority Oversampling Technique) mampu mengurangi overfitting sekaligus meningkatkan kinerja model Random Forest pada semua indikator. Peningkatan skor Accuracy sebesar 3.45%, Precision 4.8%, Sensitivity 7.1%, F1-score 4.8%, Specificity 2.1%, G-Mean 4.4%, dan Youdens Index 6.3%. Penelitian ini membuktikan bahwa dalam menentukan pengklasifikasi dengan algoritma machine learning seperti Random Forest, kemiringan kelas dalam data perlu diperhitungkan dan diseimbangkan untuk hasil kinerja yang lebih baik.
Downloads
References
of the 6th International Conference on Inventive Computation Technologies, ICICT 2021, pp. 915–923, 2021.
[2] E. D. Adler, A. A. Voors, L. Klein, F. Macheret, O. O. Braun, M. A. Urey, W. Zhu, I. Sama, M. Tadel, C. Campagnari, B. Greenberg, and
A. Yagil, “Improving Risk Prediction in Heart Failure Using Machine Learning,” European Journal of Heart Failure, vol. 22, no. 1, pp.
139–147, 2020.
[3] K. V. V. Reddy, I. Elamvazuthi, A. A. Aziz, S. Paramasivam, H. N. Chua, and S. Pranavanand, “Heart Disease Risk Prediction Using
Machine Learning Classifiers with Attribute Evaluators,” Applied Sciences, vol. 11, no. 18, 2021.
[4] A. O¨ zdemir, K. Polat, and A. Alhudhaif, “Classification of Imbalanced Hyperspectral Images Using SMOTE-Based Deep Learning
Methods,” Expert Systems with Applications, vol. 178, no. April, 2021.
[5] E. Prasetyo and B. Prasetiyo, “Increased Classification Accuracy C4 . 5 Algorithm Using Bagging Techniques in Diagnosing Heart
Disease,” Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 7, no. 5, pp. 1035–1040, 2020.
[6] A. Riani, Y. Susianto, and N. Rahman, “Implementasi Data Mining untuk Memprediksi Penyakit Jantung Mengunakan Metode Naive
Bayes,” Journal of Innovation Information Technology and Application (JINITA), vol. 1, no. 01, pp. 25–34, 2019.
[7] D. S. Permana and A. Silvanie, “Prediksi Penyakit Jantung Menggunakan Support Vector Machine dan Python pada Basis Data Pasien,”
Jurnal Nasional Informatia, vol. 2, no. 1, pp. 29–34, 2021.
[8] M. M. Bukhari, B. F. Alkhamees, S. Hussain, A. Gumaei, A. Assiri, and S. S. Ullah, “An Improved Artificial Neural Network Model for
Effective Diabetes Prediction,” Complexity, vol. 2021, 2021.
[9] Erlin, Y. N. Marlim, Junadhi, L. Suryati, and N. Agustina, “Early Detection of Diabetes Using Machine Learning with Logistic Regression
Algorithm,” Jurnal Nasional Teknik Elektro dan Teknologi Informasi, vol. 11, no. 2, 2022.
[10] Q. Zou, K. Qu, Y. Luo, D. Yin, Y. Ju, and H. Tang, “Predicting Diabetes Mellitus with Machine Learning Techniques,” Frontiers in
Genetics, vol. 9, no. November, pp. 1–10, 2018.
[11] K. Polat, “A Hybrid Approach to Parkinson Disease Classification Using Speech Signal: The Combination of SMOTE and Random
Forests,” 2019 Scientific Meeting on Electrical-Electronics and Biomedical Engineering and Computer Science, EBBT 2019, pp. 1–3,
2019.
[12] T. Pan, J. Zhao,W.Wu, and J. Yang, “Learning Imbalanced Datasets Based on SMOTE and Gaussian Distribution,” Information Sciences,
vol. 512, pp. 1214–1233, 2020.
[13] D. Elreedy and A. F. Atiya, “A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for Handling Class
Imbalance,” Information Sciences, vol. 505, pp. 32–64, 2019.
[14] J. Li, Q. Zhu, Q. Wu, and Z. Fan, “A Novel Oversampling Technique for Class-Imbalanced Learning Based on SMOTE and Natural
Neighbors,” Information Sciences, vol. 565, pp. 438–455, 2021.
[15] Z. Xu, D. Shen, T. Nie, Y. Kou, N. Yin, and X. Han, “A Cluster-Based Oversampling Algorithm Combining SMOTE and K-Means for
Imbalanced Medical Data,” Information Sciences, vol. 572, pp. 574–589, 2021.
[16] D. S. Sisodia and U. Verma, “The Impact of Data Re-Sampling on Learning Performance of Class Imbalanced Bankruptcy Prediction
Models,” International Journal on Electrical Engineering and Informatics, vol. 10, no. 3, pp. 433–446, 2018.
[17] S. Feng, J. Keung, X. Yu, Y. Xiao, and M. Zhang, “Investigation on The Stability of SMOTE-Based Oversampling Techniques in Software
Defect Prediction,” Information and Software Technology, vol. 139, no. June, p. 106662, 2021.
[18] N. K. Mishra and P. K. Singh, “Feature Construction and Smote-Based Imbalance Handling for Multi-Label Learning,” Information
Sciences, vol. 563, pp. 342–357, 2021.
[19] H. Hairani, A. S. Suweleh, and D. Susilowaty, “Penanganan Ketidak Seimbangan Kelas Menggunakan Pendekatan Level Data,” MATRIK
: Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, vol. 20, no. 1, pp. 109–116, 2020.
[20] T. Ahmad, A. Munir, S. H. Bhatti, M. Aftab, and M. A. Raza, “Survival Analysis of Heart Failure Patients : A Case Study,” PLOS ONE,
vol. 12, no. 7, pp. 1–8, 2017.
[21] D. Chicco and G. Jurman, “Machine Learning Can Predict Survival of Patients with Heart Failure from Serum Creatinine and Ejection
Fraction Alone,” BMCMedical Informatics and DecisionMaking, vol. 5, pp. 1–16, 2020.
[22] S. Wang, S. Liu, J. Zhang, X. Che, Y. Yuan, Z. Wang, and D. Kong, “A New method of Diesel Fuel Brands Identification: SMOTE
Oversampling Combined with XGBoost Ensemble Learning,” Fuel, vol. 282, no. July, p. 118848, 2020.
[23] EngEd Community, “Introduction to Random Forest in Machine Learning,” Section’s Engineering Education Program, 2020.
[24] F. Hu and H. Li, “A Novel Boundary Oversampling Algorithm Based on Neighborhood Rough Set Model : NRSBoundary-SMOTE,”
Mathematical Problems in Engineering, 2013.
[25] P. Ghosh, A. Neufeld, and J. K. Sahoo, “Forecasting Directional Movements of Stock Prices for Intraday Trading Using LSTM and
Random Forests,” Finance Research Letters, no. November 2015, p. 102280, 2021.
[26] Q. Zhou, W. Lan, Y. Zhou, and G. Mo, “Effectiveness Evaluation of Anti-bird Devices based on Random Forest Algorithm,” 2020 7th
International Conference on Information, Cybernetics, and Computational Social Systems, ICCSS 2020, pp. 743–748, 2020.
[27] Z. Chai and C. Zhao, “Multiclass Oblique Random Forests with Dual-Incremental Learning Capacity,” IEEE Transactions on Neural
Networks and Learning Systems, vol. 31, no. 12, pp. 5192–5203, 2020.
[28] N. Soonthornphisaj, T. Sira-Aksorn, and P. Suksankawanich, “Social Media Comment Management Using SMOTE and Random Forest
Algorithms,” International Journal of Networked and Distributed Computing, vol. 6, no. 4, pp. 204–209, 2018.
[29] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, and B. Thirion, “Scikit-Learn: Machine Learning in Python,” Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, 2011.
[30] V. P. K. Turlapati and M. R. Prusty, “Outlier-SMOTE: A Refined Oversampling Technique for Improved Detection of COVID-19,”
Intelligence-Based Medicine, vol. 3-4, no. July, p. 100023, 2020.
[31] J. Akosa, “Predictive Accuracy : A Misleading Performance Measure for Highly Imbalanced Data Classified Negative,” Oklahoma State
University, pp. 1–12, 2017.
[32] M. Aria, C. Cuccurullo, and A. Gnasso, “A Comparison Among Interpretative Proposals for Random Forests,” Machine Learning with
Applications, vol. 6, no. January, p. 100094, 2021.
[33] S. Fotouhi, S. Asadi, and M. W. Kattan, “A Comprehensive Data Level Analysis for Cancer Diagnosis on Imbalanced Data,” Journal of
Biomedical Informatics, vol. 90, no. January, p. 103089, 2019.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.