Dampak SMOTE terhadap Kinerja Random Forest Classifier berdasarkan Data Tidak seimbang

Erlin Erlin; Yenny Desnelita; Nurliana Nasution; Laili Suryati; Fransiskus Zoromi

doi:10.30812/matrik.v21i3.1726

Authors

Erlin Erlin Institut Bisnis dan Teknologi Pelita Indonesia
Yenny Desnelita Institut Bisnis dan Teknologi Pelita Indonesia
Nurliana Nasution Universitas Lancang Kuning
Laili Suryati Universitas Persada Indonesia
Fransiskus Zoromi STMIK Amik Riau

DOI:

https://doi.org/10.30812/matrik.v21i3.1726

Keywords:

Data Tidak seimbang, Machine Learning, Overfitting, Random Forest, SMOTE

Abstract

Dalam aplikasi machine learning sangat umum ditemukan kumpulan data dalam berbagai tingkat ketidakseimbangan mulai dari ketidakseimbangan kecil, sedang sampai ekstrim. Sebagian besar model machine learning yang dilatih pada data tidak seimbang akan memiliki bias dengan memberikan tingkat akurasi yang tinggi pada kelas mayoritas dan sebaliknya rendah pada kelas minoritas. Tujuan penelitian ini adalah untuk mengevaluasi dampak dari SMOTE (Synthetic Minority Oversampling Technique) pada pengklasifikasi Random Forest untuk memprediksi penyakit jantung. Data berjumlah 299 berasal dari UCI Machine learning Repository digunakan untuk membangun model prediksi berdasarkan 12 variabel independen dan 1 variabel dependen. Kelas minoritas dalam dataset pelatihan di oversampling menggunakan teknik SMOTE (Synthetic Minority Oversampling Technique). Model dievaluasi tidak hanya menggunakan ukuran kinerja Accuracy dan Precision saja, namun juga menggunakan alternatif ukuran kinerja lainnya seperti Sensitivity, F1-score, Specificity, G-Mean dan Youdens Index yang lebih baik digunakan untuk data yang tidak seimbang. Hasil penelitian menunjukkan bahwa teknik SMOTE (Synthetic Minority Oversampling Technique) mampu mengurangi overfitting sekaligus meningkatkan kinerja model Random Forest pada semua indikator. Peningkatan skor Accuracy sebesar 3.45%, Precision 4.8%, Sensitivity 7.1%, F1-score 4.8%, Specificity 2.1%, G-Mean 4.4%, dan Youdens Index 6.3%. Penelitian ini membuktikan bahwa dalam menentukan pengklasifikasi dengan algoritma machine learning seperti Random Forest, kemiringan kelas dalam data perlu diperhitungkan dan diseimbangkan untuk hasil kinerja yang lebih baik.

Downloads

Download data is not yet available.

References

[1] T. P. Pushpavathi, S. Kumari, and N. K. Kubra, â€œHeart Failure Prediction by Feature Ranking Analysis in Machine Learning,â€ Proceedings
of the 6th International Conference on Inventive Computation Technologies, ICICT 2021, pp. 915â€“923, 2021.
[2] E. D. Adler, A. A. Voors, L. Klein, F. Macheret, O. O. Braun, M. A. Urey, W. Zhu, I. Sama, M. Tadel, C. Campagnari, B. Greenberg, and
A. Yagil, â€œImproving Risk Prediction in Heart Failure Using Machine Learning,â€ European Journal of Heart Failure, vol. 22, no. 1, pp.
139â€“147, 2020.
[3] K. V. V. Reddy, I. Elamvazuthi, A. A. Aziz, S. Paramasivam, H. N. Chua, and S. Pranavanand, â€œHeart Disease Risk Prediction Using
Machine Learning Classifiers with Attribute Evaluators,â€ Applied Sciences, vol. 11, no. 18, 2021.
[4] A. OÂ¨ zdemir, K. Polat, and A. Alhudhaif, â€œClassification of Imbalanced Hyperspectral Images Using SMOTE-Based Deep Learning
Methods,â€ Expert Systems with Applications, vol. 178, no. April, 2021.
[5] E. Prasetyo and B. Prasetiyo, â€œIncreased Classification Accuracy C4 . 5 Algorithm Using Bagging Techniques in Diagnosing Heart
Disease,â€ Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 7, no. 5, pp. 1035â€“1040, 2020.
[6] A. Riani, Y. Susianto, and N. Rahman, â€œImplementasi Data Mining untuk Memprediksi Penyakit Jantung Mengunakan Metode Naive
Bayes,â€ Journal of Innovation Information Technology and Application (JINITA), vol. 1, no. 01, pp. 25â€“34, 2019.
[7] D. S. Permana and A. Silvanie, â€œPrediksi Penyakit Jantung Menggunakan Support Vector Machine dan Python pada Basis Data Pasien,â€
Jurnal Nasional Informatia, vol. 2, no. 1, pp. 29â€“34, 2021.
[8] M. M. Bukhari, B. F. Alkhamees, S. Hussain, A. Gumaei, A. Assiri, and S. S. Ullah, â€œAn Improved Artificial Neural Network Model for
Effective Diabetes Prediction,â€ Complexity, vol. 2021, 2021.
[9] Erlin, Y. N. Marlim, Junadhi, L. Suryati, and N. Agustina, â€œEarly Detection of Diabetes Using Machine Learning with Logistic Regression
Algorithm,â€ Jurnal Nasional Teknik Elektro dan Teknologi Informasi, vol. 11, no. 2, 2022.
[10] Q. Zou, K. Qu, Y. Luo, D. Yin, Y. Ju, and H. Tang, â€œPredicting Diabetes Mellitus with Machine Learning Techniques,â€ Frontiers in
Genetics, vol. 9, no. November, pp. 1â€“10, 2018.
[11] K. Polat, â€œA Hybrid Approach to Parkinson Disease Classification Using Speech Signal: The Combination of SMOTE and Random
Forests,â€ 2019 Scientific Meeting on Electrical-Electronics and Biomedical Engineering and Computer Science, EBBT 2019, pp. 1â€“3,
2019.
[12] T. Pan, J. Zhao,W.Wu, and J. Yang, â€œLearning Imbalanced Datasets Based on SMOTE and Gaussian Distribution,â€ Information Sciences,
vol. 512, pp. 1214â€“1233, 2020.
[13] D. Elreedy and A. F. Atiya, â€œA Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for Handling Class
Imbalance,â€ Information Sciences, vol. 505, pp. 32â€“64, 2019.
[14] J. Li, Q. Zhu, Q. Wu, and Z. Fan, â€œA Novel Oversampling Technique for Class-Imbalanced Learning Based on SMOTE and Natural
Neighbors,â€ Information Sciences, vol. 565, pp. 438â€“455, 2021.
[15] Z. Xu, D. Shen, T. Nie, Y. Kou, N. Yin, and X. Han, â€œA Cluster-Based Oversampling Algorithm Combining SMOTE and K-Means for
Imbalanced Medical Data,â€ Information Sciences, vol. 572, pp. 574â€“589, 2021.
[16] D. S. Sisodia and U. Verma, â€œThe Impact of Data Re-Sampling on Learning Performance of Class Imbalanced Bankruptcy Prediction
Models,â€ International Journal on Electrical Engineering and Informatics, vol. 10, no. 3, pp. 433â€“446, 2018.
[17] S. Feng, J. Keung, X. Yu, Y. Xiao, and M. Zhang, â€œInvestigation on The Stability of SMOTE-Based Oversampling Techniques in Software
Defect Prediction,â€ Information and Software Technology, vol. 139, no. June, p. 106662, 2021.
[18] N. K. Mishra and P. K. Singh, â€œFeature Construction and Smote-Based Imbalance Handling for Multi-Label Learning,â€ Information
Sciences, vol. 563, pp. 342â€“357, 2021.
[19] H. Hairani, A. S. Suweleh, and D. Susilowaty, â€œPenanganan Ketidak Seimbangan Kelas Menggunakan Pendekatan Level Data,â€ MATRIK
: Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, vol. 20, no. 1, pp. 109â€“116, 2020.
[20] T. Ahmad, A. Munir, S. H. Bhatti, M. Aftab, and M. A. Raza, â€œSurvival Analysis of Heart Failure Patients : A Case Study,â€ PLOS ONE,
vol. 12, no. 7, pp. 1â€“8, 2017.
[21] D. Chicco and G. Jurman, â€œMachine Learning Can Predict Survival of Patients with Heart Failure from Serum Creatinine and Ejection
Fraction Alone,â€ BMCMedical Informatics and DecisionMaking, vol. 5, pp. 1â€“16, 2020.
[22] S. Wang, S. Liu, J. Zhang, X. Che, Y. Yuan, Z. Wang, and D. Kong, â€œA New method of Diesel Fuel Brands Identification: SMOTE
Oversampling Combined with XGBoost Ensemble Learning,â€ Fuel, vol. 282, no. July, p. 118848, 2020.
[23] EngEd Community, â€œIntroduction to Random Forest in Machine Learning,â€ Sectionâ€™s Engineering Education Program, 2020.
[24] F. Hu and H. Li, â€œA Novel Boundary Oversampling Algorithm Based on Neighborhood Rough Set Model : NRSBoundary-SMOTE,â€
Mathematical Problems in Engineering, 2013.
[25] P. Ghosh, A. Neufeld, and J. K. Sahoo, â€œForecasting Directional Movements of Stock Prices for Intraday Trading Using LSTM and
Random Forests,â€ Finance Research Letters, no. November 2015, p. 102280, 2021.
[26] Q. Zhou, W. Lan, Y. Zhou, and G. Mo, â€œEffectiveness Evaluation of Anti-bird Devices based on Random Forest Algorithm,â€ 2020 7th
International Conference on Information, Cybernetics, and Computational Social Systems, ICCSS 2020, pp. 743â€“748, 2020.
[27] Z. Chai and C. Zhao, â€œMulticlass Oblique Random Forests with Dual-Incremental Learning Capacity,â€ IEEE Transactions on Neural
Networks and Learning Systems, vol. 31, no. 12, pp. 5192â€“5203, 2020.
[28] N. Soonthornphisaj, T. Sira-Aksorn, and P. Suksankawanich, â€œSocial Media Comment Management Using SMOTE and Random Forest
Algorithms,â€ International Journal of Networked and Distributed Computing, vol. 6, no. 4, pp. 204â€“209, 2018.
[29] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, and B. Thirion, â€œScikit-Learn: Machine Learning in Python,â€ Journal of Machine
Learning Research, vol. 12, pp. 2825â€“2830, 2011.
[30] V. P. K. Turlapati and M. R. Prusty, â€œOutlier-SMOTE: A Refined Oversampling Technique for Improved Detection of COVID-19,â€
Intelligence-Based Medicine, vol. 3-4, no. July, p. 100023, 2020.
[31] J. Akosa, â€œPredictive Accuracy : A Misleading Performance Measure for Highly Imbalanced Data Classified Negative,â€ Oklahoma State
University, pp. 1â€“12, 2017.
[32] M. Aria, C. Cuccurullo, and A. Gnasso, â€œA Comparison Among Interpretative Proposals for Random Forests,â€ Machine Learning with
Applications, vol. 6, no. January, p. 100094, 2021.
[33] S. Fotouhi, S. Asadi, and M. W. Kattan, â€œA Comprehensive Data Level Analysis for Cancer Diagnosis on Imbalanced Data,â€ Journal of
Biomedical Informatics, vol. 90, no. January, p. 103089, 2019.

Dampak SMOTE terhadap Kinerja Random Forest Classifier berdasarkan Data Tidak seimbang

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

How to Cite

sidebar menu 2

tools

citation