Handling Imbalance Data using Hybrid Sampling SMOTE-ENN in Lung Cancer Classification

Muhammad Abdul Latief; Luthfi Rakan Nabila; Wildan Miftakhurrahman; Saihun Ma'rufatullah; Henri Tantyoko

doi:10.30812/ijecsa.v3i1.3758

Muhammad Abdul Latief Institut Teknologi Telkom Purwokerto
Luthfi Rakan Nabila Institut Teknologi Telkom Purwokerto
Wildan Miftakhurrahman Institut Teknologi Telkom Purwokerto
Saihun Ma'rufatullah Institut Teknologi Telkom Purwokerto
Henri Tantyoko Institut Teknologi Telkom Purwokerto

DOI: https://doi.org/10.30812/ijecsa.v3i1.3758

Keywords: Hybrid Sampling, Lung Cancer, Imbalance Data, Resampling, Random Forest

Abstract

The classification problem is one instance of a problem that is typically handled or resolved using machine learning. When there is an imbalance in the classes within the data, machine learning models have a tendency to overclassify a greater number of classes. The model will have low accuracy in a few classes and high accuracy in many classes as a result of the issue. The majority of the data has the same number of classes, but if the difference is too great, it will differ. The issue of data imbalance is also evident in the data on lung cancer, where there are 283 positive classes and negative classes 38. Therefore, this research aims to use a hybrid sampling technique, combining Synthetic Minority Over-sampling Technique (SMOTE) with Edited Nearest Neighbors (ENN) and Random Forest, to balance the data of lung cancer patients who experience class imbalance. This research method involves the SMOTE-ENN preprocessing method to balance the data and the Random Forest method is used as a classification method to predict lung cancer by dividing training data and testing 10-fold cross validation. The results of this study show that using SMOTE-ENN with Random Forest has the best performance compared to SMOTE and without oversampling on all metrics used. The conclusion is using the SMOTE-ENN hybrid sampling technique with the Random Forest model significantly improves the model's ability to identify and classify data.

References

K. R. Ririh, N. Laili, A. Wicaksono, and S. Tsurayya, “Studi Komparasi dan Analisis SWOT pada Implementasi Kecerdasan Buatan (Artificial Intelligence) di Indonesia,” J@ti Undip J. Tek. Ind., vol. 15, no. 2, pp. 122–133, 2020, https://doi.org/10.14710/jati.15.2.122-133.

C. Chazar, “Machine Learning Diagnosis Kanker Payudara Menggunakan Algoritma Support Vector Machine,” Inf. (Jurnal Inform. dan Sist. Informasi), vol. 12, no. 1, pp. 67–80, May 2020, https://doi.org/10.37424/informasi.v12i1.48.

R. Supriyadi, W. Gata, N. Maulidah, and A. Fauzi, “Penerapan Algoritma Random Forest Untuk Menentukan Kualitas Anggur Merah,” E-Bisnis J. Ilm. Ekon. dan Bisnis, vol. 13, no. 2, pp. 67–75, Nov. 2020, https://doi.org/10.51903/e-bisnis.v13i2.247.

N. Salim, “Penggunaan Jaringan Syaraf Tiruan Untuk Optimasi Kontruksi Bendung Tyrol Plat Berlubang (Study Kasus Pemodelan Bendung Tyrol Plat Berlubang, Provinsi Ankara, Turkey),” JUSTINDO (Jurnal Sist. dan Teknol. Inf. Indones., vol. 7, no. 1, pp. 50–58, Mar. 2022, https://doi.org/10.32528/justindo.v7i1.5898.

M. Azhari, Z. Situmorang, and R. Rosnelly, “Perbandingan Akurasi, Recall, dan Presisi Klasifikasi pada Algoritma C4.5, Random Forest, SVM dan Naive Bayes,” J. MEDIA Inform. BUDIDARMA, vol. 5, no. 2, pp. 640–651, Apr. 2021, http://dx.doi.org/10.30865/mib.v5i2.2937.

D. Pramadhana, R. Rendi, and R. Robiyanto, “Peningkatan Algoritma J48 Untuk Klasifikasi Hasil Prestasi Mahasiswa Selama Proses Pembelajaran Secara Daring Menggunakan CFS Dan Adaboost,” J. Informatics Inf. Syst. Softw. Eng. Appl., vol. 5, no. 1, pp. 17–26, Dec. 2022, https://doi.org/10.20895/inista.v5i1.853.

T. Pan, J. Zhao, W. Wu, and J. Yang, “Learning imbalanced datasets based on SMOTE and Gaussian distribution,” Inf. Sci. (Ny)., vol. 512, pp. 1214–1233, Feb. 2020, https://doi.org/10.1016/j.ins.2019.10.048.

D. Elreedy and A. F. Atiya, “A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance,” Inf. Sci. (Ny)., vol. 505, pp. 32–64, 2019, https://doi.org/10.1016/j.ins.2019.07.070.

E. F. Swana, W. Doorsamy, and P. Bokoro, “Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset,” Sensors, vol. 22, no. 9, May 2022, https://doi.org/10.3390/s22093246.

M. P. Pangestika, I. M. Sumertajaya, and A. Rizki, “Penerapan Synthetic Minority Oversampling Technique pada Pemodelan Regresi Logistik Biner terhadap Keberhasilan Studi Mahasiswa Program Magister IPB,” Xplore J. Stat., vol. 10, no. 2, pp. 152–166, May 2021, https://doi.org/10.29244/xplore.v10i2.238.

R. D. Fitriani, H. Yasin, and Tarno, “Penanganan Klasifikasi Kelas Data Tidak Seimbang Dengan Random Oversampling Pada Naive Bayes (Studi Kasus: Status Peserta Kb Iud Di Kabupaten Kendal,” J. Gaussian, vol. 10, no. 1, pp. 11–20, 2021.

E. Erlin, Y. Desnelita, N. Nasution, L. Suryati, and F. Zoromi, “Dampak SMOTE terhadap Kinerja Random Forest Classifier berdasarkan Data Tidak seimbang,” MATRIK J. Manajemen, Tek. Inform. dan Rekayasa Komput., vol. 21, no. 3, pp. 677–690, Jul. 2022, https://doi.org/10.30812/matrik.v21i3.1726.

C. M. Lauw, H. Hairani, I. Saifuddin, J. X. Guterres, M. M. Huda, and M. Mayadi, “Combination of Smote and Random Forest Methods for Lung Cancer Classification,” Int. J. Eng. Comput. Sci. Appl., vol. 2, no. 2, pp. 59–64, Jan. 2023, https://doi.org/10.30812/ijecsa.v2i2.3333.

I. Yulianti, A. Rahmawati, and T. Mardiana, “The Effectiveness Analysis of Random Forest Algorithms with SMOTE Technique In Predicting Lung Cancer Risk,” J. Ris. Inform., vol. 4, no. 2, pp. 207–214, Mar. 2022, https://doi.org/10.34288/jri.v4i2.159.

H. Hairani and D. Priyanto, “A New Approach of Hybrid Sampling SMOTE and ENN to the Accuracy of Machine Learning Methods on Unbalanced Diabetes Disease Data,” Int. J. Adv. Comput. Sci. Appl., vol. 14, no. 8, pp. 585–590, 2023, https://dx.doi.org/10.14569/IJACSA.2023.0140864

H. Hairani, A. Anggrawan, and D. Priyanto, “Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link,” Int. J. Informatics Vis., vol. 7, no. 1, pp. 258–264, 2023, https://doi.org/10.30630/joiv.7.1.1069