The Effect of Class Imbalance Handling on Datasets Toward Classification Algorithm Performance
DOI:
https://doi.org/10.30812/matrik.v22i2.2515Keywords:
Classification, Resampling, Imbalanced ClassAbstract
Class imbalance is a condition where the amount of data in the minority class is smaller than that of the majority class. The impact of the class imbalance in the dataset is the occurrence of minority class misclassification, so it can affect classification performance. Various approaches have been taken to deal with the problem of class imbalances such as the data level approach, algorithmic level approach, and cost-sensitive learning. At the data level, one of the methods used is to apply the sampling method. In this study, the ADASYN, SMOTE, and SMOTE-ENN sampling methods were used to deal with the problem of class imbalance combined with the AdaBoost, K-Nearest Neighbor, and Random Forest classification algorithms. The purpose of this study was to determine the effect of handling class imbalances on the dataset on classification performance. The tests were carried out on five datasets and based on the results of the classification the integration of the ADASYN and Random Forest methods gave better results compared to other model schemes. The criteria used to evaluate include accuracy, precision, true positive rate, true negative rate, and g-mean score. The results of the classification of the integration of the ADASYN and Random Forest methods gave 5% to 10% better than other models.
Downloads
References
[2] H. Ali, M. N. M. Salleh, R. Saedudin, K. Hussain, and M. F. Mushtaq, “Imbalance class problems in data mining: A review,†Indonesian Journal of Electrical Engineering and Computer Science, vol. 14, no. 3, pp. 1552–1563, 2019, doi: 10.11591/ijeecs.v14.i3.pp1552-1563.
[3] N. S. Ramadhanti, W. A. Kusuma, and A. Annisa, “Optimasi Data Tidak Seimbang pada Interaksi Drug Target dengan Sampling dan Ensemble Support Vector Machine,†Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 7, no. 6, pp. 1221–1230, Dec. 2020, doi: 10.25126/jtiik.2020762857.
[4] I. Lin, O. Loyola-González, R. Monroy, and M. A. Medina-Pérez, “A Review of Fuzzy and Pattern-Based Approaches for Class Imbalance Problems,†Applied Sciences, vol. 11, no. 14, pp. 1–23, Jul. 2021, doi: 10.3390/app11146310.
[5] N. Chamidah, M. M. Santoni, and N. Matondang, “Pengaruh Oversampling pada Klasifikasi Hipertensi dengan Algoritma Naïve Bayes, Decision Tree, dan Artificial Neural Network (ANN),†JURNAL RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 1, no. 3, pp. 635–641, 2017.
[6] N. G. Ramadhan, “Comparative Analysis of ADASYN-SVM and SMOTE-SVM Methods on the Detection of Type 2 Diabetes Mellitus,†Scientific Journal of Informatics, vol. 8, no. 2, pp. 276–282, 2021, doi: 10.15294/sji.v8i2.32484.
[7] H. L. Ngo et al., “The composition of time-series images and using the technique SMOTE ENN for balancing datasets in land use/cover mapping,†Acta Montanistica Slovaca, vol. 27, no. 2, pp. 342–359, 2022, doi: 10.46544/AMS.v27i2.05.
[8] M. Imran, M. Afroze, S. K. Sanampudi, A. Abdul, and M. Qyser, “Data Mining of Imbalanced Dataset in Educational Data Using Weka Tool,†International Journal of Engineering Science and Computing, vol. 6, no. 6, pp. 7666–7669, 2016, doi: 10.4010/2016.1809.
[9] R. I. Rashu, N. Haq, and R. M. Rahman, “Data mining approaches to predict final grade by overcoming class imbalance problem,†in 2014 17th International Conference on Computer and Information Technology, ICCIT 2014, 2014, pp. 14–19, doi: 10.1109/ICCITechn.2014.7073095.
[10] D. Thammasiri, D. Delen, P. Meesad, and N. Kasap, “A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition,†Expert Systems with Applications, vol. 41, no. 2, pp. 321–330, 2014, doi: 10.1016/j.eswa.2013.07.046.
[11] M. Kubat and S. Matwin, “Addressing the Curse of Imbalanced Training Sets: One Sided Selection,†in International Conference on Machine Learning, 1997, vol. 97, pp. 179–186, doi: 10.1007/s13398-014-0173-7.2.
[12] N. Noorhalim, A. Ali, and S. M. Shamsuddin, “Handling Imbalanced Ratio for Class Imbalance Problem Using SMOTE,†in Proceedings of the Third International Conference on Computing, Mathematics and Statistics (iCMS2017), 2017, pp. 19–29, doi: 10.1007/978-981-13-7279-7.
[13] Z. Peng, F. Yan, and X. Li, “Comparison of the different sampling techniques for imbalanced classification problems in machine learning,†in Proceedings - 2019 11th International Conference on Measuring Technology and Mechatronics Automation, ICMTMA 2019, 2019, pp. 431–434, doi: 10.1109/ICMTMA.2019.00101.
[14] S. Ahmed, A. Mahbub, F. Rayhan, R. Jani, S. Shatabda, and D. M. Farid, “Hybrid Methods for Class Imbalance Learning Employing Bagging with Sampling Techniques,†in 2nd International Conference on Computational Systems and Information Technology for Sustainable Solutions, CSITSS 2017, 2018, pp. 1–5, doi: 10.1109/CSITSS.2017.8447799.
[15] H. Ding, L. Chen, L. Dong, Z. Fu, and X. Cui, “Imbalanced data classification: A KNN and generative adversarial networks-based hybrid approach for intrusion detection,†Future Generation Computer Systems, vol. 131, no. June, pp. 240–254, Jun. 2022, doi: 10.1016/j.future.2022.01.026.
[16] P. Wibowo and C. Fatichah, “An in-depth performance analysis of the oversampling techniques for high-class imbalanced dataset,†Register: Jurnal Ilmiah Teknologi Sistem Informasi, vol. 7, no. 1, pp. 63–71, 2021, doi: 10.26594/register.v7i1.2206.
[17] R. Gupta, R. Bhargava, and M. Jayabalan, “Diagnosis of Breast Cancer on Imbalanced Dataset Using Various Sampling Techniques and Machine Learning Models,†in 2021 14th International Conference on Developments in eSystems Engineering (DeSE), Dec. 2021, pp. 162–167, doi: 10.1109/DeSE54285.2021.9719398.
[18] R. R. Rao and K. Makkithaya, “Learning from a class imbalanced public health dataset: A cost-based comparison of classifier performance,†International Journal of Electrical and Computer Engineering, vol. 7, no. 4, pp. 2215–2222, 2017, doi: 10.11591/ijece.v7i4.pp2215-2222.
[19] N. Santoso, W. Wibowo, and H. Hikmawati, “Integration of synthetic minority oversampling technique for imbalanced class,†Indonesian Journal of Electrical Engineering and Computer Science, vol. 13, no. 1, p. 102, Jan. 2019, doi: 10.11591/ijeecs.v13.i1.pp102-108.
[20] A. Indrawati, H. Subagyo, A. Sihombing, W. Wagiyah, and S. Afandi, “Analyzing The Impact of Resampling Mehod for Imbalanced Data Text in Indonesian Scientific Articles Categorization,†BACA: JURNAL DOKUMENTASI DAN INFORMASI, vol. 41, no. 2, pp. 133–141, Dec. 2020, doi: 10.14203/j.baca.v41i2.702.
[21] S. Akter et al., “AD-CovNet: An exploratory analysis using a hybrid deep learning model to handle data imbalance, predict fatality, and risk factors in Alzheimer’s patients with COVID-19,†Computers in Biology and Medicine, vol. 146, no. July, pp. 1–19, Jul. 2022, doi: 10.1016/j.compbiomed.2022.105657.
[22] T. Sasada, Z. Liu, T. Baba, K. Hatano, and Y. Kimura, “A resampling method for imbalanced datasets considering noise and overlap,†in Procedia Computer Science, 2020, vol. 176, pp. 420–429, doi: 10.1016/j.procs.2020.08.043.
[23] A. L. Karn et al., “Fuzzy and SVM Based Classification Model to Classify Spectral Objects in Sloan Digital Sky,†IEEE Access, vol. 10, pp. 101276–101291, 2022, doi: 10.1109/ACCESS.2022.3207480.
[24] A. Subasi, M. Balfaqih, Z. Balfagih, and K. Alfawwaz, “A Comparative Evaluation of Ensemble Classifiers for Malicious Webpage Detection,†in Procedia Computer Science, 2021, pp. 272–279, doi: 10.1016/j.procs.2021.10.082.
[25] Y. Wang and L. Feng, “Improved Adaboost Algorithm for Classification Based on Noise Confidence Degree and Weighted Feature Selection,†IEEE Access, vol. 8, pp. 153011–153026, 2020, doi: 10.1109/ACCESS.2020.3017164.
[26] Y. M. Wazery, E. Saber, E. H. Houssein, A. A. Ali, and E. Amer, “An Efficient Slime Mould Algorithm Combined with K-Nearest Neighbor for Medical Classification Tasks,†IEEE Access, vol. 9, pp. 113666–113682, 2021, doi: 10.1109/ACCESS.2021.3105485.
[27] F. M. J. M. Shamrat et al., “Sentiment analysis on twitter tweets about COVID-19 vaccines using NLP and supervised KNN classification algorithm,†Indonesian Journal of Electrical Engineering and Computer Science, vol. 23, no. 1, pp. 463–470, 2021, doi: 10.11591/ijeecs.v23.i1.pp463-470.
[28] A. R. Isnain, J. Supriyanto, and M. P. Kharisma, “Implementation of K-Nearest Neighbor (K-NN) Algorithm For Public Sentiment Analysis of Online Learning,†IJCCS (Indonesian Journal of Computing and Cybernetics Systems), vol. 15, no. 2, pp. 121–130, Apr. 2021, doi: 10.22146/ijccs.65176.
[29] B. Solihah, A. Azhari, and A. Musdholifah, “The Empirical Comparison of Machine Learning Algorithm for the Class Imbalanced Problem in Conformational Epitope Prediction,†JUITA: Jurnal Informatika, vol. 9, no. 1, pp. 131–138, May 2021, doi: 10.30595/juita.v9i1.9969.
[30] V. K. Gupta, A. Gupta, D. Kumar, and A. Sardana, “Prediction of COVID-19 confirmed, death, and cured cases in India using random forest model,†Big Data Mining and Analytics, vol. 4, no. 2, pp. 116–123, 2021, doi: 10.26599/BDMA.2020.9020016.
[31] J. Zeffora and Shobarani, “Optimizing random forest classifier with Jenesis-index on an imbalanced dataset,†Indonesian Journal of Electrical Engineering and Computer Science, vol. 26, no. 1, pp. 505–511, 2022, doi: 10.11591/ijeecs.v26.i1.pp505-511.
[32] A. N. Kasanah, M. Muladi, and U. Pujianto, “Penerapan Teknik SMOTE untuk Mengatasi Imbalance Class dalam Klasifikasi Objektivitas Berita Online Menggunakan Algoritma KNN,†Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 3, no. 2, pp. 196–201, 2019, doi: 10.29207/resti.v3i2.945.
[33] A. A. Salih and A. M. Abdulazeez, “Evaluation of Classification Algorithms for Intrusion Detection System: A Review,†Journal of Soft Computing and Data Mining, vol. 02, no. 01, pp. 31–40, 2021, doi: 10.30880/jscdm.2021.02.01.004.
[34] A. S. Desuky, A. H. Omar, and N. M. Mostafa, “Boosting with crossover for improving imbalanced medical datasets classification,†Bulletin of Electrical Engineering and Informatics, vol. 10, no. 5, pp. 2733–2741, 2021, doi: 10.11591/eei.v10i5.3121.
[35] Z. P. Agusta and Adiwijaya, “Modified balanced random forest for improving imbalanced data prediction,†International Journal of Advances in Intelligent Informatics, vol. 5, no. 1, pp. 58–65, 2019, doi: 10.26555/ijain.v5i1.255.
Downloads
Published
Issue
Section
How to Cite
Similar Articles
- Muhammad Ali Akbar Hutasuhut, Pahrul Irfan, Sistem Informasi Pemasaran Paket Tour Koperasi Karya Wisata Senggigi Berbasis Web , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 18 No. 1 (2018)
- Taufik Hidayat, Mohammad Ridwan, Muhamad Fajrul Iqbal, Sukisno Sukisno, Robby Rizky, William Eric Manongga, Determining Toddler's Nutritional Status with Machine Learning Classification Analysis Approach , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 24 No. 2 (2025)
- Muhammad Amirul Mukminin, Tio Dharmawan, Muhamad Arief Hidayat, Gender Classification Using Viola Jones, Orthogonal Difference Local Binary Pattern and Principal Component Analysis , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 23 No. 3 (2024)
- Prihandoko Prihandoko, Deny Jollyta, Gusrianty Gusrianty, Muhammad Siddik, Johan Johan, Cluster Validity for Optimizing Classification Model: Davies Bouldin Index – Random Forest Algorithm , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 24 No. 1 (2024)
- sayuti rahman, Marwan Ramli, Arnes Sembiring, Muhammad Zen, Rahmad B.Y Syah, Normalization Layer Enhancement in Convolutional Neural Network for Parking Space Classification , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 23 No. 3 (2024)
- Bambang Suprihatin, Yuli Andriani, Fauziah Nuraini Kurdi, Anita Desiani, Ibra Giovani Dwi Putra, Muhammad Akmal Shidqi, Lungs X-Ray Image Segmentation and Classification of Lung Disease using Convolutional Neural Network Architectures , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 23 No. 1 (2023)
- Jaka Tirta Samudra, Rika Rosnelly, Zakarias Situmorang, Comparative Analysis of SVM and Perceptron Algorithms in Classification of Work Programs , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 22 No. 2 (2023)
- Annisa Nurul Puteri, Suryadi Syamsu, Topan Leoni Putra, Andita Dani Achmad, Support Vector Machine for Predicting Candlestick Chart Movement on Foreign Exchange , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 22 No. 2 (2023)
- Irma Binti Sya'idah, Sugiyarto Surono, Goh Khang Wen, DynamicWeighted Particle Swarm Optimization - Support Vector Machine Optimization in Recursive Feature Elimination Feature Selection , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 23 No. 3 (2024)
- Fitra Ahya Mubarok, Mohammad Reza Faisal, Dwi Kartini, Dodon Turianto Nugrahadi, Triando Hamonangan Saragih, Gender Classification of Twitter Users Using Convolutional Neural Network , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 23 No. 1 (2023)
You may also start an advanced similarity search for this article.
Most read articles by the same author(s)
- Fadhilah Dwi Ananda, Yoga Pristyanto, Analisis Sentimen Pengguna Twitter Terhadap Layanan Internet Provider Menggunakan Algoritma Support Vector Machine , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 20 No. 2 (2021)
- Rizky Hafizh Jatmiko, Yoga Pristyanto, Investigating The Effectiveness of Various Convolutional Neural Network Model Architectures for Skin Cancer Melanoma Classification , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 23 No. 1 (2023)