The Effect of Class Imbalance Handling on Datasets Toward Classification Algorithm Performance

  • Cherfly Kaope Universitas AMIKOM Yogyakarta
  • Yoga Pristyanto Universitas AMIKOM Yogyakarta
Keywords: Classification, Resampling, Imbalanced Class

Abstract

Class imbalance is a condition where the amount of data in the minority class is smaller than that of the majority class. The impact of the class imbalance in the dataset is the occurrence of minority class misclassification, so it can affect classification performance. Various approaches have been taken to deal with the problem of class imbalances such as the data level approach, algorithmic level approach, and cost-sensitive learning. At the data level, one of the methods used is to apply the sampling method. In this study, the ADASYN, SMOTE, and SMOTE-ENN sampling methods were used to deal with the problem of class imbalance combined with the AdaBoost, K-Nearest Neighbor, and Random Forest classification algorithms. The purpose of this study was to determine the effect of handling class imbalances on the dataset on classification performance. The tests were carried out on five datasets and based on the results of the classification the integration of the ADASYN and Random Forest methods gave better results compared to other model schemes. The criteria used to evaluate include accuracy, precision, true positive rate, true negative rate, and g-mean score. The results of the classification of the integration of the ADASYN and Random Forest methods gave 5% to 10% better than other models.

Downloads

Download data is not yet available.

References

[1] W. Ustyannie and S. Suprapto, “Oversampling Method To Handling Imbalanced Datasets Problem in Binary Logistic Regression Algorithm,” IJCCS (Indonesian Journal of Computing and Cybernetics Systems), vol. 14, no. 1, p. 1, 2020, doi: 10.22146/ijccs.37415.
[2] H. Ali, M. N. M. Salleh, R. Saedudin, K. Hussain, and M. F. Mushtaq, “Imbalance class problems in data mining: A review,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 14, no. 3, pp. 1552–1563, 2019, doi: 10.11591/ijeecs.v14.i3.pp1552-1563.
[3] N. S. Ramadhanti, W. A. Kusuma, and A. Annisa, “Optimasi Data Tidak Seimbang pada Interaksi Drug Target dengan Sampling dan Ensemble Support Vector Machine,” Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 7, no. 6, pp. 1221–1230, Dec. 2020, doi: 10.25126/jtiik.2020762857.
[4] I. Lin, O. Loyola-González, R. Monroy, and M. A. Medina-Pérez, “A Review of Fuzzy and Pattern-Based Approaches for Class Imbalance Problems,” Applied Sciences, vol. 11, no. 14, pp. 1–23, Jul. 2021, doi: 10.3390/app11146310.
[5] N. Chamidah, M. M. Santoni, and N. Matondang, “Pengaruh Oversampling pada Klasifikasi Hipertensi dengan Algoritma Naïve Bayes, Decision Tree, dan Artificial Neural Network (ANN),” JURNAL RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 1, no. 3, pp. 635–641, 2017.
[6] N. G. Ramadhan, “Comparative Analysis of ADASYN-SVM and SMOTE-SVM Methods on the Detection of Type 2 Diabetes Mellitus,” Scientific Journal of Informatics, vol. 8, no. 2, pp. 276–282, 2021, doi: 10.15294/sji.v8i2.32484.
[7] H. L. Ngo et al., “The composition of time-series images and using the technique SMOTE ENN for balancing datasets in land use/cover mapping,” Acta Montanistica Slovaca, vol. 27, no. 2, pp. 342–359, 2022, doi: 10.46544/AMS.v27i2.05.
[8] M. Imran, M. Afroze, S. K. Sanampudi, A. Abdul, and M. Qyser, “Data Mining of Imbalanced Dataset in Educational Data Using Weka Tool,” International Journal of Engineering Science and Computing, vol. 6, no. 6, pp. 7666–7669, 2016, doi: 10.4010/2016.1809.
[9] R. I. Rashu, N. Haq, and R. M. Rahman, “Data mining approaches to predict final grade by overcoming class imbalance problem,” in 2014 17th International Conference on Computer and Information Technology, ICCIT 2014, 2014, pp. 14–19, doi: 10.1109/ICCITechn.2014.7073095.
[10] D. Thammasiri, D. Delen, P. Meesad, and N. Kasap, “A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition,” Expert Systems with Applications, vol. 41, no. 2, pp. 321–330, 2014, doi: 10.1016/j.eswa.2013.07.046.
[11] M. Kubat and S. Matwin, “Addressing the Curse of Imbalanced Training Sets: One Sided Selection,” in International Conference on Machine Learning, 1997, vol. 97, pp. 179–186, doi: 10.1007/s13398-014-0173-7.2.
[12] N. Noorhalim, A. Ali, and S. M. Shamsuddin, “Handling Imbalanced Ratio for Class Imbalance Problem Using SMOTE,” in Proceedings of the Third International Conference on Computing, Mathematics and Statistics (iCMS2017), 2017, pp. 19–29, doi: 10.1007/978-981-13-7279-7.
[13] Z. Peng, F. Yan, and X. Li, “Comparison of the different sampling techniques for imbalanced classification problems in machine learning,” in Proceedings - 2019 11th International Conference on Measuring Technology and Mechatronics Automation, ICMTMA 2019, 2019, pp. 431–434, doi: 10.1109/ICMTMA.2019.00101.
[14] S. Ahmed, A. Mahbub, F. Rayhan, R. Jani, S. Shatabda, and D. M. Farid, “Hybrid Methods for Class Imbalance Learning Employing Bagging with Sampling Techniques,” in 2nd International Conference on Computational Systems and Information Technology for Sustainable Solutions, CSITSS 2017, 2018, pp. 1–5, doi: 10.1109/CSITSS.2017.8447799.
[15] H. Ding, L. Chen, L. Dong, Z. Fu, and X. Cui, “Imbalanced data classification: A KNN and generative adversarial networks-based hybrid approach for intrusion detection,” Future Generation Computer Systems, vol. 131, no. June, pp. 240–254, Jun. 2022, doi: 10.1016/j.future.2022.01.026.
[16] P. Wibowo and C. Fatichah, “An in-depth performance analysis of the oversampling techniques for high-class imbalanced dataset,” Register: Jurnal Ilmiah Teknologi Sistem Informasi, vol. 7, no. 1, pp. 63–71, 2021, doi: 10.26594/register.v7i1.2206.
[17] R. Gupta, R. Bhargava, and M. Jayabalan, “Diagnosis of Breast Cancer on Imbalanced Dataset Using Various Sampling Techniques and Machine Learning Models,” in 2021 14th International Conference on Developments in eSystems Engineering (DeSE), Dec. 2021, pp. 162–167, doi: 10.1109/DeSE54285.2021.9719398.
[18] R. R. Rao and K. Makkithaya, “Learning from a class imbalanced public health dataset: A cost-based comparison of classifier performance,” International Journal of Electrical and Computer Engineering, vol. 7, no. 4, pp. 2215–2222, 2017, doi: 10.11591/ijece.v7i4.pp2215-2222.
[19] N. Santoso, W. Wibowo, and H. Hikmawati, “Integration of synthetic minority oversampling technique for imbalanced class,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 13, no. 1, p. 102, Jan. 2019, doi: 10.11591/ijeecs.v13.i1.pp102-108.
[20] A. Indrawati, H. Subagyo, A. Sihombing, W. Wagiyah, and S. Afandi, “Analyzing The Impact of Resampling Mehod for Imbalanced Data Text in Indonesian Scientific Articles Categorization,” BACA: JURNAL DOKUMENTASI DAN INFORMASI, vol. 41, no. 2, pp. 133–141, Dec. 2020, doi: 10.14203/j.baca.v41i2.702.
[21] S. Akter et al., “AD-CovNet: An exploratory analysis using a hybrid deep learning model to handle data imbalance, predict fatality, and risk factors in Alzheimer’s patients with COVID-19,” Computers in Biology and Medicine, vol. 146, no. July, pp. 1–19, Jul. 2022, doi: 10.1016/j.compbiomed.2022.105657.
[22] T. Sasada, Z. Liu, T. Baba, K. Hatano, and Y. Kimura, “A resampling method for imbalanced datasets considering noise and overlap,” in Procedia Computer Science, 2020, vol. 176, pp. 420–429, doi: 10.1016/j.procs.2020.08.043.
[23] A. L. Karn et al., “Fuzzy and SVM Based Classification Model to Classify Spectral Objects in Sloan Digital Sky,” IEEE Access, vol. 10, pp. 101276–101291, 2022, doi: 10.1109/ACCESS.2022.3207480.
[24] A. Subasi, M. Balfaqih, Z. Balfagih, and K. Alfawwaz, “A Comparative Evaluation of Ensemble Classifiers for Malicious Webpage Detection,” in Procedia Computer Science, 2021, pp. 272–279, doi: 10.1016/j.procs.2021.10.082.
[25] Y. Wang and L. Feng, “Improved Adaboost Algorithm for Classification Based on Noise Confidence Degree and Weighted Feature Selection,” IEEE Access, vol. 8, pp. 153011–153026, 2020, doi: 10.1109/ACCESS.2020.3017164.
[26] Y. M. Wazery, E. Saber, E. H. Houssein, A. A. Ali, and E. Amer, “An Efficient Slime Mould Algorithm Combined with K-Nearest Neighbor for Medical Classification Tasks,” IEEE Access, vol. 9, pp. 113666–113682, 2021, doi: 10.1109/ACCESS.2021.3105485.
[27] F. M. J. M. Shamrat et al., “Sentiment analysis on twitter tweets about COVID-19 vaccines using NLP and supervised KNN classification algorithm,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 23, no. 1, pp. 463–470, 2021, doi: 10.11591/ijeecs.v23.i1.pp463-470.
[28] A. R. Isnain, J. Supriyanto, and M. P. Kharisma, “Implementation of K-Nearest Neighbor (K-NN) Algorithm For Public Sentiment Analysis of Online Learning,” IJCCS (Indonesian Journal of Computing and Cybernetics Systems), vol. 15, no. 2, pp. 121–130, Apr. 2021, doi: 10.22146/ijccs.65176.
[29] B. Solihah, A. Azhari, and A. Musdholifah, “The Empirical Comparison of Machine Learning Algorithm for the Class Imbalanced Problem in Conformational Epitope Prediction,” JUITA: Jurnal Informatika, vol. 9, no. 1, pp. 131–138, May 2021, doi: 10.30595/juita.v9i1.9969.
[30] V. K. Gupta, A. Gupta, D. Kumar, and A. Sardana, “Prediction of COVID-19 confirmed, death, and cured cases in India using random forest model,” Big Data Mining and Analytics, vol. 4, no. 2, pp. 116–123, 2021, doi: 10.26599/BDMA.2020.9020016.
[31] J. Zeffora and Shobarani, “Optimizing random forest classifier with Jenesis-index on an imbalanced dataset,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 26, no. 1, pp. 505–511, 2022, doi: 10.11591/ijeecs.v26.i1.pp505-511.
[32] A. N. Kasanah, M. Muladi, and U. Pujianto, “Penerapan Teknik SMOTE untuk Mengatasi Imbalance Class dalam Klasifikasi Objektivitas Berita Online Menggunakan Algoritma KNN,” Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 3, no. 2, pp. 196–201, 2019, doi: 10.29207/resti.v3i2.945.
[33] A. A. Salih and A. M. Abdulazeez, “Evaluation of Classification Algorithms for Intrusion Detection System: A Review,” Journal of Soft Computing and Data Mining, vol. 02, no. 01, pp. 31–40, 2021, doi: 10.30880/jscdm.2021.02.01.004.
[34] A. S. Desuky, A. H. Omar, and N. M. Mostafa, “Boosting with crossover for improving imbalanced medical datasets classification,” Bulletin of Electrical Engineering and Informatics, vol. 10, no. 5, pp. 2733–2741, 2021, doi: 10.11591/eei.v10i5.3121.
[35] Z. P. Agusta and Adiwijaya, “Modified balanced random forest for improving imbalanced data prediction,” International Journal of Advances in Intelligent Informatics, vol. 5, no. 1, pp. 58–65, 2019, doi: 10.26555/ijain.v5i1.255.
Published
2023-03-01
How to Cite
Kaope, C., & Pristyanto, Y. (2023). The Effect of Class Imbalance Handling on Datasets Toward Classification Algorithm Performance. MATRIK : Jurnal Manajemen, Teknik Informatika Dan Rekayasa Komputer, 22(2), 227-238. https://doi.org/https://doi.org/10.30812/matrik.v22i2.2515
Section
Articles