The Effect of Class Imbalance Handling on Datasets Toward Classification Algorithm Performance

Cherfly Kaope; Yoga Pristyanto

doi:10.30812/matrik.v22i2.2515

Authors

Cherfly Kaope Universitas AMIKOM Yogyakarta
Yoga Pristyanto Universitas AMIKOM Yogyakarta

DOI:

https://doi.org/10.30812/matrik.v22i2.2515

Keywords:

Classification, Resampling, Imbalanced Class

Abstract

Class imbalance is a condition where the amount of data in the minority class is smaller than that of the majority class. The impact of the class imbalance in the dataset is the occurrence of minority class misclassification, so it can affect classification performance. Various approaches have been taken to deal with the problem of class imbalances such as the data level approach, algorithmic level approach, and cost-sensitive learning. At the data level, one of the methods used is to apply the sampling method. In this study, the ADASYN, SMOTE, and SMOTE-ENN sampling methods were used to deal with the problem of class imbalance combined with the AdaBoost, K-Nearest Neighbor, and Random Forest classification algorithms. The purpose of this study was to determine the effect of handling class imbalances on the dataset on classification performance. The tests were carried out on five datasets and based on the results of the classification the integration of the ADASYN and Random Forest methods gave better results compared to other model schemes. The criteria used to evaluate include accuracy, precision, true positive rate, true negative rate, and g-mean score. The results of the classification of the integration of the ADASYN and Random Forest methods gave 5% to 10% better than other models.

Downloads

Download data is not yet available.

References

[1] W. Ustyannie and S. Suprapto, â€œOversampling Method To Handling Imbalanced Datasets Problem in Binary Logistic Regression Algorithm,â€ IJCCS (Indonesian Journal of Computing and Cybernetics Systems), vol. 14, no. 1, p. 1, 2020, doi: 10.22146/ijccs.37415.
[2] H. Ali, M. N. M. Salleh, R. Saedudin, K. Hussain, and M. F. Mushtaq, â€œImbalance class problems in data mining: A review,â€ Indonesian Journal of Electrical Engineering and Computer Science, vol. 14, no. 3, pp. 1552â€“1563, 2019, doi: 10.11591/ijeecs.v14.i3.pp1552-1563.
[3] N. S. Ramadhanti, W. A. Kusuma, and A. Annisa, â€œOptimasi Data Tidak Seimbang pada Interaksi Drug Target dengan Sampling dan Ensemble Support Vector Machine,â€ Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 7, no. 6, pp. 1221â€“1230, Dec. 2020, doi: 10.25126/jtiik.2020762857.
[4] I. Lin, O. Loyola-GonzÃ¡lez, R. Monroy, and M. A. Medina-PÃ©rez, â€œA Review of Fuzzy and Pattern-Based Approaches for Class Imbalance Problems,â€ Applied Sciences, vol. 11, no. 14, pp. 1â€“23, Jul. 2021, doi: 10.3390/app11146310.
[5] N. Chamidah, M. M. Santoni, and N. Matondang, â€œPengaruh Oversampling pada Klasifikasi Hipertensi dengan Algoritma NaÃ¯ve Bayes, Decision Tree, dan Artificial Neural Network (ANN),â€ JURNAL RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 1, no. 3, pp. 635â€“641, 2017.
[6] N. G. Ramadhan, â€œComparative Analysis of ADASYN-SVM and SMOTE-SVM Methods on the Detection of Type 2 Diabetes Mellitus,â€ Scientific Journal of Informatics, vol. 8, no. 2, pp. 276â€“282, 2021, doi: 10.15294/sji.v8i2.32484.
[7] H. L. Ngo et al., â€œThe composition of time-series images and using the technique SMOTE ENN for balancing datasets in land use/cover mapping,â€ Acta Montanistica Slovaca, vol. 27, no. 2, pp. 342â€“359, 2022, doi: 10.46544/AMS.v27i2.05.
[8] M. Imran, M. Afroze, S. K. Sanampudi, A. Abdul, and M. Qyser, â€œData Mining of Imbalanced Dataset in Educational Data Using Weka Tool,â€ International Journal of Engineering Science and Computing, vol. 6, no. 6, pp. 7666â€“7669, 2016, doi: 10.4010/2016.1809.
[9] R. I. Rashu, N. Haq, and R. M. Rahman, â€œData mining approaches to predict final grade by overcoming class imbalance problem,â€ in 2014 17th International Conference on Computer and Information Technology, ICCIT 2014, 2014, pp. 14â€“19, doi: 10.1109/ICCITechn.2014.7073095.
[10] D. Thammasiri, D. Delen, P. Meesad, and N. Kasap, â€œA critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition,â€ Expert Systems with Applications, vol. 41, no. 2, pp. 321â€“330, 2014, doi: 10.1016/j.eswa.2013.07.046.
[11] M. Kubat and S. Matwin, â€œAddressing the Curse of Imbalanced Training Sets: One Sided Selection,â€ in International Conference on Machine Learning, 1997, vol. 97, pp. 179â€“186, doi: 10.1007/s13398-014-0173-7.2.
[12] N. Noorhalim, A. Ali, and S. M. Shamsuddin, â€œHandling Imbalanced Ratio for Class Imbalance Problem Using SMOTE,â€ in Proceedings of the Third International Conference on Computing, Mathematics and Statistics (iCMS2017), 2017, pp. 19â€“29, doi: 10.1007/978-981-13-7279-7.
[13] Z. Peng, F. Yan, and X. Li, â€œComparison of the different sampling techniques for imbalanced classification problems in machine learning,â€ in Proceedings - 2019 11th International Conference on Measuring Technology and Mechatronics Automation, ICMTMA 2019, 2019, pp. 431â€“434, doi: 10.1109/ICMTMA.2019.00101.
[14] S. Ahmed, A. Mahbub, F. Rayhan, R. Jani, S. Shatabda, and D. M. Farid, â€œHybrid Methods for Class Imbalance Learning Employing Bagging with Sampling Techniques,â€ in 2nd International Conference on Computational Systems and Information Technology for Sustainable Solutions, CSITSS 2017, 2018, pp. 1â€“5, doi: 10.1109/CSITSS.2017.8447799.
[15] H. Ding, L. Chen, L. Dong, Z. Fu, and X. Cui, â€œImbalanced data classification: A KNN and generative adversarial networks-based hybrid approach for intrusion detection,â€ Future Generation Computer Systems, vol. 131, no. June, pp. 240â€“254, Jun. 2022, doi: 10.1016/j.future.2022.01.026.
[16] P. Wibowo and C. Fatichah, â€œAn in-depth performance analysis of the oversampling techniques for high-class imbalanced dataset,â€ Register: Jurnal Ilmiah Teknologi Sistem Informasi, vol. 7, no. 1, pp. 63â€“71, 2021, doi: 10.26594/register.v7i1.2206.
[17] R. Gupta, R. Bhargava, and M. Jayabalan, â€œDiagnosis of Breast Cancer on Imbalanced Dataset Using Various Sampling Techniques and Machine Learning Models,â€ in 2021 14th International Conference on Developments in eSystems Engineering (DeSE), Dec. 2021, pp. 162â€“167, doi: 10.1109/DeSE54285.2021.9719398.
[18] R. R. Rao and K. Makkithaya, â€œLearning from a class imbalanced public health dataset: A cost-based comparison of classifier performance,â€ International Journal of Electrical and Computer Engineering, vol. 7, no. 4, pp. 2215â€“2222, 2017, doi: 10.11591/ijece.v7i4.pp2215-2222.
[19] N. Santoso, W. Wibowo, and H. Hikmawati, â€œIntegration of synthetic minority oversampling technique for imbalanced class,â€ Indonesian Journal of Electrical Engineering and Computer Science, vol. 13, no. 1, p. 102, Jan. 2019, doi: 10.11591/ijeecs.v13.i1.pp102-108.
[20] A. Indrawati, H. Subagyo, A. Sihombing, W. Wagiyah, and S. Afandi, â€œAnalyzing The Impact of Resampling Mehod for Imbalanced Data Text in Indonesian Scientific Articles Categorization,â€ BACA: JURNAL DOKUMENTASI DAN INFORMASI, vol. 41, no. 2, pp. 133â€“141, Dec. 2020, doi: 10.14203/j.baca.v41i2.702.
[21] S. Akter et al., â€œAD-CovNet: An exploratory analysis using a hybrid deep learning model to handle data imbalance, predict fatality, and risk factors in Alzheimerâ€™s patients with COVID-19,â€ Computers in Biology and Medicine, vol. 146, no. July, pp. 1â€“19, Jul. 2022, doi: 10.1016/j.compbiomed.2022.105657.
[22] T. Sasada, Z. Liu, T. Baba, K. Hatano, and Y. Kimura, â€œA resampling method for imbalanced datasets considering noise and overlap,â€ in Procedia Computer Science, 2020, vol. 176, pp. 420â€“429, doi: 10.1016/j.procs.2020.08.043.
[23] A. L. Karn et al., â€œFuzzy and SVM Based Classification Model to Classify Spectral Objects in Sloan Digital Sky,â€ IEEE Access, vol. 10, pp. 101276â€“101291, 2022, doi: 10.1109/ACCESS.2022.3207480.
[24] A. Subasi, M. Balfaqih, Z. Balfagih, and K. Alfawwaz, â€œA Comparative Evaluation of Ensemble Classifiers for Malicious Webpage Detection,â€ in Procedia Computer Science, 2021, pp. 272â€“279, doi: 10.1016/j.procs.2021.10.082.
[25] Y. Wang and L. Feng, â€œImproved Adaboost Algorithm for Classification Based on Noise Confidence Degree and Weighted Feature Selection,â€ IEEE Access, vol. 8, pp. 153011â€“153026, 2020, doi: 10.1109/ACCESS.2020.3017164.
[26] Y. M. Wazery, E. Saber, E. H. Houssein, A. A. Ali, and E. Amer, â€œAn Efficient Slime Mould Algorithm Combined with K-Nearest Neighbor for Medical Classification Tasks,â€ IEEE Access, vol. 9, pp. 113666â€“113682, 2021, doi: 10.1109/ACCESS.2021.3105485.
[27] F. M. J. M. Shamrat et al., â€œSentiment analysis on twitter tweets about COVID-19 vaccines using NLP and supervised KNN classification algorithm,â€ Indonesian Journal of Electrical Engineering and Computer Science, vol. 23, no. 1, pp. 463â€“470, 2021, doi: 10.11591/ijeecs.v23.i1.pp463-470.
[28] A. R. Isnain, J. Supriyanto, and M. P. Kharisma, â€œImplementation of K-Nearest Neighbor (K-NN) Algorithm For Public Sentiment Analysis of Online Learning,â€ IJCCS (Indonesian Journal of Computing and Cybernetics Systems), vol. 15, no. 2, pp. 121â€“130, Apr. 2021, doi: 10.22146/ijccs.65176.
[29] B. Solihah, A. Azhari, and A. Musdholifah, â€œThe Empirical Comparison of Machine Learning Algorithm for the Class Imbalanced Problem in Conformational Epitope Prediction,â€ JUITA: Jurnal Informatika, vol. 9, no. 1, pp. 131â€“138, May 2021, doi: 10.30595/juita.v9i1.9969.
[30] V. K. Gupta, A. Gupta, D. Kumar, and A. Sardana, â€œPrediction of COVID-19 confirmed, death, and cured cases in India using random forest model,â€ Big Data Mining and Analytics, vol. 4, no. 2, pp. 116â€“123, 2021, doi: 10.26599/BDMA.2020.9020016.
[31] J. Zeffora and Shobarani, â€œOptimizing random forest classifier with Jenesis-index on an imbalanced dataset,â€ Indonesian Journal of Electrical Engineering and Computer Science, vol. 26, no. 1, pp. 505â€“511, 2022, doi: 10.11591/ijeecs.v26.i1.pp505-511.
[32] A. N. Kasanah, M. Muladi, and U. Pujianto, â€œPenerapan Teknik SMOTE untuk Mengatasi Imbalance Class dalam Klasifikasi Objektivitas Berita Online Menggunakan Algoritma KNN,â€ Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 3, no. 2, pp. 196â€“201, 2019, doi: 10.29207/resti.v3i2.945.
[33] A. A. Salih and A. M. Abdulazeez, â€œEvaluation of Classification Algorithms for Intrusion Detection System: A Review,â€ Journal of Soft Computing and Data Mining, vol. 02, no. 01, pp. 31â€“40, 2021, doi: 10.30880/jscdm.2021.02.01.004.
[34] A. S. Desuky, A. H. Omar, and N. M. Mostafa, â€œBoosting with crossover for improving imbalanced medical datasets classification,â€ Bulletin of Electrical Engineering and Informatics, vol. 10, no. 5, pp. 2733â€“2741, 2021, doi: 10.11591/eei.v10i5.3121.
[35] Z. P. Agusta and Adiwijaya, â€œModified balanced random forest for improving imbalanced data prediction,â€ International Journal of Advances in Intelligent Informatics, vol. 5, no. 1, pp. 58â€“65, 2019, doi: 10.26555/ijain.v5i1.255.

The Effect of Class Imbalance Handling on Datasets Toward Classification Algorithm Performance

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

How to Cite

Similar Articles

Most read articles by the same author(s)

sidebar menu 2

tools

citation