The Application of Repeated SMOTE for Multi Class Classification on Imbalanced Data

  • Muhammad Ibnu Choldun Rachmatullah Politeknik Pos Indonesia
Keywords: Imbalanced data, Oversampling, Classification, Multi Class, Repeated SMOTE

Abstract

One of the problems that are often faced by classifier algorithms is related to the problem of imbalanced data. One of the recommended improvement methods at the data level is to balance the number of data in different classes by enlarging the sample to the minority class (oversampling), one of which is called The Synthetic Minority Oversampling Technique (SMOTE). SMOTE is commonly used to balance data consisting of two classes. In this research, SMOTE was used to balance multi-class data. The purpose of this research is to balance multi-class data by applying SMOTE repeatedly. This iterative process needs to be applied if the number of unbalanced data classes is more than two classes, because the one-time SMOTE process is only suitable for binary classification or the number of unbalanced data classes is only one class. To see the performance of iterative SMOTE, the SMOTE datasets were classified using a neural network, k-NN, Nave Bayes, and Random Forest and the performance measures were measured in terms of accuracy, sensitivity, and specificity. The experiment in this research used the Glass Identification dataset which had six classes, and the SMOTE process was repeated five times. The best performance was achieved by the Random Forest classifier method with accuracy = 86.27%, sensitivity = 86.18%, and specificity = 95.82%. The result of experiment present that repeated SMOTE results can increase the performance of classification.

Downloads

Download data is not yet available.

References

[1] F. Bao, Y. Wu, Z. Li, Y. Li, L. Liu, and G. Chen, “Effect Improved for High-Dimensional and Unbalanced Data Anomaly Detection Model Based on KNN-SMOTE-LSTM,” Complexity, vol. 2020, pp. 1–17, 2020.
[2] J. Luo, L. Zhu, Q. Li, D. Liu, and M. Chen, “Imbalanced Fault Diagnosis of Rotating Machinery Based on Deep Generative Adversarial Networks with Gradient Penalty,” Processes, vol. 9, pp. 1–13, 2021.
[3] Y. Fan, X. Cui, H. Han, and H. Lu, “Chiller Fault Diagnosis with Field Sensors Using The Technology of Imbalanced Data,” Applied Thermal Engineering, vol. 159, p. 113933, aug 2019.
[4] H. Hairan, K. E. Saputro, and S. Fadli, “K-means-SMOTE for Handling Class Imbalance in The Classification of Diabetes with C4.5, SVM, and Naive Bayes,” Jurnal Teknologi dan Sistem Komputer, vol. 8, no. 2, pp. 89–93, apr 2020.
[5] R. Siringoringo, “Klasifikasi Data Tidak Seimbang Menggunakan Algoritma Smote dan K-Nearest Neighbor,” Information System Development, vol. 3, no. 1, pp. 44–49, 2018.
[6] M. Koziarski, “Potential Anchoring for Imbalanced Data Classification,” Pattern Recognition, vol. 120, p. 108114, dec 2021.
[7] L. Wang, M. Han, X. Li, N. Zhang, and H. Cheng, “Review of Classification Methods on Unbalanced Data Sets,” IEEE Access, vol. 9, pp. 64 606–64 628, 2021.
[8] M. Mukherjee and M. Khushi, “SMOTE-ENC : A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features,” Applied System Innovation, vol. 4, no. 12, pp. 1–12, 2021.
[9] J. Liu, “Importance-SMOTE: A Synthetic Minority Oversampling Method for Noisy Imbalanced Data,” Soft Computing, vol. 26, no. 2, pp. 1141–1163, 2022.
[10] S. Wang, Y. Dai, J. Shen, and J. Xuan, “Research on Expansion and Classification of Imbalanced Data Based on SMOTE Algorithm,” Scientific Reports, vol. 11, no. 1, pp. 1–11, 2021.
[11] A. Guezzaz, Y. Asimi, M. Azrour, and A. Asimi, “Mathematical Validation of Proposed Machine Learning Classifier for Heterogeneous Traffic and Anomaly Detection,” Big Data Mining and Analytics, vol. 4, no. 1, pp. 18–24, mar 2021.
[12] L. Huang, Q. Fu, M. He, D. Jiang, and Z. Hao, “Detection Algorithm of Safety Helmet Wearing Based on Deep Learning,” Concurrency and Computation: Practice and Experience, vol. 33, no. 13, pp. 1–14, jul 2021.
[13] A. Fern´andez, S. Garc´ıa, M. Galar, and R. C. Prati, Learning From Imbalanced Data Sets. Switzerland AG: Springer, 2018.
[14] D. Veganzones and E. S´everin, “An Investigation of Bankruptcy Prediction in Imbalanced Datasets,” Decision Support Systems, vol. 112, pp. 111–124, aug 2018.
[15] M. Pirizadeh, N. Alemohammad, M. Manthouri, and M. Pirizadeh, “A New Machine Learning Ensemble Model for Class Imbalance Problem of Screening Enhanced Oil Recovery Methods,” Journal of Petroleum Science and Engineering, vol. 198, pp. 108214, mar 2021.
[16] S. V. Spelmen and R. Porkodi, “A Review on Handling Imbalanced Data,” in 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT). IEEE, 2018, pp. 1–11.
[17] L. Yu, R. Zhou, L. Tang, and R. Chen, “A DBN-Based Resampling SVM Ensemble Learning Paradigm for Credit Classification with Imbalanced Data,” Applied Soft Computing, vol. 69, pp. 192–202, aug 2018.
[18] A. S. Desuky and S. Hussain, “An Improved Hybrid Approach for Handling Class Imbalance Problem,” Arabian Journal for Science and Engineering, vol. 46, no. 4, pp. 3853–3864, 2021.
[19] N. V. Chawla, K. W. Bowyer, and L. O. Hall, “SMOTE : Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 341–378, 2002.
[20] Z. Xu, D. Shen, T. Nie, Y. Kou, N. Yin, and X. Han, “A Cluster-based Oversampling Algorithm Combining SMOTE and K-means for Imbalanced Medical Data,” Information Sciences, vol. 572, no. 5, pp. 574–589, sep 2021.
[21] D. Gan, J. Shen, B. An, M. Xu, and N. Liu, “Integrating TANBN with Cost Sensitive Classification Algorithm for Imbalanced Data in Medical Diagnosis,” Computers and Industrial Engineering, vol. 140, p. 106266, 2020.
[22] M. Khushi, K. Shaukat, T. M. Alam, I. A. Hameed, S. Uddin, S. Luo, X. Yang, and M. C. Reyes, “A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data,” IEEE Access, vol. 9, pp. 109 960–109 975, 2021.
[23] Y.-y. Hsin, T.-s. Dai, Y.-w. Ti, M.-c. Huang, T.-h. Chiang, and L.-c. Liu, “Feature Engineering and Resampling Strategies for Fund Transfer Fraud with Limited Transaction Data and A Time-Inhomogeneous Modi Operandi,” IEEE Access, vol. 10, no. August, pp. 86 101–86 116, 2022.
[24] J.-r. Jiang and Y.-t. Chen, “Industrial Control System Anomaly Detection and Classification Based on Network Traffic,” IEEE Access, vol. 10, pp. 41 874–41 888, 2022.
[25] T. Guo, W. Zhao, M. Alrashoud, A. Tolba, S. Firmin, and F. Xia, “Multimodal Educational Data Fusion for Students’ Mental Health Detection,” IEEE Access, vol. 10, no. May, pp. 70 370–70 382, 2022.
[26] R. Obiedat, R. Qaddoura, A. M. Al-Zoubi, L. Al-Qaisi, O. Harfoushi, M. Alrefai, and H. Faris, “Sentiment Analysis of Customers Reviews Using a Hybrid Evolutionary SVM-Based Approach in an Imbalanced Data Distribution,” IEEE Access, vol. 10, pp. 22 260–22 273, 2022.
[27] M. Deng, Y. Guo, C. Wang, and F. Wu, “An Oversampling Method for Multi-Class Imbalanced Data Based on Composite Weights,” PLOS ONE, vol. 16, no. 11, p. e0259227, nov 2021.
[28] N. Darapureddy, N. Karatapu, and T. K. Battula, “Research of Machine Learning Algorithms Using K-fold Cross Validation,”International Journal of Engineering and Advanced Technology, vol. 8, no. 6S, pp. 215–218, sep 2019.
[29] I. K. Nti, O. Nyarko-Boateng, and J. Aning, “Performance of Machine Learning Algorithms with Different K Values in K-foldCrossValidation,” International Journal of Information Technology and Computer Science, vol. 13, no. 6, pp. 61–71, dec 2021.
[30] M. L. Suliztia and A. Fauzan, “Comparing Naive Bayes , K-Nearest Neighbor , and Neural Network Classification Methods of Seat Load Factor in Lombok Outbound Flights,” Jurnal Matematika, Statistika & Komputasi, vol. 16, no. 2, pp. 187–198, 2020.
[31] A. Naimi, J. Deng, and S. Member, “Fault Detection and Isolation of a Pressurized Water Reactor Based on Neural Network and K-Nearest Neighbor,” IEEE Access, vol. 10, pp. 17 113–17 121, 2022.
[32] P. R. Sihombing and I. F. Yuliati, “Penerapan Metode Machine Learning dalam Klasifikasi Risiko Kejadian Berat Badan Lahir Rendah di Indonesia,” MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, vol. 20, no. 2, pp. 417–426, may 2021.
[33] N. Santoso,W.Wibowo, and H. Hikmawati, “Integration of Synthetic Minority Oversampling Technique for Imbalanced Class,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 13, no. 1, p. 102, jan 2019.
[34] M. Bader-El-Den, E. Teitei, and T. Perry, “Biased Random Forest for Dealing with The Class Imbalance Problem,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 7, pp. 2163–2172, jul 2019.
[35] A. S. More, D. P. Rana, and I. Agarwal, “Random Forest Classifier Approach for Imbalanced Big Data Classification for Smart City Application Domains,” in Proceedings of International Conference on Computational Intelligence & IoT (ICCIIoT), 2018, pp. 260–266.
[36] and T. Perry G. Szepannek, “Explaining Artificial Intelligence with Care,” K¨unstl Intell, pp. 1–10, 2022.
[37] M. S. Aldayel, “K-Nearest Neighbor Classification for Glass Identification Problem,” in International Conference on Computer Systems and Industrial Informatics, 2012, pp. 1–5.
[38] H. Mathur and A. Surana, “Glass Classification Based on Machine Learning Algorithms,” International Journal of Innovative Technology and Exploring Engineering (IJITEE), vol. 9, no. 11, pp. 139–142, 2020.
Published
2022-11-30
How to Cite
Rachmatullah, M. (2022). The Application of Repeated SMOTE for Multi Class Classification on Imbalanced Data. MATRIK : Jurnal Manajemen, Teknik Informatika Dan Rekayasa Komputer, 22(1), 13-24. https://doi.org/https://doi.org/10.30812/matrik.v22i1.1803
Section
Articles