The Application of Repeated SMOTE for Multi Class Classification on Imbalanced Data

Muhammad Ibnu Choldun Rachmatullah

doi:10.30812/matrik.v22i1.1803

Authors

Muhammad Ibnu Choldun Rachmatullah Politeknik Pos Indonesia

DOI:

https://doi.org/10.30812/matrik.v22i1.1803

Keywords:

Imbalanced data, Oversampling, Classification, Multi Class, Repeated SMOTE

Abstract

One of the problems that are often faced by classifier algorithms is related to the problem of imbalanced data. One of the recommended improvement methods at the data level is to balance the number of data in different classes by enlarging the sample to the minority class (oversampling), one of which is called The Synthetic Minority Oversampling Technique (SMOTE). SMOTE is commonly used to balance data consisting of two classes. In this research, SMOTE was used to balance multi-class data. The purpose of this research is to balance multi-class data by applying SMOTE repeatedly. This iterative process needs to be applied if the number of unbalanced data classes is more than two classes, because the one-time SMOTE process is only suitable for binary classification or the number of unbalanced data classes is only one class. To see the performance of iterative SMOTE, the SMOTE datasets were classified using a neural network, k-NN, Nave Bayes, and Random Forest and the performance measures were measured in terms of accuracy, sensitivity, and specificity. The experiment in this research used the Glass Identification dataset which had six classes, and the SMOTE process was repeated five times. The best performance was achieved by the Random Forest classifier method with accuracy = 86.27%, sensitivity = 86.18%, and specificity = 95.82%. The result of experiment present that repeated SMOTE results can increase the performance of classification.

Downloads

Download data is not yet available.

References

[1] F. Bao, Y. Wu, Z. Li, Y. Li, L. Liu, and G. Chen, â€œEffect Improved for High-Dimensional and Unbalanced Data Anomaly Detection Model Based on KNN-SMOTE-LSTM,â€ Complexity, vol. 2020, pp. 1â€“17, 2020.
[2] J. Luo, L. Zhu, Q. Li, D. Liu, and M. Chen, â€œImbalanced Fault Diagnosis of Rotating Machinery Based on Deep Generative Adversarial Networks with Gradient Penalty,â€ Processes, vol. 9, pp. 1â€“13, 2021.
[3] Y. Fan, X. Cui, H. Han, and H. Lu, â€œChiller Fault Diagnosis with Field Sensors Using The Technology of Imbalanced Data,â€ Applied Thermal Engineering, vol. 159, p. 113933, aug 2019.
[4] H. Hairan, K. E. Saputro, and S. Fadli, â€œK-means-SMOTE for Handling Class Imbalance in The Classification of Diabetes with C4.5, SVM, and Naive Bayes,â€ Jurnal Teknologi dan Sistem Komputer, vol. 8, no. 2, pp. 89â€“93, apr 2020.
[5] R. Siringoringo, â€œKlasifikasi Data Tidak Seimbang Menggunakan Algoritma Smote dan K-Nearest Neighbor,â€ Information System Development, vol. 3, no. 1, pp. 44â€“49, 2018.
[6] M. Koziarski, â€œPotential Anchoring for Imbalanced Data Classification,â€ Pattern Recognition, vol. 120, p. 108114, dec 2021.
[7] L. Wang, M. Han, X. Li, N. Zhang, and H. Cheng, â€œReview of Classification Methods on Unbalanced Data Sets,â€ IEEE Access, vol. 9, pp. 64 606â€“64 628, 2021.
[8] M. Mukherjee and M. Khushi, â€œSMOTE-ENC : A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features,â€ Applied System Innovation, vol. 4, no. 12, pp. 1â€“12, 2021.
[9] J. Liu, â€œImportance-SMOTE: A Synthetic Minority Oversampling Method for Noisy Imbalanced Data,â€ Soft Computing, vol. 26, no. 2, pp. 1141â€“1163, 2022.
[10] S. Wang, Y. Dai, J. Shen, and J. Xuan, â€œResearch on Expansion and Classification of Imbalanced Data Based on SMOTE Algorithm,â€ Scientific Reports, vol. 11, no. 1, pp. 1â€“11, 2021.
[11] A. Guezzaz, Y. Asimi, M. Azrour, and A. Asimi, â€œMathematical Validation of Proposed Machine Learning Classifier for Heterogeneous Traffic and Anomaly Detection,â€ Big Data Mining and Analytics, vol. 4, no. 1, pp. 18â€“24, mar 2021.
[12] L. Huang, Q. Fu, M. He, D. Jiang, and Z. Hao, â€œDetection Algorithm of Safety Helmet Wearing Based on Deep Learning,â€ Concurrency and Computation: Practice and Experience, vol. 33, no. 13, pp. 1â€“14, jul 2021.
[13] A. FernÂ´andez, S. GarcÂ´Ä±a, M. Galar, and R. C. Prati, Learning From Imbalanced Data Sets. Switzerland AG: Springer, 2018.
[14] D. Veganzones and E. SÂ´everin, â€œAn Investigation of Bankruptcy Prediction in Imbalanced Datasets,â€ Decision Support Systems, vol. 112, pp. 111â€“124, aug 2018.
[15] M. Pirizadeh, N. Alemohammad, M. Manthouri, and M. Pirizadeh, â€œA New Machine Learning Ensemble Model for Class Imbalance Problem of Screening Enhanced Oil Recovery Methods,â€ Journal of Petroleum Science and Engineering, vol. 198, pp. 108214, mar 2021.
[16] S. V. Spelmen and R. Porkodi, â€œA Review on Handling Imbalanced Data,â€ in 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT). IEEE, 2018, pp. 1â€“11.
[17] L. Yu, R. Zhou, L. Tang, and R. Chen, â€œA DBN-Based Resampling SVM Ensemble Learning Paradigm for Credit Classification with Imbalanced Data,â€ Applied Soft Computing, vol. 69, pp. 192â€“202, aug 2018.
[18] A. S. Desuky and S. Hussain, â€œAn Improved Hybrid Approach for Handling Class Imbalance Problem,â€ Arabian Journal for Science and Engineering, vol. 46, no. 4, pp. 3853â€“3864, 2021.
[19] N. V. Chawla, K. W. Bowyer, and L. O. Hall, â€œSMOTE : Synthetic Minority Over-sampling Technique,â€ Journal of Artificial Intelligence Research, vol. 16, pp. 341â€“378, 2002.
[20] Z. Xu, D. Shen, T. Nie, Y. Kou, N. Yin, and X. Han, â€œA Cluster-based Oversampling Algorithm Combining SMOTE and K-means for Imbalanced Medical Data,â€ Information Sciences, vol. 572, no. 5, pp. 574â€“589, sep 2021.
[21] D. Gan, J. Shen, B. An, M. Xu, and N. Liu, â€œIntegrating TANBN with Cost Sensitive Classification Algorithm for Imbalanced Data in Medical Diagnosis,â€ Computers and Industrial Engineering, vol. 140, p. 106266, 2020.
[22] M. Khushi, K. Shaukat, T. M. Alam, I. A. Hameed, S. Uddin, S. Luo, X. Yang, and M. C. Reyes, â€œA Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data,â€ IEEE Access, vol. 9, pp. 109 960â€“109 975, 2021.
[23] Y.-y. Hsin, T.-s. Dai, Y.-w. Ti, M.-c. Huang, T.-h. Chiang, and L.-c. Liu, â€œFeature Engineering and Resampling Strategies for Fund Transfer Fraud with Limited Transaction Data and A Time-Inhomogeneous Modi Operandi,â€ IEEE Access, vol. 10, no. August, pp. 86 101â€“86 116, 2022.
[24] J.-r. Jiang and Y.-t. Chen, â€œIndustrial Control System Anomaly Detection and Classification Based on Network Traffic,â€ IEEE Access, vol. 10, pp. 41 874â€“41 888, 2022.
[25] T. Guo, W. Zhao, M. Alrashoud, A. Tolba, S. Firmin, and F. Xia, â€œMultimodal Educational Data Fusion for Studentsâ€™ Mental Health Detection,â€ IEEE Access, vol. 10, no. May, pp. 70 370â€“70 382, 2022.
[26] R. Obiedat, R. Qaddoura, A. M. Al-Zoubi, L. Al-Qaisi, O. Harfoushi, M. Alrefai, and H. Faris, â€œSentiment Analysis of Customers Reviews Using a Hybrid Evolutionary SVM-Based Approach in an Imbalanced Data Distribution,â€ IEEE Access, vol. 10, pp. 22 260â€“22 273, 2022.
[27] M. Deng, Y. Guo, C. Wang, and F. Wu, â€œAn Oversampling Method for Multi-Class Imbalanced Data Based on Composite Weights,â€ PLOS ONE, vol. 16, no. 11, p. e0259227, nov 2021.
[28] N. Darapureddy, N. Karatapu, and T. K. Battula, â€œResearch of Machine Learning Algorithms Using K-fold Cross Validation,â€International Journal of Engineering and Advanced Technology, vol. 8, no. 6S, pp. 215â€“218, sep 2019.
[29] I. K. Nti, O. Nyarko-Boateng, and J. Aning, â€œPerformance of Machine Learning Algorithms with Different K Values in K-foldCrossValidation,â€ International Journal of Information Technology and Computer Science, vol. 13, no. 6, pp. 61â€“71, dec 2021.
[30] M. L. Suliztia and A. Fauzan, â€œComparing Naive Bayes , K-Nearest Neighbor , and Neural Network Classification Methods of Seat Load Factor in Lombok Outbound Flights,â€ Jurnal Matematika, Statistika & Komputasi, vol. 16, no. 2, pp. 187â€“198, 2020.
[31] A. Naimi, J. Deng, and S. Member, â€œFault Detection and Isolation of a Pressurized Water Reactor Based on Neural Network and K-Nearest Neighbor,â€ IEEE Access, vol. 10, pp. 17 113â€“17 121, 2022.
[32] P. R. Sihombing and I. F. Yuliati, â€œPenerapan Metode Machine Learning dalam Klasifikasi Risiko Kejadian Berat Badan Lahir Rendah di Indonesia,â€ MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, vol. 20, no. 2, pp. 417â€“426, may 2021.
[33] N. Santoso,W.Wibowo, and H. Hikmawati, â€œIntegration of Synthetic Minority Oversampling Technique for Imbalanced Class,â€ Indonesian Journal of Electrical Engineering and Computer Science, vol. 13, no. 1, p. 102, jan 2019.
[34] M. Bader-El-Den, E. Teitei, and T. Perry, â€œBiased Random Forest for Dealing with The Class Imbalance Problem,â€ IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 7, pp. 2163â€“2172, jul 2019.
[35] A. S. More, D. P. Rana, and I. Agarwal, â€œRandom Forest Classifier Approach for Imbalanced Big Data Classification for Smart City Application Domains,â€ in Proceedings of International Conference on Computational Intelligence & IoT (ICCIIoT), 2018, pp. 260â€“266.
[36] and T. Perry G. Szepannek, â€œExplaining Artificial Intelligence with Care,â€ KÂ¨unstl Intell, pp. 1â€“10, 2022.
[37] M. S. Aldayel, â€œK-Nearest Neighbor Classification for Glass Identification Problem,â€ in International Conference on Computer Systems and Industrial Informatics, 2012, pp. 1â€“5.
[38] H. Mathur and A. Surana, â€œGlass Classification Based on Machine Learning Algorithms,â€ International Journal of Innovative Technology and Exploring Engineering (IJITEE), vol. 9, no. 11, pp. 139â€“142, 2020.

The Application of Repeated SMOTE for Multi Class Classification on Imbalanced Data

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

How to Cite

sidebar menu 2

tools

citation