Multiclass Text Classification of Indonesian Short Message Service Spam using Deep Learning Method and Easy Data Augmentation

  • Nurun Latifah Universitas Mataram, Mataram, Indonesia
  • Ramaditia Dwiyansaputra Universitas Mataram, Mataram, Indonesia
  • Gibran Satya Nugraha Universitas Mataram, Mataram, Indonesia
Keywords: Easy Data Augmentation, Multiclass Classification, Short Message Service Spam, Text Classification

Abstract

The ease of using Short Message Service (SMS) has brought the issue of SMS spam, characterized by unsolicited and unwanted. Many studies have been conducted utilizing machine learning methods to build models capable of classifying SMS Spam to overcome this problem. However, most of these studies still rely on traditional methods, with limited exploration of deep learning-based approaches. Whereas traditional methods have a limitation compared to deep learning, which performs manual feature extraction. Moreover, many of these studies only focus on binary classification rather than multiclass SMS classification which can provide more detailed classification results. The aim of this research is to analyze deep learning model for multiclass Indonesian SMS spam classification with six categories and to assess the effectiveness of the text augmentation method in addressing data imbalace issues arising from the increased number of SMS categories. The research method used were Indonesian version of Bidirectional Encoder Representations from Transformers (IndoBERT) model and exploratory data analysis (EDA) augmentation technique to address imbalance dataset issue. The evaluation is conducted by comparing the performance of the IndoBERT model on the dataset and applying EDA techniques to enhance the representation of minority classes. The result of this research shows that IndoBERT achieves 91% accuracy rate in classifying SMS spam. Furthermore, the use of EDA technique results in significant improvement in f1-score, with an average 12% increase in minority classes. Overall model accuracy also improves to 93% after EDA implementation. This research concludes that IndoBERT is effective for multiclass SMS spam classification, and the EDA is beneficial in handling imbalanced data, contributing to the enhancement of model performances.

Downloads

Download data is not yet available.

References

[1] F. D. Pramakrisna, F. D. Adhinata, and N. A. F. Tanjung, “Aplikasi Klasifikasi SMS Berbasis Web Menggunakan Algoritma Logistic Regression,” Teknika, vol. 11, no. 2, pp. 90–97, 2022, doi: 10.34148/teknika.v11i2.466.
[2] A. A. N. A. Surya Utama and A. A. Sri Indrawati, “PERLINDUNGAN TERHADAP PENGGUNA LAYANAN SELULER YANG TERGANGGU DENGAN ADANYA SHORT MESSAGE SERVICE (SMS) SPAM,” Kertha Semaya : Journal Ilmu Hukum, vol. 10, no. 9, p. 2067, Jul. 2022, doi: 10.24843/ks.2022.v10.i09.p09.
[3] F. R. Suprihati, “Analisis Klasifikasi SMS Spam Menggunakan Logistic Regression,” Jurnal Sistem Cerdas, vol. 4, no. 3, pp. 155–160, 2021.
[4] H. Baaqeel and R. Zagrouba, “Hybrid SMS Spam Filtering System Using Machine Learning Techniques,” in 2020 21st International Arab Conference on Information Technology (ACIT), IEEE, Nov. 2020, pp. 1–8. doi: 10.1109/ACIT50332.2020.9300071.
[5] E. Sankar, Y. Y. S. S. Babu, and M. Tridev, “SMS SPAM DETECTION USING MACHINE LEARNING,” International Journal of Scientific Research in Engineering and Management (IJSREM) International Journal of Scientific Research in Engineering and Management, 2023, doi: 10.55041/IJSREM18832.
[6] I. Indriyani and P. Dewanti, “Truecaller’s Spam Call and SMS Blocking Solution for Surveillance on Social Media,” Jurnal Mekintek : Jurnal Mekanikal, Energi, Industri, Dan Teknologi, vol. 13, no. 1, pp. 19–29, Apr. 2022, doi: 10.35335/mekintek.v13i1.121.
[7] M. S. Ghofany, R. Dwiyansaputra, F. Bimantoro, and Khairunnas, “Indonesian SMS Spam Detection Using TF-RF Feature Weighting Method and Support Vector Machine Classifier,” in Proceedings of the First Mandalika International Multi-Conference on Science and Engineering 2022, MIMSE 2022 (Informatics and Computer Science) (MIMSE-I-C-2022), Atlantis Press, 2022, pp. 117–129. doi: 10.2991/978-94-6463-084-8_12.
[8] A. Theodorus, T. K. Prasetyo, R. Hartono, and D. Suhartono, “Short Message Service (SMS) Spam Filtering using Machine Learning in Bahasa Indonesia,” in 2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT), IEEE, Apr. 2021, pp. 199–203. doi: 10.1109/EIConCIT50028.2021.9431859.
[9] P. A. Raharja, M. F. Sidiq, and D. C. Fransisca, “Comparative Analysis of Multinomial Naïve Bayes and Logistic Regression Models for Prediction of SMS Spam,” JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 6, no. 3, p. 1290, Jul. 2022, doi: 10.30865/mib.v6i3.4019.
[10] M. H. S. Ajat, “KLASIFIKASI SMS SPAM DENGAN KOMPARASI METODE SVM DAN NAÏVE BAYES,” METHODIKA: Jurnal Teknik Informatika dan Sistem Informasi, vol. 9, no. 1, pp. 31–34, Mar. 2023, doi: 10.46880/mtk.v9i1.1694.
[11] S. A. Sireesha, S. B. Karthik, K. Srena, S. N. Gopal, and S. K. Reddy, “SMS Spam Detection Using Machine Learning,” Scandinavian Journal of Information Systems, vol. 35, no. 1, pp. 749–754, 2023.
[12] A. N. R. Hasanah, R. A. Krestianti, and S. Wati, “Implementasi Algoritma Regresi Logistik untuk Binary Classification dalam Spam SMS dan WhatsApp,” in Prosiding SEMNAS INOTEK (Seminar Nasional Inovasi Teknologi), 2023, pp. 80–93. doi: 10.29407/inotek.v7i1.3413.
[13] A. Kurniasih and L. P. Manik, “On the Role of Text Preprocessing in BERT Embedding-based DNNs for Classifying Informal Texts,” International Journal of Advanced Computer Science and Applications, vol. 13, no. 6, pp. 927–934, 2022, doi: 10.14569/IJACSA.2022.01306109.
[14] H. Jayadianti, W. Kaswidjanti, A. T. Utomo, S. Saifullah, F. A. Dwiyanto, and R. Drezewski, “Sentiment analysis of Indonesian reviews using fine-tuning IndoBERT and R-CNN,” ILKOM Jurnal Ilmiah, vol. 14, no. 3, pp. 348–354, Dec. 2022, doi: 10.33096/ilkom.v14i3.1505.348-354.
[15] M. V Koroteev, “BERT: A Review of Applications in Natural Language Processing and Understanding,” arXiv preprint arXiv:2103.11943, Mar. 2021, doi: https://doi.org/10.48550/arXiv.2103.11943.
[16] H. M. Lee and Y. Sibaroni, “Comparison of IndoBERTweet and Support Vector Machine on Sentiment Analysis of Racing Circuit Construction in Indonesia,” JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 7, no. 1, pp. 99–106, 2023, doi: 10.30865/mib.v7i1.5380.
[17] S. M. Isa, G. Nico, and M. Permana, “Indobert for Indonesian fake news detection,” ICIC Express Lett, vol. 16, no. 3, pp. 289–297, 2022, doi: 10.24507/icicel.16.03.289.
[18] N. N. Qomariyah, T. Sun, and D. Kazakov, “NLP Analysis of COVID-19 Radiology Reports in Indonesian using IndoBERT,” in 2022 4th International Conference on Biomedical Engineering (IBIOMED), IEEE, Oct. 2022, pp. 65–70. doi: 10.1109/IBIOMED56408.2022.9988223.
[19] B. Juarto, “Indonesian News Classification Using IndoBert,” International Journal of Intelligent Systems and Applications in Engineering, vol. 11, no. 2, pp. 454–460, 2023.
[20] L. Geni, E. Yulianti, and D. I. Sensuse, “Sentiment Analysis of Tweets Before the 2024 Elections in Indonesia Using IndoBERT Language Models,” Jurnal Ilmiah Teknik Elektro Komputer dan Informatika (JITEKI), vol. 9, no. 3, pp. 746–757, 2023, doi: 10.26555/jiteki.v9i3.26490.
[21] M. I. Amal, E. S. Rahmasita, E. Suryaputra, and N. A. Rakhmawati, “Analisis Klasifikasi Sentimen Terhadap Isu Kebocoran Data Kartu Identitas Ponsel di Twitter,” Jurnal Teknik Informatika dan Sistem Informasi, vol. 8, no. 3, pp. 645–660, Dec. 2022, doi: 10.28932/jutisi.v8i3.5483.
[22] D. A. Oyeyemi and A. K. Ojo, “SMS Spam Detection and Classification to Combat Abuse in Telephone Networks Using Natural Language Processing,” Journal of Advances in Mathematics and Computer Science, vol. 38, no. 10, pp. 144–156, Oct. 2023, doi: 10.9734/jamcs/2023/v38i101832.
[23] D.-C. Li, S.-C. Chen, Y.-S. Lin, and W.-Y. Hsu, “A Novel Classification Method Based on a Two-Phase Technique for Learning Imbalanced Text Data,” Symmetry, vol. 14, no. 3, p. 567, Mar. 2022, doi: 10.3390/sym14030567.
[24] A. Wirawan, H. D. Cahyono, and Winarno, “Easy Data Augmentation in Sentiment Analysis of Cyberbullying,” in 2023 6th International Conference on Information and Communications Technology (ICOIACT), IEEE, Nov. 2023, pp. 443–447. doi: 10.1109/ICOIACT59844.2023.10455817.
[25] H. R. Nafiisah and F. Z. Ruskanda, “Content-based Multiclass Classification on Indonesian SMS Messages,” in 2022 International Symposium on Electronics and Smart Devices (ISESD), IEEE, Nov. 2022, pp. 1–6. doi: 10.1109/ISESD56103.2022.9980769.
[26] R. Dwiyansaputra, G. S. Nugraha, F. Bimantoro, and A. Aranta, “Deteksi SMS Spam Berbahasa Indonesia menggunakan TF-IDF dan Stochastic Gradient Descent Classifier,” Jurnal Teknologi Informasi, Komputer, dan Aplikasinya (JTIKA), vol. 3, no. 2, pp. 200–207, 2021, doi: 10.29303/jtika.v3i2.145.
[27] S. Efendi and P. Sihombing, “Sentiment Analysis of Food Order Tweets to Find Out Demographic Customer Profile Using SVM,” MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, vol. 21, no. 3, pp. 583–594, Jul. 2022, doi: 10.30812/matrik.v21i3.1898.
[28] G. Z. Nabiilah, I. N. Alam, E. S. Purwanto, and M. F. Hidayat, “Indonesian multilabel classification using IndoBERT embedding and MBERT classification,” International Journal of Electrical and Computer Engineering (IJECE), vol. 14, no. 1, p. 1071, Feb. 2024, doi: 10.11591/ijece.v14i1.pp1071-1078.
[29] B. Wilie et al., “IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,” arXiv preprint arXiv:2009.05387, Sep. 2020, doi: 10.48550/arXiv.2009.05387.
[30] S. Saadah, K. M. Auditama, A. A. Fattahila, F. I. Amorokhman, A. Aditsania, and A. A. Rohmawati, “Implementation of BERT, IndoBERT, and CNN-LSTM in Classifying Public Opinion about COVID-19 Vaccine in Indonesia,” Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 6, no. 4, pp. 648–655, Aug. 2022, doi: 10.29207/resti.v6i4.4215.
[31] P. T. Putra, A. Anggrawan, and H. Hairani, “Comparison of Machine Learning Methods for Classifying User Satisfaction Opinions of the PeduliLindungi Application,” MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, vol. 22, no. 3, pp. 431–442, Jun. 2023, doi: 10.30812/matrik.v22i3.2860.
[32] E. D. Pratama, “Implementasi Model Long-Short Term Memory (LSTM) pada Klasifikasi Teks Data SMS Spam Berbahasa Indonesia,” The Journal on Machine Learning and Computational Intelligence (JMLCI), vol. 1, no. 2, 2022, doi: 10.26740/vol1iss2y2022id12.
Published
2024-07-06
How to Cite
Latifah, N., Dwiyansaputra, R., & Nugraha, G. S. (2024). Multiclass Text Classification of Indonesian Short Message Service Spam using Deep Learning Method and Easy Data Augmentation. MATRIK : Jurnal Manajemen, Teknik Informatika Dan Rekayasa Komputer, 23(3). https://doi.org/https://doi.org/10.30812/matrik.v23i3.3835