Multiclass Text Classification of Indonesian Short Message Service (SMS) Spam using Deep Learning Method and Easy Data Augmentation

  • Nurun Latifah Universitas Mataram, Mataram, Indonesia
  • Ramaditia Dwiyansaputra Universitas Mataram, Mataram, Indonesia
  • Gibran Satya Nugraha Universitas Mataram, Mataram, Indonesia
Keywords: Easy Data Augmentation, Multiclass Classification, Short Message Service Spam, Text Classification

Abstract

The ease of using Short Message Service (SMS) has brought the issue of SMS spam, characterized by unsolicited and unwanted. Many studies have been conducted utilizing machine learning methods to build models capable of classifying SMS Spam to overcome this problem. However, most of these studies still rely on traditional methods, with limited exploration of deep learning-based approaches. Whereas traditional methods have a limitation compared to deep learning, which performs manual feature extraction. Moreover, many of these studies only focus on binary classification rather than multiclass SMS classification which can provide more detailed classification results. The aim of this research is to analyze deep learning model for multiclass Indonesian SMS spam classification with six categories and to assess the effectiveness of the text augmentation method in addressing data imbalace issues arising from the increased number of SMS categories. The research method used were Indonesian version of Bidirectional Encoder Representations from Transformers (IndoBERT) model and exploratory data analysis (EDA) augmentation technique to address imbalance dataset issue. The evaluation is conducted by comparing the performance of the IndoBERT model on the dataset and applying EDA techniques to enhance the representation of minority classes. The result of this research shows that IndoBERT achieves 91% accuracy rate in classifying SMS spam. Furthermore, the use of EDA technique results in significant improvement in f1-score, with an average 12% increase in minority classes. Overall model accuracy also improves to 93% after EDA implementation. This research concludes that IndoBERT is effective for multiclass SMS spam classification, and the EDA is beneficial in handling imbalanced data, contributing to the enhancement of model performances.

Downloads

Download data is not yet available.

References

[1] F. D. Pramakrisna, F. D. Adhinata, and N. A. F. Tanjung, “Aplikasi Klasifikasi SMS Berbasis Web Menggunakan Algoritma
Logistic Regression,” Teknika, vol. 11, no. 2, pp. 90–97, 2022, https://doi.org/10.34148/teknika.v11i2.466.
[2] A. A. N. A. Surya Utama and A. A. Sri Indrawati, “Perlindungan Terhadap Pengguna Layanan Seluler yang Terganggu dengan
Adanya Short Message Service (SMS) Spam,” Kertha Semaya : Journal Ilmu Hukum, vol. 10, no. 9, pp. 1–10, jul 2022,
https://doi.org/10.24843/ks.2022.v10.i09.p09.
[3] F. R. Suprihati, “Analisis Klasifikasi SMS Spam Menggunakan Logistic Regression,” Jurnal Sistem Cerdas, vol. 4, no. 3, pp.
155–160, 2021, https://doi.org/10.22219/repositor.v3i4.32080.
[4] H. Baaqeel and R. Zagrouba, “Hybrid SMS Spam Filtering System Using Machine Learning Techniques,” in 2020 21st International
Arab Conference on Information Technology (ACIT). IEEE, nov 2020, pp. 1–8, https://doi.org/10.1109/ACIT50332.
2020.9300071.
[5] E. Sankar, Y. Y. S. S. Babu, and M. Tridev, “SMS Spam Detection Using Machine Learning,” International Journal of Scientific
Research in Engineering and Management (IJSREM) International Journal of Scientific Research in Engineering and
Management, vol. 7, no. 3, pp. 1–11, 2023, https://doi.org/10.55041/IJSREM18832.
[6] I. Indriyani and P. Dewanti, “Truecaller’s Spam Call and SMS Blocking Solution for Surveillance on Social Media,” Jurnal
Mekintek : Jurnal Mekanikal, Energi, Industri, Dan Teknologi, vol. 13, no. 1, pp. 19–29, apr 2022, https://doi.org/10.35335/
mekintek.v13i1.121.
[7] M. S. Ghofany, R. Dwiyansaputra, F. Bimantoro, and Khairunnas, “Indonesian SMS Spam Detection Using TF-RF Feature
Weighting Method and Support Vector Machine Classifier,” in Proceedings of the First Mandalika International Multi-
Conference on Science and Engineering 2022, MIMSE 2022 (Informatics and Computer Science) (MIMSE-I-C-2022), vol. 35,
no. 1. Atlantis Press, 2022, pp. 117–129, https://doi.org/10.2991/978-94-6463-084-8_12.
[8] A. Theodorus, T. K. Prasetyo, R. Hartono, and D. Suhartono, “Short Message Service (SMS) Spam Filtering using Machine
Learning in Bahasa Indonesia,” in 2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT).
IEEE, apr 2021, pp. 199–203, https://doi.org/10.1109/EIConCIT50028.2021.9431859.
[9] P. A. Raharja, M. F. Sidiq, and D. C. Fransisca, “Comparative Analysis of Multinomial Naïve Bayes and Logistic Regression
Models for Prediction of SMS Spam,” JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 6, no. 3, pp. 1–7, jul 2022, https:
//doi.org/10.30865/mib.v6i3.4019.
[10] M. H. S. Ajat, “Klasifikasi SMS Spam Dengan Komparasi Metode SVM dan Naïve Bayes,” METHODIKA: Jurnal Teknik
Informatika dan Sistem Informasi, vol. 9, no. 1, pp. 31–34, mar 2023, https://doi.org/10.46880/mtk.v9i1.1694.
[11] S. A. Sireesha, S. B. Karthik, K. Srena, S. N. Gopal, and S. K. Reddy, “SMS Spam Detection Using Machine Learning,”
Scandinavian Journal of Information Systems, vol. 35, no. 1, pp. 749–754, 2023.
[12] A. N. R. Hasanah, R. A. Krestianti, and S. Wati, “Implementasi Algoritma Regresi Logistik untuk Binary Classification dalam
Spam SMS dan WhatsApp,” in Prosiding SEMNAS INOTEK (Seminar Nasional Inovasi Teknologi), vol. 7, no. 1, 2023, pp.
80–93, https://doi.org/10.29407/inotek.v7i1.3413.
[13] A. Kurniasih and L. P. Manik, “On the Role of Text Preprocessing in BERT Embedding-based DNNs for Classifying Informal
Texts,” International Journal of Advanced Computer Science and Applications, vol. 13, no. 6, pp. 927–934, 2022, https://doi.
org/10.14569/IJACSA.2022.01306109.
[14] H. Jayadianti, W. Kaswidjanti, A. T. Utomo, S. Saifullah, F. A. Dwiyanto, and R. Drezewski, “Sentiment analysis of Indonesian
reviews using fine-tuning IndoBERT and R-CNN,” ILKOM Jurnal Ilmiah, vol. 14, no. 3, pp. 348–354, dec 2022, https://doi.
org/10.33096/ilkom.v14i3.1505.348-354.
[15] M. V. Koroteev, “BERT: A Review of Applications in Natural Language Processing and Understanding,” arXiv preprint
arXiv:2103.11943, pp. 1–18, mar 2021, https://doi.org/10.48550/arXiv.2103.11943.
[16] H. M. Lee and Y. Sibaroni, “Comparison of IndoBERTweet and Support Vector Machine on Sentiment Analysis of Racing
Circuit Construction in Indonesia,” JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 7, no. 1, pp. 99–106, 2023, https:
//doi.org/10.30865/mib.v7i1.5380.
[17] S. M. Isa, G. Nico, and M. Permana, “Indobert for Indonesian fake news detection,” ICIC Express Lett, vol. 16, no. 3, pp.
289–297, 2022, https://doi.org/10.24507/icicel.16.03.289.
[18] N. N. Qomariyah, T. Sun, and D. Kazakov, “NLP Analysis of COVID-19 Radiology Reports in Indonesian using IndoBERT,”
in 2022 4th International Conference on Biomedical Engineering (IBIOMED). IEEE, oct 2022, pp. 65–70, https://doi.org/10.
1109/IBIOMED56408.2022.9988223.
[19] B. Juarto, “Indonesian News Classification Using IndoBert,” International Journal of Intelligent Systems and Applications in
Engineering, vol. 11, no. 2, pp. 454–460, 2023.
[20] L. Geni, E. Yulianti, and D. I. Sensuse, “Sentiment Analysis of Tweets Before the 2024 Elections in Indonesia Using IndoBERT
Language Models,” Jurnal Ilmiah Teknik Elektro Komputer dan Informatika (JITEKI), vol. 9, no. 3, pp. 746–757, 2023, https:
//doi.org/10.26555/jiteki.v9i3.26490.
[21] M. I. Amal, E. S. Rahmasita, E. Suryaputra, and N. A. Rakhmawati, “Analisis Klasifikasi Sentimen Terhadap Isu Kebocoran
Data Kartu Identitas Ponsel di Twitter,” Jurnal Teknik Informatika dan Sistem Informasi, vol. 8, no. 3, pp. 645–660, dec 2022,
https://doi.org/10.28932/jutisi.v8i3.5483.
[22] D. A. Oyeyemi and A. K. Ojo, “SMS Spam Detection and Classification to Combat Abuse in Telephone Networks Using Natural
Language Processing,” Journal of Advances in Mathematics and Computer Science, vol. 38, no. 10, pp. 144–156, oct 2023,
https://doi.org/10.9734/jamcs/2023/v38i101832.
[23] D.-C. Li, S.-C. Chen, Y.-S. Lin, and W.-Y. Hsu, “A Novel Classification Method Based on a Two-Phase Technique for Learning
Imbalanced Text Data,” Symmetry, vol. 14, no. 3, pp. 1–23, mar 2022, https://doi.org/10.3390/sym14030567.
[24] A. Wirawan, H. D. Cahyono, and Winarno, “Easy Data Augmentation in Sentiment Analysis of Cyberbullying,” in 2023 6th
International Conference on Information and Communications Technology (ICOIACT). IEEE, nov 2023, pp. 443–447, https:
//doi.org/10.1109/ICOIACT59844.2023.10455817.
[25] H. R. Nafiisah and F. Z. Ruskanda, “Content-based Multiclass Classification on Indonesian SMS Messages,” in 2022 International
Symposium on Electronics and Smart Devices (ISESD). IEEE, nov 2022, pp. 1–6, https://doi.org/10.1109/ISESD56103.
2022.9980769.
[26] R. Dwiyansaputra, G. S. Nugraha, F. Bimantoro, and A. Aranta, “Deteksi SMS Spam Berbahasa Indonesia menggunakan TFIDF
dan Stochastic Gradient Descent Classifier,” Jurnal Teknologi Informasi, Komputer, dan Aplikasinya (JTIKA), vol. 3, no. 2,
pp. 200–207, 2021, https://doi.org/10.29303/jtika.v3i2.145.
[27] G. Z. Nabiilah, I. N. Alam, E. S. Purwanto, and M. F. Hidayat, “Indonesian multilabel classification using IndoBERT embedding
and MBERT classification,” International Journal of Electrical and Computer Engineering (IJECE), vol. 14, no. 1, p. 1071, feb
2024, https://doi.org/10.11591/ijece.v14i1.pp1071-1078.
[28] B. Wilie, K. Vincentio, G. I. Winata, S. Cahyawijaya, X. Li, Z. Y. Lim, S. Soleman, R. Mahendra, P. Fung, S. Bahar, and
A. Purwarianti, “IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,” arXiv
preprint arXiv:2009.05387, sep 2020, https://doi.org/10.48550/arXiv.2009.05387.
[29] S. Saadah, K. M. Auditama, A. A. Fattahila, F. I. Amorokhman, A. Aditsania, and A. A. Rohmawati, “Implementation of BERT,
IndoBERT, and CNN-LSTM in Classifying Public Opinion about COVID-19 Vaccine in Indonesia,” Jurnal RESTI (Rekayasa
Sistem dan Teknologi Informasi), vol. 6, no. 4, pp. 648–655, aug 2022, https://doi.org/10.29207/resti.v6i4.4215.
[30] P. T. Putra, A. Anggrawan, and H. Hairani, “Comparison of Machine Learning Methods for Classifying User Satisfaction Opinions
of the PeduliLindungi Application,” MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, vol. 22,
no. 3, pp. 431–442, jun 2023, https://doi.org/10.30812/matrik.v22i3.2860.
[31] E. D. Pratama, “Implementasi Model Long-Short Term Memory (LSTM) pada Klasifikasi Teks Data SMS Spam Berbahasa
Indonesia,” The Journal on Machine Learning and Computational Intelligence (JMLCI), vol. 1, no. 2, 2022, https://doi.org/10.
26740/vol1iss2y2022id12.
Published
2024-06-29
How to Cite
Latifah, N., Dwiyansaputra, R., & Nugraha, G. S. (2024). Multiclass Text Classification of Indonesian Short Message Service (SMS) Spam using Deep Learning Method and Easy Data Augmentation. MATRIK : Jurnal Manajemen, Teknik Informatika Dan Rekayasa Komputer, 23(3), 663-676. https://doi.org/https://doi.org/10.30812/matrik.v23i3.3835
Section
Articles