Evaluation Analysis of the Necessity of Stemming and Lemmatization in Text Classification
DOI:
https://doi.org/10.30812/matrik.v24i2.4833Keywords:
Lemmatization, Performance, Stemming, Support Vector Machine, Text ClassificationAbstract
Stemming and lemmatization are text preprocessing methods that aim to convert words into their root and to the canonical or dictionary form. Some previous studies state that using stemming and lemmatization worsens the performance of text classification models. However, some other studies report the positive impact of using stemming and lemmatization in supporting the performance of text classification models. This study aims to analyze the impact of stemming and lemmatization in text classification work using the support vector machine method, in this case, devoted to English text datasets and Indonesian text datasets, and analyze when this method should be used. The analysis of the experimental results shows that the use of stemming will generally degrade the performance of the text classification model, especially on large and unbalanced datasets. The research process consisted of several stages: text preprocessing using stemming and lemmatization, feature extraction with Term Frequency-Inverse Document Frequency (TF-IDF), classification using SVM, and model evaluation with 4 experiment scenarios. Stemming performed the best computation time, completing in 4 hours, 51 minutes, and 41.3 seconds on the largest dataset. While lemmatization positively impacts classification performance on small datasets, achieving 91.075% accuracy results in the worst computation time, especially for large datasets, which take 5 hours, 10 minutes, and 25.2 seconds. The Experimental results also show that stemming from the Indonesian balanced dataset yields a better text classification model performance, reaching 82.080% accuracy.
Downloads
References
Literature Review,†vol. 10, no. 2, pp. 217–231, 2024, https://doi.org/10.20473/jisebi.10.2.217-231. [Online]. Available:
https://e-journal.unair.ac.id/JISEBI/article/view/50341
[2] Q. Li, H. Peng, J. Li, C. Xia, R. Yang, L. Sun, P. S. Yu, and L. He, “A Survey on Text Classification: From
Traditional to Deep Learning,†vol. 13, no. 2, pp. 1–41, 2022, https://doi.org/10.1145/3495162. [Online]. Available:
https://dl.acm.org/doi/10.1145/3495162
[3] M. M. Rahman, A. I. Shiplu, and Y. Watanobe, “CommentClass: A Robust Ensemble Machine Learning Model for
Comment Classification,†vol. 17, no. 1, pp. 1–20, 2024, https://doi.org/10.1007/s44196-024-00589-3. [Online]. Available:
https://link.springer.com/10.1007/s44196-024-00589-3
[4] R. Ahmed, “Exploring The Impact of Stemming on Text Topic-Based Classification Accuracy,†vol. 2, no. 2, pp. 204–224,
2024, https://doi.org/10.61320/jolcc.v2i2.204-224. [Online]. Available: https://jolcc.org/index.php/jolcc/article/view/51
[5] Lviv Polytechnic National University, Lviv, 79013, Ukraine, O. Prokipchuk, V. Vysotska, P. Pukach, V. Lytvyn, D. Uhryn,
Y. Ushenko, and Z. Hu, “Intelligent Analysis of Ukrainian-language Tweets for Public Opinion Research based on NLP
Methods and Machine Learning Technology,†vol. 15, no. 3, pp. 70–93, 2023, https://doi.org/10.5815/ijmecs.2023.03.06.
[Online]. Available: http://mecs-press.org/ijmecs/ijmecs-v15-n3/v15n3-6.html
[6] U. Naseem, I. Razzak, and P. W. Eklund, “A survey of pre-processing techniques to improve short-text quality: A case study on
hate speech detection on twitter,†vol. 80, no. 28–29, pp. 35 239–35 266, 2021, https://doi.org/10.1007/s11042-020-10082-6.
[Online]. Available: https://link.springer.com/10.1007/s11042-020-10082-6
[7] G. Imin, M. Ablimit, H. Yilahun, and A. Hamdulla, “A Character String-Based Stemming for Morphologically
Derivative Languages,†vol. 13, no. 4, pp. 1–16, 2022, https://doi.org/10.3390/info13040170. [Online]. Available:
https://www.mdpi.com/2078-2489/13/4/170
[8] J. K. Mursi, P. R. Subramaniam, and I. Govender, “Exploring the Influence of Pre-Processing Techniques in Obtaining
Labelled Data from Twitter Data,†in 2023 IEEE AFRICON. IEEE, 2023, pp. 1–6, https://doi.org/10.1109/AFRICON55910.
2023.10293408. [Online]. Available: https://ieeexplore.ieee.org/document/10293408/
[9] S. F. Chaerul Haviana, S. Mulyono, and Badie’Ah, “The Effects of Stopwords, Stemming, and Lemmatization on Pre-trained
Language Models for Text Classification: A Technical Study,†in 2023 10th International Conference on Electrical Engineering,
Computer Science and Informatics (EECSI), 2023, pp. 521–527.
[10] M. Siino, I. Tinnirello, and M. La Cascia, “Is text preprocessing still worth the time? A comparative survey on the
influence of popular preprocessing methods on Transformers and traditional classifiers,†vol. 121, March, pp. 1–19, 2024,
https://doi.org/10.1016/j.is.2023.102342. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0306437923001783
[11] J. Liu, “515K Hotel Reviews Data in Europe,†https://www.kaggle.com/datasets/jiashenliu/515k-hotel-reviews-data-in-europe/
data.
[12] M. O. Ibrohim and I. Budi, “Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter,†in Proceedings
of the Third Workshop on Abusive Language Online, S. T. Roberts, J. Tetreault, V. Prabhakaran, and Z. Waseem, Eds.
Association for Computational Linguistics, 2019, pp. 46–57, https://doi.org/10.18653/v1/W19-3506. [Online]. Available:
https://aclanthology.org/W19-3506/
[13] K. Maharana, S. Mondal, and B. Nemade, “A review: Data pre-processing and data augmentation techniques,†vol. 3, no. 1,
pp. 91–99, 2022, https://doi.org/10.1016/j.gltp.2022.04.020. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/
S2666285X22000565
[14] F. Neutatz, B. Chen, Y. Alkhatib, J. Ye, and Z. Abedjan, “Data Cleaning and AutoML: Would an Optimizer
Choose to Clean?†vol. 22, no. 2, pp. 121–130, 2022, https://doi.org/10.1007/s13222-022-00413-2. [Online]. Available:
https://link.springer.com/10.1007/s13222-022-00413-2
[15] P. Li, X. Rao, J. Blase, Y. Zhang, X. Chu, and C. Zhang, “CleanML: A Study for Evaluating the Impact of Data Cleaning
on ML Classification Tasks,†in 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2021, pp.
13–24, https://doi.org/10.1109/ICDE51399.2021.00009. [Online]. Available: https://ieeexplore.ieee.org/document/9458702/
[16] N. W. S. Saraswati, I. K. G. D. Putra, M. Sudarma, I. M. Sukarsa, C. P. Yanti, and N. K. Tri Juniartini, “Revealing
the Potential of Hotel Improvements in Bali Based on Sentiment Analysis and Tourist Characteristics,†in 2024 11th
International Conference on Electrical Engineering, Computer Science and Informatics (EECSI). IEEE, 2024, pp. 722–728,
https://doi.org/10.1109/EECSI63442.2024.10776092. [Online]. Available: https://ieeexplore.ieee.org/document/10776092/
[17] N. W. S. Saraswati, I. D. M. K. Muku, I. W. D. Suryawan, D. A. K. Pramita, and I. K. A. Bisena, “Balinese
Temple: The Image and Characteristics of Tourists based on Sentiment Analysis,†in 2024 IEEE International Symposium
on Consumer Technology (ISCT). IEEE, 2024, pp. 19–24, https://doi.org/10.1109/ISCT62336.2024.10791104. [Online].
Available: https://ieeexplore.ieee.org/document/10791104/
[18] N. W. S. Saraswati, I. Ketut Gede Darma Putra, M. Sudarma, and I. Made Sukarsa, “Enhance sentiment analysis in big data
tourism using hybrid lexicon and active learning support vector machine,†vol. 13, no. 5, pp. 3663–3674, 2024.
[19] N. W. S. Saraswati, I. K. G. D. Putra, M. Sudarma, and I. M. Sukarsa, “The Image of Tourist Attraction in Bali Based
on Big Data Analytics and Sentiment Analysis,†in 2023 International Conference on Smart-Green Technology in Electrical
and Information Systems (ICSGTEIS). IEEE, 2023, pp. 82–87, https://doi.org/10.1109/ICSGTEIS60500.2023.10424322.
[Online]. Available: https://ieeexplore.ieee.org/document/10424322/
[20] C. Xu, P. Coen-Pirani, and X. Jiang, “Empirical Study of Overfitting in Deep Learning for Predicting Breast
Cancer Metastasis,†vol. 15, no. 7, pp. 1–18, 2023, https://doi.org/10.3390/cancers15071969. [Online]. Available:
https://www.mdpi.com/2072-6694/15/7/1969
[21] A. Habberrih and M. Ali Abuzaraida, “Sentiment Analysis of Libyan Dialect Using Machine Learning with Stemming
and Stop-words Removal,†in 5th International Conference on Communication Engineering and Computer Science
(CIC-COCOS’24). Cihan University-Erbil, 2024, pp. 259–264, https://doi.org/10.24086/cocos2024/paper.1171. [Online].
Available: https://conferences.cihanuniversity.edu.iq/index.php/COCOS/COCOS24/paper/view/1171
Downloads
Published
Issue
Section
How to Cite
Similar Articles
- Erlin Erlin, Yenny Desnelita, Nurliana Nasution, Laili Suryati, Fransiskus Zoromi, Dampak SMOTE terhadap Kinerja Random Forest Classifier berdasarkan Data Tidak seimbang , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 21 No. 3 (2022)
- Firman Noor Hasan, Achmad Sufyan Aziz, Yos Nofendri, Utilization of Data Mining on MSMEs using FP-Growth Algorithm for Menu Recommendations , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 22 No. 2 (2023)
- Nella Rosa Sudianjaya, Chastine Fatichah, Segmentation and Classification of Breast Cancer Histopathological Image Utilizing U-Net and Transfer Learning ResNet50 , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 24 No. 1 (2024)
- Helna Wardhana, I Made Yadi Dharma, Khairan Marzuki, Ibjan Syarif Hidayatullah, Implementation of Neural Machine Translation in Translating from Indonesian to Sasak Language , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 23 No. 2 (2024)
- Jaka Tirta Samudra, Rika Rosnelly, Zakarias Situmorang, Comparative Analysis of SVM and Perceptron Algorithms in Classification of Work Programs , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 22 No. 2 (2023)
- Imam Riadi, Herman Herman, Fitriah Fitriah, Suprihatin Suprihatin, Optimizing Inventory with Frequent Pattern Growth Algorithm for Small and Medium Enterprises , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 23 No. 1 (2023)
- Dewa Ayu Kadek Pramita, Ni Wayan Sumartini Saraswati, I Putu Dedy Sandana, Poria Pirozmand, I Kadek Agus Bisena, Optimizing Hotel Room Occupancy Prediction Using an Enhanced Linear Regression Algorithms , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 24 No. 1 (2024)
- Yarza Aprizal, Rabin Ibnu Zainal, Afriyudi Afriyudi, Perbandingan Metode Backpropagation dan Learning Vector Quantization (LVQ) Dalam Menggali Potensi Mahasiswa Baru di STMIK PalComTech , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 18 No. 2 (2019)
- Reni Fatrisna Salsabila, Didik Dwi Prasetya, Triyanna Widyaningtyas, Tsukasa Hirashima, Comparison of Text Representation for Clustering Student Concept Maps , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 24 No. 2 (2025)
- Deni Marta, M. Angga Eka Putra, Guntoro Barovih, Analisis Perbandingan Performa Virtualisasi Server Sebagai Basis Layanan Infrastructure As A Service Pada Jaringan Cloud , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 19 No. 1 (2019)
You may also start an advanced similarity search for this article.
Most read articles by the same author(s)
- Ni Wayan Sumartini Saraswati, Ni Wayan Wardani, Ketut Laksmi Maswari, I Dewa Made Krishna Muku, Rapid Application Development untuk Sistem Informasi Payroll berbasis Web , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 20 No. 2 (2021)
- Ni Wayan Sumartini Saraswati, I Gusti Ayu Agung Diatri Indradewi, Recognize The Polarity of Hotel Reviews using Support Vector Machine , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 22 No. 1 (2022)
- Ni Wayan Sumartini Saraswati, I Wayan Dharma Suryawan, Ni Komang Tri Juniartini, I Dewa Made Krishna Muku, Poria Pirozmand, Weizhi Song, Recognizing Pneumonia Infection in Chest X-Ray Using Deep Learning , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 23 No. 1 (2023)
- Ni Wayan Sumartini Saraswati, Ni Made Lisma Martarini, Extract Transform Loading Data Absensi STMIK STIKOM Indonesia Menggunakan Pentaho , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 19 No. 2 (2020)
- I Gusti Ayu Agung Diatri Indradewi, Ni Wayan Sumartini Saraswati, Ni Wayan Wardani, COVID-19 Chest X-Ray Detection Performance Through Variations of Wavelets Basis Function , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 21 No. 1 (2021)
- Dewa Ayu Kadek Pramita, Ni Wayan Sumartini Saraswati, I Putu Dedy Sandana, Poria Pirozmand, I Kadek Agus Bisena, Optimizing Hotel Room Occupancy Prediction Using an Enhanced Linear Regression Algorithms , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 24 No. 1 (2024)
- Ni Wayan Sumartini Saraswati, I Wayan Agustya Saputra, Sistem Monitoring Tekanan Air pada PDAM Gianyar Berbasis Web , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 18 No. 2 (2019)