Komparasi Ekstraksi Fitur dalam Klasifikasi Teks Multilabel Menggunakan Algoritma Machine Learning

  • Lusiana Efrizoni STMIK Amik Riau https://orcid.org/0000-0002-3153-6233
  • Sarjon Defit Universitas Putra Indonesia YPTK Padang
  • Muhammad Tajuddin Universitas Bumigora
  • Anthony Anggrawan Universitas Bumigora
Keywords: Ekstraksi Fitur, Klasifikasi Teks Multilabel, Machine Learning, Perbandingan Kinerja Model

Abstract

Ektraksi fitur dan algoritma klasifikasi teks merupakan bagian penting dari pekerjaan klasifikasi teks, yang memiliki dampak langsung pada efek klasifikasi teks. Algoritma machine learning tradisional seperti Na¨ıve Bayes, Support Vector Machines, Decision Tree, K-Nearest Neighbors, Random Forest, Logistic Regression telah berhasil dalam melakukan klasifikasi teks dengan ektraksi fitur i.e. Bag ofWord (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), Documents to Vector (Doc2Vec), Word to Vector (word2Vec). Namun, bagaimana menggunakan vektor kata untuk merepresentasikan teks pada klasifikasi teks menggunakan algoritma machine learning dengan lebih baik selalu
menjadi poin yang sulit dalam pekerjaan Natural Language Processing saat ini. Makalah ini bertujuan untuk membandingkan kinerja dari ekstraksi fitur seperti BoW, TF-IDF, Doc2Vec dan Word2Vec dalam melakukan klasifikasi teks dengan menggunakan algoritma machine learning. Dataset yang digunakan sebanyak 1000 sample yang berasal dari tribunnews.com dengan split data 50:50, 70:30, 80:20 dan 90:10. Hasil dari percobaan menunjukkan bahwa algoritma Na¨ıve Bayes memiliki akurasi tertinggi dengan menggunakan ekstraksi fitur TF-IDF sebesar 87% dan BoW sebesar 83%. Untuk ekstraksi fitur Doc2Vec, akurasi tertinggi pada algoritma SVM sebesar 81%. Sedangkan ekstraksi fitur Word2Vec dengan algoritma machine learning (i.e. i.e. Na¨ıve Bayes, Support Vector Machines, Decision Tree, K-Nearest Neighbors, Random Forest, Logistic Regression) memiliki akurasi model dibawah 50%. Hal ini menyatakan, bahwa Word2Vec kurang optimal digunakan bersama algoritma machine learning, khususnya pada dataset tribunnews.com.

Downloads

Download data is not yet available.

References

[1] M. Naili, A. H. Chaibi, and H. H. Ben Ghezala, “Comparative study of word embedding methods in topic segmentation,” Procedia
Computer Science, vol. 112, pp. 340–349, 2017.
[2] F. K. Khattak, S. Jeblee, C. Pou-Prom, M. Abdalla, C. Meaney, and F. Rudzicz, “A survey of word embeddings for clinical text,”
Journal of Biomedical Informatics: X, vol. 4, p. 100057, 2019.
[3] A. Conneau, H. Schwenk, Y. L. Cun, and L. Barrault, “Very deep convolutional networks for text classification,” 15th Conference of
the European Chapter of the Association for Computational Linguistics, EACL 2017 - Proceedings of Conference, vol. 1, no. 2001,
pp. 1107–1116, 2017.
[4] S. Bhoir, T. Ghorpade, and V. Mane, “Comparative analysis of different word embedding models,” International Conference on
Advances in Computing, Communication and Control 2017, ICAC3 2017, vol. 2018-Janua, pp. 1–4, 2018.
[5] K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text classification algorithms: A survey,”
Information (Switzerland), vol. 10, no. 4, pp. 1–68, 2019.
[6] A. Conneau, H. Schwenk, Y. Le Cun, and L. Barrault, “Very Deep Convolutional Neural Networks for Text Classification,” Lecture
Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol.
11727 LNCS, no. 2001, pp. 193–207, 2019.
[7] R. Wongso, F. A. Luwinda, B. C. Trisnajaya, O. Rusli, and Rudy, “News Article Text Classification in Indonesian Language,”
Procedia Computer Science, vol. 116, pp. 137–143, 2017.
[8] I. C. Irsan and M. L. Khodra, “Hierarchical multi-label news article classification with distributed semantic model based features,”
International Journal of Advances in Intelligent Informatics, vol. 5, no. 1, pp. 40–47, 2019.
[9] A. Onan, S. Korukolu, and H. Bulut, “Ensemble of keyword extraction methods and classifiers in text classification,” Expert Systems
with Applications, vol. 57, pp. 232–247, 2016.
[10] Y. Goldberg and O. Levy, “word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method,” no. 2, pp.
1–5, 2014.
[11] H. Yuan, Y. Wang, X. Feng, and S. Sun, “Sentiment analysis based on weighted word2vec and ATT-LSTM,” ACM International
Conference Proceeding Series, pp. 420–424, 2018.
[12] J. Lilleberg, Y. Zhu, and Y. Zhang, “Support vector machines and Word2vec for text classification with semantic features,” Proceedings
of 2015 IEEE 14th International Conference on Cognitive Informatics and Cognitive Computing, ICCI*CC 2015, pp. 136–140,
2015.
[13] R. G. Rossi, A. D. A. Lopes, and S. O. Rezende, “Optimization and label propagation in bipartite heterogeneous networks to improve
transductive classification of texts,” Information Processing and Management, vol. 52, no. 2, pp. 217–257, 2016.
[14] Z. Liu, Y. Lin, and M. Sun, Representation Learning for Natural Language Processing, 2020.
[15] B. Y. Pratama and R. Sarno, “Personality classification based on Twitter text using Naive Bayes, KNN and SVM,” Proceedings of
2015 International Conference on Data and Software Engineering, ICODSE 2015, pp. 170–174, 2016.
[16] M. Azam, T. Ahmed, F. Sabah, and M. I. Hussain, “Feature Extraction based Text Classification using K-Nearest Neighbor Algorithm,”
IJCSNS International Journal of Computer Science and Network Security, vol. 18, no. 12, pp. 95–101, 2018.
[17] S. Xu, “Bayesian Na¨ıve Bayes classifiers to text classification,” Journal of Information Science, vol. 44, no. 1, pp. 48–59, 2018.
[18] L. Jiang, C. Li, S.Wang, and L. Zhang, “Deep feature weighting for naive Bayes and its application to text classification,” Engineering
Applications of Artificial Intelligence, vol. 52, pp. 26–39, 2016.
[19] Y. Cahyono, “Analisis Sentiment pada Sosial Media Twitter Menggunakan Nave Bayes Classifier dengan Feature Selection Particle
Swarm Optimization dan Term Frequency,” Jurnal Informatika Universitas Pamulang, vol. 2, no. 1, p. 14, 2017.
[20] M. Fanjin, H. Ling, T. Jing, and X.Wang, “The research of semantic kernel in SVM for Chinese text classification,” ACM International
Conference Proceeding Series, vol. Part F1318, no. 319, 2017.
[21] W. A. Luqyana, I. Cholissodin, and R. S. Perdana, “Analisis Sentimen Cyberbullying Pada Komentar Instagram dengan Metode Klasifikasi
Support Vector Machine,” Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer (J-PTIIK) Universitas Brawijaya,
vol. 2, no. 11, pp. 4704–4713, 2018.
[22] C. Satria and A. Anggrawan, “Aplikasi K-Means berbasis Web untuk Klasifikasi Kelas Unggulan,” MATRIK : Jurnal Manajemen,
Teknik Informatika dan Rekayasa Komputer, vol. 21, no. 1, pp. 111–124, 2021.
[23] A. Muhammad and S. Defit, “Analyzing the use of Social Media by Fashion Designers with K-Means and C45,” vol. 21, no. 2, pp.
463–476, 2022.
[24] K. I. Gunawan and J. Santoso, “Multilabel Text Classification Menggunakan SVM dan Doc2Vec Classification Pada Dokumen Berita
Bahasa Indonesia,” Journal of Information System,Graphics, Hospitality and Technology, vol. 3, no. 01, pp. 29–38, 2021.
[25] M. Gao, T. Li, and P. Huang, Text classification research based on improved word2vec and CNN. Springer International Publishing,
2019, vol. 11434 LNCS.
[26] M. K. Anam, B. N. Pikir, and M. B. Firdaus, “Penerapan Na ve Bayes Classifier, K-Nearest Neighbor (KNN) dan Decision Tree
untuk Menganalisis Sentimen pada Interaksi Netizen danPemeritah,” MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa
Komputer, vol. 21, no. 1, pp. 139–150, 2021.
[27] D. R. Java,W.Wijaya, J. Hendry, and B. Sumanto, “Seleksi Fitur Terhadap Performa Kinerja Sistem E-Nose untuk Klasifikasi Aroma
Kopi Gayo Features Selection on E-Nose System Performance for Classification of Gayo Coffee Aroma,” MATRIK J. Manajemen,
Tek. Inform. dan Rekayasa Komput., vol. 21, no. 2, 2022.
[28] F. Gorunescu, Data Mining: Concepts, models and techniques. Vol (12). Springer Science & Business Media, 2011.
[29] R. A. Stein, P. A. Jaques, and J. F. Valiati, “An analysis of hierarchical text classification using word embeddings,” Information
Sciences, vol. 471, pp. 216–232, 2019.
[30] Y. Shao, S. Taylor, N. Marshall, C. Morioka, and Q. Zeng-Treitler, “Clinical Text Classification with Word Embedding Features vs.
Bag-of-Words Features,” Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018, pp. 2874–2878, 2019.
[31] X. Wang, D. Gao, G. Zhang, X. Zhang, Q. Li, Q. Gao, R. Chen, S. Xu, L. Huang, Y. Zhang, L. Lin, C. Zhong, X. Chen, G. Sun,
Y. Song, X. Yang, L. Hao, H. Yang, L. Yang, and N. Yang, “Exposure to multiple metals in early pregnancy and gestational diabetes
mellitus: A prospective cohort study,” Environment International, vol. 135, p. 105370, 2020.
[32] G. Wang, C. Li, W. Wang, Y. Zhang, D. Shen, X. Zhang, R. Henao, and L. Carin, “Joint embedding of words and labels for text
classification,” ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
(Long Papers), vol. 1, pp. 2321–2331, 2018.
[33] F. Sun and H. Chen, “Feature extension for Chinese short text classification based on LDA and Word2vec,” Proceedings of the 13th
IEEE Conference on Industrial Electronics and Applications, ICIEA 2018, no. 1, pp. 1189–1194, 2018.
[34] H. Xu, A. Kotov, M. Dong, A. I. Carcone, D. Zhu, and S. Naar-King, “Text classification with topic-based word embedding and
Convolutional Neural Networks,” ACM-BCB 2016 - 7th ACM Conference on Bioinformatics, Computational Biology, and Health
Informatics, no. April 2019, pp. 88–97, 2016.
Published
2022-07-31
How to Cite
Efrizoni, L., Defit, S., Tajuddin, M., & Anggrawan, A. (2022). Komparasi Ekstraksi Fitur dalam Klasifikasi Teks Multilabel Menggunakan Algoritma Machine Learning. MATRIK : Jurnal Manajemen, Teknik Informatika Dan Rekayasa Komputer, 21(3), 653-666. https://doi.org/https://doi.org/10.30812/matrik.v21i3.1851
Section
Articles