Thesis Topic Modeling Study: Latent Dirichlet Allocation (LDA) and Machine Learning Approach
Abstract
The thesis reports housed in the campus repository have yet to be analyzed to reveal valuable knowledge patterns. Analyzing trends in thesis research topics can facilitate the selection of research topics, aid in mapping research areas, and identify underexplored topics.Therefore, this research aims to model and classify thesis topics using Latent Dirichlet Allocation (LDA) and the Naïve Bayes and Support Vector Machine (SVM) methods. This study employs the LDA method for thesis topic modeling, while SVM and Naïve Bayes are used for classifying these topics. The research results show that LDA successfully modeled five of the most popular thesis topics, namely two related to computer networks, two on software engineering, and one on multimedia. For thesis topic classification, the SVM method demonstrated higher accuracy than Naïve Bayes, reaching 92.80% after the data was balanced using Synthetic Minority Oversampling Technique (SMOTE). The implication of this study is that the topic modeling approach using LDA is able to identify dominant thesis topics. In addition, the SVM classification results obtained better accuracy than Naïve Bayes in the thesis topic classification task.
References
L. P. I. Kharisma, Muh. Fahrurrozi, and Khairunnazri, “Sistem Informasi Repositori Skripsi Berbasis Web pada STMIK Syaikh Zainuddin NW Anjani,” TEKNIMEDIA: Teknologi Informasi dan Multimedia, vol. 1, no. 1, pp. 53–58, May 2020. https://doi.org/10.46764/teknimedia.v1i1.15.
R. F. Nasution, R. Sayekti, and R. Devianty, “Meningkatkan Pemanfaatan Institutional Repository Perpustakaan Institut Agama Islam Negeri (IAIN) Padangsidimpuan,” Lentera Pustaka: Jurnal Kajian Ilmu Perpustakaan, Informasi dan Kearsipan, vol. 8, no. 2, pp. 109–122, Dec. 2022. https://doi.org/10.14710/lenpust.v8i2.44801.
S. Hong, T. Park, and J. Choi, “Analyzing Research Trends in University Student Experience Based on Topic Modeling,” Sustainability, vol. 12, no. 9, pp. 1-11, Apr. 2020. https://doi.org/10.3390/su12093570.
Andre, N. Suciati, H. Fabroyir, and E. Pardede, “Educational Data Mining Clustering Approach: Case Study
of Undergraduate Student Thesis Topic,” IEEE Access, vol. 11, pp. 130 072–130 088, 2023. https://doi.org/10.1109/ACCESS.2023.3332818.
S. H. Mohammed and S. Al-augby, “LSA & LDA topic modeling classification: comparison study on e-books,”
Indonesian Journal of Electrical Engineering and Computer Science, vol. 19, no. 1, pp. 353-362, Jul. 2020. http://doi.org/10.11591/ijeecs.v19.i1.pp353-362.
X. Li and M. F. Rosas, “Graduation Thesis Topic Recommendation Based on Neural Network,” in Proceedings of the 2022 3rd International Conference on Artificial Intelligence and Education (IC-ICAIE 2022), B. Fox, C. Zhao, and M. T. Anthony, Eds. Dordrecht: Atlantis Press International BV, 2023, vol. 9, pp. 409–414, series Title: Atlantis Highlights in Computer Sciences. https://doi.org/10.2991/978-94-6463-040-4_62.
H. Hairani, A. Anggrawan, A. I. Wathan, K. A. Latif, K. Marzuki, and M. Zulfikri, “The Abstract of Thesis Classifier by Using Naive Bayes Method,” in 2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on Computational Science and Information Management (ICSECS-ICOCSIM). Pekan, Malaysia: IEEE, Aug. 2021, pp. 312–315. https://doi.org/10.1109/ICSECS52883.2021.00063.
S.-W. Kim and J.-M. Gil, “Research paper classification systems based on TF-IDF and LDA schemes,” Humancentric Computing and Information Sciences, vol. 9, no. 1, pp. 1-21, Dec. 2019. https://doi.org/10.1186/s13673-019-0192-7.
E. M. S. Rochman, I. O. Suzanti, I. Imamah, M. A. Syakur, D. R. Anamisa, A. Khozaimi, and A. Rachmad, “Classification of Thesis Topics Based on Informatics Science Using SVM,” IOP Conference Series: Materials Science and Engineering, vol.1125, no. 1, pp. 1-6, May 2021. https://doi.org/10.1088/1757-899X/1125/1/012033.
E. Hokijuliandy, H. Napitupulu, and Firdaniza, “Application of SVM and Chi-Square Feature Selection for Sentiment Analysis of Indonesias National Health Insurance Mobile Application,” Mathematics, vol. 11, no. 17, pp. 1-21, Sep. 2023. https://doi.org/10.3390/math11173765.
D. Meng and Y. Li, “An imbalanced learning method by combining SMOTE with Center Offset Factor,” Applied Soft Computing, vol. 120, p. 108618, May 2022. https://doi.org/10.1016/j.asoc.2022.108618.
H. Hairani and M. Mujahid, “Recommendations of Thesis Supervisor using the Cosine Similarity Method,” SISTEMASI, vol. 11, no. 3, pp. 646-654, Sep. 2022. https://doi.org/10.32520/stmsi.v11i3.2003.
M. M. Adankon and M. Cheriet, “Support Vector Machine,” in Encyclopedia of Biometrics, S. Z. Li and A. Jain, Eds. Boston, MA: Springer US, 2009, pp. 1303–1308. https://doi.org/10.1007/978-0-387-73003-5_299.
D. Saini, T. Chand, D. K. Chouhan, and M. Prakash, “A comparative analysis of automatic classification and grading methods for knee osteoarthritis focussing on X-ray images,” Biocybernetics and Biomedical Engineering, vol. 41, no. 2, pp. 419–444, Apr. 2021. https://doi.org/10.1016/j.bbe.2021.03.002.
G. F. M. d. Souza, A. Caminada Netto, A. H. D. A. Melani, M. A. D. C. Michalski, and R. F. d. Silva, Reliability analysis and asset management of engineering systems, ser. Advances in reliability science. Amsterdam, Netherlands ; Cambridge, MA: Elsevier, 2022.
Y. Zhang, Y. Zhou, and J. Yao, “Feature extraction with tf-idf and game-theoretic shadowed sets,” Information
Processing and Management of Uncertainty in Knowledge-Based Systems, pp. 722-733., 2020. https://doi.org/10.1088/1757-899X/1125/1/012033.
H. Hairani, A. S. Suweleh, and D. Susilowaty, “Penanganan Ketidak Seimbangan Kelas Menggunakan Pendekatan Level Data,” MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, vol. 20, no. 1, pp. 109–116, Sep. 2020. https://doi.org/10.30812/matrik.v20i1.846.
N. Santoso, W. Wibowo, and H. Hikmawati, “Integration of synthetic minority oversampling technique for imbalanced class,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 13, no. 1, pp. 102–108, Jan. 2019. http://doi.org/10.11591/ijeecs.v13.i1.pp102-108.
N. Chamidah and R. Sahawaly, “Comparison support vector machine and naive bayes methods for classifying cyberbullying in twitter,” Jurnal Ilmiah Teknik Elektro Komputer dan Informatika (JITEKI, vol. 7, no. 2, pp. 338–346, 2021. http://dx.doi.org/10.26555/jiteki.v7i2.21175.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.