Comparison of Text Representation for Clustering Student Concept Maps
DOI:
https://doi.org/10.30812/matrik.v24i2.4598Keywords:
Bidirectional Encoder, Representations from Transformers, Term Frequency-Inverse Document Frequency, Clustering, Concept MapAbstract
This research aims to address the critical challenge of selecting a text representation method that effectively captures students’ conceptual understanding for clustering purposes. Traditional methods, such as Term Frequency-Inverse Document Frequency (TF-IDF), often fail to capture semantic relationships, limiting their effectiveness in clustering complex datasets. This study compares TF-IDF with the advanced Bidirectional Encoder Representations from Transformers (BERT) to determine their suitability in clustering student concept maps for two learning topics: Databases and Cyber Security. The method used applies two clustering algorithms: K-Means and its improved variant, K-Means++, which enhances centroid initialization for better stability and clustering quality. The datasets consist of concept maps from 27 students for each topic, including 1,206 concepts and 616 propositions for Databases, as well as 2,564 concepts and 1,282 propositions for Cyber Security. Evaluation is conducted using two metrics Davies-Bouldin Index (DBI) and Silhouette Score, to assess the compactness and separability of the clusters. The result of this study is that BERT consistently outperforms TF-IDF, producing lower DBI values and higher Silhouette Scores across all clusters (k= 2 - k=10). Combining BERT with K-Means++ yields the most compact and well-separated clusters, while TF-IDF results in overlapping and less-defined clusters. The research concludes that BERT is a superior text representation method for clustering, offering significant advantages in capturing semantic context and enabling educators to identify student misconceptions and improve learning strategies.
Downloads
References
and use of knowledge,†vol. 191, p. 122519, https://doi.org/10.1016/j.techfore.2023.122519.
[2] A. L¨owstedt, “Developmental Stages of Information and Communication Technology,†vol. 31, no. 4, pp. 758–778,
https://doi.org/10.1093/ct/qtaa015.
[3] S. Wang, A. Beheshti, Y. Wang, J. Lu, Q. Z. Sheng, S. Elbourn, and H. Alinejad-Rokny, “Learning Distributed Representations
and Deep Embedded Clustering of Texts,†vol. 16, no. 3, p. 158, https://doi.org/10.3390/a16030158.
[4] V. Adu, M. D. Adane, and K. Asante, “Similarity Measure Algorithm for Text Document Clustering, Using Singular Value
Decomposition,†pp. 8–25, https://doi.org/10.9734/cjast/2021/v40i2231475.
[5] S. M. Dol and P. M. Jawandhiya, “Classification Technique and its Combination with Clustering and Association Rule Mining
in Educational Data Mining— A survey,†vol. 122, p. 106071, https://doi.org/10.1016/j.engappai.2023.106071.
[6] G. As¸ıksoy, “Computer-Based Concept Mapping as a Method for Enhancing the Effectiveness of Concept Learning in
Technology-Enhanced Learning,†vol. 11, no. 4, p. 1005, https://doi.org/10.3390/su11041005.
[7] Z. Labd, S. Bahassine, K. Housni, F. Z. A. Hamou Aadi, and K. Benabbes, “Text classification supervised algorithms with
term frequency inverse document frequency and global vectors for word representation: A comparative study,†vol. 14, no. 1,
p. 589, https://doi.org/10.11591/ijece.v14i1.pp589-599.
[8] A. Subakti, H. Murfi, and N. Hariadi, “The performance of BERT as data representation of text clustering,†vol. 9, no. 1, p. 15,
https://doi.org/10.1186/s40537-022-00564-9.
[9] E. C. Garrido-Merchan, R. Gozalo-Brizuela, and S. Gonzalez-Carvajal, “Comparing BERT Against Traditional Machine
Learning Models in Text Classification,†vol. 2, no. 4, pp. 352–356, https://doi.org/10.47852/bonviewJCCE3202838.
[10] L. George and P. Sumathy, “An integrated clustering and BERT framework for improved topic modeling,†vol. 15, no. 4, pp.
2187–2195, https://doi.org/10.1007/s41870-023-01268-w.
[11] V. Mehta, S. Bawa, and J. Singh, “WEClustering: Word embeddings based text clustering technique for large datasets,†vol. 7,
no. 6, pp. 3211–3224, https://doi.org/10.1007/s40747-021-00512-9.
[12] C. Wu, B. Yan, R. Yu, B. Yu, X. Zhou, Y. Yu, and N. Chen, K -Means Clustering Algorithm and Its Simulation Based on
Distributed Computing Platform,†vol. 2021, no. 1, p. 9446653, https://doi.org/10.1155/2021/9446653.
[13] J. Y. K. Chan, A. P. Leung, and Y. Xie, “Efficient High-Dimensional Kernel k-Means++ with Random Projection,†vol. 11,
no. 15, p. 6963.
[14] A. Naghizadeh and D. N. Metaxas, “Condensed silhouette: an optimized filtering process for cluster selection in Kmeans,
Procedia Computer Science, vol. 176, pp. 205–214, 2020.
[15] Q. Li, S. Yue, Y. Wang, M. Ding, and J. Li, “A New Cluster Validity Index Based on the Adjustment of Within-Cluster
Distance,†vol. 8, pp. 202 872–202 885, https://doi.org/10.1109/ACCESS.2020.3036074.
[16] D. D. Prasetya and T. Hirashima, “Associated Patterns in Open-Ended Concept Maps within E-Learning,†vol. 5, no. 2, p. 179,
https://doi.org/10.17977/um018v5i22022p179-187.
[17] D. D. Prasetya, A. Pinandito, Y. Hayashi, and T. Hirashima, “Analysis of quality of knowledge structure and students’
perceptions in extension concept mapping,†vol. 17, no. 1, p. 14, https://doi.org/10.1186/s41039-022-00189-9.
[18] Q. Zhang, Y. Sun, L. Zhang, Y. Jiao, and Y. Tian, “Named entity recognition method in health preserving field based on
BERT,†vol. 183, pp. 212–220, https://doi.org/10.1016/j.procs.2021.03.010.
[19] T. Kwon, J. Myung, J. Lee, K.-i. Kim, and J. Song, “A Network Packet Analysis Method to Discover Malicious Activities,â€
vol. 0, no. S, pp. 143–153, https://doi.org/10.1633/JISTAP.2022.10.S.14.
[20] R. Suwanda, Z. Syahputra, and E. M. Zamzami, “Analysis of Euclidean Distance and Manhattan Distance in the K-Means
Algorithm for Variations Number of Centroid K,†vol. 1566, no. 1, p. 012058, https://doi.org/10.1088/1742-6596/1566/1/
012058.
[21] F. S. Mukti, A. Junikhah, P. M. A. Putra, A. Soetedjo, and A. U. Krismanto, “A Clustering Optimization for Energy Consumption
Problems in Wireless Sensor Networks using Modified K-Means++ Algorithm,†vol. 15, no. 3, pp. 355–365, https://doi.org/10.
22266/ijies2022.0630.30.
[22] J. Meng, Z. Yu, Y. Cai, and X. Wang, “K-Means++ Clustering Algorithm in Categorization of Glass Cultural Relics,†vol. 13,
no. 8, p. 4736, https://doi.org/10.3390/app13084736.
[23] C. D. Guti´errez, J. N. Ruiz, S. C. Salazar, J. P. G. L´opez, J. D. Zapata, and J. F. Bot´ıa, “Performance of Hybrid
Clustering-Classification Approach for Dual-Band System in a Mode-Locked Fiber Laser,†vol. 12, pp. 104 115–104 125,
https://doi.org/10.1109/ACCESS.2024.3409565.
[24] M. Shutaywi and N. N. Kachouie, “Silhouette Analysis for Performance Evaluation in Machine Learning with Applications to
Clustering,†vol. 23, no. 6, p. 759, https://doi.org/10.3390/e23060759.
[25] P. Charoenkwan, C. Nantasenamat, M. M. Hasan, B. Manavalan, and W. Shoombuatong, “BERT4Bitter: A bidirectional
encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides,†vol. 37,
no. 17, pp. 2556–2562, https://doi.org/10.1093/bioinformatics/btab133.
Downloads
Published
Issue
Section
How to Cite
Similar Articles
- Muhammad Zaki Wiryawan, Didik Dwi Prasetya, Anik Nur Handayani, Tsukasa Hirashima, Wahyu Styo Pratama, Lalu Ganda Rady Putra, Enhancing Semantic Similarity in Concept Maps Using LargeLanguage Models , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 24 No. 3 (2025)
- Nurun Latifah, Ramaditia Dwiyansaputra, Gibran Satya Nugraha, Multiclass Text Classification of Indonesian Short Message Service (SMS) Spam using Deep Learning Method and Easy Data Augmentation , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 23 No. 3 (2024)
- Muhammad Alkaff, Muhammad Afrizal Miqdad, Muhammad Fachrurrazi, Muhammad Nur Abdi, Ahmad Zainul Abidin, Raisa Amalia, Hate Speech Detection for Banjarese Languages on Instagram Using Machine Learning Methods , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 22 No. 3 (2023)
- Lusiana Efrizoni, Sarjon Defit, Muhammad Tajuddin, Anthony Anggrawan, Komparasi Ekstraksi Fitur dalam Klasifikasi Teks Multilabel Menggunakan Algoritma Machine Learning , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 21 No. 3 (2022)
- F.ti Ayyu Sayyidul Laily, Didik Dwi Prasetya, Anik Nur Handayani, Tsukasa Hirashima, Revealing Interaction Patterns in Concept Map Construction Using Deep Learning and Machine Learning Models , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 24 No. 2 (2025)
- Wahyu Styo Pratama, Didik Dwi Prasetya, Triyanna Widyaningtyas, Muhammad Zaki Wiryawan, Lalu Ganda Rady Putra, Tsukasa Hirashima, Performance Evaluation of Artificial Intelligence Models for Classification in Concept Map Quality Assessment , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 24 No. 3 (2025)
- Saiful Nur Arif, Muhammad Dahria, Sarjon Defit, Dicky Novriansyah, Ali Ikhwan, Implementation of Single Linked on Machine Learning for Clustering Student Scientific Fields , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 22 No. 1 (2022)
- Rahman Rahman, Teguh Iman Hermanto, Meriska Defriani, Hyperparamaters Fine Tuning for Bidirectional Long Short Term Memory on Food Delivery , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 23 No. 1 (2023)
- Indra Indra, Nur Aliza, Detecting Disaster Trending Topics on Indonesian Tweets Using BNgram , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 23 No. 1 (2023)
- Frans Mikael Sinaga, Sio Jurnalis Pipin, Sunaryo Winardi, Karina Mannita Tarigan, Ananda Putra Brahmana, Analyzing Sentiment with Self-Organizing Map and Long Short-Term Memory Algorithms , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 23 No. 1 (2023)
You may also start an advanced similarity search for this article.
Most read articles by the same author(s)
- Wahyu Styo Pratama, Didik Dwi Prasetya, Triyanna Widyaningtyas, Muhammad Zaki Wiryawan, Lalu Ganda Rady Putra, Tsukasa Hirashima, Performance Evaluation of Artificial Intelligence Models for Classification in Concept Map Quality Assessment , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 24 No. 3 (2025)
- F.ti Ayyu Sayyidul Laily, Didik Dwi Prasetya, Anik Nur Handayani, Tsukasa Hirashima, Revealing Interaction Patterns in Concept Map Construction Using Deep Learning and Machine Learning Models , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 24 No. 2 (2025)
- Sucipto Sucipto, Didik Dwi Prasetya, Triyanna Widiyaningtyas, Educational Data Mining: Multiple Choice Question Classification in Vocational School , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 23 No. 2 (2024)
- Muhammad Zaki Wiryawan, Didik Dwi Prasetya, Anik Nur Handayani, Tsukasa Hirashima, Wahyu Styo Pratama, Lalu Ganda Rady Putra, Enhancing Semantic Similarity in Concept Maps Using LargeLanguage Models , MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer: Vol. 24 No. 3 (2025)