Comparison of Text Representation for Clustering Student Concept Maps

Reni Fatrisna Salsabila; Didik Dwi Prasetya; Triyanna Widyaningtyas; Tsukasa Hirashima

doi:10.30812/matrik.v24i2.4598

Authors

Reni Fatrisna Salsabila Universitas Negeri Malang, Malang, Indonesia https://orcid.org/0009-0001-9648-4972
Didik Dwi Prasetya Universitas Negeri Malang, Malang, Indonesia
Triyanna Widyaningtyas Universitas Negeri Malang, Malang, Indonesia
Tsukasa Hirashima Hiroshima University, Hiroshima, Japan

DOI:

https://doi.org/10.30812/matrik.v24i2.4598

Keywords:

Bidirectional Encoder, Representations from Transformers, Term Frequency-Inverse Document Frequency, Clustering, Concept Map

Abstract

This research aims to address the critical challenge of selecting a text representation method that effectively captures studentsâ€™ conceptual understanding for clustering purposes. Traditional methods, such as Term Frequency-Inverse Document Frequency (TF-IDF), often fail to capture semantic relationships, limiting their effectiveness in clustering complex datasets. This study compares TF-IDF with the advanced Bidirectional Encoder Representations from Transformers (BERT) to determine their suitability in clustering student concept maps for two learning topics: Databases and Cyber Security. The method used applies two clustering algorithms: K-Means and its improved variant, K-Means++, which enhances centroid initialization for better stability and clustering quality. The datasets consist of concept maps from 27 students for each topic, including 1,206 concepts and 616 propositions for Databases, as well as 2,564 concepts and 1,282 propositions for Cyber Security. Evaluation is conducted using two metrics Davies-Bouldin Index (DBI) and Silhouette Score, to assess the compactness and separability of the clusters. The result of this study is that BERT consistently outperforms TF-IDF, producing lower DBI values and higher Silhouette Scores across all clusters (k= 2 - k=10). Combining BERT with K-Means++ yields the most compact and well-separated clusters, while TF-IDF results in overlapping and less-defined clusters. The research concludes that BERT is a superior text representation method for clustering, offering significant advantages in capturing semantic context and enabling educators to identify student misconceptions and improve learning strategies.

Downloads

Download data is not yet available.

References

[1] Y. Bilan, O. Oliinyk, H. Mishchuk, and M. Skare, â€œImpact of information and communications technology on the development
and use of knowledge,â€ vol. 191, p. 122519, https://doi.org/10.1016/j.techfore.2023.122519.
[2] A. LÂ¨owstedt, â€œDevelopmental Stages of Information and Communication Technology,â€ vol. 31, no. 4, pp. 758â€“778,
https://doi.org/10.1093/ct/qtaa015.
[3] S. Wang, A. Beheshti, Y. Wang, J. Lu, Q. Z. Sheng, S. Elbourn, and H. Alinejad-Rokny, â€œLearning Distributed Representations
and Deep Embedded Clustering of Texts,â€ vol. 16, no. 3, p. 158, https://doi.org/10.3390/a16030158.
[4] V. Adu, M. D. Adane, and K. Asante, â€œSimilarity Measure Algorithm for Text Document Clustering, Using Singular Value
Decomposition,â€ pp. 8â€“25, https://doi.org/10.9734/cjast/2021/v40i2231475.
[5] S. M. Dol and P. M. Jawandhiya, â€œClassification Technique and its Combination with Clustering and Association Rule Mining
in Educational Data Miningâ€” A survey,â€ vol. 122, p. 106071, https://doi.org/10.1016/j.engappai.2023.106071.
[6] G. AsÂ¸Ä±ksoy, â€œComputer-Based Concept Mapping as a Method for Enhancing the Effectiveness of Concept Learning in
Technology-Enhanced Learning,â€ vol. 11, no. 4, p. 1005, https://doi.org/10.3390/su11041005.
[7] Z. Labd, S. Bahassine, K. Housni, F. Z. A. Hamou Aadi, and K. Benabbes, â€œText classification supervised algorithms with
term frequency inverse document frequency and global vectors for word representation: A comparative study,â€ vol. 14, no. 1,
p. 589, https://doi.org/10.11591/ijece.v14i1.pp589-599.
[8] A. Subakti, H. Murfi, and N. Hariadi, â€œThe performance of BERT as data representation of text clustering,â€ vol. 9, no. 1, p. 15,
https://doi.org/10.1186/s40537-022-00564-9.
[9] E. C. Garrido-Merchan, R. Gozalo-Brizuela, and S. Gonzalez-Carvajal, â€œComparing BERT Against Traditional Machine
Learning Models in Text Classification,â€ vol. 2, no. 4, pp. 352â€“356, https://doi.org/10.47852/bonviewJCCE3202838.
[10] L. George and P. Sumathy, â€œAn integrated clustering and BERT framework for improved topic modeling,â€ vol. 15, no. 4, pp.
2187â€“2195, https://doi.org/10.1007/s41870-023-01268-w.
[11] V. Mehta, S. Bawa, and J. Singh, â€œWEClustering: Word embeddings based text clustering technique for large datasets,â€ vol. 7,
no. 6, pp. 3211â€“3224, https://doi.org/10.1007/s40747-021-00512-9.
[12] C. Wu, B. Yan, R. Yu, B. Yu, X. Zhou, Y. Yu, and N. Chen, K -Means Clustering Algorithm and Its Simulation Based on
Distributed Computing Platform,â€ vol. 2021, no. 1, p. 9446653, https://doi.org/10.1155/2021/9446653.
[13] J. Y. K. Chan, A. P. Leung, and Y. Xie, â€œEfficient High-Dimensional Kernel k-Means++ with Random Projection,â€ vol. 11,
no. 15, p. 6963.
[14] A. Naghizadeh and D. N. Metaxas, â€œCondensed silhouette: an optimized filtering process for cluster selection in Kmeans,
Procedia Computer Science, vol. 176, pp. 205â€“214, 2020.
[15] Q. Li, S. Yue, Y. Wang, M. Ding, and J. Li, â€œA New Cluster Validity Index Based on the Adjustment of Within-Cluster
Distance,â€ vol. 8, pp. 202 872â€“202 885, https://doi.org/10.1109/ACCESS.2020.3036074.
[16] D. D. Prasetya and T. Hirashima, â€œAssociated Patterns in Open-Ended Concept Maps within E-Learning,â€ vol. 5, no. 2, p. 179,
https://doi.org/10.17977/um018v5i22022p179-187.
[17] D. D. Prasetya, A. Pinandito, Y. Hayashi, and T. Hirashima, â€œAnalysis of quality of knowledge structure and studentsâ€™
perceptions in extension concept mapping,â€ vol. 17, no. 1, p. 14, https://doi.org/10.1186/s41039-022-00189-9.
[18] Q. Zhang, Y. Sun, L. Zhang, Y. Jiao, and Y. Tian, â€œNamed entity recognition method in health preserving field based on
BERT,â€ vol. 183, pp. 212â€“220, https://doi.org/10.1016/j.procs.2021.03.010.
[19] T. Kwon, J. Myung, J. Lee, K.-i. Kim, and J. Song, â€œA Network Packet Analysis Method to Discover Malicious Activities,â€
vol. 0, no. S, pp. 143â€“153, https://doi.org/10.1633/JISTAP.2022.10.S.14.
[20] R. Suwanda, Z. Syahputra, and E. M. Zamzami, â€œAnalysis of Euclidean Distance and Manhattan Distance in the K-Means
Algorithm for Variations Number of Centroid K,â€ vol. 1566, no. 1, p. 012058, https://doi.org/10.1088/1742-6596/1566/1/
012058.
[21] F. S. Mukti, A. Junikhah, P. M. A. Putra, A. Soetedjo, and A. U. Krismanto, â€œA Clustering Optimization for Energy Consumption
Problems in Wireless Sensor Networks using Modified K-Means++ Algorithm,â€ vol. 15, no. 3, pp. 355â€“365, https://doi.org/10.
22266/ijies2022.0630.30.
[22] J. Meng, Z. Yu, Y. Cai, and X. Wang, â€œK-Means++ Clustering Algorithm in Categorization of Glass Cultural Relics,â€ vol. 13,
no. 8, p. 4736, https://doi.org/10.3390/app13084736.
[23] C. D. GutiÂ´errez, J. N. Ruiz, S. C. Salazar, J. P. G. LÂ´opez, J. D. Zapata, and J. F. BotÂ´Ä±a, â€œPerformance of Hybrid
Clustering-Classification Approach for Dual-Band System in a Mode-Locked Fiber Laser,â€ vol. 12, pp. 104 115â€“104 125,
https://doi.org/10.1109/ACCESS.2024.3409565.
[24] M. Shutaywi and N. N. Kachouie, â€œSilhouette Analysis for Performance Evaluation in Machine Learning with Applications to
Clustering,â€ vol. 23, no. 6, p. 759, https://doi.org/10.3390/e23060759.
[25] P. Charoenkwan, C. Nantasenamat, M. M. Hasan, B. Manavalan, and W. Shoombuatong, â€œBERT4Bitter: A bidirectional
encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides,â€ vol. 37,
no. 17, pp. 2556â€“2562, https://doi.org/10.1093/bioinformatics/btab133.

Comparison of Text Representation for Clustering Student Concept Maps

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

How to Cite

sidebar menu 2

tools

citation