Evaluating Different K Values in K-Fold Cross Validation for Binary Logistic Regression to Classify Poverty

Julia Oriana Sinaga; M. Fathurahman; Sri Wahyuningsih; Memi Nor Hayati

doi:10.30812/varian.v8i2.4403

Authors

Julia Oriana Sinaga Universitas Mulawarman, Samarinda, Indonesia
M. Fathurahman Universitas Mulawarman, Samarinda, Indonesia
Sri Wahyuningsih Universitas Mulawarman, Samarinda, Indonesia
Memi Nor Hayati Universitas Mulawarman, Samarinda, Indonesia

DOI:

https://doi.org/10.30812/varian.v8i2.4403

Keywords:

Binary Logistic Regression, Classification, K-Fold Cross Validation, Poverty Depth Levels

Abstract

Data mining is essential for decision-makers to analyze and extract insights from data efficiently. Classification is one of the data mining techniques used to organize data based on its features, helping to identify patterns and make predictions. This study evaluates Binary Logistic Regression (BLR), a type of generalized linear model that suitable for binary outcomes, for classifying poverty depth across Indonesian regencies/cities in 2022, with a focus on the impact of different K values in K-Fold Cross Validation. The dataset includes 514 regencies/cities, with the Poverty Depth Index as the target variable, categorized into high (1) and low (0) levels, using 11 predictor variables. K-Fold Cross Validation was performed with K values of 3, 5, and 10, using accuracy and Area Under Curve (AUC) as evaluation metrics. The mean accuracy values for BLR are 75.7% for K=3, 74.3% for K=5, and 75.1% for K=10. Results show that K=3 offers the highest accuracy in classifying poverty depth in Indonesia, with the lowest standard deviation of 0.03. However, K=10 demonstrates superior discriminative ability in BLR, reflected by a higher AUC value. This study highlights the significant influence of K values in K-Fold Cross Validation on BLR performance.

Downloads

Download data is not yet available.

References

Agarwal, N., & Das, S. (2020). Interpretable Machine Learning Tools: A Survey. 2020 IEEE Symposium Series on Computational

Intelligence (SSCI), 1528–1534. https://doi.org/10.1109/SSCI47803.2020.9308260

Agresti, A. (2018, November 20). An Introduction to Categorical Data Analysis. John Wiley & Sons.

Arisandi, R. R. R., Warsito, B., & Hakim, A. R. (2022). Aplikasi Na¨ıve Bayes Classifier (NBC) pada Klasifikasi Status Gizi Balita

Stunting dengan Pengujian K-Fold Cross Validation. Jurnal Gaussian, 11(1), 130–139. https://doi.org/10.14710/j.gauss.

v11i1.33991

Asriningtias, Y., & Mardhiyah, R. (2014). Aplikasi Data Mining Untuk Menampilkan Informasi Tingkat Kelulusan Mahasiswa.

Jurnal Informatika, 8(1), 837–848. https://journal.uad.ac.id/index.php/JIFO/article/view/2082

Asysyifa, S., Vionanda, D., Amalita, N., & Fitria, D. (2023). Comparison of Error Rate Prediction Methods in Binary Logistic

Regression Model for Balanced Data. UNP Journal of Statistics and Data Science, 1(4), 256–263. https://doi.org/10.

24036/ujsds/vol1-iss4/90

Azis, H. (2024). Assessing the Performance of Logistic Regression in Heart Disease Detection through 5-Fold Cross-Validation.

International Journal of Artificial Intelligence in Medical Issues, 2(1), 1–11. https://doi.org/10.56705/ijaimi.v2i1.137

Braun, T., Spiliopoulos, S., Veltman, C., Hergesell, V., Passow, A., Tenderich, G., Borggrefe, M., & Koerner, M. M. (2020). Detection

of myocardial ischemia due to clinically asymptomatic coronary artery stenosis at rest using supervised artificial intelligenceenabled

vectorcardiography - A five-fold cross validation of accuracy. Journal of Electrocardiology, 59, 100–105. https:

//doi.org/10.1016/j.jelectrocard.2019.12.018

Hendayanti, N. P. N., & Nurhidayati, M. (2020). Regresi Logistik Biner dalam Penentuan Ketepatan Klasifikasi Tingkat Kedalaman

Kemiskinan Provinsi-Provinsi di Indonesia. Sainstek : Jurnal Sains dan Teknologi, 12(2), 63–70. https://doi.org/10.

31958/js.v12i2.2483

Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (Third edition). Wiley.

Larose, D. T., & Larose, C. D. (2014). Discovering knowledge in data: An introduction to data mining (Second edition).Wiley.

Ling, H., Qian, C., Kang, W., Liang, C., & Chen, H. (2019). Combination of Support Vector Machine and K-Fold cross validation to

predict compressive strength of concrete in marine environment. Construction and Building Materials, 206, 355–363.

https://doi.org/10.1016/j.conbuildmat.2019.02.071

Liu, Y. (2019, February 28). Python Machine Learning By Example - Second Edition: Implement machine learning algorithms

and techniques to build intelligent systems, 2nd Edition (2nd edition). Packt Publishing.

Nti, I. K., Nyarko-Boateng, O., & Aning, J. (2021). Performance of Machine Learning Algorithms with Different K Values in Kfold

Cross Validation. International Journal of Information Technology and Computer Science, 13(6), 61–71. https:

//doi.org/10.5815/ijitcs.2021.06.05

Nurrizqi, A. I., Erfiani, E., Indahwati, I., Fitrianto, A., & Amelia, R. (2022). Pemodelan Regresi Logistik Berbasis Backward Elimination

Untuk Mengetahui Faktor yang Mempengaruhi Tingkat Kemiskinan di Indonesia Tahun 2021. Jurnal Statistika

dan Aplikasinya, 6(2), 160–170. https://doi.org/10.21009/JSA.06202

Prasetyo, E. (2012). Data Mining: Konsep dan Aplikasi menggunakan MATLAB. Penerbit Andi.

Prusty, S., Patnaik, S., Dash, S. K., & Priyadarsini Prusty, S. G. (2024). SEMeL-LR: An improvised modeling approach using a

meta-learning algorithm to classify breast cancer. Engineering Applications of Artificial Intelligence, 129, 107630.

https://doi.org/10.1016/j.engappai.2023.107630

Putri, F.W., Vionanda, D., Putra, A. A., & Fitri, F. (2023). Comparison of Error Prediction Methods in Claassification Modeling with

CHAID Methods for Balanced Data. UNP Journal of Statistics and Data Science, 1(5), 456–463. https://doi.org/10.

24036/ujsds/vol1-iss5/116

Sahputra, D. R., Sulistiani, M., Aulia, E. N., Fadhillah, R., Fadhilah, K., Sumarni, S., Fadhilah, A. N., Wirawan, A. S., & Wasono,

W. (2023). Model Regresi Logistik pada Indeks Kedalaman Kemiskinan di Provinsi Jawa Timur Tahun 2021. Prosiding

Seminar Nasional Matematika dan Statistika, 3(1), 1–9. https://jurnal.fmipa.unmul.ac.id/index.php/SNMSA/article/

view/1159

Sasongko, T. B. (2016). Komparasi dan Analisis Kinerja Model Algoritma SVM dan PSO-SVM (Studi Kasus Klasifikasi Jalur Minat

SMA). Jurnal Teknik Informatika dan Sistem Informasi, 2(2). https://doi.org/10.28932/jutisi.v2i2.627

Tougui, I., Jilbab, A., & Mhamdi, J. E. (2021). Impact of the Choice of Cross-Validation Techniques on the Results of Machine

Learning-Based Diagnostic Applications. Healthcare Informatics Research, 27(3), 189–199. https://doi.org/10.4258/hir.

2021.27.3.189

Widodo, S., Brawijaya, H., & Samudi, S. (2022). Stratified K-fold cross validation optimization on machine learning for prediction.

Sinkron : jurnal dan penelitian teknik informatika, 6(4), 2407–2414. https://doi.org/10.33395/sinkron.v7i4.11792

Wong, T.-T. (2015). Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognition,

48(9), 2839–2846. https://doi.org/10.1016/j.patcog.2015.03.009

World Bank. (2022). Population 2022. Retrieved May 14, 2024, from https://databank.worldbank.org/source/world-developmentindicators

Evaluating Different K Values in K-Fold Cross Validation for Binary Logistic Regression to Classify Poverty

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Submit

Citedness Scopus

sidemenu

tools

index

View Stats Button

supervised

citation