Enhancing Multiple Linear Regression with Stacking Ensemble for Dissolved Oxygen Estimation
Abstract
Maintaining optimal dissolved oxygen levels is essential for aquatic ecosystems, yet industrial and domestic waste has led to a global decline in dissolved oxygen. Traditional measurement methods, such as oxygen meters and Winkler titration, are often costly or time-consuming. This study aims to improve the Root Mean Square Error, Mean Absolute Error, and R2 values for estimating dissolved oxygen levels. The research method uses Multiple Linear Regression with various training and testing data splits, both before and after applying polynomial features. The model is further optimized using a stacking technique, with Random Forest Regressor and Gradient Booster Regressor as base models.
The results show that the best model was achieved using the stacking ensemble technique with a 90:10 data split and polynomial features, yielding a Root Mean Square Error of 1.206, Mean Absolute Error of 0.990, and R2 of 0.670. This model has also met the assumptions of linear regression, such as residual normality, homoscedasticity, and no autocorrelation of residuals. This study concluded that the ensemble stacking technique and the addition of polynomial features could improve the model in estimating dissolved oxygen values and also contribute by providing an accessible user interface using the Gradio Framework, allowing users to estimate dissolved oxygen levels effectively.
Downloads
References
[2] B. Ali, . A., and A. Mishra, “Effects of dissolved oxygen concentration on freshwater fish: A review,” Int J Fish Aquat Stud, vol. 10, no. 4, pp. 113–127, 2022, https://doi.org/10.22271/fish.2022.v10.i4b.2693.
[3] C. Garcia-Soto et al., “An Overview of Ocean Climate Change Indicators: Sea Surface Temperature, Ocean Heat Content, Ocean pH, Dissolved Oxygen Concentration, Arctic Sea Ice Extent, Thickness and Volume, Sea Level and Strength of the AMOC (Atlantic Meridional Overturning Circula,” Front Mar Sci, vol. 8, no. September, 2021, https://doi.org/10.3389/fmars.2021.642372.
[4] K. M. Abbott, P. A. Zaidel, A. H. Roy, K. M. Houle, and K. H. Nislow, “Investigating impacts of small dams and dam removal on dissolved oxygen in streams,” PLoS One, vol. 17, no. 11 November, pp. 1–23, 2022, http://dx.doi.org/10.1371/journal.pone.0277647.
[5] J. C. C. Casila, M. D. Nicolas, M. Duka, S. Haddout, K. L. Priya, and K. Yokoyama, “Assessing dissolved oxygen dynamics in Pasig River, Philippines: A HEC-RAS modeling approach during the COVID-19 pandemic,” Water Pract Technol, vol. 19, no. 4, pp. 1365–1381, 2024, https://doi.org/10.2166/wpt.2024.078.
[6] H. Wang, L. Zhang, R. Wu, and H. Zhao, “Enhancing Dissolved Oxygen Concentrations Prediction in Water Bodies: A Temporal Transformer Approach with Multi-Site Meteorological Data Graph Embedding,” Water (Switzerland), vol. 15, no. 17, 2023, https://doi.org/10.3390/w15173029.
[7] E. Prasetyo, M. F. Al-adni, and R. F. Tias, “Classification of Cash Direct Recipients Using the Naive Bayes with Smoothing,” Matrik: Jurnal Manajemen, Teknik Informatika, dan Rekayasa Komputer, vol. 23, no. 3, pp. 615–626, 2024, https://doi.org/10.30812/matrik.v23i3.3584.
[8] X. Shu and Y. Ye, “Knowledge Discovery: Methods from data mining and machine learning,” Soc Sci Res, vol. 110, no. October 2022, p. 102817, 2023, https://doi.org/10.1016/j.ssresearch.2022.102817.
[9] H. Santoso, H. Magdalena, and H. Wardhana, “Aplikasi Dynamic Cluster pada K-Means BerbasisWeb untuk Klasifikasi Data Industri Rumahan,” MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, vol. 21, no. 3, pp. 541–554, 2022, https://doi.org/10.30812/matrik.v21i3.1720.
[10] Z. Liu, H. Gao, M. Zhang, R. Yan, and J. Liu, “A data mining method to extract traffic network for maritime transport management,” Ocean Coast Manag, vol. 239, no. February, p. 106622, 2023, https://doi.org/10.1016/j.ocecoaman.2023.106622.
[11] A. Nugroho and Y. Religia, “Analisis Optimasi Algoritma Klasifikasi Naive Bayes menggunakan Genetic Algorithm dan Bagging,” Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 5, no. 3, pp. 504–510, 2021, https://doi.org/10.29207/resti.v5i3.3067.
[12] K. Aulakh, R. K. Roul, and M. Kaushal, “E-learning enhancement through educational data mining with Covid-19 outbreak period in backdrop: A review,” Int J Educ Dev, vol. 101, no. March, p. 102814, 2023, https://doi.org/10.1016/j.ijedudev.2023.102814.
[13] Yoga Religia, Agung Nugroho, and Wahyu Hadikristanto, “Klasifikasi Analisis Perbandingan Algoritma Optimasi pada Random Forest untuk Klasifikasi Data Bank Marketing,” Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 5, no. 1, pp. 187–192, 2021, https://doi.org/10.29207/resti.v5i1.2813.
[14] Ondra Eka Putra and Randy Permana, “Hybrid Data Mining For Member Determination And Financing Prediction In Syariah Financing Saving And Loan Cooperatives,” Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 8, no. 2, pp. 309–320, 2024, https://doi.org/10.29207/resti.v8i2.5683.
[15] D. Feng, Q. Han, L. Xu, F. Sohel, S. G. Hassan, and S. Liu, “An ensembled method for predicting dissolved oxygen level in aquaculture environment,” Ecol Inform, vol. 80, no. August 2023, p. 102501, 2024, https://doi.org/10.1016/j.ecoinf.2024.102501.
[16] J. Huang, S. Liu, S. G. Hassan, L. Xu, and C. Huang, “A hybrid model for short-term dissolved oxygen content prediction,” Comput Electron Agric, vol. 186, no. May, p. 106216, 2021, https://doi.org/10.1016/j.compag.2021.106216.
[17] A. Chatziantoniou, S. Charalampis Spondylidis, O. Stavrakidis-Zachou, N. Papandroulakis, and K. Topouzelis, “Dissolved oxygen estimation in aquaculture sites using remote sensing and machine learning,” Remote Sens Appl, vol. 28, no. July, p. 100865, 2022, https://doi.org/10.1016/j.rsase.2022.100865.
[18] J. Liang, “Multivariate linear regression method based on SPSS analysis of influencing factors of CPI during epidemic situation,” Proceedings - 2020 2nd International Conference on Economic Management and Model Engineering, ICEMME 2020, vol., no., pp. 294–297, 2020, https://doi.org/10.1109/ICEMME51517.2020.00062.
[19] Z. Zhao, Y. Peng, X. Zhu, X. Wei, X. Wang, and J. Zuo, “Research on prediction of electricity consumption in smart parks based on multiple linear regression,” vol. 2020, no., pp. 812–816, 2020, https://doi.org/10.1109/ITAIC49862.2020.9338976.
[20] D. Alita, A. D. Putra, and D. Darwis, “Analysis of classic assumption test and multiple linear regression coefficient test for employee structural office recommendation,” IJCCS (Indonesian Journal of Computing and Cybernetics Systems), vol. 15, no. 3, p. 295, 2021, https://doi.org/10.22146/ijccs.65586.
[21] K. Maharana, S. Mondal, and B. Nemade, “A review: Data pre-processing and data augmentation techniques,” Global Transitions Proceedings, vol. 3, no. 1, pp. 91–99, 2022, https://doi.org/10.1016/j.gltp.2022.04.020.
[22] J. Hu, “Data cleaning and feature selection for gravelly soil liquefaction,” Soil Dynamics and Earthquake Engineering, vol. 145, no. March, p. 106711, 2021, https://doi.org/10.1016/j.soildyn.2021.106711.
[23] H. Mende, M. Frye, P. A. Vogel, S. Kiroriwal, R. H. Schmitt, and T. Bergs, “On the importance of domain expertise in feature engineering for predictive product quality in production,” Procedia CIRP, vol. 118, no., pp. 1096–1101, 2023, https://doi.org/10.1016/j.procir.2023.06.188.
[24] D. Dallah and H. Sulieman, “Outlier Detection Using the Range Distribution BT - Advances in Mathematical Modeling and Scientific Computing,” F. Kamalov, R. Sivaraj, and H.-H. Leung, Eds., Cham: Springer International Publishing, vol., no. pp. 687–697, 2024. [Online]. Available: https://link.springer.com/book/10.1007/978-3-031-41420-6
[25] V. N. G. Raju, K. P. Lakshmi, V. M. Jain, A. Kalidindi, and V. Padma, “Study the Influence of Normalization/Transformation process on the Accuracy of Supervised Classification,” Proceedings of the 3rd International Conference on Smart Systems and Inventive Technology, ICSSIT 2020, vol., no. Icssit, pp. 729–735, 2020, https://doi.org/10.1109/ICSSIT48917.2020.9214160.
[26] F. A. S. H et al., “Application of the Polynomial Regression Algorithm to Predict Covid-19 Cases Per Day in Colombia,” vol. 9, no. 3, pp. 49–61, 2021, [Online]. Available: https://advancesinmechanics.com/view-97.php
[27] J. Y. Chan et al., “Mitigating the multicollinearity problem and its machine learning approach : A review,” Mathematics, vol. 10, no. 8, p. 1283, 2022, https://doi.org/10.3390/math10081283.
[28] M. Greenacre, P. J. F. Groenen, T. Hastie, A. I. D’Enza, A. Markos, and E. Tuzhilina, “Principal component analysis,” Nature Reviews Methods Primers, vol. 2, no. 1, p. 100, 2022, https://doi.org/10.1038/s43586-022-00184-w.
[29] K. Lee, S. Im, and B. Lee, “Prediction of renewable energy hosting capacity using multiple linear regression in KEPCO system,” Energy Reports, vol. 9, no. S12, pp. 343–347, 2023, https://doi.org/10.1016/j.egyr.2023.09.121.
[30] T. O. Hodson, “Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not,” Geosci Model Dev, vol. 15, no. 14, pp. 5481–5487, 2022, https://doi.org/10.5194/gmd-15-5481-2022.
[31] A. Monter-Pozos and E. González-Estrada, “On testing the skew normal distribution by using Shapiro–Wilk test,” J Comput Appl Math, vol. 440, p. 115649, 2024, doi: https://doi.org/10.1016/j.cam.2023.115649.
[32] Y. Y. Zhao, J. Q. Zhao, and S. A. Qian, “A new test for heteroscedasticity in single-index models,” J Comput Appl Math, vol. 381, no., p. 112993, 2020, https://doi.org/10.1016/j.cam.2020.112993.
[33] A. Katsileros, N. Antonetsis, P. Mouzaidis, E. Tani, P. J. Bebeli, and A. Karagrigoriou, “A comparison of tests for homoscedasticity using simulation and empirical data,” Commun Stat Appl Methods, vol. 31, no. 1, pp. 1–35, 2024, https://doi.org/10.29220/CSAM.2024.31.1.001.
[34] S. S. Uyanto, “Power comparisons of five most commonly used autocorrelation tests,” Pakistan Journal of Statistics and Operation Research, vol. 16, no. 1, pp. 119–130, 2020, https://doi.org/10.18187/PJSOR.V16I1.2691.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.