Data Mining Earthquake Prediction with Multivariate Adaptive Regression Splines and Peak Ground Acceleration

Earthquake research has not yielded promising results because earthquakes have uncertain data parameters, and one of the methods to overcome the problem of uncertain parameters is the nonparametric method, namely Multivariate Adaptive Regression Splines (MARS). Sumbawa Island is part of the territory of Indonesia and is in the position of three active earth plates, so Sumbawa is prone to earth-quake hazards. Therefore, this research is important to do. This study aimed to analyze earthquake hazard prediction on the island of Sumbawa by using the nonparametric MARS and Peak Ground Acceleration (PGA) methods to determine the risk of earthquake hazards. The method used in this study was MARS, which has two completed stages: Forward Stepwise and Backward Stepwise. The results of this study were based on testing and parameter analysis obtained a Mathematical model with 11 basis functions (BF) that contribute to the response variable, namely (BF) 1,2,3,4,5,7,9,11, and the basis functions do not contribute 6, 8, and 10. The predictor variables with the greatest inﬂuence were 100% Epicenter Distance and 73.8% Magnitude. The conclusion of this study is based on the highest PGA values in


INTRODUCTION
Earthquakes are natural disasters that can cause minor to severe damage. Many lives and property were lost as a result of the earthquake. Research on earthquakes to date has not provided significant results to determine the causative factors or when the earthquake occurred. Many studies have been carried out, but the problem is that the data related to earthquakes is uncertain and involves big data as a result of recording the accelerograph machine, so an appropriate method is needed to perform predictive analysis based on past data. Research has been carried out using various methods, such as classification and regression methods, Artificial Neural Networks (ANN), Support Vector Regression (SVR), Hybrid Neural Networks (HNN), and others, but these methods use a parametric approach, while earthquake data The earth is uncertain, so an appropriate method is needed, namely a nonparametric approach such as the Multivariate Adaptive Regression Spline (MARS) method. Earthquakes can occur anywhere, including on Sumbawa Island, West Nusa Tenggara. Sumbawa Island is part of the Indonesian archipelago and is positioned west of the Alas Strait. Sumbawa, with an area of 15,448 square km, has an active volcano that once erupted violently in 1815 and had an impact on the whole world with changes in weather and the distribution of volcanic ash up to 1,300 km. The existence of Sumbawa is geologically in the position of the western and eastern island arcs due to the subduction of the Australian plate at the continental boundary of the Indo-Pacific plate, which is to the south of Sumbawa Island. Because Sumbawa is at a plate-meeting position, Sumbawa Island is an area that is prone to tectonic earthquakes. History shows that Sumbawa often experiences earthquakes with a magnitude scale of more than 5 with a depth of less than 70 Kilometers. Many methods have been developed, and research on Matrik: Jurnal Managemen,Teknik Informatika, dan Rekayasa Komputer Ì 585 different soil and rock structures. This research was conducted by developing the function of the Mathematical model formed by MARS according to the condition of the regional bedrock on Sumbawa Island. This researcher will use three predictor variables to find correlations in predictive analysis. This study will analyze predictions of earthquake-prone areas on Sumbawa Island based on Peak Ground Acceleration data with the highest value.

Multivariate Adaptive Regression Spline (MARS)
The MARS method is a nonparametric regression method used to overcome the problem of high-dimensional data, which is used to determine the relationship pattern between the response variable and the predictor variable whose regression curve is not known [19]. In data mining management, predictions can be completed in two ways: Parametric Regression and Nonparametric Regression. These two approaches are commonly used as statistical methods and widely used for investigating and modeling relationships between variables [20]. The MARS method can overcome the shortcomings of Recursive Partitioning Regression (RPR) by producing a continuous model at knots and identifying the presence of an additive linear function. Two stages of the algorithm can solve the MARS method, namely the Forward Stepwise model and the Backward Stepwise model [18,21,22]. The first stage, namely the Forward Stepwise Algorithm, is used for a combination of basis functions (BF), maximum interaction (MI), and minimum observation (MO). to find the relationship between the response variables and predictor variables. This research has determined that the response variable is Peak Ground Acceleration (PGA), and the predictor variables are depth, magnitude (Mw), and epicenter distance (Repi). Furthermore, the Backward Stepwise model's second stage is used to simplify the basis function (BF) obtained from the Forward Stepwise stage. The basis function (BF), which has no contribution or makes a small contribution to the response variable, will be eliminated at the backward stepwise model stage. This deletion process will have the effect of decreasing the number of least squares of the remainder. In general, the Nonparametric Regression model can be presented as in Equation (1) [23][24][25].
Where yi is the response variable on observation I, f (xi) is the vector predictor variable function, and Ei is a free error i. The determination of the independent variable greatly determines the results of the model built using the MARS method so that the MARS model is flexible, and its basic functions can be explained in Equations (2) and (3). Equations (2) and (3) seem almost the same function, so they can be called reflected pairs. The goal is reflected pairs on each variable x j on each observation xi, j on the knots of the variable so that a truncated linear function is formed from the basis function as in Equation (4). The MARS model starts from Equation (5). and Where M is the number of basis functions that make up the function model. β m (x) is a basic function formed by a single element or by multiplying two or more elements contained in r, multiplied by the coefficient β m . The m basic function can be explained into the basis function as shown in Equation (6). Where K m is the number of truncated linear functions times the basis function to m. For X k m j is the input variable associated with the truncated function in the mth basis function. τ k m j is the value of the knot variable τ k m j . While S k m j is operator +/-, which is worth 1 or -1.

ISSN: 2476-9843
The MARS model is flexible and can be used to overcome the weaknesses of recursive partition regression by increasing the accuracy of the model. The MARS model is run with a two-stage algorithm: Forward Stepwise and Backward Stepwise. Then the algorithm will determine the value of knots in the continuous model and minimize the value of Generalized Cross Validation (GCV) to obtain the best model. GCV measurement can be seen in Equation (7). Where y i is Variabel response, x i is Variable predictor, N is the number of observations,f M (x i ) is the estimated value of the dependent variable on the M basis function on where B is a matrix of M basis functions, and d is the value when each base function reaches optimization (2 ≤ d ≤ 4).

Peak Ground Acceleration (GPA)
Maximum Ground Acceleration (PGA) is the maximum ground vibration acceleration that occurs in an area caused by an earthquake. A large PGA value in an area usually has a large damage impact on the area at the center of the earthquake. The unit of PGA value is usually expressed in units of Gravitational Acceleration "gal." One way to get the PGA value is by using the empirical calculation of the Attenuation function. The attenuation function determines the relationship between ground vibration intensity, magnitude, and distance from an area to the earthquake's epicenter. Several factors affect the attenuation function, namely the earthquake mechanism, the epicenter's distance, and the ground location's condition. This research is to get the PGA value using the Attenuation function of the Joyner and Boore Attenuation equations as in the following Formula (8) [26,27].
Where M is the magnitude, and r is the root of (R epi2+82). The PGA value will be obtained by assigning a value to 'M' and the value of 'r' in Equation (9). Furthermore, the research was conducted with Prediction analysis with previous selection/separation and selection of appropriate variables for Responsive and Predictor variables. This study uses the response variable 'PGA,' and the predictor variables are depth, magnitude (Mw), and epicenter distance (Repi).

Data Collection
This study uses earthquake catalog data taken from sources on the USGS website (https://earthquake.usgs.gov/earthquakes/search/). The data was accessed on May 31, 2022, at 10:47 AM and has been filtered with a magnitude of more than 4 Mw. This is done because a Magnitude of less than 4 Mw does not have a significant impact or may not be felt at all. The coordinate position taken is -8,59491 South Latitude and 117,26121 East Longitude. Earthquake catalog data was obtained over 20 years with a total of 105 records in the range of Magnitude 4 to the highest of 5.5 Mw. The data is processed by a selection system with a magnitude of more than 4 Mw, a depth of less than 250 Km, and an earthquake center distance of less than 300 Km. Data that is not in the provisions or ring will be deleted or not used because it does not cause damage. The data is processed for use in predictive analysis. Three data variables have been determined: Magnitude, Epicenter Distance, and depth of the center of the earthquake location.
The earthquake prediction analysis process used the Multivariate Adaptive Regression Spline (MARS) method using the equation function number 6, and to get the minimum value of Generalized Cross Validation (GCV) used equation number 7. The SPM 8 software was used to predict earthquakes by analyzing the parameter factor of the relationship between the predictor variable and the response variable. MARS works with two algorithms, the Forward Stepwise and the Backward Stepwise algorithms. The Forward Stepwise algorithm determines the combination of the maximum basis function with maximum interaction and minimum observation (MO).
Maximum basis function for cross multiplying between variables that have linkage and correlation. Maximum Interaction (MI) to describe the maximum line in the basis function (BF) that can be traversed or past the knot point, and the minimum observation to obtain a minimum smoothing parameter value or, in other words, the minimum observation between knots. Furthermore, the Backward Stepwise algorithm is used to simplify the complexity of the formed mathematical model functions. This algorithm uses a regularization technique to minimize generalization errors by using the Tikhonov Regularization technique, which gives a penalty if the function of the formed mathematical model is too complex. Peak Ground Acceleration (PGA) is used to determine whether an area is categorized as prone or not to earthquake hazards. A high PGA value in an area will have a high impact due to the occurrence of an earthquake. The PGA value is obtained from recording using an Accelerograph machine or by empirical calculations, and this study uses empirical calculations using the Joyner and Boore Attenuation functions, as in the equations of functions number 8 and 9.

Results
The study's results began with preprocessing the data to find the value of the epicenter distance and Maximum Ground Acceleration (PGA). The Joyner and Boore attenuation functions were used to find the PGA value. After knowing the PGA value, the calculation and prediction analysis can be continued using the MARS method. At this stage, to get the best MARS model, it is necessary to test the data and determine the best model by selecting the minimum GCV value. Peak Ground Acceleration (PGA) is the maximum ground vibration acceleration that occurs in an area caused by an earthquake. A large PGA value in an area usually has a large damage impact on the area at the center of the earthquake. The unit of PGA value is usually expressed in units of Gravitational Acceleration "gal." One way to get the PGA value is by using the empirical calculation of the Attenuation function. The attenuation function determines the relationship between ground vibration intensity, magnitude, and distance from an area to the earthquake's epicenter. Several factors affect the attenuation function, namely the earthquake mechanism, the epicenter's distance, and the ground location's condition. This research is to get the PGA value using the Attenuation function of the Joyner and Boore Attenuation equations as in the following Formula (8) and (9) [26]. The PGA value was obtained from the results of processing earthquake data in Sumbawa from 2000 to 2021, as shown in Table 1. depth, magnitude (Mw), and epicenter distance (Repi). The results of selecting the appropriate type of variables in the prediction analysis data can be obtained, as shown in Table 2. The results of the prediction analysis using the MARS method using the Forward Stepwise algorithm and the Backward Stepwise algorithm based on a combination of BF, MI, and MO are in the form of training data. The results of the MARS regression based on the training data are shown in Table 3.

Testing and Analysis
In predictive analysis, a statistical analysis test is needed to obtain the hypothesis testing results and determine the significance level. The significance level is meant to get the parameter significance. Hypothesis testing is required to use statistical analysis to determine the significance of parameters with the suitability of the mathematical model obtained. This research tests mathematical model analysis using a partial regression coefficient test. In testing the partial regression coefficient, the following Formula is needed: H 0 : a 1 = a 2 = a 3 = a 5 = a 7 = a 8 = a 9 = a 11 = 0 H 1 : there is at least one am = 0; m = 1, 2, 3, 4, 5, 7, 9, 11 (significant model) ,61) orP − value < α P-value in statistical tests used to determine the magnitude of the opportunity, to state the status Reject the null hypothesis or (H0) with the actual condition (H0) is true. As shown in Table 3 (results of training data) that the P-value is less than 0.05, or in other words, every m < α or (m < 0.05) so that the H0 status is rejected. This means that each coefficient α 1 , α 2 , α 3 , α 4 α 5 , α 7 , α 9 , α 11 has a significant effect on the mathematical model obtained. Based on the significance level of 5%, the mathematical model in Formula (10) is significant. It can be used in predictive analysis of the PGA value for earthquake data sets in Sumbawa. Furthermore, after knowing the suitability of the parameters and mathematical models obtained based on testing, it is concluded that the variables that affect the PGA value are epicenter distance (R-epi), magnitude (Mw), and depth (Depth).

Discussion
It can be seen in Table 3 that the parameters formed with 11 basis functions that contribute to the response variable are Basis functions 1, 2, 3, 4, 5, 7, 9 and 11. Several basis functions do not contribute to the response variable, namely base functions 6, 8, and 10, then the basic function is omitted or deleted. The results of testing the data at the Backward Stepwise stage by simplifying the function can be obtained from a Mathematical model as in Formula (10). Based on the best MARS model, the predictor variable inference that affects PGA is obtained based on the MARS model according to the smallest GCV value sequentially based on the percentage of its contribution, namely the distance of the epicenter (Repi), the magnitude (Mw), and the depth (Depth) as shown in Table 4, which describes the interactivity of the predictor variable's contribution to the response variable. interactivity of the variable contributions of each predictor variable can be seen in Figure 1 of the Three Dimensional graphs of the contribution of the predictor variable to the response variable. As seen in Figure 1, the three-dimensional graph shows that the lower the value of the epicenter distance (Repi), the higher the contribution value to the Response variable, and this means that the closer the epicenter distance, the higher the impact of damage caused by earthquakes. Likewise, it can be seen that the larger the Magnitude (Mw) variable value, the higher the contribution value to the Response variable, meaning that the greater the magnitude value, the greater the damage caused by the earthquake. After going through the testing and validation of the Prediction Analysis results, the Regions in Sumbawa with the Highest Potential for Earthquake Hazards can be identified based on the highest PGA values referring to Table 1, namely Mapin Kebak, Mapin Rea, Pulau Panjang, and Pulau Saringi. Based on the calculation of the PGA value, which is influenced by the magnitude, depth, and distance of the earthquake location. In theory, based on a high PGA value will have a high impact on earthquake damage, although other factors affect earthquake damage, such as the condition of the bedrock of the location. Based on the results of the prediction analysis by grouping the areas with the highest earthquake vulnerability in Sumbawa, policymakers can use it to make rules in infrastructure development with special specifications in earthquake-prone areas.
Based on a literature search, no earthquake prediction research was found that specifically mapped areas in Sumbawa prone to earthquakes. However, other studies discuss, in general, that Sumbawa Island is an earthquake-prone area, as explained by Haryadi, that the potential for an earthquake on Sumbawa Island is very likely to occur because in the northern part of Sumbawa Island, there are micro tectonic plates that extend from Singaraja Bali to Dompu Regency and there is a hemisphere fracture. This threat originates from the south, which is at the bottom of the Indian Ocean because of the Indo-Australian oceanic plate [28]. This is reinforced by the results of research conducted by Sabtaji, who stated that the results of his research in West Nusa Tenggara Province, including the island of Sumbawa, have a number of monthly earthquakes the most, namely the seismicity that occurred in August 2018 as many as 1,658 earthquake events [29]. Another research by Hidhajah stated that major earthquakes occurred from July to August 2018, which impacted food poisoning among refugees in the Alas area, Sumbawa district [30]. Based on the results of previous research, it can be concluded that the authors' research gave the same results that Sumbawa Island is included in areas prone to earthquake hazards. The authors have been able to cluster Sumbawa Island, which areas have the greatest risk of earthquake hazards.

CONCLUSION
Based on earthquake catalog data from 2000 to 2021, this study analyzes earthquake hazard predictions in Sumbawa using the MARS method, which involves 11 basic functions. There is a close relationship between the predictor variable and the response variable, with a percentage of 100% epicenter distance and 73.8% magnitude. Based on PGA data, the Potential Areas with a great earthquake hazard in Sumbawa are Mapin Kebak, Mapin Rea, Pulau Panjang, and Pulau Saringi. The analysis of earthquake hazard predictions in Sumbawa can be used as a consideration in infrastructure development in Sumbawa to minimize the risk of earthquake hazards. Furthermore, this research can be developed by adding the number of predictor variables and the number of basis functions to provide more accurate prediction results.