K-Prototypes Algorithm For Clustering The Tectonic Earthquake In Sulawesi Island

Natural disasters related to the tectonic earthquakes frequently occur in Sulawesi Island, mainly in Central Sulawesi and West Sulawesi. This study aims to cluster the tectonic earthquakes occurrence in the range of 2017 to 2020. The variables used were magnitude, depth, and distance category. The characteristic of tectonic earthquakes produces a mixed type of objects between numeric and categorical type attribute. The method of k-prototypes algorithm was proposed for clustering the data because it can be used to handle on data mixed numeric scale and categorical scale. The study resulted four clusters in 2017, six clusters in 2018, ﬁve clusters in 2019, and six clusters in 2020. These clusters were formed based on the results cluster on a ratio of within-cluster distance against between-cluster distance. It can be related to the active fault on Sulawesi Island. The characteristics of clusters form each year are the greater magnitude. The result of study also showed that the used of k-prototype algorithm can properly classify the occurrence of tectonic earthquakes on the Sulawesi Island.


A. INTRODUCTION
Cluster analysis is one of the topics of multivariate statistical analysis or statistical learning, which is also known as unsupervised learning (Ansori Mattjik and Sumertajaya, 2011). Cluster analysis is the process of collecting n objects into k groups with k less than n (Ji et al., 2012). Objects with similar characteristics to each other are grouped into a group, while other objects are collected in different clusters. The group formed is called the cluster (Nooraeni et al., 2021). The similarity between objects is obtained based on the variables that characterize the observed objects. To measure the similarity, it is conducted by using the concept of distance. Mathematically, the smaller the distance between objects, the more similar the objects are and vice versa. The concept of distance that is commonly used is Euclidean distance (Dinh et al., 2021).
According (Pham et al., 2011) further mentioned that there are two main problems that need to be considered in non-hierarchical clustering, namely the number of clusters and the selection of cluster centre's because the clustering results depend on the selected centroids. Another challenge encountered is the type of variable that characterizes the objects (Li et al., 2019). Characteristics of objects consisting of numerical variables are measured by Euclid distance as in the k-means algorithm (Akramunnisa and Fajriani, 2020) (Annas et al., 2022). Furthermore, the characteristics of objects consisting of categorical variables can be measured using the mode, the smaller the value of the mode, the more similar objects are and vice versa. This concept is used in the k-modes algorithm, where the mode is the centroid of a cluster (Mau and Huynh, 2021).
When the object characteristics consist of numeric and categorical variables, the concept of distance that can be used is a combination of the concepts of k-means and k-modes distances (Kuo et al., 2021) (Nooraeni et al., 2021). In this case, k-prototype JURNAL VARIAN | e-ISSN: 2581-2017 method was proposed because the objects that are often encountered in real-world databases are mixed type objects between numeric and categorical (Kuo and Wang, 2022). Furthermore, this method can overcome the challenges of large-scale data compared to hierarchical-based method (Pham et al., 2011).
The characteristics of data resulted from the tectonic earthquakes events in Sulawesi Island are mixed type objects between numeric scale and categorical scale. Therefore, this study proposed the use of k-prototype algorithm for clustering the data. The study will cluster the areas on the island of Sulawesi related to the earthquake events base on the data of the strength and potential for regional earthquakes in South Sulawesi. So that the results of this study can be taken into consideration in the preparation of disaster mitigation policies.

B. LITERATURE REVIEW
The k-prototypes algorithm is one of the clustering methods based on partitioning (Pham et al., 2011) (Iriawan et al., 2018. This algorithm is the result of the development of the k-means algorithm (Mau and Huynh, 2021) (Ahmad and Dey, 2011) to handle clustering on data with mixed numeric and categorical type attributes (Dinh et al., 2021). The development carried out by Huang maintains the efficiency of the k-means algorithm in dealing with large data and can be applied to numerical and categorical data (Annas et al., 2022). The basic development of the k-prototypes algorithm is in measuring the similarity between the object and its centroid prototype (Pham et al., 2011). In general, the k-prototypes algorithm is divided into three main stages, (Sulastri et al., 2021), as follows: First, initialization of the prototype. In this process, several k-prototypes will be selected randomly from the X dataset according to the specified number of clusters.
Second, allocation of objects in X to the cluster with the closest prototype. Measure the object distance to all prototypes and place the object in the closest cluster. At this stage the k-prototype algorithm allocates all objects in the dataset to the cluster where the prototype of the cluster has the closest distance to the data object. Allocating all objects in data set X to the cluster that has the closest prototype distance to the object being measured. For each time object X has been allocated, the next step will be to calculate the related prototype cluster.
Third, reallocation of objects if there is a change in the prototype. After all objects in X have been allocated, the next step will be to re-measure the distance between all objects in X against all existing prototypes. If an object is found that is closer to another prototype, membership transfer will be carried out and then an update will be made on the old cluster prototype and the new cluster prototype. This process will continue until there are no more changes to the prototype or until the stopping criteria are met.

C. RESEARCH METHOD
The data used in this study was data on the occurrence of tectonic earthquakes on the island of Sulawesi from 2017 to 2020. The variables measured were the strength of the earthquake, the depth of the earthquake and the range of the earthquake. This type of data scale was a combination of numeric data and categorical data. This data was obtained from the Central Statistics Agency (BPS) of South Sulawesi, North Sulawesi, Central Sulawesi, West Sulawesi, Southeast Sulawesi, and Gorontalo.
The procedure for clustering the data by using k-prototype are as follows: 1. Data exploration is carried out in order to identify the relationship of variables by visualization using scatterplot and boxplot 2. Magnitude and depth earthquake is transformed by formula where, x * is the transformed variable, x is the original variable, andx is the average, and s is the standard deviation 3. Implementation of k-prototype clustering as follows (a) Determining the centroid of the cluster as many as the k, where k < n, n is the number of samples as the starting point C 1 , C 2 , . . . , C k on every variable (X 1 , X 2 , . . . , X p ); (b) Calculating the distance or similarity of data points on the data set against the centroid of the cluster, the data points are grouping into the cluster that has the closest distance to the centroid as follows: d(x jc , y jc ) = 0 , x jc = y jc 1 , x jc = y jc d(X, Y ) is distance or similarity of object X and Y , p and m are the number of numerical variables and categorical variable respectively, j is the jth variable, n and c is corresponding to numeric and category. The first term is Euclid distance for numerical characteristics and the second terms is frequency mismatch of level category for categorical characteristics where γ is a parameter that balances the variable scale difference. (c) Calculating the new centroid of the cluster after all objects have been grouped into clusters, and then re-grouping all objects on the new centroids. (d) The process would stop if there were not changing to the centroids, or it has been convergent. 4. The optimum cluster selection using diversity values. It is conducted by k optimum selection; Value of k is selecting by using ratio of variety of within-cluster distances (S W ) against variety of between-cluster distance (S B ). The ratio is plotted against the number of clusters (k) and the selected k is whose greatest changing of ratio proposed S W and S B of numerical variable is obtained by using Equation 3: For categorical variable, proposed within and between sums of square are obtained by using Equation 4 as follows:  Relationship between earthquake magnitude and earthquake depth tends to directly weak relationship. It can be seen on Figure 1 which depicted that point patterns form positive pattern and spread enough for each year. The relationship between earthquake distance category and those numerical variables i.e., depth and magnitude are depicted on Figure 2. It seems like the relationship on Figure 1 (Li et al., 2019).
There is not significantly difference either magnitude against earthquake distance category or depth against earthquake distance category for each year. Therefore, it can be concluded that there is not a significance relationship among used variables. Finally, the scale of magnitude and depth of earthquake is difference as depicted on Figure 1 and Figure 2, so they are transformed using Equation 1.

Clustering by K-Prototypes Algorithm
The clustering of the data was done every year. The values for each year 7. 60, 11.43, 14.19, and 14.90. By using these balancing parameters, the optimum k-value for clustering tectonic earthquakes on Sulawesi Island is shown in Figure 3. In general, the value of the SW to SB ratio is fluctuating so that the k value is chosen based on the largest change in the ratio (Kuo and Wang, 2022).

Interpretation of Cluster
The number of elements of clusters for each year is shown on Table 1. Based on the number of elements of cluster, Cluster 3 is the cluster with the most elements for each year. Conversely, Cluster 6 tends to have the fewest elements for each year. This result means that the most tectonic earthquake occur in Cluster 3 followed by Cluster 2 and at least occur in Cluster 6.  shows magnitude and depth of tectonic earthquake each cluster. We can see that Cluster 1 and Cluster 2 contain greater magnitude of earthquake than others and Cluster 4 contains the least magnitude of earthquake in 2017. Nevertheless, Cluster 3 and Cluster 4 contain outliers. In 2018, Cluster 2 and Cluster 4 contain greater magnitude of earthquake than the others where the least magnitude of earthquake is Cluster 3. In 2019, Cluster 1 and Cluster 4 contain greater magnitude of earthquake than the others where the least magnitude of earthquake is Cluster 2. Finally, in 2020, Cluster 1 and Cluster 6 contain greater magnitude of earthquake than the others where the lowest magnitude of earthquake is Cluster 5. There is only one cluster contains the deepest for each year.
Furthermore, the depth of earthquake tectonic occurred in Cluster 1 in 2017, Cluster 2 in 2018, Cluster 4 in 2019, and Cluster 6 in 2020. Regarding to the distance category, most of tectonic earthquake from 2017 to 2020 is regional level. There is only a cluster contains local level that is in 2018. It can be related to the tectonic earthquake in Palu and Donggala on September 2018. The tectonic earthquake in Figure 5 shows the number of earthquake events that are distinguished by local earthquakes and regional earthquakes. Earthquake events that occurred from 2017 to 2020, generally occurred in regional earthquakes. Where earthquakes that occur are generally based on fault or fault patterns in each region. In 2017, cluster 2 had the highest earthquake incidence, while in 2018 the highest earthquake occurred in cluster 1. In 2019, the cluster with the highest earthquake incidence was cluster 3, and cluster 6 had the highest earthquake occurrence in 2020. Cluster 2. In 2017, Cluster 1 in 2018, Cluster 3 in 2019 and Cluster 4 in 2020 were the highest because each cluster consisted of regions in the Central Sulawesi.

E. CONCLUSION AND SUGGESTION
The method of k-prototype algorithm was used for clustering the data of tectonic earthquake that occurred in Sulawesi Island in the range of 2017 until 2020. We concluded that this method could cluster the tectonic earthquakes according to depth, strength, and distance category. By implementing the k-prototype algorithm to cluster the Sulawesi Island tectonic earthquake data, the optimum cluster and the best number of clusters were produced. Therefore, it is easier to interpret the strength of tectonic earthquakes. This proposed method was also suitable for clustering the mixed type data between numeric scale and categorical scale, so that it can be used to analyse the same characteristics in another research.