Comparison K-Means and Fuzzy C-Means in Regencies/Cities Grouping Based on Educational Indicators

Cluster analysis is an analysis that aims to classify data based on the similarity of speciﬁc characteristics. The methods used in this research are K-Means and Fuzzy C-Means (FCM). K-Means is a partition-based non-hierarchical data grouping method. FCM is a clustering technique in which the existence of each data is determined by the degree of membership. The purpose of this study is to classify regencies/cities in Kalimantan based on education indicators in 2021 using K-Means and FCM and ﬁnd out which method is better to use between K-Means and FCM based on the standard deviation ratio so it can be used efﬁciently and effectively for decision making by the government to advance the level of education on the island of Kalimantan. Based on the results of the analysis, it’s concluded that K-Means is the better method with the ratio of the standard deviation within a cluster against the standard deviation between clusters of 0.6052 which produces optimal clusters of 2 clusters, namely the ﬁrst cluster consisting of 14 Regencies/Cities, while the second cluster consists of 42 Regencies/Cities in Kalimantan


A. INTRODUCTION
Cluster analysis is a data mining method for identifying a set of objects that have certain common characteristics and can be separated from other clusters.Cluster analysis is divided into two, hierarchical and non-hierarchical (Suyanto, 2018).The hierarchical approach has a weakness, if one of the mergers or splits is carried out in the wrong place, the optimal cluster will not be obtained (Jollyta et al., 2020).The advantage of the non-hierarchical method is that it can perform analysis with a larger number of samples compared to the hierarchical method, several methods included in the non-hierarchical method are K-means and Fuzzy C-means (Triyanto, 2015).
K-Means is a data clustering techniques that divides objects into C clusters by allocating each object to the nearest centroid (Siregar, 2016).With K-Means, data objects will only be one cluster members and hard to achive convergence.Therefore, a comparison is made with other clustering methods that use fuzzy logic, because in its application a data object can be between two or more clusters (Jang and Sun, 1995).One of the fuzzy grouping algorithms is Fuzzy C-Means (FCM).FCM is an object or data  The stages of data analysis in this study use the help of software R below : 1. Standardize data In cluster analysis, large difference in values between variables can cause the distance calculation to become unstable, so it is necessary to standardize the data by reducing the range of data (Hidayatullah et al., 2014).One of the algorithms that can be used to standardize data is the Min-Max algorithm.This algorithm is formulated below (Suyanto, 2018) : description : x i,k : standardization of the i data for the k variable x i,k : data-i of the k variable x kmin : minimum data of the k variable x kmax : maximum data of the k variable 2. Detect multicollinearity.Cluster analysis has two assumptions, representative sample and non-multicollinearity (Ghozali, 2016).Multicollinearity is a situation where there is a robust linear relationship between variables.One way to know if occur multicollinearity is to look at the Variance Inflation Factor (VIF) value.
3. Grouping the observed objects using K-Means method : K-Means is a partition-based method that separates into C clusters different from assigning each data to the nearest centroid.The centroid is obtained by the average value of the variables of all objects in cluster.The results of clustering using K-Means depend on the initial centroid value that has been used.Giving different initial values can produce different groups.The steps for the K-Means method below (Kakushadze and Yu, 2017): (a) Determine the number of clusters (C).(b) Determine the centroid v ck randomly from the object of observation.(c) Calculate the euclidean distance for each observation object to the centroid.
) : euclidean distance between of the i observation data and center of the c cluster v c,k : centroid in the c cluster on the k variable (d) Assigns each object to the cluster with the most similar object, based on the closest distance between objects to each centroid.(e) Updating the centroid is with calculating the average value of each object for each cluster .
: centroid in the c cluster on the k variable in the t iteration x i,k,c : standardization of the i data for the k variable into the c cluster n c : number of data in c cluster (f) Repeat steps 3, 4, and 5 until no more cluster members change cluster.4. Grouping the observed objects using Fuzzy C-Means method : Fuzzy C-Means (FCM) is a clustering method where the existence of each data in a cluster is determined by the degree of membership.FCM begins by determining the centroid which will mark the average location for each cluster.By repairing the membership degree of each data centroid repeatedly, the centroid will go to the right place..As a result of the degree of membership, data points can belong to more than one cluster.
The steps in the FCM method below (Kusuma et al., 2015): (a) Determine number of clusters (c), rank (m), maksimum iteration (MaxIter), smallest expected error (ε), initial objective function (P 0 = 0), is a fuction to be optimized.(b) Generate random numbers as the initial elements of the initial membership matrix U .(c) Calculating the center of the c cluster with the following equation : (d) Calculating the objective function in the t iteration with the following equation : (e) Calculating of membership matrix changes with the following equation : axIter then stop.If not t := t + 1, repeat step 3 to step 5 5. Determine the best method based on the value of the standard deviation ratio.
According to a grouping method that can be used to form clusters be told to have good performance if it has a minimum standard deviation within the cluster to the standard deviation between clusters.Standard deviation within cluster (S w ) and standard deviation between clusters (S b ) can be calculated using following equation (Barakbah and Arai, 2004): JURNAL VARIAN | e-ISSN: 2581-2017 : number of clusters 6. Interpret the best grouping results based on the value of the smallest standard deviation ratio.

K-Means Clustering 1. Determining Number of Cluster
In this study, the number of clusters to be used is C = 2, 3, 4, 5 and 6.As an example of the calculations in this study, it is done using C = 2.

Calculating the Distance of All Observational Data with the Initial Centroid
The entire calculation results can be seen in Table 6 below: Based on Table 7, the euclidean for the 1st observation data to the center of cluster 1 is smaller than the euclidean of the 1st observation data to the center of cluster 2 so that the 1st observation data included in the membership of a cluster 1 and so on up to the 56th observation data.Based on the placement results, cluster 1 consisted of 24 regencies/cities while cluster 2 consisted of 32 regencies/cities.

Updating the Centroid
The results of the calculation the new centroid can be seen in Table 8 : X 1 0,314 0,418 X 5 0,305 0,234 X 2 0,278 0,483 X 9 0,347 0,189 X 3 0,564 0,353 X 11 0,265 0,155 Based on the calculation results in Table 8, it can be seen that there is a difference between the new centroid and the previous centroid, so the grouping is continued to the next iteration.6. Repeat steps c, d and e until there is no change in the centroid from the previous centroid Based on the calculation results, the clustering is stop at the 5th iteration, where there is no change in the cluster membership.So that the new centroid will be the same as the old centroid.The results of grouping the K-Means method with C = 2 can be seen in Table 9 :

Generate Random Numbers
Generate random numbers µ ic as an element of the initial membership matrix U .The initial membership value is presented in Table 14 :

Standard Deviation Ratio
Calculation of the ratio of standard deviation within groups (S w ) and standard deviation between groups (S b ) FCM method with C = 2, 3, 4, 5, 6 using equations ( 8) to (13) obtained the complete calculation results can be seen in Table 23 below : Based on Table 23, it can be seen that the grouping K-Means with C = 2 has a smaller standard deviation ratio compared to groups with C = 3, 4, 5 dan 6. it shows that the results of grouping with c = 2 are better than the results of grouping with C = 3, 4, 5 dan 6.While grouping FCM with C = 6 has a smaller standard deviation ratio compared to groups with C = 2, 3, 4 dan 5. it shows that the results of grouping with C= 6 are better than the results of grouping with C = 2, 3, 4 dan 5.

Best Method Interpretation
Based on the results of calculation standard deviation ratio of K-Means is smaller than FCM, indicates that the K-Means method is more appropriate for grouping regencies/cities in Kalimantan based on the education indicator.After the group formed, the next step is calculate the average value of all variables for each cluster.The average calculation results can be seen in Table 24 below : Based on the best grouping results, cluster 1 consists of 14 regencies/cities where 3 of the 14 members of cluster 1 are provincial capitals and there are several members classfified as a big city on the Kalimantan such as Balikpapan.While cluster 2 consists of 42 regencies/cities on the island of Kalimantan mostly members of cluster 2 are regencies/cities with variable averages smaller than the regencies/cities that are members of cluster 1.This can be seen from Table 15 where the average variable value of the expected length of schooling (X 1 ), variable Number of elementary schools (X 3 ), number of high schools (X 5 ), variable number of elementary students (X 9 ), and variable number of senior high school students (X 11 ) in cluster 2 is smaller than the regencies/city that is a member of cluster 1.However, the variable average length of schooling (X 2 ) for cluster 2 has a better value than cluster 1.
The findings of this study can be used as information for government agencies interested in making policies related to education indicators on the island of Kalimantan.especially the districts/cities in cluster 2 so that they can be used as evaluation material in increasing the level of education in Kalimantan.Compared with research conducted by (Ls et al., 2021) which grouped based on educational indicators using the ward method, this study compared 2 methods, K-Means and FCM methods so that more varied results were obtained while at the same time being able to find out which method was more effective used in grouping.Similar to the research conducted (Putri and Dwidayati, 2021), grouping using the K-Means method has better grouping results than the FCM method.

D. CONCLUSION AND SUGGESTION
Based on the results of research and discussion, the conclusions that can be drawn are grouping the K-Means with C = 2, 3, 4, 5 dan 6 based on the value of the standard deviation ratio shows that the K-Means method with C = 2 has better grouping results compared to the others.Grouping the FCM with C = 2, 3, 4, 5 dan 6 based on the standard deviation ratio shows that the FCM method with C = 6 has better grouping results than the others C. Based on the calculation of the value of the standard deviation ratio of the K-Means method of 0.605 while the value of the standard deviation ratio of the FCM method is 0.624, it can be concluded that the better method between K-Means and FCM for grouping regencies/cities in Kalimantan based on the year of education indicator 2021 is the K-Means method with C = 2 cluster.

Figure 1 .
Figure 1.Flowchart of research analysis steps within cluster S c,k : standard deviation of the c cluster on k variabel xc,k : average of the c cluster on k variable S c : standard deviation of the c cluster S b : standard deviation between cluster xc : average of the c cluster x : average of the entire cluster C

Table 1 .
Variable Used

Table 6 .
Eulidean Distance to Each Initial Centroid 4. Placing Observation Data to the Nearest CentroidThe results of data allocation can be seen in Table7below :

Table 7 .
Results of Placement of Each Data to the Nearest Centroid

Table 14 .
Initial Membership Value

Table 16 .
Updated Membership Value

Table 23 .
Standard Deviation ratio of K-Means and FCM

Table 24 .
Variable Average of Each Cluster