Application of KNN Machine Learning and Fuzzy C-Means to Diagnose Diabetes

The disease is a common thing in humans. Diseases that attack humans do not know anyone and do not know age. The disease experienced by a person starts from an ordinary level until it can be declared severe to the point of being at risk of death. In this study, the early diagnosis was carried out related to diabetes, where diabetes is a condition in which the sufferer’s body has low sugar levels above normal. Symptoms experienced by sufferers include frequent thirst, frequent urination, frequent hunger


INTRODUCTION
Matrik: Jurnal Managemen,Teknik Informatika, dan Rekayasa Komputer Ì 407 the performance of the Bayesian regularization neural network method with diabetes data related to the use of neurons in the hidden layer, which can affect the accuracy of the results of the classification process (the more, the more accurate), this is because by changing the number of neurons in the hidden layer, we can also change the network structure of the RBNN method (results can be optimal or not) [29]. While what is being done in the current research is to carry out the early diagnosis of diabetes with the KNN and Nave Bayes methods and then implement it into a WEB-based system. Based on the discussion from previous studies, this research is different. It has updated the method used and aims to increase the accuracy of predictions using a more dominant method known as data mining. In short, the research in this article differs from the prediction method used, but this research also builds Web-based applications that previous researchers did not do. With the existence of a built web application, it is useful for every layperson to predict diabetes based on an expert system that is built without the need to involve experts (medical doctors). This research has very useful implications for the benefit of the general public and also for the medical world, especially doctors and medical nurses, in predicting people with diabetes using a web-based electronic system (Android web).
The structure of writing this research is that the next section discusses the research methodology, namely a brief discussion of the methodology used in this study. Then, the third sub-section discusses Results and Discussion, which means explaining how to design a website-based application system interface and includes testing the website application system that was built and the results achieved. Finally, at the end of the manuscript, the conclusions from the research results are discussed and placed in the Conclusion sub-section.

RESEARCH METHOD
In this study, the process of research methodology uses several stages. The stages carried out in this study are shown in Figure  1:

Identification of problems
Identifying problems in this study is a step to finding out problems related to diabetes. Therefore, this process is carried out to collect information related to diabetes problems. Furthermore, this stage is carried out to obtain symptoms from patients and diabetes specialist doctors, where this information is useful for building expert systems with machine learning.

Data collection
This data collection stage is carried out through interviews and observations, and the process is carried out to related parties, for symptoms to experts, and diagnoses to patients. Interviews and observations were conducted to obtain information about the data used in the diagnosis. The interview process was conducted with experts (parties at the Tanjung Health Center) and obtained several symptoms, namely fatigue, difficult-to-heal wounds, blurred vision, frequent hunger and thirst, and a history of heredity [30].

Data Preprocessing
Preprocessing is one of the stages in the data mining process. This stage is also the process of converting raw data into a form that is easy to understand. In this study, the authors divided the preprocessing stage into several processes, including data weighting and dataset formation.

Expert System Design and Testing
Tahapan The design and testing phase is the design and performance process of the system being built. Testing systems built with machine learning must produce accurate performance, both system performance processes and process results from methods in the form of accurate resultsthe results of this accuracy test aim to determine how accurate the system is in diagnosing diabetes. The amount of data used to diagnose diabetes is 120 patients data with five symptoms.

3.
RESULT AND ANALYSIS

Problem Identification
From helping to get problems related to knowledge, this process or stage aims to obtain the required data knowledge. Programming knowledge obtained to obtain useful data in solving the logic of diagnosing diabetes.

Data collection
In collecting data, there are two types of attributes and the process of determining the data: interviews and lab test results. At the time of the interview, information was obtained about how a person could get diabetes and how the examination process was in the lab. Lab tests were used to test whether diabetes was positive and to collect information on what factors led to diabetes and what data was obtained. In the interview process, five questions related to diabetes symptoms were obtained with 120 data. Table 1 shows the symptom survey instrument experienced by the patient, while Table 2 contains symptom data obtained from the survey.  Stages of data weighting aim to give value to each attribute. The weighting results will be used for the calculation process in the system. The attributes used are: often feeling tired, difficult to heal wounds, blurred vision, often feeling hungry (polyphagia), and history of heredity, while the weighting levels are in Table 3. Very often 3 So that in Table 3, on the symptom weighting score given by the expert on the certainty factor, a score of 0 indicates that the user does not experience these symptoms. If the answer to the question is rare, then the patient rarely experiences symptoms with a score of 1, for often, the patient experiences the symptoms asked with a score of 2, and the answer is very often a score of 3.

Formation of datasets.
The formation of datasets in this study is the process of changing data into sentences and then converting them into weights, where the goal is to be processed using the KNN and Fuzzy C-means methods. The weighting results are shown in Table 4.  1  2  3  2  3  1  Diabetes  2  3  0  1  1  0  Negatives  3  0  1  1  0  0  Negatives  4  3  3  2  3  1  Diabetes  5  3  2  3  3  1  Diabetes  6  2  1  3  3  1  Diabetes  7  1  1  1  0  0  Negatives  8  2  3  2  3  1  Diabetes  9  0  0  2  0  0  Negatives  10  1  0  1  0  0  Negatives  11  3  2  2  3  1  Diabetes  12  2  3  3  3  1  Diabetes  13  0  1  1  1  0  Negatives  14  2  3  3 Table 4 is the result of weighting the data. The weighted value entered is the weight of the questions given to the patient. The goal is to be able to perform calculations using KNN and fuzzy c-means. The results of each weighting are then divided into two classes, namely negative and diabetic.

Testing the Machine Learning Method with KNN
Testing the performance of the machine learning method is carried out with a sample of diabetes patient case data based on Table 4 in the process of testing the machine learning method using KNN. 1. Determine the number of neighbors (k) that will be used as a consideration for class determination. For example, suppose K=3.
Then, calculate the distance between the test data and the training data. 2. Perform calculations with the Euclidean Distance method. The training data consists of 30 data, which will be calculated individually with data testing to determine the distance for each data.
3. Retrieve data with the shortest distance. Based on the test data, after calculating with the Euclidean Distance formula above, sort them according to the smallest value based on the neighbor value (k), namely k = 3. The results obtained are k (closest neighbors) data 11, 19, and 28. They are shown in Table 5.  11  3  2  2  3  1  Diabetes  19  3  3  1  3  1  Diabetes  28  3  3  1  3 1 Diabetes 4. Define Class. From the data calculated in Table 5, it can be concluded that the patient in the previous test data had diabetes. This is evidenced by the 3 data with the closest distance whose status also has a history of diabetes

Trial of the Fuzzy C-means Machine Learning Method
In the calculation process with fuzzy c-means with data in Table 4, The initial steps in calculating fuzzy-means are [31,32]: Determine the number of clusters (c), rank (w), maximum iteration (MaxIter ), smallest error expected (), the initial objective function (P0) and initial iteration. The values used are as in Table 6. Determine the membership of the cluster randomly, which, if added up = 1. After that, the 1st iteration is carried out. The trial data used is diabetes patient data in Table 7, with a total of 120, and comes from patient data at the health center with predetermined parameters. From the calculation results, the iteration stops for up to 5 iterations, and the objective function fulfills = 0.001. Then the iteration process can be stopped. From the final results, the results of cluster 1 were 54 and cluster 2 were 64 in Table 8 as follows:

System Development and Expert System Testing
The expert system design stage in this study is to make an initial design for modeling, and the model is built based on the data that has been collected. System design or development is planning use case design diagrams, data flow diagram designs (DFD), database designs, and flowcharts built on application programs. After passing through the design stage, then programming development is carried out. The programming stage is implementing the system design into a computer programming language. The programming language used to build applications is PHP, and database creation with MySQL. The expert system application built is stored on a server computer which aims to be accessible anywhere [33]. Untuk mendapatkan hasil yang diharapkan tahap berikutnya melakukan akuisisi pengetahuan untuk memperoleh data pengetahuan yang dibutuhkan. Pengetahuan yang didapat akuisisi data berguna dalam memecahkan logika pemrograman mendiagnosis penyakit diab To get the expected results, the next step is to acquire knowledge to obtain the required knowledge data. The knowledge gained from data acquisition is useful in solving programming logic for diagnosing diabetes.     Figure 4 illustrate the Data Flow Diagram (DFD) or data flow originating from and where the data is processed in the expert system that was built. The context diagram in Figure 3 shows the data flow of the global system.
In contrast, the use case description in Fig. 2 shows a more detailed data flow that the system performs and engages with external data. The flow chart in Figure 5 shows a series of flow relationships in the expert system built in this study or a demonstration of the overall process sequence in building the expert system in this lesson. The flowchart contains a more detailed description of how each step of the procedure is carried out, building an expert system on machine learning that can diagnose the user and the type of drug used by the user.  Figure 5 shows the page for making a diagnosis. This diagnosis page consists of the patient's name, often feeling tired, wounds that are difficult to heal, blurred vision, frequent hunger, and family history. In filling in the symptoms, first, enter the value of K. Next, the value of K finds the closeness value to determine the final value or class prediction results. Whereas in the Fuzzy C-means process, it is not necessary to determine the value of K, as shown in Figure 7.   Figure 7 is a page for diagnosing calculations using fuzzy c-means. The process on the diagnostic page with fuzzy c-means consists of filling in the number of clusters, the number of iterations, the number of weights, and epsilons. For each condition, it is mandatory to fill in because in this study, looking for 3 clusters must fill in 3 clusters. A maximum iteration process of 100 can be less or more, epsilon (smallest error) 0.01, and weighing 2. From this process, the calculation results for cluster 1 are obtained; cluster 1 is as many as 54, and cluster 2 is as many as 64.

Method Accuracy Testing
From the results of calculating accuracy with the Confusion Matrix, the results of an accuracy comparison are obtained where the Fuzzy C-Means method is 96% better than the K-Nearest Neighbor method of 86.33% by showing the comparison of accuracy values in Table 9 as follows: Based on the comparison shown in Table 9, fuzzy C-means gets the largest value compared to K-Nearest Neighbor and, as a comparison with the previous fuzzy c-means process, which has the highest accuracy. Based on Table 10, several differences have never been made in research. New things that have never been done by other researchers before. Table 10 shows some comparisons of the differences between the work of this article and several related previous works.

CONCLUSION
For the results of research on diagnosing diabetes using the K-Nearest Neighbor and Fuzzy C-Means methods, the following conclusions are obtained. After testing with the K-Nearest Neighbor method, the highest accuracy was obtained at 83.33%, while the Fuzzy C-Means method obtained an accuracy of 96%. Therefore, based on the results of comparative testing of accuracy and working methods, it can be concluded that the Fuzzy C-Means method is better than the K-Nearest Neighbor because, in this study, Fuzzy C-means has the highest value, application development uses Fuzzy C-means.
The novelty of this research is that this research article is not the same as previous research in terms of the research method used. Another novelty of this research is to build a web-based application that facilitates the work of medical experts and can be used by ordinary people to predict whether someone has diabetes accurately. The results of this study increase the accuracy (has results of accuracy) of diabetes predictions that have been carried out by previous researchers, including research conducted by: This study has limitations in predicting diabetes using KNN and Fuzzy C-means, although many other data mining methods exist. Therefore, it is suggested for further research to conduct research with other methods such as random forest, SVM, ANN, C4.5, and other methods, as well as other types of diseases.

5.
ISSN: 2476-9843 designs FUNDING STATEMENT in this research did not receive grants from anywhere including the public, commercial, or non-profit sectors.

COMPETING INTEREST
In this study, there are no reserves related to competing financial, public and institutional interests.