K Value Effect on Accuracy Using the K-NN for Heart Failure Dataset

Heart failure is included in the category of cardiovascular disease. Heart disease is not easy to detect, and its detection needs to be done by experienced and skilled medical professionals. Most patients with heart failure require hospitalization. Common symptoms of heart disease, such as chest pain and high or low blood pressure, vary from person to person. This study aims to ﬁnd the most optimal k value based on the accuracy obtained based on calculations by testing different k values, namely 1, 3, 5, 7

Ì 595 using machine learning proposes to predict heart failure. This research aims to apply various machine-learning methods to the resulting data. The data collected includes heart rate, blood pressure, blood oxygen level, and other variables to assist in diagnosing heart disease. The collected data were analyzed using the K-NN method. The results of this study indicate that IoT and machine learning can significantly contribute to the diagnosis of heart failure. Patients can be monitored, and changes in cardiac condition can be detected earlier [14].
Based on the previous discussion, the author examines the most optimal k values based on the accuracy obtained based on calculations by testing different k values, namely 1, 3, 5, 7, and 9. The k value in the K-NN algorithm greatly influences the results to be obtained. Therefore, this study focuses on finding which accuracy value is the highest of the five k values to be used for the K-NN model. Testing the value of k will be carried out using 80% training data and 20% testing data, which will be used as a model. Heart failure will be the object of this study. In the heart failure dataset, two targets will be used: normal or indicated heart failure.

RESEARCH METHOD
This research was carried out following the framework shown in Figure 1. The research began by acquiring research data, then preprocessing the data, splitting the data, calculating the shortest distance using the Manhattan distance, testing the value of k, and finally analyzing the accuracy results. In Figure 1, The research framework is the stage for getting the best k value for this study. The stages will be passed as follows: First, data acquisition is done through Kaggle.com. Second is the preprocessing of the data in the dataset to make it easier for the data to be processed and to avoid errors in the data when the data is executed. The preprocessing is carried out in three stages: data cleaning, transformation, and normalization. Cleaning data is the initial process of the data preprocessing stage. Cleaning data functions to check the dataset. If the dataset contains noisy and missing values, then the dataset will not have maximum accuracy when processed. Data transformation is an advanced process of data cleaning. Transformation data serves to change the scale of data measurement into another form. If the data is numeric, it will be aligned with numeric data. Data normalization has three methods that can be used, but at this stage, only one normalization method is used: simple feature scaling. Normalization is used to help the training process. If there are very large range differences between the numeric variables, the variable with the highest magnitude may dominate the model. Third, after preprocessing the data, the data is split. Split data is divided into two scenarios: the first divide 80% of the training data and 20% of the testing data. The second scenario divides 90% of the training data and 10% of the testing data. Fourth, after splitting the data, the next step is calculating the distance between the testing and training data using the Manhattan distance. Manhattan distance measures the closeness distance between attributes, which will later be used as a model for testing the value of k. The last process is testing the value of k. This test is carried out individually, starting from k = 1, k = 3, k = 5, k = 7, and k = 9. After testing, it will be determined which value of k is the best for use in this study.

Data Acquisition
The dataset used in this study comes from Kaggle.com. The data obtained has been widely used in previous studies. The dataset, which totals 918 pieces, will be divided into two parts: training and testing data. The training data (training data) that will be used is 734 data points, and the testing data (test data) is 184 data points. The training data trains and builds a model, while the test data tests the model after the complete training process. In Table 1, the first attribute is a person's age (2977 years). The second attribute describes a person's gender ("0" for women and "1" for men). The third attribute defines the level of chest pain experienced by patients in the hospital. Four types of chest pain are converted into numerical values; each value describes the level of chest pain (TA: 0, ATA: 1, NAP: 2, ASY: 3). The fourth attribute describes the results of a person's blood pressure. The fifth attribute indicates a person's cholesterol level. The sixth attribute describes a person's fasting blood sugar level (1 if blood sugar is ¿ 120 mg/dl, 0 otherwise). The seventh attribute shows EKG results from 0 to 2, where each value indicates the severity of pain. The eighth attribute is the maximum heart rate value (minimum: 71, maximum: 202). The ninth attribute is used to understand whether exercise causes angina or not (yes: 1, no: 0). The tenth attribute defines a person's depression status. The eleventh attribute describes the slope of the peak training ST segment (Up: upsloping, Flat: flat, Down: downsloping). The final data is a collection of classes or labels describing the number of categories in a dataset. This dataset uses class binary; 0 means there is no possibility of heart failure in a person, while 1 implies a strong possibility of someone having heart failure [15].

Preprocessing Data
Preprocessing data is the first step in creating machine learning and artificial intelligence models. This process transforms data into an easier and more efficient form, making it possible for machine learning models to produce more accurate results [6,16]. In this study, three data preprocessing stages were used: first, data cleaning is an initial process carried out in data preprocessing. Cleaning data is used to select and delete data that can reduce the accuracy of the machine-learning model. In this study, there were no noisy data or missing values. Noisy data is data that contains wrong or abnormal values; this condition is called a data anomaly [16]. Second, data transformation is a function to equalize all data, such as by equating data structures, data formats, or values in data to produce the appropriate data set. Finally, data normalization is a technique for converting data into a regular scale. A process in which several variables have the same range of values, not too large or too small, to facilitate analysis [16]. Normalization of the data has three normalization methods, but in this study, we used one data normalization method, namely simple feature scaling. This method is a simple normalization method that divides each value by the maximum value on the attribute. The formula used in simple feature scaling is contained in Formula (1) x old is the value of each attribute in the dataset, x max is the maximum value for each attribute in the dataset, and the dataset x new is the normalized value.

Split Data
The process of making a model and testing it to get the best K results is done by making a model using training data and testing the model with data testing [17]. The distribution of training data and data testing in this study was done manually by dividing 80% of the data used for training by 20% of the data used for testing in the first scenario. Split the data manually using the percentage formula contained in Formulas (2) and (3).
The above formula is the manual division of the training data into 80% of the dataset, which will be used as training or model training data.
Data Testing = 918 * 20 100 = 184data The above formula is a manual division of testing data, with 20% of the dataset used as testing data.
In the second scenario, manually divide the data into 90% training data and 10% testing data using Formulas (4) and (5).
The above formula is the manual division of the training data into 90% of the dataset, which will be used as training or model training data. Data Training = 918 * 10 100 = 92data The above formula is a manual division of testing data, with 10% of the dataset for testing data.

Manhattan Distance
Manhattan distance is one of the distance measurement methods used in the K-NN algorithm. Manhattan distance is the distance between two points in three-dimensional space, calculated by adding the absolute difference of the x and y coordinates between the two points. The Manhattan Formula distance can be seen in Formula (6).
Based on equation (6), calculate the distance between the training data and the testing data, where the xi symbol represents the testing data, the yi symbol represents the training data, and the i symbol represents the data variable.

ISSN: 2476-9843
The distance calculation used in this study calculates the distance values that exist in training and testing data attributes. The attributes used in measuring this distance are age, sex, chest pain type, resting BP, cholesterol, fasting BP, resting ECG, maxhr, exercise angina, old peak, and st slope.

K-Nearest Neighbors (K-NN)
The K-NN algorithm is a supervised learning classification algorithm that can be implemented on labeled data. K-NN classifies the dependent variable based on how similar the independent variable is to an example similar to known data [18].
The K-NN algorithm is one of the most widely used classification and learning algorithms for implementation and modification, and this algorithm is fairly simple because some parameters work using distance metrics and k values. The main goal of the K-NN algorithm is to find the nearest neighbors (plot points) in the data and the data set [19]. K-NN classifies sample points with the most or majority values for its neighbors. K-NN uses a dataset where data points are divided into several classes to predict the classification of new sample points. This algorithm determines the distance between all data queries and selects a specific number close to the query value, then selects a repeating label in the classification case or the average label in the regression case [10].
The k-NN (k Nearest Neighbors) algorithm is a classification algorithm based on learning from previously classified data. This algorithm includes supervised learning where the results of the new instance are classified based on the majority of the nearest neighbor distance from the existing class. Meanwhile, Zhang explained that the k-NN (K-Nearest Neighbors) algorithm is a nonparametric or instance-based method and is considered one of the simplest methods in data mining and machine learning [20]. Testing the value of k, which is made using Manhattan distance measurements after the data preprocessing process and data split, the best and most optimal k value will be selected from the k values that have been determined [21]. Based on the results of the highest accuracy of all k values, the most optimal k value will be obtained for use in this study.

RESULTS AND ANALYSIS
Following the research framework shown in Figure 1, the following will present the research results, starting from the data preprocessing stage to applying the K-NN algorithm to classify data on patients with potential for heart disease and patients with normal hearts. To get the best classification model, this study will test several different k values.

Implementation of Preprocessing Data
Data preprocessing is carried out in three stages: cleaning, transformation, and normalization. At the data cleaning stage, the dataset is free of noise and missing values. The dataset used is 918 data points with 12 attributes.  Table 2 is the original dataset that has not been preprocessed. The dataset needs to be preprocessed so that the results obtained produce better accuracy values.

Transformation Data
The data transformation carried out at this stage is to equalize or align the values in the attributes to become the same. In the dataset used in this study, attribute values are in the form of categories, so these values are difficult to process. Therefore, attribute values in the form of categories are converted into numeric form. There are five attributes whose values are converted into the numeric form: Sex, ChestPainType, RestingECG, ExerciseAngina, and St Slope. The normalization results are presented in table form, which can be seen below.   Table 3 describes the data transformation results, where the categorical M (male) and F (Female) data are converted into numeric forms 0 and 1.       Table 6 describes the data transformation results, where data N (No) and Y (Yes) in categories are converted into numeric forms 0 and 1.

Normalization Data
In the normalization process carried out using the simple feature scaling normalization method, normalization results are obtained on a scale of 0 to 1. Numerical normalization can help the learning process if there is a very large range difference between numeric variables because the variable with the highest magnitude can dominate the model, regardless of whether the features are informative with respect to the target or not. The normalization results can be seen in more detail in Table 7data normalization results. In Table 9, it can be explained that the values in each attribute are the result of normalizing the data that has been calculated using the simple feature scaling method.

Best value of k
The classification process divides the data into two stages: the first stage uses split data with 80% data training and 20% data testing; the second stage uses split data with 90% data training and 10% data testing. Furthermore, in the classification process using the Manhattan distance measurement method, the Manhattan distance is used to find the closest neighbor value of each piece of data to be classified by measuring the distance between the training data and the testing data. The training data with the closest distance to the data to be classified will be selected as the nearest neighbor, and the number of selected neighbors will be adjusted to a predetermined k value. The calculation of the Manhattan distance from data that has been normalized using the simple feature scale method is carried out using the equation formula. The following is an example of calculating Manhattan distances, and the results are presented in Table 10 using the values k = 1, k = 3, k = 5, k = 7, and k = 9.  Table 10 results from distance measurements using the Manhattan distance calculated from normalized simple feature scale data. The calculation of the Manhattan distance value is sorted based on the nearest neighbor of the k value. The value of k = 1 is obtained from the Manhattan distance, the smallest distance between the training and test data. An example of determining the value of k1 = 1, k3 = 1, k5 = 1, k7 = 1, and k9 = 1 Calculation of the Manhattan distance where k1 produces a value of 1 obtained from the results of the nearest neighbor of the class, namely 1. k3 equals 1 obtained from the results of the nearest neighbor on data 1, 2, and 3. k5 equals 1 obtained from the results of the nearest neighbors on data 1, 2, 3, 4, and 5. k7 equals 1 obtained from the results of the nearest neighbors on data 1, 2, 3, 4, 5, 6, and 7. k9 is equal to 1 and is obtained from the results of the nearest neighbors in data 1, 2, 3, 4, 5, 6, 7, 8, and 9. After getting the values k1 = 1, k3 = 1, k5 = 1, k = 7, and k = 9, we calculate the accuracy. More clear results will be presented in Table 11. If the data has class 1 (diagnosed with heart failure) and k1 has the same class, then the value is TRUE. Class k3, k5, k7, and k9 have the same value, namely 1 (diagnosed with heart failure); therefore, it is TRUE. This testing process is carried out from the first data test to the 184th data test. If the k value is the same as the class value, the result is TRUE, and vice versa. If the k value is not the same as the class value, then the result is FALSE. Calculation of the accuracy of split data 80% training data and 20% testing data can be seen below: Based on 184 testing data points, 155 data points with a true value or the same as the label was divided by the total amount of testing data. Then, the accuracy for k = 1 is: Based on 184 testing data points, 158 data points with a true value or the same as the label was divided by the total amount of testing data. Then, the accuracy for k = 3 is: Based on 184 testing data points, 164 data points with a true value or the same as the label was divided by the total amount of testing data. Then, the accuracy for k = 5 is: Based on 184 testing data points, 164 data points with a true value or the same as the label was divided by the total amount of testing data. Then, the accuracy for k = 7 is: Based on 184 testing data points, 156 data points with a true value or the same as the label was divided by the total amount of testing data. Then, the accuracy for k = 9 is: This stage is the result of testing carried out to determine the most optimal k value from the split data results of 80% training data and 20% testing data. The final test results are presented in tabular form, as shown in Table 12result  In Table 12, it can be explained that the value of k in the third and fourth columns is the most optimal result for each accuracy. The value of k = 1 gets 84% accuracy, the value of k = 3 gets 85% accuracy, the value of k = 5 gets 86% accuracy, the value of k = 7 gets 86% accuracy, and the value of k = 9 gets 84% accuracy.
Next are the results of the tests carried out to determine the most optimal k value from the split data results of 90% training data and 10% testing data. The final test results are presented in tabular form, as shown in Table 3result Accuracy. In Table 13, it can be explained that the value of k in the fourth and fifth columns is the most optimal result for each accuracy. The value of k = 1 gets 86% accuracy, the value of k = 3 gets 87% accuracy, the value of k = 5 gets 87% accuracy, the value of k = 7 gets 88% accuracy, and the value of k = 9 gets 88% accuracy.
This K-NN research method produces different accuracy values. Based on the results of the accuracy obtained from testing 80% of training data and 20% of data, the highest accuracy results were found to be 86%, and testing 90% of training data and 20% of testing data obtained the highest accuracy results of 88%. Other normalization methods can be used to produce maximum accuracy values, such as Min-Max, Z-Score, or Decimal Scale. The normalization process has quite an effect on the resulting accuracy value. Increasing the accuracy value can also divide the data into 70% training data and 30% test data or 60% training data and 40% test data.  Table 14 is a comparison of the research with previous studies. Based on the research that has been done, this research has a very low accuracy value, namely 86% and 88%. Accuracy can be further improved by using other classification methods from machine learning, such as the SVM method, Decision Tree, random forest, and the like. Distance measurement methods such as Canberra, Euclidean, and others can also be used and tried to improve accuracy values. As can be seen from previous research, the use of the K-NN algorithm can be used not only in the health sector but also in other fields, such as stock trends.

CONCLUSION
Based on the results of research conducted on the heart failure dataset that has been done, it can be concluded that the application of the K-Nearest Neighbor algorithm to classify heart failure as a target attribute has a value of 0 and 1, where 0 is normal or does not have the potential for heart failure and 1 has the potential for heart failure. The k value most recommended for use and has the best accuracy is k = 7 and k = 9 in testing 90% training and 10% data, producing the highest accuracy of 88%. This accuracy is obtained by manually dividing the dataset, or split data, with 90% testing data from 92 data points and 80% training data from 826 data points. The testing process is done by normalizing the dataset using simple feature scaling and the Manhattan distance calculation method. In this study, the accuracy results obtained did not change significantly. Suggestions for further research can be added to or compared with other machine learning methods and can use other normalization methods to get even better accuracy results.