Comparison of Support Vector Machine Performance with Oversampling and Outlier Handling in Diabetic Disease Detection Classiﬁcation

Diabetes mellitus is a disease that attacks chronic metabolism, characterized by the body’s inability to process carbohydrates, fats so that glucose levels are high. Diabetes mellitus is the sixth cause of death in the world. Classifying data about diabetes mellitus makes it easier to predict the disease. As technology develops, diabetes mellitus can be detected using machine learning methods. The method that can be done is the support vector machine. The advantage of SVM is that it is very effective in completing classiﬁcation, so it can quickly separate each positive and negative point. This study aimed to obtain the best SVM classiﬁcation model based on accuracy, sensitivity, and precision values in detecting diabetes by adding Synthetic Minority Over-Sampling Technique (SMOTE) and handling outliers. The SMOTE method was applied to handle class imbalance. The Support Vector Machine (SVM) method aimed to produce a function as a dividing line or what can be called a hyperplane that matches all input data with the smallest possible error. The data studied were indications of diabetes, consisting of 8-factor variables and 1 class variable. The test results show that the SVM-SMOTE scenario produces the best accuracy. The SVM SMOTE scenario produced an accuracy value of the RBF kernel of 88% with an error of 12%, and this is obtained from the division of test data and training data of 90:10. This SVM-SMOTE scenario produced a precision value of 0.880 and a sensitivity value of 0.880. The research results showed that factor classiﬁcation was more accurate if it is carried out using the support vector machine (SVM) method with imbalance data handling (SMOTE), and it can be concluded that the distribution of test data and training data inﬂuences a test scenario.


RESEARCH METHOD
This study uses the Data Mining method, which results from observational analysis using a data set that aims to determine a correlation (relationship) and narrow down the data using different methods depending on the characteristics possessed by the data [19]. This study uses a Support Vector Machine or what can be called Support Vector Classification [20]. SVM aims to produce a function as a dividing line or what can be called a hyperplane which corresponds to all input data with the smallest possible error [21]. This study uses data on the classification of diabetes from Kaggle. The data consisted of response variables in the form of diabetes status labeled (+ and -) and predictor variables consisting of pregnancy, glucose, blood pressure, skin thickness, insulin, pedigree diabetes, and age. This data has 768 rows with eight columns in which each column contains factors causing indications of diabetes which will be examined to find out the results of classification using the SVM method [22].

Flowchart
This research was conducted to show the best classification model based on accuracy, sensitivity, and precision values. Results from this study were processed using the support vector machine method, which was optimized using the Synthetic Minority Over-Sampling Technique. Figure 1 shows the research stages. (i) SVM original data The SVM algorithm is as follows [23] Input: Input data (X), target data (Y), kernel parameters (σ), penalty (C) Output: a,b accuracy Begin: a. Dividing the class into binary groups v(v − 1) 2 For k = 1 : v, 1 = k + 1 : v b. Defines the RBF kernel function parameters (K) c. Determining The SMOTE algorithm is as follows [23] Input: Number of class minority data (T); The number of data for the majority of classes (P); Number of SMOTE replications (N); Number of nearest neighbors (KNN) Output: Synthetic data x syn Begin: a. Determine the amount of data from the minor class (T), and it is said to be a minor class if the percentage of the total class data is less than 50% b. Determines the amount of data from the major class; there is only one major data class c. Calculates the k-nearest neighbor or Euclidean distance using formula 1. for x =1:n for z = 1:n end : d. Determine replication for minority data N = T otal major class data(P ) T otal minor class data(T ) e. Specifies the data to be replicated in the minor class x i f. Determines the data with the shortest distance from the data to be replicated in the same minor class (x knn ) g. Determine random value γ (γ a random number between values [0,1]) h. Calculating the synthesis using the formula 2.
(iii) Outlier-SVM (iv) Outlier-SMOTE-SVM 6. Model evaluation with the criteria of accuracy, precision, sensitivity 7. Conclusion Figure 1 shows the flow of research stages from start to finish. The first stage is to enter data on the classification of diabetes indication factors. The data preprocessing is done to detect missing values, outliers, and multicollinearity. Dividing data by comparison of training and testing 90:10, 80:20,70:30. SVM classification is carried out with four scenarios, namely initial data pure SVM, SMOTE and SVM, SVM and outlier handling, SMOTE SVM, and outlier handling. Finally, evaluate the model with accuracy, precision, and sensitivity criteria.

Data
The data used in this research is secondary data, which comes from Kaggle. The data used in this study are things that can cause someone to be confirmed to have diabetes. There are eight predictor variables: pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, diabetes pedigree function, and age. The predictor variable is labeled 'x,' while the response variable is denoted by the letter 'y'. The data is presented in tabular form in Table 1.

Data Preprocessing
Preprocessing is an important sequence in any processing system. Data analysis will be hindered, and work is not easy if the data is very large, so it can be solved with data preprocessing. Data preprocessing is done by preparing the data by cleaning the data from noise or changing the data format. This is necessary because raw data is often found to be incomplete and has a format that changes frequently. Preprocessing itself is divided into data validation and imputation. Validation was carried out to assess the level of completeness and accuracy of the filtered data. On the other hand, imputation aims to minimize the error rate and manually enter missing values or automatically through a business process automation (BPA) program [24]. The missing value is the void of some existing data in the data. Listwise Deletion is the most suitable method for dealing with missing values. If empty data is found, it will be removed from the analysis [25]. An outlier is a data object with an abnormal value, either too low or too high, so the difference is large with other objects [26]. Multicollinearity describes a perfect or definite linear relationship between some or all of the independent variables [27]. The formula for determining the Pearson correlation value using formula (3). Where r (XY ) represents the correlation coefficient, X (i) represents the variable x to i, Y (i) represents the variable y to i.

Support Vector Machine
SVM stands for Support Vector Machine. SVM uses the process of finding the optimal dividing line (hyperplane) to find the maximum margin size between inputs, using a linear function in a high-dimensional feature space that works by separating two groups of data classes using space and feature space using kernel rules [28]. SVM is a derived model of statistical learning theory with better results than other methods. In SVM, each training data is known as (x i , y i ), where i = 1, 2, . . . , N dengan x i = x i1 , x i 2, . . . , x iq T is an attribute for training data I, y i −1, +1 is the label class [29]. Figure 2 shows that various alternative separators can separate all datasets according to their class, but the best separators can separate data and have the largest margins. The SVM model equation is as in formula 4 [30]. where w represents the weight of SVM and b is a scalar. Several types of kernels will form a hyperplane to turn and produce the best accuracy, represented in Figure 3. The formula for solving the problem linearly and nonlinearly in SVM is shown in Table 2.

Synthetic Minority Over-Sampling Technique
The Synthetic Minority Over-Sampling Technique (SMOTE) method is generally implemented to overcome the class imbalance. This technique balances the dataset by resampling minority class samples by adding new data from minority classes.

Evaluation of Classification Results
Confusion Matrix (Confusion Matrix) is a matrix used to indicate the level of accuracy of classified images relative to the reference data. A classification model can be said to be good if it gets a relatively small error rate. The confusion matrix table contains True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) values, as in Table 3. The calculation of the confusion matrix will produce a sensitivity value, specificity value, and accuracy value. The specificity value is the value resulting from the classification of the class of diabetes that really belongs to that class of diabetes. The specificity value is the value resulting from the classification of a class of diabetes and is a class of diabetes. At the same time, the success value of a system that runs the classification is the accuracy value [31]. Where True Positive (TP), namely positive information successfully identified correctly, enters the positive class. True Negative (TN) is a negative statement correctly identified and entered into the negative class. False Positive (FP), namely negative information, but identified as wrong by the system and entered into the positive class. False Negative (FN), namely a positive statement identified as wrong by the system and entered into the negative class [32].
Accuracy is a value that describes the accuracy of the classification results on positive or negative data. The higher the accuracy value obtained, it means that the system has succeeded in classifying properly. The formula can know the accuracy value in the binary class (two classes) as in Equation (5). The precision value represents the number of correctly identified positive data points divided by the total number of positive data points. Precision can be known by equation (6) with positive information can be correctly identified as being in a positive class. A higher sensitivity value means the classification system is better at detecting objects. The sensitivity value can be found using Equation (7). F1-score is an evaluation metric used in classification to measure the balance between precision and recall of a model. F1-score provides an overall picture of the quality of the model's performance in predicting the positive class. The formula of the F1-score is as in Equation (8). Error or error is a problem that identifies errors in a number of data so you can see the error level in the system used. The percentage of error can be found using Equation (9). 3.

RESULT AND ANALYSIS
All variables tested have a total of 768 rows. Then it detects missing values in nine variables, and there are no missing values in Pregnancy, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Pedigree Diabetes, Age, or group variables. Figure 5 shows the boxplot results with dots at the top and bottom, indicating that this variable has outlier data. The data has 218 outlier data. After handling the outlier data by deleting data containing outliers, there are 550 diabetes data without outlier data.

Descriptive Statistics
Based on Table 4, all the variables tested totaled 768 rows. The pregnancy variable describes the number of pregnancies in women; the maximum value is 17. According to medical science, a pregnant person has a condition where the body is unable to produce insulin during pregnancy, so a mother who has had multiple pregnancies will be at a higher risk of developing diabetes. The glucose variable referred to in the table is the result of measurements using an oral glucose tolerance test within 2 hours, with a maximum value of 199. According to medical science, it is classified as high because normal blood sugar is 70-130 mg/dl. Blood pressure in the data has a maximum value of 122. From a scientific point of view, the normal human blood pressure is 90/60 mmhg to 120/80 mmhg. Skin Thickness is the thickness of the skin multiples of the triceps and has an average value of 20.5. The insulin variable is the number of serum insulin in 2 hours, with a maximum value of 846. Variable BMI has an average of 31.99. If reviewed according to medical science, the BMI figure of 31.99 is classified as obese and can be at risk of diabetes. The Pedigree variable is the number of hereditary history of diabetes, and the average is 0.4. Then the age variable with an average value of 33. Based on Figure 4 illustrates the histogram of 8 predictor variables causing diabetes. Histogram of pregnancy with positive skewness and leptokurtosis, glucose histogram with negative skewness and leptokurtosis, blood pressure histogram with zero skewness and leptokurtosis, inulin histogram with positive skewness and leptokurtosis, BMI histogram with normal skewness and leptokurtosis, pedigree histogram of diabetes with positive skewness and leptokurtosis, age histogram with positive skewness and Comparison Of Support . . . (Firda Yunita Sari) ISSN: 2476-9843 leptokurtosis. From several histograms that have been visualized, there is a skewness worth 0 and kurtosis worth 3, so it can be concluded that the eight variables that cause diabetes above have uneven data.

Preprocessing Data
Missing Value or empty data is the loss of some data that has been obtained. After detecting empty data on nine variables, it can be seen that there is no empty data on the variables of Pregnancy, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Pedigree Diabetes, Age, or class variables. So it is true that the data consists of 768 rows with no empty data. Based on Figure 5, it can be seen that in the boxplot, there are dots at the top and bottom, which indicate that the variable has outlier data. The data has as many as 218 outlier data, and this number follows previous research with the same data source. After handling outlier data by deleting data containing outliers, the diabetes data totals 550 without outlier data. Table 5 shows that several variables have the highest correlation of 0.544, namely the age variable with pregnancy, or there is no correlation between variables that are 0.650. So it can be concluded that all variables in the diabetes indication data have a relationship or correlation in the absence of multicollinearity data.

SVM classification
Based on Figure 6, it can be seen that in Figure (i), variable 0 or negative has a percentage of 65.1% or amounts to 500 data. In comparison, variable 1 or positive has a percentage of 34.9% or amounts to 268 data in the diabetes indication dataset, so it can be concluded that the data predominantly indicates a negative case of diabetes. In Figure (ii), after balancing data on variable 0 or negative and variable 1 or positive has a percentage of 50% or amounts to 500 data, the data has 1000 data after SMOTE. In Figure  (iii), after handling the outlier data, variable 0 or negative has a percentage of 64.7% or a total of 356 data. In comparison, variable 1 or positive has a percentage of 35.3% or a total of 194 data on the dataset of indications of diabetes. In Figure (iv), after handling the outlier data and balancing with SMOTE, variable 0 or negative has a percentage of 50% or a total of 356 data. In comparison, variable 1 or positive has a percentage of 50% or a total of 356 data on the dataset of indications of diabetes. Our research results follow several previous studies using the same data.

Accuracy Value of SVM
This research uses a separation strategy by dividing the dataset into training and testing data with a ratio of 90:10, 80:20, and 70:30. This is because the three divisions are optimal data-sharing strategies in implementing the support vector machine method and using four scenarios, namely SVM without outlier handling and SMOTE, SVM no outlier handling and with SMOTE, SVM outlier handling, and no SMOTE, and SVM outlier handling and no SMOTE. Based on Table 6, the test results show the accuracy or the best model for diabetes data in scenario B, namely SVM without outliers with SMOTE in the 90:10 data distribution (Kernel RBF), which has an accuracy value of 0.880 or 88% with an evaluation of results. With the error of 0.120 or 12%, it can be concluded that the accuracy that has been obtained is accurate.
ISSN: 2476-9843  Figure 7 is the best model confusion matrix, namely Kernel RBF with a data division of 90:10. From the SVM scenario without any outliers to SMOTE, and this study shows a significant advantage in the context of diabetes classification. From the given confusion matrix, it appears that the model has a high level of sensitivity (recall) in classifying patients who are actually positive with only 7 cases of wrong positive predictions (false negatives). This shows that the model can accurately identify patients with potential diabetes, which is very important in preventing and controlling this disease. In addition, with only 5 cases of false positive predictions, the model also has a good level of precision in avoiding a wrong diagnosis in patients who do not actually have diabetes. With this combination of high sensitivity and precision, this study demonstrates the superiority of using a classification model to detect diabetes, which can positively impact medical practice and clinical decision-making accurately.  Table 7 shows the value of the best evaluation model, namely Kernel RBF, with a data division of 90:10. From the SVM scenario without outliers with SMOTE, the evaluated model shows good performance with high precision, recall, f1-score, and accuracy. The precision is 0,900 for class 0 and 0,850 for class 1, which shows the model's ability to give a slight error in positive predictions. Recall has a value of 0,870 for class 0 and 0,890 for class 1, which shows the model's ability to identify the most positive data. The F1 scores, the harmonic averages of precision and gain, are 0,890 for class 0 and 0,870 for class 1, indicating a good balance between precision and gain. With an accuracy of 0,880, the overall model can classify data accurately. The weighted average and macro average score is 0,880 for precision, recall, and f1 scores, indicating consistent and balanced performance across classes.

Visualization of Prediction Results
Based on Figure 8 shows the predicted results of SVM without Outlier Handling with SMOTE with the RBF 90:10 kernel data division; it is known that the yellow center has information 1, namely positive diabetes, and the red center has information 0, namely negative diabetes, according to the color of the stem. At the same time, the blue line is called a hyperplane or the distinction between the two groups, and the dotted line is called the margin or estimated class difference. A center on the margin is close to the hyperplane because the accuracy is 0.880 with an error value of 0.120. These results experienced a 1% difference from the research [33]. This difference was due to an update in this study by adding the SMOTE feature selection. The best accuracy results were obtained in the second scenario, namely SVM with SMOTE. So it can be concluded that the SMOTE feature selection can improve accuracy so that the accuracy obtained is 88%.

CONCLUSION
This study uses Diabetes data with several influencing variables. In variable 0 or negative, there are 500 data, while in variable 1 or positive, there are 268 data, so it can be concluded that negative cases of diabetes dominate this data. In this study, four tests were carried out to compare the final results of these tests to find the model with the highest accuracy. Of the four scenarios that have been tested, the best accuracy is produced by the SVM-SMOTE scenario with the RBF kernel, which produces an accuracy value of 88% with an error of 12%, and this is obtained from the division of testing data. Training data of 90:10. The results showed that the classification of diabetes using the 8-factor variable was more accurate if it was carried out using the Support Vector Machine (SVM) method with unbalanced data handling (SMOTE), and the distribution of data affected the test scenario. The main advantage of the Machine Learning method, especially the Support Vector Machine (SVM), in classifying diabetes compared to diagnosing diabetes in medicine is SVM's ability to process and analyze complex medical data with high accuracy. SVM can find hidden patterns and nonlinear relationships that are difficult for humans to detect in medical data involving many features. SVM can also address class imbalance problems in medical datasets and provide consistent and objective predictions. Although the Support Vector Machine (SVM) has advantages in classifying diabetes, some drawbacks need to be considered in its comparison with diagnosing diabetes in medicine; namely, SVM requires large amounts of data to train the model properly. In diabetes cases where medical data is often limited, using SVM can be challenging as it can lead to overfitting or underfitting issues and requires proper parameter tuning and selection of the appropriate kernel for optimal performance, which can be time-consuming and requires considerable technical expertise.
Suggestions for future research are to focus on developing more sophisticated feature selection strategies, such as genetic algorithms or more in-depth data mining, to select the most informative and relevant feature subsets in diabetes data. In addition, research can extend the application of SMOTE with various other oversampling techniques, such as ADASYN (Adaptive Synthetic Sampling), to address more complex class imbalances in diabetes datasets. Thus, this research is expected to make a significant contribution to increasing the accuracy and reliability of the SVM model in diagnosing diabetes, as well as providing new insights for the development of better classification techniques in the medical field.

ACKNOWLEDGEMENTS
Thank you to Ms. Hani Khaulasari, M.Si as the lecturer in the mathematical statistics course, who has patiently guided us in completing this journal, and also thanks to the team members who have worked well together to prepare this journal.