The Effect of Class Imbalance Handling on Datasets Toward Classiﬁcation Algorithm Performance

Class imbalance is a condition where the amount of data in the minority class is smaller than that of the majority class. The impact of the class imbalance in the dataset is the occurrence of minority class misclassiﬁcation, which can affect classiﬁcation performance. Various approaches have been taken to deal with the problem of class imbalances, such as the data level approach, algorithmic level approach, and cost-sensitive learning. At the data level, one of the methods used is to apply the sampling method. In this study, the ADASYN, SMOTE, and SMOTE-ENN sampling methods were used to deal with the problem of class imbalance combined with the AdaBoost, K-Nearest Neighbor, and Random Forest classiﬁcation algorithms. This study aimed to determine the effect of handling class imbalances on the dataset on classiﬁcation performance. The tests were carried out on ﬁve datasets, and based on the classiﬁcation results, the integration of the ADASYN and Random Forest methods gave better results than other model schemes. The evaluation criteria include accuracy, precision, true positive rate, true negative rate, and g-mean score. The results of the classiﬁcation of the integration of the ADASYN and Random Forest methods gave 5% to 10% better than other models.


INTRODUCTION
A problem often found in the classification is the imbalance of classes. Class imbalance occurs when the data is not evenly distributed, and the number of minority classes is smaller than that of majority classes [1]. This condition can lead to the classifier mistakenly classifying the minority class and the classifier tending to choose the majority class and ignore the minority class. It can affect the performance of the classification. There are several ways to deal with the problem of class imbalance: the data-level approach, algorithmic level approach, and cost-sensitive learning [2]. One way to deal with class imbalances at the data level is to apply sampling methods [3]. The sampling method is an approach to balance the distribution of minority classes and majority classes. The sampling method is divided into three types: undersampling, oversampling, and a combination of oversampling and undersampling (hybrid sampling) methods. Undersampling removes objects in the majority class randomly with the goal that the number of objects each class has is the same. Oversampling randomly selects objects from minority classes, thus generating new objects. Hybrid sampling is a combination of oversampling and undersampling. This sampling method adds new objects to the minority class and subtracts objects from the majority class to balance the data [4].
Research on handling class imbalances with sampling methods has been widely conducted, resulting in good classification performance. For example, the study conducted by [5] using the ADASYN oversampling method to balance classes on the hypertension dataset shows that the method can help classification models classify hypertension classes and significantly improve classification performance in each classification model compared to without applying oversampling methods. The study by [6] used ADASYN and SMOTE methods to address class imbalances in diabetes mellitus data and was classified using the Support Vector Machine (SVM) algorithm. The study showed increased classification performance after applying the oversampling method, with an accuracy value of 87.3% for the ADASYN + SVM method and 85.4% for the SMOTE + SVM method. In contrast, the accuracy result without oversampling was lower, which was 83%. Another study combined Synthetic Minority Oversampling Technique (SMOTE) oversampling and Edited Nearest Neighbor (ENN) undersampling methods to balance data on Land Use and Land Cover (LULC) classifications and showed SMOTE-ENN improved the performance of Random Forest and Casboost models [7].
In addition, Imran [8] compared two oversampling methods, namely SMOTE and ROS (Random Over Sampling). The results of this study show that both can improve the performance of the classification algorithm. Whereas Rashu [9] and Thammasiri [10] used one of the undersampling methods, namely RUS (Random Under Sampling), the results of research conducted by both of them showed that the RUS method caused a decrease in the performance of the classification algorithm. On the other hand, the research conducted by Kubat [11] used one of the undersampling methods, namely OSS (Sided Selection). The results showed that applying the OSS method can improve the performance of the classification algorithm. Handling class imbalance with a similar approach was also carried out by Noorhalim [12] and Zhihao [13] using the SMOTE method. Both studies show that applying class imbalance handling to datasets can improve the performance of several classification algorithms. In addition, Sajid Ahmed [14] studied handling class imbalances in datasets. This study used ensemble resampling, while the tested methods included SMOTE-Bagging, RUS-Bagging, ADASYN-Bagging, and RYSIN-Bagging. The results of this study indicate that the four methods used have succeeded in improving the performance of the classification algorithm used.
As we know, most of these studies deal with class imbalance using resampling techniques. On the other hand, the resampling technique has weaknesses, namely the risk of duplicating instances and can cause loss of information or patterns in the dataset. This, of course, impacts the performance of a single classifier used. Besides that, the data level approach could also change the composition contained in the dataset. While the approach at the algorithmic level has a weakness, it is not suitable when applied to datasets with a large class imbalance ratio. This study used two approaches: resampling with ADASYN, SMOTE, and SMOTE-ENN. Meanwhile, the classification algorithm used is a single classifier, namely K-Nearest Neighbors (KNN) and Adaboost and Random Forest as metalearning. This study aims to determine how much handling class imbalance affects the performance of machine learning models. In addition, this study also aims to compare the performance of several model schemes to handle class imbalances in datasets. There are two contributions to this proposed method. First, the proposed method can be a solution for dealing with imbalanced dataset problems in machine learning. Second, the proposed method can be used as a reference for further research on handling imbalanced dataset problems in machine learning.

Dataset
In this study, public datasets from the KEEL-Dataset repository were used. There are five binary class datasets with different imbalanced ratios (IR). The datasets used are Pima, Wisconsin, glass1, glass0, and segment0. The following figure 1 is a description of each dataset, including the number of instances, the number of attributes, and the imbalanced ratio (IR) (

Research Stages
Imbalanced datasets are divided into training data for machine learning and testing data for testing classification models. After that, the oversampling process is carried out using the ADASYN, SMOTE, and SMOTE-ENN methods to balance the data. Then, the resulting data is used for the classification process using the Random Forest, AdaBoost, and K-Nearest Neighbor algorithms. The final stage is to evaluate each method used to measure the performance of the resulting classification. The stages of the research carried out can be seen in Figure 1.

Data Splitting
The initial stage of the unbalanced dataset is divided into two parts. In several studies on imbalanced classes that have been carried out before, a comprehensive scheme for dividing training data and testing data uses the stratified splitting technique. In this study, the data will be divided as follows, 80% of the data will be used as training data for the machine learning model and data to be resampled. Meanwhile, 20% of the data is used for testing machine learning models.

Adaptive Synthetic Sampling (ADASYN)
Adaptive Synthetic Sampling (ADASYN) is one of the oversampling methods. This method synthesizes data adaptively based on the distribution of positive samples [15]. The advantage of ADASYN is that it can focus data duplication on only one specific area [16], where samples are produced more in areas with low minority sample densities than in areas with high densities. This increase in distribution can reduce data imbalances and help improve classification [17].

Synthetic Minority Over-sampling Technique (SMOTE)
Synthetic Minority Over-sampling Technique (SMOTE) is widely used for data imbalance issues [18]. SMOTE balances the data by adding new data to the minority class from the resulting artificial data so that the amount of data on the minority and majority classes are balanced. Synthetic data are determined based on their closest neighbors. This method generates new data using equation (1) [19].
Where, X syn are new synthetic samples from SMOTE process, x i samples that will be synthetic from minority sample, rand (0, 1) is random values from zeros to ones and x knn The number of neighbor samples will be used to synthesize new samples from minority class samples.

SMOTE-ENN
SMOTE-ENN is a combination of the Synthetic Minority Over-sampling Technique (SMOTE) and undersampling Edited Nearest Neighbors (ENN) methods [20]. SMOTE calculates the distance between random data and k-nearest neighbors taken from minority classes [21]. ENN selects samples randomly and removes samples that do not have k samples in the nearest neighbors, where ENN can minimize the occurrence of noise in the data [22]. Based on [23], the SMOTE-ENN sampling process is as follows.
Step 1. Choosing random data from minority classes.
Step 2. Finds the distance between the random data and the k-nearest neighbor.
Step 3. Multiply the difference by random values 0 and 1. Then add it to the minority class as a synthetic sample.
Step 4. Repeat steps two and three until obtaining the appropriate proportions.
Step 5. Determining k based on the nearest neighbors. If it cannot be determined, then k is assumed to be the third step.
Step 6. Calculates k-nearest neighbors for the observation class from the remaining observation data. Then return to the majority class.
Step 7. When the observation and majority class of k-nearest neighbors are different, the statement and k-nearest neighbors are removed from the dataset.
Step 8. The iterative process continues until the proportion required for each class has been met (steps 2 3).

Adaptive Boosting (AdaBoost)
AdaBoost is a boosting method designed for classification and can be applied to various classification algorithms [24]. This algorithm pays more attention to samples misclassified by weak classifiers, thus strengthening the classifier [25].

K-Nearest Neighbor (KNN)
K-Nearest Neighbor is a popular algorithm used in classification. The algorithm is simple, easy to implement, and produces good results across multiple domains [26]. KNN determines data points based on the distance of the data to its neighbors [27]. The KNN algorithm uses Euclidean Distance to measure the distance of the dataeuclidean Distance equation (2) [28].
Where d(x i , x j ) is Euclidean Distance, x i records to i, x j to j, and a r data to r.

Random Forest
Random forest is an extension of tree-based bagging as a basic learning model [29]. Random forest classification selects a random subset from training data [30]. This algorithm is used to generate accurate predictions [31].

RESULT AND ANALYSIS
The initial stage of the unbalanced dataset is divided into two parts. In several studies on imbalanced classes that have been carried out before, a comprehensive scheme for dividing training data and testing data uses the stratified splitting technique. In this study, the data will be divided as follows, 80% of the data will be used as training data for the machine learning model and data to be resampled. Meanwhile, 20% of the data is used for testing machine learning models. Table 3 shows the data for the training process and the data for validation or testing.

Resampling Process
After the training and testing data are determined, a resampling process is carried out on the training data using ADASYN, SMOTE, and SMOTE-ENN. The distribution of positive and negative classes in the training data before and after applying the sampling method can be seen in Table 4 and Table 5.   The resampling results of each technique in Table 5 show that the training set conditions after resampling using SMOTE result in the number of instances of the two classes being the same. This is because SMOTE, apart from performing data synthesis, also performs data duplication. While using ADASYN and SMOTE-ENN, there tends to be little difference in the number of instances between the two classes.

Classification Performance
The following process is classified using AdaBoost, K-Nearest Neighbor, and Random Forest. Classification performance is evaluated with a confusion matrix, where the metrics used are accuracy, precision, recall, true negative rate, and g-mean score. A comparison of classification performance between the original data and after resampling can be seen in Table 6, Tabel 7, Table 8, Table 9, and The accuracy in the classification results on the original data showed quite good values. However, the classification results cannot be trusted because the dataset's condition is unbalanced. Table 6 accuracy values on the ADASYN+RF combination resulted in the best performance compared to other methods with accuracy values of 0.786, 0.947, 0.837, 0.93, and 0.998.  Table 7 shows the precision values in the ADASYN, SMOTE, and SMOTE-EN sampling methods. In the Random Forest and AdaBoost algorithms, the precision value is seen to have decreased. However, in SMOTE-ENN + KNN, the precision value is quite good compared to the original data and combined KNN with other sampling methods. True positive rate results prove that classifiers predict minority classes better. Table 8 shows the true positive rate results on each method used. In dataset 1, the true positive rate is best indicated by the ADASYN+AB method. In dataset 2, true positive rates are best generated by ADASYN+AB and SMOTE-ENN+RF. In dataset 3, the true positive rate value is best generated by ADASYN+KNN. In dataset 4, the true positive rate results are best shown by the SMOTE-ENN+RF combine, and in dataset 5, the true positive rate results are best shown by ADASYN+AB, ADASYN+KNN, SMOTE-ENN+RF, and SMOTE-ENN+AB. Based on the true positive rate results, it can be seen that the overall true positive rate value on the original data is lower than the true positive rate value in the ADASYN, SMOTE, and SMOTE-ENN methods as well as the ADASYN and SMOTE-ENN methods showing better true positive rate results compared to the SMOTE method. The true negative rate value indicates the classifier's ability to predict negative classes. Table 9 shows the true negative rate value, where the true negative rate value shows a high result in the original data. However, as seen in Table 8 value is lower than ADASYN, SMOTE, and SMOTE-ENN. This can happen because, in the original data that experience a class imbalance, the classifier will tend to classify the majority class (negative) and ignore the minority class (positive) so that the true negative rate value in the original data can be higher than the results when the data is in a balanced state or after the implementation of the ADASYN, SMOTE, and SMOTE-ENN sampling methods.  Table 10 shows the g-mean values. The g-mean results are more realistic than general accuracy, which will still give high results despite minority class misclassifications [35]. Of all the datasets tested, ADASYN+RF excelled in three datasets, namely datasets 1, 4, and 5. ADASYN+KNN at dataset 3 and SMOTE+RF at dataset 4. ADASYN+RF produced a better g-mean value than the original data and other sampling methods.
Based on the experimental results that have been carried out from the five indicators used, namely accuracy, precision, true positive rate, true negative rate, and geometric mean, it shows that handling class imbalances in datasets greatly influences the performance of machine learning models. Integrating ADASYN and Random Forest predominantly gave better results than without sampling or combining classification algorithms and other sampling methods. However, in some datasets, other methods showed better results. For example, the results from the ADSYN Resampling and Random Forests models are more than the others because there are no duplicate sample values from the oversampling process. Whereas in SMOTE, there are still some sample duplications, so the results are not as optimal as in ADASYN. Therefore, ADSYN Resampling and Random Forests generally produce better performance than other models such as SMOTE [12,13] and SMOTE-ENN [7].

CONCLUSION
Based on the classification results, implementing sampling methods in each classification model shows an improvement in classification performance. The classification performance on the algorithm without sampling looks quite good, but this is invalid because the classifier only predicts the majority class and the presence of minority class misclassifications. Therefore, the classification model gives different results based on the sampling method applied. Overall, the method that produces the best performance is the combination of ADASYN and Random Forest which is shown by the accuracy, precision, true positive rate, true negative rate, and g-mean.
The results from ADSYN Resampling and the Random Forests model are more than the other models because there are no duplicated sample values from the oversampling process. Whereas in SMOTE, there are still several sample duplications, so the results are not as optimal as in ADASYN. Therefore, ADSYN Resampling and Random Forests generally produce better performance than other models such as SMOTE and SMOTE-ENN. The results of this experiment can also be used as a reference for further research on handling imbalanced dataset problems in machine learning. For further research, you can use datasets with a more significant number of samples or multiclass datasets. In addition, the sampling method can be combined with other classification algorithms.