Handling Imbalance Data Using Hybrid Sampling SMOTE-ENN in Lung Cancer Classiﬁcation

,


INTRODUCTION
Artificial intelligence is one of the implementations of the rapid development of technology [1].Artificial intelligence improves the performance of computers/software to obtain and process information by adopting and imitating human intelligence.One of the artificial intelligence applications is machine learning, which focuses on developing systems capable of self-learning without the need to reprogram continuously [2].Machine learning is computer programming that aims to achieve certain criteria by utilizing training data or experience [3].One example of a problem usually handled or solved by machine learning is the classification problem.Classification is a machine learning model that predicts the appropriate category or label [4].Classification is used to estimate a class in data that is unknown beforehand [5].Until now, many algorithms have been developed for classification, but some problems often become obstacles in classification, namely unbalanced classes in the data [6].Machine learning models tend to over-classify a larger number of classes when there is a class imbalance in the data [7].The impact of the problem is that the model will have high accuracy on a large number of classes and low accuracy on a small number of classes [8,9].The reality is that most of the data has the same number of classes, but it will be different if the difference is too large.The problem of data imbalance also appears in the lung cancer data, with the number of positive classes 283 and negative classes 38.Therefore, researchers focus on the imbalance problem that exists in the data.
Various techniques have been proposed to address the issue of imbalanced data, with resampling being one of the prominent approaches [10].Resampling methods aim to rebalance the class distribution by manipulating the dataset through different sampling algorithms, thus enabling better training of classification models.These techniques generally fall into three categories: undersampling, oversampling, and hybrid sampling.For instance, a previous study [11] utilized undersampling in combination with the Random Forest K-Fold algorithm to mitigate class imbalance, resulting in improved performance metrics such as AUC scores exceeding 0.5.Similarly, another investigation [12] compared the effectiveness of oversampling techniques, particularly SMOTE, with traditional methods.The study reported significant enhancements across various evaluation metrics, including accuracy, sensitivity, precision, G-Mean, F1-score, specificity, and Youden's Index.However, despite these advancements, neither undersampling nor oversampling alone can fully address the complexities of imbalanced datasets.In contrast, our proposed approach combines the strengths of both undersampling and oversampling through the SMOTE-ENN method, offering a more comprehensive solution for lung cancer classification.By synthetically generating minority class instances while simultaneously removing noisy samples, SMOTE-ENN enhances the discriminative power of the model, resulting in improved performance and robustness against class imbalance.Therefore, our study aims to investigate the efficacy of this combined approach in handling imbalanced data and its impact on lung cancer classification performance.
In this study, researchers tried to use hybrid sampling techniques to handle class imbalances in lung cancer patient data.This study tests the SMOTE-ENN algorithm and performs a combination with Random Forest in balancing classes according to suggestions in research [13].Research [13] combined SMOTE and Random Forest for imbalanced data.As a result, the combination of random forest and SMOTE improved by 5% accuracy and 39% sensitivity compared to random forest without SMOTE.The use of the SMOTE-ENN hybrid sampling method with Random Forest in the classification of lung diseases has not been carried out by previous research.So, the difference between this research and research [13] lies in the resampling technique used to handle data imbalance.Research [13] uses oversampling techniques, while this research uses hybrid sampling techniques that combine undersampling and oversampling.The data used in this study is lung cancer patient data obtained from the data provider website, Kaggle.The data has a problem with class imbalance with a class ratio of 283 and 38.From the problem of imbalance, researchers use hybrid sampling techniques to perform resampling, in this case, SMOTE-ENN, which is then modeled with the Random Forest classification algorithm.Therefore, this research aims to utilize hybrid sampling techniques to handle imbalanced lung cancer and improve the accuracy of random forest classification.This study contributes to using hybrid sampling techniques (SMOTE-ENN) to handle class imbalance in lung cancer patient data.The combination of undersampling and oversampling is expected to provide a more comprehensive solution than using either technique separately, resulting in increased performance and robustness in lung cancer classification with the Random Forest algorithm.

RESEARCH METHOD
The effectiveness of a cancer prediction system can help people to know their cancer risk at a lower cost and can also help people to make the right decision based on their cancer risk status.This research has a process flow chart framework to achieve these goals, as in Figure 1.The quality and accuracy of the data used often determine the success of a study.Therefore, the first crucial step in the success of the research is data collection.In this context, the researcher has collected the diabetes dataset, a step that requires care and caution.The data source used comes from Kaggle, a leading platform that provides a variety of datasets for analysis and research purposes.The selection of this dataset was not done haphazardly but through careful consideration to ensure that the data obtained was in line with the research objectives and had a reliable quality.Data retrieval is a mechanical downloading process and involves a deep understanding of the dataset's characteristics.Researchers need to ensure that the data retrieved has a high relevance to the research focus and identify potential biases or anomalies that may appear in the dataset.This step sets a solid foundation for the rest of the research journey, ensuring a strong foundation before entering further analysis.

Preprocessing
In the preprocessing stage, lung cancer data is processed from its raw form into a format that is ready to be trained by classification models to avoid potential problems that could interfere with classification results [10].Steps involve checking and removing duplicate data, evaluating label imbalance, and using hybrid resampling techniques, specifically SMOTE ENN, to align the number of instances between the majority and minority classes [11].The result of this process is a balanced dataset, ensuring that the model to be trained can learn well from both classes and produce accurate classification results.

Modeling
The modeling stage is the third stage after the data collection and preprocessing.In this data, clean data will be modeled using the Random Forest algorithm using the scikit-learn library in Python.Model selection is based on data characteristics and analysis objectives, considering model performance and interpretability.However, before modeling, the data will be divided into training and test data using the 10-fold cross-validation technique [12].

Model Evaluation
The last stage of this research is to evaluate the model's performance by measuring the accuracy of the Random Forest algorithm in classifying lung cancer.The evaluation stage uses three measurement metrics: accuracy, recall or sensitivity, and specificity.The International Journal of Engineering and Computer Science Applications (IJECSA) ISSN: 2828-5611 accuracy value describes how accurately the system can classify the accuracy results.It describes how accurately the system can classify the data correctly.In other words, the accuracy value is the ratio of correctly classified data to the overall data.The three techniques use different approaches, as shown in Equations ( 1), (2), and (3).The accuracy value can be obtained with equation (1).The accuracy result describes how accurately the system can classify.Then, the specificity value describes the number of correctly classified positive category data divided by the total data categorized as positive.Specificity is obtained in equation (3).Meanwhile, the sensitivity value shows how much of the data from the positive category is correctly classified by the system.The sensitivity value is obtained by Equation (2).
Where TP is True Positive (True detected positive data), TN is True Negative (Number of correctly detected negative data), FP is False Positive (Negative data but detected as positive data), and FN is False Negative (Positive data but detected as negative data).

3.
RESULT AND ANALYSIS

Data Collecting
In the initial stage of the data collection process, secondary data in the form of lung cancer data from the Kaggle site was obtained.The lung cancer dataset obtained from Kaggle has 309 data and 15 attributes.The attributes of lung cancer in the dataset can be seen in Table 1.

Table 1. Attributes of the Lung Cancer Disease Dataset
No.
Gender Gender is an attribute of the patient's gender 2.
Age Age is the patient's age attribute 3.
Smoking Attributes that describe whether the patient is a smoker 4.
Yellow Fingers Attributes in the form of a question whether the patient has yellow fingers 5.
Anxiety Excessive panic when breathing out of breath rhythm 6.
Peer Pressure Psychological stress or feeling pressured by the environment (shortness of breath in crowds) 7.
Chronic Disease Having a chronic disease 8.
Fatigue Rapid atigue during daily activities 9.

Alcohol Consuming History of alcohol consumption 12 Coughing
Coughing is an attribute of coughing 13.
Shortness of Breath Shortness of breath is an attribute of shortness of breath 14.
Swallowing Difficulty Difficulty swallowing is an attribute of superficial difficulty 15.
Lung Cancer Prediction Class

Preprocessing
In this stage, a data check is conducted to find whether there is data that has the same value.After eliminating duplicate data, a check is made to see if there is an imbalance of labels in the data.The results show that the majority class in the lung cancer data has 238 instances, while the minority class has only 38 instances.To solve this imbalance problem, a resampling process is performed on the data.Resampling in this study uses a hybrid technique, which combines oversampling and undersampling.The hybrid method implemented is SMOTE ENN, which aims to align the number of examples between the majority class and the minority class so that both have a balanced or not too different distribution, as in Table 2.
International Journal of Engineering and Computer Science Applications (IJECSA)

Modeling
After the data is processed and cleaned, the next step is modeling using Random Forest.However, before modeling, the data is first divided using K-Fold Cross-validation with the number K as 10.In the modeling process, there are two modeling schemes; the first is Random Forest without imbalance handling with SMOTE-ENN and modeling using SMOTE-ENN + Random Forest.The 2 model schemes will be compared in the evaluation process to determine the best model.

Model Evaluation
In this process, the models made, namely the Random Forest model without SMOTE-ENN and Random Forest with SMOTE-ENN, are evaluated using accuracy score, sensitivity, and specificity.The accuracy scores of both models can be seen in Table 3. Table 3 shows that Random Forest has an accuracy of 90.2%, recall (sensitivity) of 90.2%, and specificity of 89.7%.This shows that the Random Forest model has good performance in data classification.However, when looking at the model using SMOTE-ENN + Random Forest, all metrics show a noticeable improvement.The accuracy reached 99.7%, as well as the recall and specificity.Table 3 also shows the considerable difference between SMOTE+RF and SMOTE-ENN+RF, with a 5.6% difference in accuracy.This study found that using the SMOTE-ENN hybrid sampling technique with the Random Forest model significantly improves the model's ability to identify and classify data.
In Figure 2, the Random Forest method without SMOTE correctly predicted the cancer grade in 22 cases out of 38 data.In comparison, the non-cancer category was correctly predicted in 227 cases out of 238 data.In Figure 3, the Random Forest method with SMOTE-ENN correctly predicted the cancer grade in 214 out of 238 cases, while the non-cancer grade was 172 out of 238 cases.So, the use of SMOTE-ENN can help improve the efficiency of the Random Forest accuracy classification method, this is in line with the research results [14][15][16].

CONCLUSION
After conducting the testing process, it was found that combining the SMOTE-ENN method with Random Forest provides accuracy, sensitivity, and specificity results of 99.7%.Meanwhile, without SMOTE-ENN, the accuracy, sensitivity, and specificity only reached 90.2%, 90.2%, and 89.7%, respectively.The combination of the SMOTE-ENN method with Random Forest improved the model's performance with an increase in accuracy of 9.5% and sensitivity of 9.5% when compared to without the use of SMOTE-ENN.In addition, SMOTE-ENN+RF also has better accuracy than the previous study that used SMOTE+RF, with an accuracy difference of 5.6%.This confirms that the application of SMOTE-ENN can significantly improve the predictive ability of lung cancer through the Random Forest method.

Figure 1 .
Figure 1.Flow of Research Method

Table 2 .
Comparison of Number of Classes of Resampling Methods

Table 3 .
Metric Score Model