Stroke Prediction using Machine Learning Method with Extreme Gradient Boosting Algorithm

Based on data obtained from WHO, stroke is a disease that ranks as the second most deadly disease. The cause of a stroke is when a blood vessel is hit or ruptured, resulting in a part of the brain not getting the blood supply that carries the oxygen it needs, leading to death. By utilizing technology in the health sciences, especially in the health sector, machine learning models can adjust and make it easier for users to predict certain diseases. Previous studies have had problems with low accuracy when used in healthcare. The purpose of this research is to increase accuracy by proposing the application of one of the ensemble learning algorithms, namely the Xtreme Gradient Boosting algorithm. This stroke prediction research uses the Xtreme Gradient Boosting Algorithm; the application of this method with split data Training data and 70/30 test data, 70 % of the training data is 3582, 30 % of the test data is 1536, and the results are 96 % accuracy with these results having good results. This study increase accuracy in predicting stroke cases and get better accuracy than previous studies. This is an open access article under the CC BY-SA license.

Ì 597 regression models using various variable selection methodologies. VAJ Hendrikus conducted this study et al. This study entitled Predicting Outcome of Endovascular Treatment for Acute Ischemic Stroke: Potential Value of Machine Learning Algorithms, the results of this study are included data on 1,383 EVT patients with good reperfusion in 531 (38%) and functional independence in 525 (38%) patients. Machine learning and logistic regression models all performed poorly in predicting good reperfusion (average AUC range: 0.53-0.57) and moderately in predicting 3-month functional independence (mean AUC: 0.77 -0 .79) using only the base variable. All models predicted 3-month functional independence using baseline and treatment variables (mean AUC range: 0.88-0.91) [11].
There is a difference between the previous research and the research conducted during the preprocessing namely, the data used is missing. This study does not delete the data, but it does not lose the information contained in the dataset by adding the data. This study aims to improve accuracy in predicting cases of stroke and get better accuracy than previous studies. In this study, we propose the use of machine learning methods using the Xtreme Gradient Boosting algorithm, using the Xtreme Gradient Boosting algorithm because this method is included in the ensemble algorithm that uses increased predictor accuracy, and the way the gradient boost algorithm works are to build a tree to adjust the data, then the next tree is constructed to reduce errors. The confusion matrix is used to assess the performance of the methods used in the stroke prediction process.
Previous studies have had problems with low accuracy when used in healthcare. The purpose of this research is to increase accuracy by proposing the application of one of the ensemble learning algorithms, namely the Xtreme Gradient Boosting algorithm. Several methods are used before classifying, namely dataset retrieval, data preprocessing, and dataset distribution (train/test split). After the previous process is complete, the classification process uses the Xtreme Gradient Boosting method and continues for the evaluation process using a confusion matrix to measure model performance in the form of accuracy, recall, and precision.

RESEARCH METHOD
In this study, the research method used is an experimental method, with the material and practical method described in the process or flow of the running of this research from the materials used, the process carried out, and the process result from data processing to the evaluation of classification methods. It can be seen in Figure 1, which describes the Materials and Methods of this Research.

Dataset
Dataset is a collection of existing data from past experience, which is used for processing to become helpful information for an institution [12]. The Stroke Prediction Dataset is used in this research process. The data is taken from the Kaggle dataset https://www.kaggle.com/fedesoriano/stroke-prediction-dataset. This dataset is used to train and test the classification model, especially in the prediction of stroke. This dataset consists of 12 attributes, ten independent variables as features without the id variable being used, and one dependent variable as a class label used to predict stroke. Table 1 The dataset describes the feature information in the dataset used [13]. The ten independent variables in question are gender, age, hypertension, heart disease, ever married, type work, type of residence, level glucose, BMI, and smoking status. The label class is the stroke attribute in this dataset has two values, namely 0 with the sign that there is no indication of stroke, while a value of 1 indicates an indication of a stroke.

Preprocessing
It was preprocessing data were for the method used to get good results in the classification process. There are several techniques in preprocessing according to the pattern of data that occurs cleaning, transformation, etc [14].
Data Preprocessing is a step before data processing using machine learning methods, and several preprocessing steps are applied.
1. Using LabelEncoder, which can convert non-numeric features into numeric features, only some attributes are changed in this labeling, including gender, ever married, work type, residence type, and smoking status.

Train/Test Split
Train/Test Split distributes datasets into training data and testing data. In this study, training data and testing data are divided into 70/30, with the total of the test data being 1536. In general, machine learning models get good accuracy results with a small amount of testing data. So, in this study, we increase the test data and test whether the model gets good results or not. Table 2 Train/Test Split describes the data sharing that is carried out. Table 2 Train/Test Split describes the data sharing carried out.

Classification using the Extreme Gradient Boosting Algorithm
XGBoost (Xtreme Gradient Boosting) is a combination method between boosting and gradient boosting. XGBoost with the boosting method is used to classify errors from the previous model. Because XGBoost uses gradient descent which helps narrow the occurrence of errors during the creation or formation of new models. In the XGBoost process, several parameters are needed to obtain an optimal model called a hyperparameter. The adjustment of various parameters can affect the performance of the methods in processing the dataset. The parameters used to increase the classification using the XGBoost (Xtreme Gradient Boosting) method. Some of the parameters that can be used in the classification can be seen in Table 3.

Table 3. Parameters in XGBoost Method
Parameter Information max depth Maximum depth of the tree.

eta (learning rate)
Prevents overfitting by reducing size min child weight Minimum weight of child node n estimators Number of trees subsample Randomly Sampling From Training Data Before constructing the tree. gamma The minimum loss reduction value, which is required to create further partitions of the nodes in the tree random state internal random number generator initialization

Evaluation Method
Confusion matrix to evaluate the method in this classification. A confusion matrix is an evaluation that is often used to assess the performance of a classification model based on research objects that have true and false prediction values. In measuring the performance of the model in this study, there are four points to identify a prediction with the method used, these four points including True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).
Some of the evaluation results that are usually used are as follows : 1. Accuracy (ACC) is the overall effectiveness of the classification results 2. Precision (PREC) is the percentage of data labels with positive labels given by the classification 3. Recall (REC) or sensitivity is the effectiveness of the classifier in identifying positive labels 600 Ì ISSN: 2476-9843

RESULT AND ANALYSIS 3.1. Dataset
The dataset used in this study is a dataset from Kaggle; the dataset is generally used for education research. This dataset consists of 12 attributes, ten independent variables as features without the id variable, and one dependent variable as a class label used to predict stroke. Figure 2 Results of the dataset used.  figure 2. it is explained that there are several variables in the dataset, namely from the id variable to the stroke or the number of attributes in the dataset, as many as 12 features, because id has no relationship that can determine the possibility of stroke patients or not. Therefore id is deleted in the dataset variable, and the number of dataset variables is only 11 attributes. Of the 11 attributes, 10 are independent attributes, namely from the gender variable to smoking status. and for the stroke variable to be a class/label in this classification process, the value contained in the class/label (stroke variable) is 0 which explains that the patient is not indicated for stroke, and one explains that the patient is indicated for stroke.

Preprocessing
Preprocessing data where in order for the method to get good results in the classification process, the preprocessing process in this classification has two steps, namely Label Encoder, which can convert non-numeric features into numeric features Replace missing values [15]. The results of this data preprocessing can be seen in Figure 3. the results of data preprocessing. From the results of preprocessing, it can be seen that the dataset has previously been converted from non-numeric features into numeric features. Then replaces attributes that are empty with an average BMI having a stroke and BMI not having a stroke previously. There was a BMI value that was empty or contained on the Nan value dataset in the BMI column.

Train/Test Split
The division of the dataset into training data and testing data in this study, the results of the Train/Test Split of this study can be seen in Figure 4. Train/Test Split. This study distributes training data and testing data to 70/30. The training data is 3582, and the test data is 1536.

Extreme Gradient Boosting Algorithm Classification
Using the Extreme Gradient Boosting (XGBoost) method for classification, the primary step taken is parameter tuning. The results of the tuning parameters are shown in table 3 of the tuning parameters.  Table 4 of the tuning parameters shows that the parameters used for the classification process of the experiments carried out include max depth, learning rate, min child weight, n estimator random state. In the first experiment, fifty trees were used, then each tree had five branches, learning rate used a value of 0.1 which was used as a learning level that affected the XGBoost algorithm in tree-shaped classification, then the minimum amount of weight used was 1 where if the tree partition resulted in a node leaves with the number of instance weights less than min child weight, the development process will stop further partitioning, the accuracy of the first try is 91%. in the second experiment has sixty trees used, then each tree has ten branches, learning rate uses a value of 0.2 which is used as a learning level that affects the XGBoost algorithm in tree-shaped classification, then the minimum amount of weight used is 1 where if the tree partition produces nodes leaf with the number of instance weights less than min child weight, then the development process will stop further partitioning, and use a random state that is worth five, the accuracy results obtained in the second try is 92%.
In the third experiment by adding the number of trees used, namely seventy, then each tree has ten branches, adding learning rate with a value of 0.3 which is used as the learning level that affects the xgboost algorithm in tree-shaped classification, then the minimum amount of weight used is 1 where if the tree partition produces leaf nodes with the number of instance weights less than min child weight, then the development process will stop further partitioning, and use a random state that is worth fifty, the accuracy results obtained in the third try is 93%. in the fourth experiment by adding the number of trees used, namely one hundred, then each tree has ten branches, adding learning rate with a value of 0.4 which is used as a learning level that affects the xgboost algorithm in tree-shaped classification, then the minimum amount of weight used is 2 where if tree partitioning produces leaf nodes with the number of instance weights less than min child weight, then the development process will stop further partitioning, and reduce the random state which is worth four, the accuracy results obtained in the fourth experiment are 95%, with the parameters used and the number is capable of increase the accuracy rate to 2%. And from the five experiments carried out, the best results were obtained in the 5th experiment process by having an accuracy value of 96% using only three parameters, namely max depth, learning rate, n estimator. The best classification results are that one hundred trees are used, and each tree makes fifteen branches. learning rate uses a value of 0.09, which is used as the level of learning that affects the Extreme Gradient Boosting algorithm in a tree-shaped classification.  Figure 2 is the results of the Extreme Gradient Boosting Tree Example. The following is an example of a tree from Extreme Gradient Boosting (XGBoost) from this research. The results obtained from the tree above are as follows.

Stroke
1. If age is less than 67.5, and age is less than 55.5 it will return -0.176223248. 2. If age is less than 67.5, age is more than 55.5, and heart disease is less than 1, it returns -0.157049194. 3. If age is less than 67.5, age is more than 55.5, heart disease is more than 1, BMI is less than 35.59, smoking status is less than 1 then it returns 0.0200000014 for the next three and the rest will return -0.132631585. 4. 4. If age is less than 67.5, age is more than 55.5, heart disease is more than 1, BMI is more than 35.59, smoking status is more than 3, then it returns -0.0490909107 in the next tree and returns 0.0900000036 for the others. 5. 5. If age is more than 67.5, ever married is less than 1, hypertension is less than 1, and BMI is less than 23.1499 it will return -0.0200000014. 6. 6. If age is more than 67.5, ever married is less than 1, hypertension is less than 1, and BMI is more than 23.1499, smoking status is more than 1, it will return -0.133548394 for other trees. 7. If age is more than 67.5, ever married is less than 1, hypertension is less than 1, and BMI is more than 23.1499, smoking status is more than 1, and age is less than 77.5, it will return -0.0360000022 and reverse -0.114545457 for the other trees. 8. 8. If age is more than 67.5, ever married is less than 1, hypertension is more than 1, BMI is more than 30.23 it will return 0.0818181857 for other trees. 9. 9. If age is more than 67.5, ever married is less than 1, and hypertension is more than 1, BMI is more than 30.23, Residence type is less than 1, it will return -0.0600000024 in the next tree and return 0.0360000022 for the other trees.
The best results from this study were an increase in the value of several parameters, namely the number of trees (n estimator) and the maximum number of trees (max depth). Then decreased the learning rate parameter used and did not set some parameters in this classification process such as Random State, and min child weight, which was previously used in the previous classification process experiment. The resulting accuracy is 96%.  (FN) is 61. From the evaluation results confusion matrix above, proceed to the binary classification calculation to assess the model's performance built using the tested XGBoost (Xtreme Gradient Boosting) algorithm. It can be seen that the best results from experiments carried out with several tested parameters. The 5th experiment resulted in the best accuracy of existing trials with a value of 96%. The following is the overall result of the 5th test can be seen in Figure 5 Classification Report.  Figure 5 above, it can be seen that the classification results have values of accuracy, recall, precision, etc. The accuracy value shows that the correct ratio of patients who predicted stroke and no stroke from all patients in the dataset is 96%. Then recall knows that patients predicted stroke compared to all patients who stroke, with a value of 20%, and precision shows The ratio of patients with true strokes in the dataset of all patients with predicted strokes is 71%. From the recall results obtained, only 20%, of course, is a low result. These results are produced because the total number of patients who have had a stroke is around 76 of the total data that has been split or divided into the previous dataset, so the results obtained are low. After all, the sum between True Positive and False Positive in the recalled formula produces very high results, therefore, the division of the sum with the True Positive value of the recall value gets low results. A comparison of research conducted previously with this research is described in Tabel 5. The topics raised in previous studies with this research are the same, namely the topic of stroke, as for the differences between the previous study and the research conducted, namely when applying the machine method. Learning processing classification methods include SVM, Decision three, etc. And there are also differences in the data preprocessing technique. Previous research was to apply deletion to data that had missing values. There was a lot of data loss from the deletion; some studies had not implemented preprocessing without overcoming the missing value. This, of course, will impact the results of the classification being biased.
In this study, the author will refine the research that is being carried out based on the shortcomings of previous research and conduct testing by adjusting the method used in the form of setting parameters, etc. The performance of the accuracy of the model used in this stroke classification process for the better, namely research. Regarding stroke prediction using the Xtreme Gradient Boosting method.

CONCLUSION
The accuracy results obtained get better results than previous studies using the same dataset pattern. The dataset is the Stroke Prediction Dataset. It consists of four stages in this study in the Stroke Prediction Dataset: preprocessing data, Split data, classification by XGBClassifier, and classifier evaluation. The changes made in this study from previous research include the data preprocessing process. Previous research carried out the deletion of data that would lose 202 data in data processing, and there were also previous studies that had not deleted data or other things that could overcome the empty values contained in the data. BMI attribute, with the classification results, will be biased. The preprocessing technique carried out by this study makes the empty values contained in the BMI attribute with an average BMI of stroke and an average of no stroke, there is no data loss, and also the classification results because the value on the BMI attribute has been resolved. The second change is related to training and testing data distribution. In this study, there was an increase of 10% of test data to 30% of test data compared to previous studies, which only used 20% of test data. The results showed that the classification using XGBoost (Xtreme Gradient Boosting) achieved the best accuracy of 96%, which was a better result than previous studies.
Suggestions from further research are to be able to overcome the imbalance of the dataset class. The dataset used is a stroke prediction dataset from Kaggle, the class of the dataset has a class imbalance, the class of someone who has a stroke is 249 and does not have a stroke is 4860. Class imbalance such as this dataset can affect athe model during classification. The model can only determine the majority class. Most likely, the predicted minority class will be predicted as the majority class. With the imbalance in this dataset, it is necessary to apply a method to overcome this.