Essay Auto-scoring using N-Gram and Jaro Winkler Based Indonesian Typos

Writing errors on e-essay exams reduce scores. Thus, detecting and correcting errors automatically in writing answers is necessary. The implementation of Levenshtein Distance and N-Gram can detect writing errors. However, this process needed a long time because of the distance method used. Therefore, this research aims to hybrid Jaro Winker and N-Gram methods to detect and correct writing errors automatically. This process required preprocessing and ﬁnding the best word recommendations by the Jaro Winkler method, which refers to Kamus Besar Bahasa Indonesia (KBBI). The N-Gram method refers to the corpus. The ﬁnal scoring used the Vector Space Model (VSM) method based on the similarity of words between the answer keys and the respondents answers. Datasets used 115 answers from 23 respondents with some writing errors. The results of Jaro Winkler and N-Gram methods are good in detecting and correcting Indonesian words with the accuracy of detection averages of 83.64% (minimum of 57.14% and maximum of 100.00%). In contrast, the error correction accuracy averages 78.44% (minimum of 40.00% and maximum of 100.00%). However, Natural Language Processing


INTRODUCTION
Exams are one way to assess the level of student understanding of the material provided [1]. One type of examination is an essay exam, which can be conducted online. The essay exam does not give answer choices but requires writing down the answer in the form of a sentence description [2]. If the assessment is done manually, it takes a long time, allowing non-objective reviews to occur, and the quality decreases [3]. Thus, it can be overcome by automatically evaluating essays using a computer [4,5].
Students require accuracy in writing answers [6] and no writing errors because that can reduce the assessment exam if using an automatic essay scoring system [7,8]. However, several writing errors include a lack of concentration in typing, adjacent keyboard positions, and a lack of understanding of correct words [9][10][11][12]. In addition, there are four writing errors: insertion, deletion, adding, and changing the position of letters in a sentence. Therefore, writing mistakes in answering essay exams must be overcome by detecting and correcting writing errors [13].
Research conducted to overcome writing errors using bigrams and trigrams for writing errors [14,15] in English obtained good results with 71%-79% precision and 81%-88% recall. In addition, this research can also detect and correct more than one word in a sentence. However, many bigrams and trigrams have not been discovered due to lacking an n-gram corpus [16]. Another research uses bigrams and trigrams in Indonesian [17,18]. This research added the additive smoothing method to the n-gram probability calculation [19,20]. This study resulted in a reasonably low accuracy of 11% due to the lack of quantity of n-gram corpus. An increase in the occurrence of 0 of the n-gram calculation probability is due to the additive smoothing method. Another research compared several techniques, such as the hamming distance, Levenshtein distance, Damerau Levenshtein distance, and Jaro Winkler distance [21], where the test used an evaluation relevance judgment. The results of this study indicate that the Jaro Winkler method obtains the highest Mean Average Precision (MAP) value of 0.87 [22] compared to other forms [23,24].
Further research uses the Jaro Winkler method, implemented in essay scoring. This method can detect three types of writing errors: excess letters, lack of letters, and misplacement of letters in a sentence. This research obtained an accuracy of 57%-73% [25]. Based on the methods from previous research, we proposed a combined method of n-gram and Jaro-Winkler methods to detect and correct writing errors in the essay scoring system. The use of n-grams in this research is due to previous research obtaining reasonably good accuracy and having the ability to detect more than one word in a sentence. In addition, the Jaro Winkler can also run faster processes than the Levenshtein distance algorithm [21,26]. Therefore, this research aims to hybrid Jaro Winker and N-Gram methods to detect and correct writing errors automatically.

RESEARCH METHOD
This research method consists of data collection and data processing. Data processing consists of several stages: data preprocessing, making an n-gram corpus, writing error detection, and correction and scoring.

Data Collection
Data collection is one of the stages used to find the information needed and used as a basis or support in research. This research uses three types of data from several sources such as:

a. Indonesian Dictionary
This dictionary is sourced from Kamus Besar Bahasa Indonesia (KBBI) versions 3 and 4, published by Balai Pustaka. This dictionary is used as a reference in detecting and correcting writing errors previously carried out by case folding and filtering processes by removing hyphens. The total words used are 43,562 words. b. N-Gram Corpus The n-gram corpus results from processing Indonesian articles using the n-gram method as a reference in the detection and correction process [27]. In addition, the n-gram word corpus is also obtained from the n-gram processing process in the Kamus Besar Bahasa Indonesia (KBBI). The total data used are 52828 unigrams, 110412 bigrams, and 69787 trigrams. c. Test Data Test data is sourced from a questionnaire that answers questions about artificial intelligence, machine learning, and artificial neural networks that have been filled in by several respondents who are then given an error scenario in the testing process. The test data used was 115 answers from 23 respondents.

Data Processing
Data processing is a process that starts from raw data processing and finishes when final results are obtained [28]. The stages of data processing are shown in Figure 1.
a. Data Preprocessing Data preprocessing is the initial stage of processing data so that the data is ready to be processed. Some of the preprocessing data stages are as follows, and the sample result of this process can be seen in Table 1.

Case Folding
Case folding is a method for converting all uppercase characters to lowercase [29]. Only letters a-z are accepted.

Sentence Tokenization
Tokenization is a process of separating or cutting the constituent text [30]. This tokenization requires a delimiter or separator, such as spaces, from one word to another. Meanwhile, the tokenization of this sentence separates the sentences that make up the paragraph with a dot delimiter or separator.

Filtering
Filtering is the process of retrieving important words or unimportant words [31]. However, this research conducted filtering to remove punctuation. The punctuation marks are characters other than letters and spaces. 4. Word Tokenization Like sentence tokenization, word tokenization is used to cut words in a sentence with a space delimiter or separator. The result of this process is a word token where an underscore (" ") is given at the beginning and end of the sentence.

Before preprocessing After Preprocessing
Data processing is the processing of raw data into information data processing is the processing raw data into information b. N-Gram Corpus N-Gram corpus is the result of processing Indonesian articles based on Kamus Besar Bahasa Indonesia using the N-Gram method to reference writing errors' detection and correction process. Several stages are as follows: 328 Ì ISSN: 2476-9843

Data Preprocessing
Data preprocessing is the initial process that needs to be done to prepare the data for the next process that has been described previously.

N-Gram Modelling
N-gram is an algorithm that pays attention to the adjacent n-letter sequence by dividing a sentence or word into smaller forms and then calculating the probability [32]. There are several types of n-grams that can be used, namely unigram with a value of n=1, bigram with a value of n=2, trigram with a value of n=3, and quadram with a value of n=4 [33]. However, in this study, only three types of n-grams were used: unigram, bigram, and trigram. The n-gram modeling is presented in Figure 2. Based on Fig. 1, the sentence I love the food is modeled into three types of n-grams. Unigram with a value of n = 1, which means it only consists of 1 word. Bigram with a value of n = 2, which means it consists of 2 words taken from the word at position n with the next word. The trigram with a value of n=3 consists of 3 words taken from the word at position n with the next two words.

Calculate Word Occurrence
This process is carried out by counting the occurrences of words in each type of n-gram in the processed text and entering the processed results into the database. An example of the results of making an n-gram corpus is in Table 2.

c. Writing Error Detection and Correction
The main process in this research is writing error detection and correction using n-gram and Jaro Winkler methods. This process needs references to Kamus Besar Bahasa Indonesia (KBBI) and N-Gram corpus. Several processes are applied as follows: 1. Data Preprocessing Data preprocessing is the initial process that needs to be done to prepare the data for the next process that has been described previously.

Jaro Winkler
Jaro Winkler is an algorithm used to measure the similarity between two strings being compared where this algorithm is included in the Jaro distance variant [34,35]. Two strings are said to be the same if the Jaro Winkler value is 1. On the other hand, if the Jaro Winkler value is 0, then there is no similarity between the two strings [36]. According to Yulianingsih [37], there are three steps to calculate the similarity between 2 strings using Jaro Winkler; for example, take the word procesing and then compare it with "processing." The word procesing has a character length of 9, and processing has a character length of 10. This step is applied to each word to be compared. -Calculate common string in 2 strings This step needs to calculate theoretical distance as an index reference in the process of searching for the same character and the formula as in (1).
Based on the example, the words procesing and processing have a theoretical distance of 4, meaning that the match starts from the four leading indexes to the four backward indexes. The process of matching two words is shown in Table 3. Based on Figure. 2, the words procesing and processing have the same nine characters.

-Calculate transposition between 2 strings
This step of the process is the same as in the process of calculating the same character that can be seen in Fig. 2.
The words procesing and processing do not have transposition characters because there are no two same characters in different positions.
After the basic steps are done, then Jaro distance and Jaro Winkler are calculated. The Jaro Distance can be seen in formula (2) and Jaro Winkler in formula (2).
3. N-Gram Modeling This step is the same as in the process of N-Gram modeling to make an N-Gram corpus. The types of n-grams used are unigram, left bigram, right bigram, and trigram. Words that are assembled with n-grams are words from the recommendation of Jaro Winkler with the text tokens being tested. The n-gram modeling is as Table 4.
C i j is the result of word recommendations obtained from Jaro Winkler calculations where i is the word token index, and j is the word recommendation order index from Jaro Winkler. W i−1 is the word token at position n − 1. While W i+1 is the word token at position n + 1. The left bigram is obtained from the combination of W i−1 with C i j , the right bigram is obtained from the combination of C i j with W i+1 and the trigram is obtained from W i−1 combination with C i j and W i+1 .

Calculate a Probability
This step needs to count every word occurrence from the n-gram modeling on the n-gram corpus that has been made. If the word is found in n-grams, then the occurrences contained in the corpus are taken. Meanwhile, if no event is found, it will be assigned a value of 0. The example can be seen in Table 5.
The number of occurrences that have been obtained in each type of n-gram is followed by the calculation of the total occurrence. Σ ki r=1 count(C i r ) is the total occurrence of the unigram, Σ ki r=1 count(W i−1 C i r is the total occurrence of the left bigram, Σ ki r=1 count(C i r W i+1 ) is the total occurrence of the right bigram, and Σ ki r=1 count(W i−1 C i r W i+1 ) is the total number of occurrences of the trigram. k is the total words calculated by Jaro Winkler, which have been modeled with n-grams in each word token. r is a word order calculated by Jaro Winkler, which has been modeled with n-grams in each word token. The total occurrence of n-grams can be seen in Table 6.
If all occurrences have been totaled, the process can be continued by calculating probabilities. Probability calculations can use Markov Chain assumptions where the probability of the next word depends on the previous word [16]. From these assumptions, the probability calculation uses the Maximum Likelihood Estimation (MLE) by dividing the occurrence by the total occurrences so that the results obtained are between 0 and 1 [38]. The formulas of Maximum Likelihood Estimation (MLE) can be seen in formula (4) Left Bigram Right Biagram Trigram Probability calculations using MLE need to be continued with probability calculations using the smoothing Jelinek Mercer method, which is used to overcome the 0 probability calculation results [32]. The Jelinek Mercer method uses an interpolation method associated with the n-gram levels and calculated by the formula (8,9,10) [25]. Left Bigram

Right Bigram
The example of the results of the probability calculation using Jelinek Mercer is presented in Table 7.

Calculate a Score
The calculation of the score is one of the final stages in the process of detecting and correcting writing errors. This score is computed by combining the results of bigrams and trigrams because high-order n-grams are more sensitive to a context, and low-order n-grams are less sensitive in recognizing a context [16]. This score is calculated using the weighted combination score formula (11).
The calculation of the score can be seen in Table 8.

The Calculation of Scores
The Result of Jaro Winkler The score that has been obtained is then ranked from the largest to the smallest value. The greatest value is the recommendation of the right word. d. Scoring The scoring process is used to obtain the value of the respondents answers who have been given corrections for writing errors. This stage uses the Vector Space Model (VSM) method. This method is used to find the similarity of a document with a query that is represented in a vector form [39]. This method uses the calculation of Term Frequency (TF) and Inverse Document Frequency (IDF). After the TF-IDF is found, then the matching document uses cosine similarity.

Term Frequency (TF)
Term Frequency (TF) is a way to get the weight value of the appearance of the term in a document. If the term frequently appears in the document, the TF value will be higher than other terms [40]. The TF is computed using the following (12).
2. Inverse Document Frequency (IDF) Inverse Document Frequency (IDF) is used to count the occurrence of a term in another document [41]. The IDF is computed using the (13) idf i = log log N df i 3. Calculate a Weight of TF IDF This weight calculation is done by multiplying the TF value by the IDF, which can be seen in (14).
If the value of N = df i, the weight or W ij will be obtained based on the tfij calculation [17]. Thus, it is necessary to add a value of 1 to the Inverse Document Frequency (IDF), which can be seen in (15).

Matching Documents
Matching documents is a way to calculate the level of similarity between documents [40]. Documents for which the similarity is calculated in this research are the answers given by the respondents and the answer keys to the questions. Calculations on this matching document use the cosine similarity method that can be calculated by (16).

Similarity Value Conversion
This step is the process of converting the similarity values from the document matching process into general test scores (Human Rate Value). The similarity conversion value range can be seen in

RESULT AND ANALYSIS
This research was conducted using test data from the results of the questionnaire, which amounted to 115 answers from 23 respondents.3.1.Writing Errors Detection and Correction Test In this test, each respondents answer is given a writing error scenario which consists of a letter overload, letter shortage, letter transposition, and letter replacement scenario, then continued with testing using the system and recorded in the table by giving a value of 1 which means true and 0 means false. This value is then used for accurate calculations. The calculation of accuracy is applied to every answer given to the respondent. If it has been applied in all answers, then the average, minimum, and maximum accuracy will be obtained. The accuracy calculation is as follows: a. Accuracy of Detection Writing Error The calculation of writing error detection accuracy is used to determine how accurately the system can detect writing errors by first comparing the results of manual detection with the modeling and then calculating the accuracy (Show Table 10).  Based on Table 10, there are scenarios of writing errors in the word processing (procesing). The error scenario of the word procesing has a value of 1 in the false column, which means that the word is included in the category of writing errors. During testing using the system, the word procesing also has a value of 1 in the false column, marked by a word recommendation generated by the system. If the results of manual detection and the detection performed by the system have the same value, then a value of 1 is given in the true check column. If the results are not the same, then a value of 1 is given in the false check column. For example, suppose every word in the sentence has been compared. In that case, the comparison results are added up to calculate the accuracy of writing error detectionthe formula to calculate the accuracy [42] using equation (17).
Detection Accuracy = true detection true detection + false detection × 100% b. Accuracy of Correction Writing Error The calculation of writing error correction accuracy is almost the same as the calculation of writing error detection accuracy. This calculation determines how accurately the system can provide correct recommendations by comparing the results of manual recommendations with the systems recommendations. The accuracy is calculated based on the results of these comparisons (as shown in Table 11).
Essay Auto-scoring Using . . . (Herlina Jayadianti) ISSN: 2476-9843 Based on Table 11, the word procesing is included in the writing error where the actual writing is processing. If the word is tested using the system, the correct recommendation result is "processing." The correct recommendation is given a value of 1 in the true column, and the incorrect recommendation is given a value of 1 in the false column. The calculation is continued by adding up each column of the systems true and false recommendations, and the accuracy is calculated using the formula (18).

Easy Scoring Test with and without Writing Errors
Essay scoring in this research was carried out by comparing the similarity of words between respondents answers and answer keys. The essay scoring test was carried out with two conditions (with and without writing errors). The result of this test is that the essay scoring is strongly influenced by the accuracy of the word recommendations generated by the system. If the recommendations are incorrect, the resulting scoring value will decrease, increase or remain the same.

Discussion of Results
Based on the research results, the following conclusions can be drawn:

Writing Errors Detection and Correction Test
Based on the test results, the average, minimum, and maximum accuracy was obtained in detecting and correcting writing errors. Based on our model testing, the average accuracy of writing error detection is 83.64%, the worst accuracy is 57.14%, and the best accuracy is 100.00% (Table 10). While the writing error correction results obtained an average accuracy of 78.44%, the worst accuracy was 40.00%, and the best accuracy was 100.00% (Table 11). Testing of Jaro Winkler and N-Gram methods in this research succeeded in detecting and correcting writing errors with four types of writing errors, namely letter overload, letter transposition, letter replacement, and letter deficiency. However, incorrect words are still detected in the writing error detection test. There are words that should not be included in the category of misspelling, but the system has detected a typo by being marked with an inappropriate word recommendation. This happens because there are words that are not in the Kamus Besar Bahasa Indonesia (KBBI), so when the words are checked using Jaro Winkler, they have been given word recommendations that are not appropriate, which can also affect the processing stage with N-Grams. In addition, it is also caused by the influence of the quantity or frequency of occurrence of words in the N-Gram. When processing the text using Jaro Winkler, the appropriate word recommendation has been given, but when processed with N-Gram, the same word is not found in the N-Gram corpus, or the frequency of occurrence of the word is less than that of other words. Thus, it can be said that the process of detecting and correcting writing errors using Jaro Winkler and N-Grams depends on the availability or completeness of words in the Kamus Besar Bahasa Indonesia (KBBI) and the N-Gram corpus to get good detection and correction accuracy.

Comparison of essay assessment test results with writing errors and without writing errors
The results of the comparison between essay scoring with writing errors and without writing errors are influenced by whether the correct or not correct recommendation word was given in the previous process. It is because the essay scoring in this