Categorization of Frequent Errors in Solution Codes Created by Novice Pro-grammers

. Abstract. In recent times, e-learning has become indispensable for both technical and general education. Among all the subjects, programming education has drawn attention because of its importance for continuous development in the ICT sector. Finding errors in a solution code is a laborious task for novice programmers, teachers and instructors. Novice programmers are spending a lot of valuable time to search errors in the solution codes. In this paper, a method for the categorization of frequent errors in solution codes is presented. In the proposed method, the di ﬀ erences between wrong solutions and accepted solutions are used to deﬁne feature vectors for a clustering algorithm. A longest common subsequence (LCS) algorithm is leveraged to ﬁnd the di ﬀ erences between wrong and accepted codes, then all the inequalities are converted into feature vectors. The k-mean clustering algorithm is applied to cluster the elements of the feature vector to ﬁnd the most common errors in solution codes. In our experiment, the method was applied to a set of program solution codes accumulated in an e-learning system. Experimental results show that the proposed method is e ﬃ cient and capable to detect the most common errors occurred in solution codes that can be helpful for novice programmers to resolve errors quickly as well as useful for teachers to prepare lesson plan.


Introduction
Programming is one of the most vital skills in information and communication technology (ICT) as well as computer education. No significant development in ICT is possible without good programming skill. Many countries and educational institutions have given priority to programming for computer science and engineering students. Programming learning is not an easy task for students, because it requires repeated practice to do well in programming. Dealing with errors in program code is also a challenging task for students, especially novice programmers. Error occurs in code for many reasons, including algorithmic, misconceptions, and misinterpretations [1]. Not all errors are easy to find and resolve by students and novice programmers. So, helping students and novice programmers to identify and classify code errors can be important research topics.
Empirical research on detecting and classifying errors in program codes has drawn attention in recent times. Program code is a complex text where many errors are occurred. Most of the syntax errors are identified during compilation. Some recent works have focused on evaluating common syntax errors in the code, improving error detection strategies, improving compiler messages after identifying syntax errors to help students and novice programmers [2][3][4][5]. In addition to the compiler, many deep learning techniques are employed to detect syntax errors * e-mail: mostafiz26@gmail.com * * e-mail: m5231138@u-aizu.ac.jp * * * e-mail: yutaka@u-aizu.ac.jp in program codes that have achieved significant success [6][7][8].
When we consider errors other than syntax they are called logic errors. Logic error is sometimes called semantic error in code. The program code successfully passes the compilation process when a logic error occurs in the code and gives results, but fails to provide correct results in all possible test cases [1]. Ahmadzadeh et al. [9] explained that the purposeful meaning of programmers is not reflected with the language when such logic errors occurred in codes. Searching for logic errors in codes is frustrating for programmers, where there is no clue or available feedback about the location of the error or the nature of the error. Logic errors in the code are particularly difficult to detect, especially for first-year students (computer science or other departments who have basic programming courses) and novice programmers due to their inadequate error debugging experience, misconceptions, and poor programming knowledge [10][11][12].
With this motivation we are focusing on identifying and classifying the most common errors, including logic errors in code. We have collected all source codes for the experiment from Aizu Online Judge (AOJ) system [13]. Initially, we have selected paired source codes (wrong → accepted). The source code may have several submissions to be accepted, we collected the last incorrect and accepted submission as a pair of source codes. We have extracted the debugging pattern in the wrong code to be accepted from the paired source codes. First, paired source codes are converted into token sequences using Lexical Analyzer and then each token is encoded with ID. We calculated the difference between paired source codes (wrong → accepted) using the longest common subsequence (LCS) algorithm. All asymmetry properties of paired source code are converted to feature vectors. Finally, k-means clustering algorithm is used to classify the detected errors. We have conducted our experimental work using several hyper parameters and the obtained results of the error classification can be helpful for students, novice programmers, teachers and instructors.
The rest of the paper is organized as follows: Section 2 presents related literature for error identification and classification. In Section 3, model architecture is presented. Section 4 presents experimental results and discussion. Finally, section 5 presents conclusion and future work of the research.

Related Literature
Ettles et al. [1] presented a logic error classification model. Students and novice programmers made logic errors in the program code that errors were broadly classified into three types of errors, including algorithmic, misinterpretation, and misconception. About 15,000 logic erroneous code segments (created by students and novice programmers) written in C programming language were analyzed for classification. Most of the logic errors in the codes were caused by misconceptions that reflected a low level of programming knowledge. A similar work was reported [3] where students' ability to resolve syntax errors was examined. The research paper also examined how students spend their time solving very common types of syntax errors and rare types of syntax errors. Targeting time-taking errors by teachers and instructors during class helps students quickly resolve rare types of syntax errors that improve students' productivity.
Intelligent source code evaluation models based on deep neural networks are proposed [6? -8]. These models are capable of detecting logic errors as well as syntax errors in codes. Experimental results show that the models are detecting logic errors at a high rate and also providing code repair suggestions. Fitzgerald et al. [10] described that debugging is a challenging task for novice programmers and students. The results of their survey show that good debuggers are more likely to be good programmers but good programmers may not be good debuggers. As they presented, most students made arithmetic errors, incorrect loop conditions, etc. that occurred due to misconceptions and lack of language knowledge.
In [14], numerous syntax erroneous source codes were categorized by manual analysis instead of compiler message analysis. However, this study did not provide a detailed description of the classification of logic errors. Hristova et al. [15] introduced a tool for Java programming called "Expresso" which was developed by faculty members and advanced students. This paper also presented the reasons why "Expresso" is better than other programming tools. The "Expresso" tool has provided a list of typical errors made by novice programmers and students. Altadmri and Brown [16] presented a comprehensive analysis based on about 37 million source codes collected from 250,000 students and novice programmers. Source codes are analyzed based on frequency, spreading of errors, time-tofix and so on. They found in their analysis that the average time to solve logical (semantic) errors in a semester course is decreasing. They further presented that students and novice programmers faced more difficulties resolving logic errors than syntax errors.
Our research is unique and different from existing research. We focus on identifying frequent errors from paired source codes to investigate error debugging patterns for students and novice programmers. One of our main aims to classify these frequent errors in different groups that help students and teachers in problem solving and teaching. One of our main goals is to classify these frequent errors into different groups that help students and teachers.

Overview
The architecture of the proposed model is presented in Figure 1. At the very beginning, paired (wrong → accepted) source codes are collected from the AOJ system and then the codes are converted into token IDs using Lexical Analyzer. The LCS algorithm is leveraged to find the difference between wrong and accepted codes then all inequalities between paired codes are listed as feature vectors. Finally, the K-mean clustering algorithm is applied to find and classify frequent errors. Similar errors will be categorized in the same category which helps teachers and instructors to know which types of errors occurred in most cases. Accordingly, they can plan programming lectures.

Word Sequencing of Program Codes
In computer science, lexical analysis, scanner or tokenization is a process used to convert high-level inputs such as web pages, programming codes into sequence of tokens. In the proposed model, we have used lexical analyzer to convert codes into word sequences and then each word is encoded to ID. The process of word sequencing of codes is shown in Figure 2.

Clustering the Frequent Errors
In the proposed model, the K-means clustering algorithm is leveraged to group the frequent errors in codes [17]. Initially, the Elbow (an optimal k value selection) method has applied to determine the optimal number of clusters for Kmeans algorithm. Similar errors are located in the same cluster, so the inter-cluster similarity index is lower and the intra-cluster similarity index is higher.

Verdicts of Online Judge and Paired Code Selection
We collected paired source codes from the AOJ system for our experimental purposes. We have focused on the actual reasons and context of the errors that users made in   Table 1.
In an online judge system, students / users can submit their solutions as many as they like or until their solution is accepted, so we have a huge number of trial-error history in our database. We selected the accepted code and the last incorrect code as a pair of source code for experimental purposes as shown in Figure 3.

Source code conversion into feature vectors
In the previous section, we described that codes are converted into word sequences and then encoded with ID. We leveraged LCS algorithm to calculate the difference between incorrect and accepted codes. All the asymmetry attributes contained in incorrect and accepted codes are

Experimental Results
We have performed several experiments with paired source codes collected from AOJ system to evaluate the performance of our proposed methodology. We selected solution codes (both incorrect and accepted) of two problems. The experimental cases are (i) clustering in incorrect source codes (ii) clustering in accepted source codes (iii) clustering in both incorrect and accepted codes. Experimental results and discussions are presented below. Initially, we performed a clustering operation independently on the incorrect solution codes of Problem 1. We selected the number of optimal clusters using the Elbow method and suggested that the number of optimal clusters for the incorrect solution codes would be 25 as shown in Figure 5 (a). The mean of the error distances of each cluster is shown in Figure 5   We have 25 clusters for incorrect codes, of which we have presented some cluster data in Table 2 where each cluster has the same type of errors. Cluster 1 contains errors related to incorrect array sizes, and array references. Cluster 4 contains errors related to the program "printf" function or output. Similarly, clusters 8 and 12 have errors related to assignment process, operation, etc.
We performed the clustering operation on accepted codes for the same problem described in the problem description (Problem 1). We have 26 clusters and the mean value of each cluster is also calculated as shown in Figure  6. Clusters 0, 4,7,8,20,22, 23 have the highest mean values among others. In Table 3, we presented some clusters data that users made their accepted solution codes. Clus-ters 0 and 5 contain corrections related to the print f or output function and array, respectively. Clusters 14 and 17 have corrections related to adding new line feeds as output. Considering all the solution codes of Problem 1, we calculated the difference between incorrect and accepted codes and then converted the asymmetry properties to a single feature vector whose dimension is 192 (96 * 2). The clustering operation is performed on the feature vector, the optimal number of clusters 32 generated by the Elbow method. The mean values of each cluster are shown in Figure 7.  Table 4 presents cluster data of different clusters where the clustering operation is performed using both incorrect and accepted solution codes. Cluster 3 shows that users made a mistake adding endl after the output. In Cluster 13, the users made a mistake in entering the correct array size. Similarly, in Cluster 27, users added incorrect numbers and then corrected them. On the other hand, the printf ( " % d \n " , l ) ; printf ( " % s \n " , str ) ; printf ( " % d \n " , num ) ; % d " , i ) ; printf ( " % c Error around printf statement Error related with assignment process 12 = 21 ; = 18 ; = 20 ; j = 0 ; = -1 Assignment operation Table 3. Clustering in accepted solution codes of Problem 1 Cluster Changes in accepted codes Descriptions 0 } printf ( " \n " ) ; return, } printf ( " \n " ) ; }, ; printf ( " \n " ) ; }, } printf ( " \n " ) ; return 0 ; }, } fprintf ( fout , " \n " ) ; return, } printf ( " % s " , " \n " ) ; return    We used both incorrect and accepted solution codes of Problem 2 for clustering, the Elbow method shows the optimal number of clusters is 2 as shown in Figure 8. The mean value of Cluster 3 is much higher than other clusters. Clustering operation has been performed using both incorrect and accepted solution codes of Problem 2. Some cluster data is presented in Table 5. After analyzing the cluster data, we found that many users made common mistakes in the solution codes.

Conclusion
In this paper, we have proposed a method for the categorization of frequent errors in solution codes. We used paired source codes (incorrect → accepted) for our experiments. We have analyzed the reasons behind the errors, we find the most common errors in the solution codes made by the users. We also analyzed the correction process in the accepted solution codes. Experimental results show that the model can detect common errors made by users and cluster errors on the basis of similarity. The outcome of this research paper can be useful for students as well as teachers and instructors. Teachers can prepare accurate lecture plans to help students, so that they can solve common mistakes in solution codes. In future work, automatic correction of incorrect codes can be interesting.