This article proposes the high-speed and high-accuracy code clone detection method based on the combination of tree-based and token-based methods. Existence of duplicated program codes, called code clone, is one of the main factors that reduces the quality and maintainability of software. If one code fragment contains faults (bugs) and they are copied and modified to other locations, it is necessary to correct all of them. But it is not easy to find all code clones in large and complex software. Much research efforts have been done for code clone detection. There are mainly two methods for code clone detection. One is token-based and the other is tree-based method. Token-based method is fast and requires less resources. However it cannot detect all kinds of code clones. Tree-based method can detect all kinds of code clones, but it is slow and requires much computing resources. In this paper combination of these two methods was proposed to improve the efficiency and accuracy of detecting code clones. Firstly some candidates of code clones will be extracted by token-based method that is fast and lightweight. Then selected candidates will be checked more precisely by using tree-based method that can find all kinds of code clones. The prototype system was developed. This system accepts source code and tokenizes it in the first step. Then token-based method is applied to this token sequence to find candidates of code clones. After extracting several candidates, selected source codes will be converted into abstract syntax tree (AST) for applying tree-based method. Some sample source codes were used to evaluate the proposed method. This evaluation proved the improvement of efficiency and precision of code clones detecting.
This article proposes the high-speed and high-accuracy code clone detecting method. Code clone is a fragment of source code that is identical or similar to other portion of source code [
From syntactical point of view, there are three types of code clone named TYPE-1, TYPE-2, and TYPE-3 [
In order to detect these code clones, several methods are proposed in previous works. These methods are based on two principles; one is token-based method [
In this article, we propose the new method that can detect all types of code clone and run relatively fast. Our method combines token-based method and tree-based method. By using token-based method that runs fast, some candidates of code clones are extracted. After the extraction of candidates, each candidate fragment is examined by using tree-based method if it is clone or not. By combining token-based method and tree-based method, our method can detect all types of code clone faster.
fragments of code clones. In order to remove all bugs in the source code, programmer has to find all bugs in other code clones. If he or she forgets to revise other bugs, the quality of software will reduce.
As the size of program increases, the number of code clones increases. Code clones are usually brought into original code by copy-and-paste operation. Finding all code clones in large complex program becomes hard. Therefore, detecting code clones and improving the structure of original code play an important role in software quality assurance.
Bellon et al. classified code clones into three types based on the features of clones [
・ Type-1 (Exact clone): Exact duplication of original part except white space, tab, carriage code and other coding style related characters.
・ Type-2 (Parameterized clone): Syntactically identical but some names of identifiers (variable names, function/method names etc.) and the values of constants are different in two code fragments.
・ Type-3 (Gap clone): Duplication with some insertion and/or deletion of statements.
Several methods to detect code clones are proposed in previous works. For example, Baker et al tried to detect clones by line-wise comparison of two files [
・ Text-based method (or line-based method): This method detects code clones by comparing two codes fragments line by line. This method runs fast and lightweight. But this method can detect only Type-1 clones. Some sophisticated method can detect Type-2 clones. This method is the fastest.
・ Token-based method: This method detects code clones by comparing two sequences of tokens (minimum unit of lexically meaningful sequence of characters). As tokens only represent the kind of elements in programming language, this method can detect Type-1 and Type-2 clones. But it requires sophisticated modification to detect Type-3 clones. This method is relatively faster and lightweight.
・ Tree-based method: This method detects code clones by comparing two abstract syntax tree (AST) [
As shown in
Our preliminaries investigation on some sample programs found that the distribution of the clone types is shown in
Type-1 | Type-2 | Type-3 | Speed | |
---|---|---|---|---|
Text-based | o | Δ | ´ | Fast |
Token-based | o | o | ´ | Medium |
Tree-based | o | o | o | Slow |
you can easily imagine, detecting Type-1 and Type-2 is not so difficult and detecting Type-3 is hard.
With this investigation, we can also see that the number of exact clones is relatively small. This make sense since in most of the cases of code clone, we take a part of the code and change it to fit new need of the function which is near the cloned program in terms of service or computation to provide. In the process of modification, some identifiers may be renamed, some statements are inserted or removed, and some conditional expression may be changed. Therefore, difficulty of finding Type-3 clones is a serious drawback from the practical point of view.
On the other hand, tree-based methods can detect almost all clones including Type-3 clones that are hard to be detected by text or token-based method. But tree-based methods require much computing time and resources. Baxter [
Based on the above consideration, we propose our new code clone detection method that is based on the combination of token-based method and tree-based method. Token-based method runs faster but is not appropriate for detecting Type-3 clones. Tree-based method can find all types of clones but runs slower. The reason of large amount of computing time of tree-based method is its comparison time of trees. The larger the software becomes, the more the number of compared trees. Therefore, we use the token-based method to narrow down the number of trees to be compared. By reducing the number of trees to be compared, we can execute the tree-based method much faster.
1) Lexical analyzing source code and generate token sequence,
2) Applying token-based method to extract the candidates of code clones,
3) Generating abstract syntax trees (ASTs) of code clone candidates,
4) Comparing ASTs to fix code clones of all types.
In step (1), source code is converted into the sequence of tokens. Then some conversion will be done to the sequence of tokens to detect TYPE-2 clones. This conversion includes the replacement of specific tokens such as identifier and function/method name by special characters.
constants are replaced by “#”. After the conversion, detection process will be executed. After arranging all tokens as shown in
Type-1 and Type-2 code clones draw the continuous diagonal lines when we represent them in a manner shown in
In order to detect Type-3 code clones, we have to detect the candidates with gap in a diagonal line. In order to detect this gap, we need to find a method to look for other part of code that might follow this gap.
Algorithm
Step 1 Extracting diagonal lines (code clone candidates) with longer than predefined length.
Step 2 Let l i be a diagonal line whose starting point is ( l i s ( x ) , l i s ( y ) ) and
ending points is ( l i e ( x ) , l i e ( y ) ) .
Step 3 For all diagonal lines, if there is another diagonal line l j where | l i e ( x ) − l j s ( x ) | < d x and | l i e ( y ) − l j s ( y ) | < d y , merge two lines l i and l j and draw a diagonal line starting at l i s and ending at l j e . Note that d x and d y are predefined tolerable limit. The fragment corresponding this new diagonal line is a merged virtual diagonal line and will be a candidate of Type-3 code clone.
Step 4 When there are two or more lines l j 1 , l j 2 , ⋯ , l j m , ⋯ , l j k whose starting points l j m s satisfies | l i e ( x ) − l j m s ( x ) | < d x and | l i e ( y ) − l j m s ( y ) | < d y , draw a line from l i s to l v e where l v e is an ending point of virtual line l v where i v e ( x ) = max { l j 1 ( x ) , l j 2 ( x ) , ⋯ , l j k ( x ) } and
l v e ( y ) = max { l j 1 ( y ) , l j 2 ( y ) , ⋯ , l j k ( y ) } . The code fragment corresponding to the
diagonal line l i s to l v e is a candidate of Type-3 code clone.
Exact and parameterized code clones (Type-1 and Type-2) can be found by token-based method. But gap clone (Type-3) cannot be identified by token-based method. In order to identify Type-3 code clone, we need further steps. They are 1) translating source code into abstract syntax tree (AST) or similar tree-based representation method for source code, 2) compute the difference (distance) of any pair of code clone candidates, and 3) identify Type-3 code clone by examination of distance. Transforming source code into AST is a well-known processing of language processor such as compiler. We will not mention this process any more.
In order to compute the distance of two trees, we use the tree edit distance (TED) [
and that of T B is 6. Therefore by using trivial method, T A can be converted into T B by 6 + 6 = 12 operations. This is the maximum length of transformation operations. However, T A can be converted by the following sequence: i) delete node “D” from tree T A , ii) insert node “G” between node “A” and “B” of tree T B , iii) rename node “F” of tree T A into “H”. Total number of primitive operation is 3 and this is the minimum number of operations for
1If strictly speaking, we have to prove that 3 is the minimum number of the sequence of transformation.
transforming T A into T B . Therefore TED of T A and T B is 31.
By applying TED and computing the distance of any two trees based on the established algorithms [
We have implemented the code clone detection system which adopted proposed method. Following is a target program of experiment.
・ Implementation Language: Java
・ Program name: JDK 1.5.0
・ The number of files: 108
・ The number of lines: 33,128 lines
Proposed system is implemented under following environment.
・ OS: Window 7 Professional 64 bits
・ CPU: Intel Core i7 3.2 GHz
・ RAM: 6.0 GB (2.0 GB is used for executing proposed method)
Proposed system is implemented by Java. The total number of prototype code is 4333 lines.
After extracting Type-1 and Type-2 code clones, we furthermore try to extract the candidates of Type-3 (gap) clones. Threshold values of d x and d y mentioned in section 3.3 were set to 500. Result is shown in
that the total computing time is heavily influenced by AST-based time. Approximately AST-based computing time occupies 80% to 90% of total computing time.
Let n be the number of tokens. Then complexity of token-based time is O ( n 2 ) . And let m be the number of nodes in the tree. Then complexity of tree-based time is O ( m 4 ) [
CCFinder is the fastest system. But it uses token-based method, therefore it is not easy to detect Type-3 code clone. Other two systems and the proposed system
CCFinder [ | Proposed Method | DECKAR [ | CloneDR [ | |
---|---|---|---|---|
TYPE | Token-based | Tree-based | Tree-based | Tree-based |
CPU | - | 3.2 GHz | 3.2 GHz | 2.0 GHz |
Memory | - | 2.0 GB | 2.0 GB | 1.0 GB |
Time | 40 sec | 2174 sec | 7200 sec | 9000 sec |
(0.6 h) | (2.0 h) | (2.5 h) |
2 9000 × 2.0 3.2 .
3The terms “precision rate” and “recall rate” are commonly used in the field of information retrieval and pattern recognition. Precision rate is the fraction of detected instance that are relevant, and recall rate is the fraction of relevant instance that are detected.
adopt tree-based method. Therefore, they can detect all types of code clones. This table shows that proposed system is three to four times faster than other two tree-based systems (DECKARD and CloneDR). Note that DECKARD uses same CPU clocks and memory size, and those of CloneDR are lower and small. Therefore comparison to CloneDR is not accurate. Clone detection computing does not include so many disk accesses. The computing time mainly depends on the CPU time. The CPU time is roughly in proportion to clock frequency. Therefore if CloneDR runs on the same clock frequency of DECKARD and proposed system, the computing time is approximately 6000 sec2. The reason why proposed system is faster than other two tree-based systems is the difference of the number of tree comparison. Our system narrows the candidates of code clones by token-based method.
Currently there is no standard criterion of the quality of code clone detection method. One of the simplest quality measures is the ratio of the number of detected code clones and the number of all code clones. But this measure is virtually useless because we cannot count all code clones in large-scale software. Bellon [
in these two is 22 and the ratios of common tokens are 56% (left) and 48% (right). After converting these two codes into two ASTs, we compare the ratio of common nodes in ASTs. The number of nodes of left code is 77 and that of right code is 79. The number of common nodes of these two ASTs is 63. Therefore ratios of common nodes are 78% and 82% respectively. Higher common ratio (tokens and nodes) means the high similarity of two codes. Comparing the common tokens with common nodes, ratio of common nodes is higher than that of common tokens. This means that using tree-based method can detect code clones more precisely.
Total average ratio of common tokens in our experiment is 57% and that of common nodes is 78%. Bellon’s experiment says average ratio of precision of previous works is approximately 64% [
In this article, we proposed the new method of code clone detection which can detect all types of code clones. Our method combines token-based method and tree-based method. The former can run faster but cannot detect all types of code clones, the latter can detect all types of code clones but requires large computing resources (memory and CPU time). Therefore, applying tree-based methods to large-scale software is virtually impossible. We combine these two methods; token-based method is used to extract the candidates of code clones. Extracted candidates are transformed into abstract syntax trees (ASTs) representation and tree-based method is applied to these trees. By narrowing the candidates of code clones and reducing the number of comparison operations, computing time is reduced to reasonable time.
Experimental evaluation is conducted using sample files with approximately 35,000 lines source code. Proposed method is 3 to 4 times faster than conventional tools such as DECKARD and CloneDR, all of which adopt tree-based method. Detection accuracy is assessed using Bellon’s criterion. Proposed method keeps almost the same accuracy of conventional tools. Based on these evaluation results, we can conclude that the proposed method keeps the accuracy of detection and runs faster than conventional tools and therefore is useful for the improvement of code clone detection of large-scale software.
Ami, R. and Haga, H. (2017) Code Clone Detection Method Based on the Combination of Tree-Based and Token-Based Methods. Journal of Software Engineering and Applications, 10, 891-906. https://doi.org/10.4236/jsea.2017.1013051