Title: Extracting Code Clones for Refactoring Using Combinations of Clone Metrics
1Extracting Code Clones for Refactoring Using
Combinations of Clone Metrics
Eunjong Choi, Norihiro Yoshida, Takashi
Ishio,Katsuro Inoue, and Tateki Sano
Osaka University, Japan Nara Institute of
Science and Technology , Japan NEC Corporation,
Japan
2Background Clone Set
- A set of code clones that is similar or identical
to each other
similar
identical
Clone Set
S1Code Clone 1, Code Clone 3
S2Code Clone 2, Code Clone 4, Code Clone 5
3Background Refactoring Code Clone
- Merge code clones into a single program unit
4Background Language-dependent Code Clone
- It is unavoidable to exist in source code
- because of features of the used program
language.
/ Code Clone in a clone set whose RNR(S) is the
second highest in Ant 1.7.0 / else // is
the zip file in the cache file) if null)
(file)
Example of the language-dependent code
clone (Consecutive setter invocations)
5Background Clone Metrics Higo2007
- Quantitative information on clone sets
- E.g., LEN(S), RNR(S), POP(S)
- Purposes
- To check features of code clones in software
- To extract code clones for several purposes
- E.g., refactoring, defect-prone code clones
Higo2007 Yoshiki Higo, Toshihiro Kamiya,
Shinji Kusumoto, Katsuro Inoue, "Method and
Implementation for Investigating Code Clones in a
Software System", Information and Software
Technology, pp. 985-998 (2007-9)
6Clone Metrics LEN(S)
- The average length of token sequences of code
clones in a clone set S
A token sequence c c is detected as a code
clone from a token sequence ltc c c a bgt
LEN(S) 2
Superscript indicated that the token is in a
repeated token sequence
Clone set S
7Clone Metrics RNR(S)
- The ratio of non-repeated token sequences of code
clones in a clone set S
A token sequence c c is detected as a code
clone from a token sequence ltc c c a bgt
Clone set S
8Clone Metrics POP(S)
- The number of code clones in a clone set S
POP(S) 6
Clone set S
9Single Clone Metric (1/2)
- Clone sets whose RNR(S) is higher
- They do not organize a single semantic unit
- semantic unit many instructions forming a
single functionality
/ Code Clone in a clone set whose RNR(S) is the
second highest in Ant 1.7.0 / else // is
the zip file in the cache ZipFile zipFile
(ZipFile) zipFiles.get(file) if (zipFile
null) zipFile new ZipFile(file)
zipFiles.put(file, zipFile)
ZipEntry entry zipFile.getEntry(resourceName)
if (entry ! null) x
10Single Clone Metric (2/2)
- Clone sets whose POP(S) is higher
- They Include many language-dependent code clones
/ Code Clone in a clone set whose POP(S) is the
first highest in Ant 1.7.0 /
out.println("\"gt")
out.println("") out.print("lt!ELEMENT
project (target ")
out.print(TASKS) out.print(" ")
out.print(TYPES)
11Key Idea
- It is not appropriate to extract refactorable
code clones using just a single clone metric - According to our experiences
- We propose a method based on combined clone
metrics - To improve the weakness of single-metric-based
extraction
12Combined Clone Metrics
- Clone sets whose RNR(S), POPS(S) are higher
- Each code clone organizes a single semantic units
/ Code Clone in a clone set whose RNR(S),
POP(S) are higher than others/ if (ifProperty !
null p.getProperty(ifProperty) null)
return false else if
(unlessProperty ! null
p.getProperty(unlessProperty) ! null)
return false return true
13Case Study (1/2)
- Goal validating our key idea
- Using combined clone metrics is a feasible method
to extract code clone for refactoring - Target System
- Industrial Java software developed by NEC
- 110KLOC, 736 clone sets
14Case Study (2/2)
- Experimental Step
- Selected 62 clone sets from CCFinder's output
using clone metrics. - Conducted a survey about these clone sets and got
feedback from a developer.
Survey
Feed back
CCFinder
Source files
Clone sets using clone metrics
15Subject Code Clones (1/2)
- Clone sets whose either clone metric value is
high - Clone sets whose LEN(S) value is top 10 high
- Clone sets whose RNR(S) value is top 10 high
- Clone sets whose POP(S) value is top 10 high
16Subject Code Clones (2/2)
- Clone sets whose combined clone metrics values
are high - 15 clone sets whose LEN(S) and RNR(S) values are
high rank in the top 15 - 7 clone sets whose LEN(S) and POP(S) values are
high rank in the top 15 - 18 clone sets whose RNR(S) and POP(S) values are
high rank in the top 15 - 1 clone set whose LEN(S), RNR(S) and POP(S)
values are high rank in the top 15
17Results of Case Study (1/2)
Filtering Selected Clone Sets Refactoring Precision
Each Single Clone metric 30 14 0.47
Combined Clone metrics 41 34 0.87
- Selected Clone Sets The number of selected
clones - Refactoring The number of clone sets marked as
Perform refactoring in survey
18Results of Case Study (2/2)
Filtering Selected Clone Sets Refactoring Precision
Each Single Clone metric 30 14 0.47
Combined Clone metrics 41 34 0.87
- Precision How many refactoring candidates were
accepted by a developer?
Refactoring
Precision
Selected Clone Sets
Combined clone metrics is more accepted as
refactoring candidates by a developer
19Summary and Future Work
- Summary
- Our Industrial case study shows that our key idea
is appropriate. - Future Work
- Investigate about recall
- Conduct case studies of open source software
- Suggest a new metric
20Thank You
21Clone sets whose RNR(S) is higher than others
- Each code clone in a clone set S consists of more
non-repeated token sequences
/ Code Clone in a clone set whose RNR(S) is the
second highest in Ant 1.7.0 / else // is
the zip file in the cache ZipFile zipFile
(ZipFile) zipFiles.get(file) if (zipFile
null) zipFile new ZipFile(file)
zipFiles.put(file, zipFile)
ZipEntry entry zipFile.getEntry(resourceName)
if (entry ! null) / /
22Clone sets whose RNR(S) is lower than others
- Consists of more repeated token sequences
- Involve in language-dependent code clone
/ Code Clone in a clone set whose RNR(S) is the
lowest in Ant 1.7.0 / String
sosCmdDir null skip
code. private String
filename null private boolean noCompress
false private boolean noCache false
private boolean recursive false private
boolean verbose false / /
23Survey Format About Clone set XXX
- (1) Do you think that this clone set need a
practice? - Yes No(?Jump to next clone set)
-
- (2) If you marked Yes in your answer to (1),
what practice is appropriate for this clone set? - Refactoring
- Write comments about code clones, but dont
perform refactoring. - Change nothing.
- Others. (
- (3) Write the reason why did you mark in your
answer to (2) - Reason
24Results, and Precision of each clone set in the
survey
Filtering Selected Clone Sets Refactoring Precision
Clone sets whose LEN(S) value is top 10 high 10 7 0.70
Clone sets whose RNR(S) value is top 10 high 10 4 0.40
Clone sets whose POP(S) value is top 10 high 10 3 0.30
Clone sets whose LEN(S) and RNR(S) values are high rank in the top 15 15 13 0.87
Clone sets whose LEN(S) and POP(S) values are high rank in the top 7 6 0.86
RNR(S) and POP(S) values are high rank in the top 15 18 14 0.78
Clone sets whose 1 clone set whose LEN(S), RNR(S), and POP(S) values are high rank in the top 15 1 1 1.00
25Clone metric RNR(S) (1/2)
- File
- F1 a b c a b,
- F2 c c c a b,
- F3 d a b, e f
- F4 c c d e f
- Superscript indicated that the token is in a
repeated token sequence - RNR(S1) of Clone Set S1 is
Clone Set S1 , , ,
ab
ab
ab
ab
26Clone metric RNR(S) (2/2)
- File
- F1 a b c a b,
- F2 c c c a b,
- F3 d a b, e f
- F4 c c d e f
- Superscript indicated that the token is in a
repeated token sequence - RNR(S2) of Clone Set S2 is
Clone Set S2 , ,
c c
c c
c c
1 0 1 2 2 2
RNR(S2) 100 33.3
27Subject Code Clones
- 62 clone sets
- clone sets whose individual clone metric value is
high - SLEN Clone sets whose LEN(S) value is top 10
high. - SRNR Clone sets whose RNR(S) value is top 10
high. - SPOP Clone sets whose POP(S) value is top 10
high. - clone sets whose combined clone metrics values
are high - SLENRNR 15 clone sets whose LEN(S) and RNR(S)
values are high rank in the top 15. - SLENPOP 7 clone sets whose LEN(S) and POP(S)
values are high rank in the top 15. - SRNRPOP 18 clone sets whose RNR(S) and POP(S)
values are high rank in the top 15. - SLENRNRPOP 1 clone set whose LEN(S), RNR(S) and
POP(S) values are high rank in the top 15.
28The Number of Duplicate Clone Set
- SRNR n SPOP n SRNR POP 1
- SRNR n SRNR POP 2
- S POP n SRNR POP 2
- SLEN RNR n SLEN POP n SRNR POP
n SLEN RNR POP 1
29Example of clone set that are not selected
- It is too short to organize a semantic unit.
- RNR metric sometimes extract unintentional code
clones - E.g., Language-dependent code clones
boolean isEqual(final DeweyDecimal other)
final int max Math.max(other.components.length,
components.length) for (int i 0 i lt max
i) final int component1 (i lt
components.length) ? components i 0
final int component2 (i lt other.components.lengt
h) ? other.components i 0 if (