Title: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool
1DCCFinder A Very-Large Scale Code Clone Analysis
and Visualization Tool
- Simone Livieri
- Yoshiki Higo
- Makoto Matsushita
- Katsuro Inoue
2Background
- Open-Source Software (OSS) is used in many
software systems - Relations between software systems can be exposed
through code clone analysis - Large collections of OSS exist
- Huge memory requirements, long running time
- Computing power is cheap
- Large number of computers are often easy
accessible - Code clone analysis can be distributed
3In the beginning was CCFinder
- CCFinder is a code-clone analysis tool
- Widely used and cited
- Token based
- Many languages supported (e.g. C, C, Java)
- Good scalability (but cant handle very large
input)
4DCCFinder
- D(istributed)CCFinder is a tool for distributed
code clone analysis - Master-slave distributed system
- Data sharing through a shared file system
- Uses CCFinder to perform the code clone analysis
- The prototype ran on 80 computers of the Student
Laboratory of our department
5Computational Model
Slave Node
Target is the set of source file undergoing code
clone analysis
A category is a set of source file sharing a
specific feature or use
A project is a single software system
A unit is a set of source files that may cross
multiple projects
Two units make a piece. A piece is the
collection of file that will be analyzed on each
slave node
target
6System Implementation (1)
- Written in Java (about 20kLoc)
- Master-Slave-Registry communication handled with
Java RMI - Basic fault tolerance
Master and slave node characteristics Master and slave node characteristics
Processor Pentium IV 3GHz
Memory 1 GBytes
Network Link Gigabit Ethernet connected to 100 MBit/s network hubs
OS FreeBSD 5.3-STABLE
Local Storage 4050 GBytes
7Analysis Process
8System Implementation (2)
- Indexer
- Examines the target and collect file size, LoC,
project and category name - Computes unit boundaries
- Master Node
- Creates the input files for CCFinder and assigns
jobs to the slaves - Slave Node
- Copies the files on the local storage
- Executes CCFinder
- Copies the output to the shared storage
9System Implementation (2)
- Indexer
- Examines the target and collect file size, LoC,
project and category name - Computes unit boundaries
- Master Node
- Creates the input files for CCFinder and assigns
jobs to the slaves - Slave Node
- Copies the files on the local storage
- Executes CCFinder
- Copies the output to the shared storage
10System Implementation (3)
- Clone Coverage Analyzer
- Compute the number of shared line of code between
each pair of files, projects and categories - Image Generator
- Generate scatter plot, heat maps or bar chart
from the clone coverage data
11System Implementation (3)
- Clone Coverage Analyzer
- Compute the number of shared line of code between
each pair of files, projects and categories - Image Generator
- Generate scatter plot, heat maps or bar chart
from the clone coverage data
12Case Study I The FressBSD Target
- Vast collection of Open-Source software used by
the FreeBSD OS - Unit size 15MBytes
- Minimum code clone length 50 tokens
- Total number of tasks 269,745
Number of categories 45
Number of projects 6658
Number of .c files 754.552
Total line of code 403,625,067
Total size 10.8GBytes
Time elapsed Time elapsed
Indexer 22 minutes
D-CCFinder 51 hours
Scatter plot Scatter plot
Clone Coverage Analyzer 23 hours
Image Generator 4 hours
Total 78 hours 22 minutes
13Case Study I Result
14Case Study I Result
php4 and php5 duplicated source tree
15Case Study I Result
gstreams main source tree is duplicated inside
all the gstream plugin projects
16Case Study I Result
Multiple copies of the X-Windows System source
tree
17Case Study I Result
18Case Study I Result
- Database Category
- CCC1 41
- Causes
- Different version of the same software
- Database drivers for different languages
- Multiple copies of the phpX source tree
19Case Study I Result
- Development Category
- CCC1 38
- Causes
- Mainly the presence of different versions of the
GNU binary utilities and compilers
20Case Study I Result
- Lang and Development Categories
- CCC1 28
- Causes
- The presence in both categories of the suite of
GNU compilers
21Case Study I Result
- X11 Fonts Category
- CCC1 46
- Causes
- Small category size
- Seven copies of the X Window System source tree
22Case Study II SPARS-J and the FressBSD Target
- SPARS-J is a Java component analysis tool
- About 47000 line of code written in C
- Code clones between the SPARS-J and the whole
FreeBSD target were detected
23Case Study II Code Clone Coverage (before)
- Most of the code clones were from a single file
getopt.c
24Case Study II Code Clone Coverage (after)
- Code clones from CGI handling source code
- Specialized version of getopt.c
25Summary
- Proposed a new approach to distributed large
scale code clone analysis - Obtained a global overview of code clones in the
FreeBSD target - In SPARS-J, effortlessly individuated the use of
code from the FreeBSD target
26Summary (2)
- The acceleration gain was 20. Limited by
- data transfer, network congestion, master-slave
coordination - Generating of reasonable size scatter-plot traded
speed for accuracy. Effects - Source code organization easily visible, enhanced
artifacts, finer details not distinguishable - Currently cant efficiently filter unnecessary or
not-so-interesting code clones - Being addressed by exploring fingerprint based
source code analysis
27Future Work
- Currently D-CCFinder is being rewritten
- Better fault tolerance
- GUI Interface
- Distributed post processing and image generation
- Exploring the evolution of different software
systems with code clone analysis
28Metrics
CCC1 is the percentage of shared line of code
between M0 and M1 computed over the total line of
code of M0 and M1
CCC2 is the percentage of line of code that M0
shares with M1 computed over the total line of
code of M0
A pair of files or projects or categories
Segments of the cone clones between M0 and M1
Segments of the cone clones between M0 and M1 in
M0
Number of lines of code in x