DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Description:

... of Computer Science, Graduate School of Information Science and Technology, Osaka University ... Science, Graduate School of Information Science and ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 29

Provided by: spa95

Category:

more less

Transcript and Presenter's Notes

Title: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

1
DCCFinder A Very-Large Scale Code Clone Analysis
and Visualization Tool

Simone Livieri
Yoshiki Higo
Makoto Matsushita
Katsuro Inoue

2
Background

Open-Source Software (OSS) is used in many
software systems
Relations between software systems can be exposed
through code clone analysis
Large collections of OSS exist
Huge memory requirements, long running time
Computing power is cheap
Large number of computers are often easy
accessible
Code clone analysis can be distributed

3
In the beginning was CCFinder

CCFinder is a code-clone analysis tool
Widely used and cited
Token based
Many languages supported (e.g. C, C, Java)
Good scalability (but cant handle very large
input)

4
DCCFinder

D(istributed)CCFinder is a tool for distributed
code clone analysis
Master-slave distributed system
Data sharing through a shared file system
Uses CCFinder to perform the code clone analysis
The prototype ran on 80 computers of the Student
Laboratory of our department

5
Computational Model
Slave Node
Target is the set of source file undergoing code
clone analysis
A category is a set of source file sharing a
specific feature or use
A project is a single software system
A unit is a set of source files that may cross
multiple projects
Two units make a piece. A piece is the
collection of file that will be analyzed on each
slave node
target
6
System Implementation (1)

Written in Java (about 20kLoc)
Master-Slave-Registry communication handled with
Java RMI
Basic fault tolerance

Master and slave node characteristics Master and slave node characteristics
Processor Pentium IV 3GHz
Memory 1 GBytes
Network Link Gigabit Ethernet connected to 100 MBit/s network hubs
OS FreeBSD 5.3-STABLE
Local Storage 4050 GBytes
7
Analysis Process
8
System Implementation (2)

Indexer
Examines the target and collect file size, LoC,
project and category name
Computes unit boundaries
Master Node
Creates the input files for CCFinder and assigns
jobs to the slaves
Slave Node
Copies the files on the local storage
Executes CCFinder
Copies the output to the shared storage

9
System Implementation (2)

Indexer
Examines the target and collect file size, LoC,
project and category name
Computes unit boundaries
Master Node
Creates the input files for CCFinder and assigns
jobs to the slaves
Slave Node
Copies the files on the local storage
Executes CCFinder
Copies the output to the shared storage

10
System Implementation (3)

Clone Coverage Analyzer
Compute the number of shared line of code between
each pair of files, projects and categories
Image Generator
Generate scatter plot, heat maps or bar chart
from the clone coverage data

11
System Implementation (3)

Clone Coverage Analyzer
Compute the number of shared line of code between
each pair of files, projects and categories
Image Generator
Generate scatter plot, heat maps or bar chart
from the clone coverage data

12
Case Study I The FressBSD Target

Vast collection of Open-Source software used by
the FreeBSD OS
Unit size 15MBytes
Minimum code clone length 50 tokens
Total number of tasks 269,745

Number of categories 45
Number of projects 6658
Number of .c files 754.552
Total line of code 403,625,067
Total size 10.8GBytes
Time elapsed Time elapsed
Indexer 22 minutes
D-CCFinder 51 hours
Scatter plot Scatter plot
Clone Coverage Analyzer 23 hours
Image Generator 4 hours
Total 78 hours 22 minutes
13
Case Study I Result
14
Case Study I Result
php4 and php5 duplicated source tree
15
Case Study I Result
gstreams main source tree is duplicated inside
all the gstream plugin projects
16
Case Study I Result
Multiple copies of the X-Windows System source
tree
17
Case Study I Result
18
Case Study I Result

Database Category
CCC1 41
Causes
Different version of the same software
Database drivers for different languages
Multiple copies of the phpX source tree

19
Case Study I Result

Development Category
CCC1 38
Causes
Mainly the presence of different versions of the
GNU binary utilities and compilers

20
Case Study I Result

Lang and Development Categories
CCC1 28
Causes
The presence in both categories of the suite of
GNU compilers

21
Case Study I Result

X11 Fonts Category
CCC1 46
Causes
Small category size
Seven copies of the X Window System source tree

22
Case Study II SPARS-J and the FressBSD Target

SPARS-J is a Java component analysis tool
About 47000 line of code written in C
Code clones between the SPARS-J and the whole
FreeBSD target were detected

23
Case Study II Code Clone Coverage (before)

Most of the code clones were from a single file
getopt.c

24
Case Study II Code Clone Coverage (after)

Code clones from CGI handling source code
Specialized version of getopt.c

25
Summary

Proposed a new approach to distributed large
scale code clone analysis
Obtained a global overview of code clones in the
FreeBSD target
In SPARS-J, effortlessly individuated the use of
code from the FreeBSD target

26
Summary (2)

The acceleration gain was 20. Limited by
data transfer, network congestion, master-slave
coordination
Generating of reasonable size scatter-plot traded
speed for accuracy. Effects
Source code organization easily visible, enhanced
artifacts, finer details not distinguishable
Currently cant efficiently filter unnecessary or
not-so-interesting code clones
Being addressed by exploring fingerprint based
source code analysis

27
Future Work

Currently D-CCFinder is being rewritten
Better fault tolerance
GUI Interface
Distributed post processing and image generation
Exploring the evolution of different software
systems with code clone analysis

28
Metrics
CCC1 is the percentage of shared line of code
between M0 and M1 computed over the total line of
code of M0 and M1
CCC2 is the percentage of line of code that M0
shares with M1 computed over the total line of
code of M0
A pair of files or projects or categories
Segments of the cone clones between M0 and M1
Segments of the cone clones between M0 and M1 in
M0
Number of lines of code in x

Write a Comment

User Comments (0)