DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Description:

... of Computer Science, Graduate School of Information Science and Technology, Osaka University ... Science, Graduate School of Information Science and ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 29
Provided by: spa95
Category:

less

Transcript and Presenter's Notes

Title: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool


1
DCCFinder A Very-Large Scale Code Clone Analysis
and Visualization Tool
  • Simone Livieri
  • Yoshiki Higo
  • Makoto Matsushita
  • Katsuro Inoue

2
Background
  • Open-Source Software (OSS) is used in many
    software systems
  • Relations between software systems can be exposed
    through code clone analysis
  • Large collections of OSS exist
  • Huge memory requirements, long running time
  • Computing power is cheap
  • Large number of computers are often easy
    accessible
  • Code clone analysis can be distributed

3
In the beginning was CCFinder
  • CCFinder is a code-clone analysis tool
  • Widely used and cited
  • Token based
  • Many languages supported (e.g. C, C, Java)
  • Good scalability (but cant handle very large
    input)

4
DCCFinder
  • D(istributed)CCFinder is a tool for distributed
    code clone analysis
  • Master-slave distributed system
  • Data sharing through a shared file system
  • Uses CCFinder to perform the code clone analysis
  • The prototype ran on 80 computers of the Student
    Laboratory of our department

5
Computational Model
Slave Node
Target is the set of source file undergoing code
clone analysis
A category is a set of source file sharing a
specific feature or use
A project is a single software system
A unit is a set of source files that may cross
multiple projects
Two units make a piece. A piece is the
collection of file that will be analyzed on each
slave node
target
6
System Implementation (1)
  • Written in Java (about 20kLoc)
  • Master-Slave-Registry communication handled with
    Java RMI
  • Basic fault tolerance

Master and slave node characteristics Master and slave node characteristics
Processor Pentium IV 3GHz
Memory 1 GBytes
Network Link Gigabit Ethernet connected to 100 MBit/s network hubs
OS FreeBSD 5.3-STABLE
Local Storage 4050 GBytes
7
Analysis Process
8
System Implementation (2)
  • Indexer
  • Examines the target and collect file size, LoC,
    project and category name
  • Computes unit boundaries
  • Master Node
  • Creates the input files for CCFinder and assigns
    jobs to the slaves
  • Slave Node
  • Copies the files on the local storage
  • Executes CCFinder
  • Copies the output to the shared storage

9
System Implementation (2)
  • Indexer
  • Examines the target and collect file size, LoC,
    project and category name
  • Computes unit boundaries
  • Master Node
  • Creates the input files for CCFinder and assigns
    jobs to the slaves
  • Slave Node
  • Copies the files on the local storage
  • Executes CCFinder
  • Copies the output to the shared storage

10
System Implementation (3)
  • Clone Coverage Analyzer
  • Compute the number of shared line of code between
    each pair of files, projects and categories
  • Image Generator
  • Generate scatter plot, heat maps or bar chart
    from the clone coverage data

11
System Implementation (3)
  • Clone Coverage Analyzer
  • Compute the number of shared line of code between
    each pair of files, projects and categories
  • Image Generator
  • Generate scatter plot, heat maps or bar chart
    from the clone coverage data

12
Case Study I The FressBSD Target
  • Vast collection of Open-Source software used by
    the FreeBSD OS
  • Unit size 15MBytes
  • Minimum code clone length 50 tokens
  • Total number of tasks 269,745

Number of categories 45
Number of projects 6658
Number of .c files 754.552
Total line of code 403,625,067
Total size 10.8GBytes
Time elapsed Time elapsed
Indexer 22 minutes
D-CCFinder 51 hours
Scatter plot Scatter plot
Clone Coverage Analyzer 23 hours
Image Generator 4 hours
Total 78 hours 22 minutes
13
Case Study I Result
14
Case Study I Result
php4 and php5 duplicated source tree
15
Case Study I Result
gstreams main source tree is duplicated inside
all the gstream plugin projects
16
Case Study I Result
Multiple copies of the X-Windows System source
tree
17
Case Study I Result
18
Case Study I Result
  • Database Category
  • CCC1 41
  • Causes
  • Different version of the same software
  • Database drivers for different languages
  • Multiple copies of the phpX source tree

19
Case Study I Result
  • Development Category
  • CCC1 38
  • Causes
  • Mainly the presence of different versions of the
    GNU binary utilities and compilers

20
Case Study I Result
  • Lang and Development Categories
  • CCC1 28
  • Causes
  • The presence in both categories of the suite of
    GNU compilers

21
Case Study I Result
  • X11 Fonts Category
  • CCC1 46
  • Causes
  • Small category size
  • Seven copies of the X Window System source tree

22
Case Study II SPARS-J and the FressBSD Target
  • SPARS-J is a Java component analysis tool
  • About 47000 line of code written in C
  • Code clones between the SPARS-J and the whole
    FreeBSD target were detected

23
Case Study II Code Clone Coverage (before)
  • Most of the code clones were from a single file
    getopt.c

24
Case Study II Code Clone Coverage (after)
  • Code clones from CGI handling source code
  • Specialized version of getopt.c

25
Summary
  • Proposed a new approach to distributed large
    scale code clone analysis
  • Obtained a global overview of code clones in the
    FreeBSD target
  • In SPARS-J, effortlessly individuated the use of
    code from the FreeBSD target

26
Summary (2)
  • The acceleration gain was 20. Limited by
  • data transfer, network congestion, master-slave
    coordination
  • Generating of reasonable size scatter-plot traded
    speed for accuracy. Effects
  • Source code organization easily visible, enhanced
    artifacts, finer details not distinguishable
  • Currently cant efficiently filter unnecessary or
    not-so-interesting code clones
  • Being addressed by exploring fingerprint based
    source code analysis

27
Future Work
  • Currently D-CCFinder is being rewritten
  • Better fault tolerance
  • GUI Interface
  • Distributed post processing and image generation
  • Exploring the evolution of different software
    systems with code clone analysis

28
Metrics
CCC1 is the percentage of shared line of code
between M0 and M1 computed over the total line of
code of M0 and M1
CCC2 is the percentage of line of code that M0
shares with M1 computed over the total line of
code of M0
A pair of files or projects or categories
Segments of the cone clones between M0 and M1
Segments of the cone clones between M0 and M1 in
M0
Number of lines of code in x
Write a Comment
User Comments (0)
About PowerShow.com