GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

Description:

* Corresponding PDGs PDG for the Original Code PDG for the Plagiarized Code * PDG-based Plagiarism Detection A program is represented as a set of PDGs Let g be a PDG ... – PowerPoint PPT presentation

Number of Views:499

Avg rating:3.0/5.0

Slides: 43

Provided by: chao61

Category:

more less

Transcript and Presenter's Notes

Title: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

1
GPLAG Detection of Software Plagiarism by
Program Dependence Graph Analysis

Chao Liu, Chen Chen,
Jiawei Han, Philip S. Yu
University of Illinois at Urbana-Champaign
IBM T.J. Waston Research Center
Presented by Chao Liu

2
Motivations

Blossom of open-source projects
SourceForge.net 125,090 projects as July 2006
Convenience for software plagiarism?
You can always find something online
Core-part plagiarism
Ripping off GUIs and irrelevant parts
(Illegally) reuse the implementations of
core-algorithms
Our goal
Efficient detection of core-part plagiarism

3
Challenges

Effectiveness
Professional plagiarists
Automated plagiarism
Efficiency
Only a small part of code is plagiarized, how to
detect it efficiently?

4
Outline

Plagiarism Disguises
Review of Plagiarism Detection
GPLAG PDG-based Plagiarism Detection
Efficiency and Scalability
Experiments
Conclusions

5
Original Program
A procedure in a program, called join
01 static void 02 make_blank (struct line
blank, int count) 03 04 int i 05
unsigned char buffer 06 struct field
fields 07 blank-gtnfields count 08
blank-gtbuf.size blank-gtbuf.length count
1 09 blank-gtbuf.buffer (char) xmalloc
(blank-gtbuf.size) 10 buffer (unsigned char
) blank-gtbuf.buffer 11 blank-gtfields
fields (struct field ) xmalloc (sizeof
(struct field) count) 12 for (i 0 i lt
count i) 13 ... 14 15
6
Disguise 1 Format Alteration
Insert comments and blanks
01 static void 02 make_blank (struct line
blank, int count) 03 04 int i 05
unsigned char buffer 06 struct field
fields 07 blank-gtnfields count //
initialization 08 blank-gtbuf.size
blank-gtbuf.length count 1 09
blank-gtbuf.buffer (char) xmalloc
(blank-gtbuf.size) 10 buffer (unsigned char
) blank-gtbuf.buffer 11 blank-gtfields
fields (struct field ) xmalloc (sizeof
(struct field) count) 12 for (i 0 i
lt count i) 13 ... 14 15
7
Disguise 2 Identifier Renaming
Rename variables consistently
01 static void 02 fill_content (struct line
fill, int num) 03 04 int i 05 unsigned
char buffer 06 struct field fields 07
fill-gtnfields num // initialization 08
fill-gtbuf.size fill-gtbuf.length num 1 09
fill-gtbuf.buffer (char) xmalloc
(fill-gtbuf.size) 10 buffer (unsigned char
) fill-gtbuf.buffer 11 fill-gtfields fields
(struct field ) xmalloc (sizeof (struct
field) num) 12 for (i 0 i lt num
i) 13 ... 14 15
8
Disguise 3 Statement Reordering
Reorder non-dependent statements
01 static void 02 fill_content (struct line
fill, int num) 03 04 int i 05 unsigned
char buffer 06 struct field fields 11
fill-gtfields fields (struct field )
xmalloc (sizeof (struct field) num) 08
fill-gtbuf.size fill-gtbuf.length num 1 09
fill-gtbuf.buffer (char) xmalloc
(fill-gtbuf.size) 10 buffer (unsigned char
) fill-gtbuf.buffer 07 fill-gtnfields num
// initialization 12 for (i 0 i lt num
i) 13 ... 14 15
9
Disguise 4 Control Replacement
Use equivalent control structure

01 static void
02 fill_content (struct line fill, int num)
03
04 int i
05 unsigned char buffer
06 struct field fields
11 fill-gtfields fields
(struct field ) xmalloc (sizeof (struct
field) num)
08 fill-gtbuf.size fill-gtbuf.length num
1
09 fill-gtbuf.buffer (char) xmalloc
(fill-gtbuf.size)
10 buffer (unsigned char )
fill-gtbuf.buffer
07 fill-gtnfields num // initialization
i 0
while (i lt num)
...
15 i
16
17

10
Disguise 5 Code Insertion
Insert immaterial code

01 static void
02 fill_content (struct line fill, int num)
03
04 int i
05 unsigned char buffer
06 struct field fields
11 fill-gtfields fields
(struct field ) xmalloc (sizeof (struct
field) num)
08 fill-gtbuf.size fill-gtbuf.length num
1
09 fill-gtbuf.buffer (char) xmalloc
(fill-gtbuf.size)
10 buffer (unsigned char )
fill-gtbuf.buffer
07 fill-gtnfields num // initialization
i 0
while (i lt num)
... for (int j 0 j lt i j)
15 i
16
17

11
Fully Disguised
12
Outline

Plagiarism Disguises
Review of Plagiarism Detection
GPLAG PDG-based Plagiarism Detection
Efficiency and Scalability
Experiments
Conclusions

13
Review of Plagiarism Detection

String-based Baker et al. 1995
A program represented as a string
Blanks and comments ignored.
AST-based Baxter et al. 1998, Kontogiannis et
al. 1995
A program is represented as an Abstract Syntax
Tree (AST)
Fragile to statement reordering, control
replacement and code insertion
Token-based Kamiya et al. 2002, Prechelt et al.
2002
Variables of the same type are mapped to the same
token
A program is represented as a token string
Fingerprint of token strings is used for
robustness Schleimer et al. 2003
Partially robust to statement reordering, control
replacement and code insertion
Representatives Moss and JPlag

14
Outline

Plagiarism Disguises
Review of Plagiarism Detection
GPLAG PDG-based Plagiarism Detection
Efficiency and Scalability
Experiments
Conclusions

15
Graphic representation of source code
int sum(int array, int count) int i, sum
sum 0 for(i 0 i lt count i) sum
add(sum, arrayi) return sum

int add(int a, int b) return a b
16
Graphic representation of source code

int sum(int array, int count) int i, sum
sum 0 for(i 0 i lt count i) sum
add(sum, arrayi) return sum
int add(int a, int b) return a b
17
Control Dependency
int sum(int array, int count) int i, sum
sum 0 for(i 0 i lt count i) sum
add(sum, arrayi) return sum
int add(int a, int b) return a b
18
Data Dependency
int sum(int array, int count) int i, sum
sum 0 for(i 0 i lt count i) sum
add(sum, arrayi) return sum
int add(int a, int b) return a b
19
Plagiarism Detectible?
20
Corresponding PDGs
PDG for the Original Code
PDG for the Plagiarized Code
21
PDG-based Plagiarism Detection

A program is represented as a set of PDGs
Let g be a PDG of Procedure P in the original
program
Let g be a PDG of Procedure P in the plagiarism
suspect
Subgraph isomorphism implies plagiarism
If g is subgraph isomorphic to g, P is likely
plagiarized from P
?-isomorphism Graph g is ?-isomorphic to g if
there exists a subgraph s of g such that s is
subgraph isomorphic to g, and s ? g.
If g is ?isomorphic to g, the PDG pair (g, g)
is regarded as a plagiarized PDG pair, and is
then returned to human beings for examination.

22
Advantages

Robust because it is hard to overhaul PDGs
Dependencies encode program logic
Incentive of plagiarism

23
Outline

Plagiarism Disguises
Review of Plagiarism Detection
GPLAG PDG-based Plagiarism Detection
Efficiency and Scalability
Experiments
Conclusions

24
Efficiency and Scalability

Search space
If the original program has n procedures and the
plagiarism suspect has m procedures
nm subgraph isomorphism testings
Pruning search space
Lossless filter
Statistical lossy filter

25
Lossless filter

Interestingness
PDGs smaller than an interesting size K are
excluded from both sides
?-isomorphism definition
A PDG pair (g, g) is discarded if g lt?g.

26
Lossy Filter

Observation
If procedure P is plagiarized from procedure P,
its PDG g should look similar to g.
So discard those dissimilar PDG pairs
Requirement
This filter must be light-weighted

27
Vertex Histogram

Represent PDG g by
h(g) (n1, n2, , nk),
where ni is the frequency of the ith kind of
vertices.
Similarly, represent PDG g by
h(g) (m1, m2, , mk).
Direct similarity measurement?
How to define a proper similarity threshold?
Is thus defined threshold program-independent?

28
Hypothesis Testing-based Approach

Basic idea
Estimate a k-dimensional multinomial distribution
from h(g)
Test whether h(g) is likely an observation from
If it is, g looks similar to g, and an
isomorphism testing is needed.
Otherwise, (g, g) is discarded

29
Technical Details
30
Technical Details (contd)
31
Work-flow of GPLAG

PDGs are generated with Codesurfer
Isomorphism testing is implemented with VFLib.

32
Outline

Plagiarism Disguises
Review of Plagiarism Detection
GPLAG PDG-based Plagiarism Detection
Efficiency and Scalability
Experiments
Conclusions

33
Experiment Design

Subject programs
Effectiveness
Filter efficiency
Core-part plagiarism detection

34
Effectiveness

2-hour manual plagiarism, but can be automated?
GPLAG detects all plagiarized PDG pairs within 1
second
PDG isomorphism also reveals what plagiarism
disguises are applied

35
Efficiency

Subject programs
bc, less and tar.
Exact copy as plagiarism.
Lossless and lossy filter
Pruning PDG-pairs.
Implication to overall time cost.

36
Pruning Uninteresting PDG-pairs

Lossless only
Lossless and lossy

37
Implication to Overall Time Cost

Time-out for subgraph isomorphism testing, time
hogs.
Lossless filter does not save much time.
Lossy filter significantly reduces the time cost.
Major time saving comes from the avoidance of
time hogs.

38
Detection of Core-part Plagiarism

Lower time cost with lossy filter.
Lower false positives with lossy filter.

39
Outline

Plagiarism Disguises
Review of Plagiarism Detection
GPLAG PDG-based Plagiarism Detection
Efficiency and Scalability
Experiments
Conclusions

40
Conclusions

We developed a new algorithm GPLAG for software
plagiarism detection
It is more effective to fight against
professional plagiarists
We developed a statistical lossy filter, which
improves the efficiency of GPLAG
We experimentally verified the effectiveness and
efficiency of GPLAG

41
Q A
Thank You!
42
References

1 B. S. Baker. On finding duplication and near
duplication in large software systems. In Proc.
of 2nd Working Conf. on Reverse Engineering,
1995.
2 I. D. Baxter, A. Yahin, L. Moura, M.
SantAnna, and L. Bier. Clone detection using
abstract syntax trees. In Proc. of Int. Conf. on
Software Maintenance, 1998.
3 K. Kontogiannis, M. Galler, and R. DeMori.
Detecting code similarity using patterns. In
Working Notes of 3rd Workshop on AI and Software
Engineering, 1995.
4 T. Kamiya, S. Kusumoto, and K. Inoue.
CCFinder a multilinguistic token-based code
clone detection system for large scale source
code. IEEE Trans. Softw. Eng., 28(7), 2002.
5 L. Prechelt, G. Malpohl, and M. Philippsen.
Finding plagiarisms among a set of programs with
JPlag. J. of Universal Computer Science, 8(11),
2002.
6 S. Schleimer, D. S. Wilkerson, and A. Aiken.
Winnowing local algorithms for document
fingerprinting. SIGMOD, 2003.
7 V. B. Livshits and T. Zimmermann. Dynamine
Finding common error patterns by mining software
revision histories. In Proc. of 13th Int. Symp.
on the Foundations of Software Engineering, 2005.
8 C. Liu, X. Yan, and J. Han. Mining control
flow abnormality for logic error isolation. In In
Proc. 2006 SIAM Int. Conf. on Data Mining, 2006.
9 C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu.
Mining behavior graphs for backtrace of
noncrashing bugs. In SDM, 2005.