Title: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis
1GPLAG Detection of Software Plagiarism by
Program Dependence Graph Analysis
- Chao Liu, Chen Chen,
- Jiawei Han, Philip S. Yu
- University of Illinois at Urbana-Champaign
- IBM T.J. Waston Research Center
- Presented by Chao Liu
2Motivations
- Blossom of open-source projects
- SourceForge.net 125,090 projects as July 2006
- Convenience for software plagiarism?
- You can always find something online
- Core-part plagiarism
- Ripping off GUIs and irrelevant parts
- (Illegally) reuse the implementations of
core-algorithms - Our goal
- Efficient detection of core-part plagiarism
3Challenges
- Effectiveness
- Professional plagiarists
- Automated plagiarism
- Efficiency
- Only a small part of code is plagiarized, how to
detect it efficiently?
4Outline
- Plagiarism Disguises
- Review of Plagiarism Detection
- GPLAG PDG-based Plagiarism Detection
- Efficiency and Scalability
- Experiments
- Conclusions
5Original Program
A procedure in a program, called join
01 static void 02 make_blank (struct line
blank, int count) 03 04 int i 05
unsigned char buffer 06 struct field
fields 07 blank-gtnfields count 08
blank-gtbuf.size blank-gtbuf.length count
1 09 blank-gtbuf.buffer (char) xmalloc
(blank-gtbuf.size) 10 buffer (unsigned char
) blank-gtbuf.buffer 11 blank-gtfields
fields (struct field ) xmalloc (sizeof
(struct field) count) 12 for (i 0 i lt
count i) 13 ... 14 15
6Disguise 1 Format Alteration
Insert comments and blanks
01 static void 02 make_blank (struct line
blank, int count) 03 04 int i 05
unsigned char buffer 06 struct field
fields 07 blank-gtnfields count //
initialization 08 blank-gtbuf.size
blank-gtbuf.length count 1 09
blank-gtbuf.buffer (char) xmalloc
(blank-gtbuf.size) 10 buffer (unsigned char
) blank-gtbuf.buffer 11 blank-gtfields
fields (struct field ) xmalloc (sizeof
(struct field) count) 12 for (i 0 i
lt count i) 13 ... 14 15
7Disguise 2 Identifier Renaming
Rename variables consistently
01 static void 02 fill_content (struct line
fill, int num) 03 04 int i 05 unsigned
char buffer 06 struct field fields 07
fill-gtnfields num // initialization 08
fill-gtbuf.size fill-gtbuf.length num 1 09
fill-gtbuf.buffer (char) xmalloc
(fill-gtbuf.size) 10 buffer (unsigned char
) fill-gtbuf.buffer 11 fill-gtfields fields
(struct field ) xmalloc (sizeof (struct
field) num) 12 for (i 0 i lt num
i) 13 ... 14 15
8Disguise 3 Statement Reordering
Reorder non-dependent statements
01 static void 02 fill_content (struct line
fill, int num) 03 04 int i 05 unsigned
char buffer 06 struct field fields 11
fill-gtfields fields (struct field )
xmalloc (sizeof (struct field) num) 08
fill-gtbuf.size fill-gtbuf.length num 1 09
fill-gtbuf.buffer (char) xmalloc
(fill-gtbuf.size) 10 buffer (unsigned char
) fill-gtbuf.buffer 07 fill-gtnfields num
// initialization 12 for (i 0 i lt num
i) 13 ... 14 15
9Disguise 4 Control Replacement
Use equivalent control structure
- 01 static void
- 02 fill_content (struct line fill, int num)
- 03
- 04 int i
- 05 unsigned char buffer
- 06 struct field fields
- 11 fill-gtfields fields
- (struct field ) xmalloc (sizeof (struct
field) num) - 08 fill-gtbuf.size fill-gtbuf.length num
1 - 09 fill-gtbuf.buffer (char) xmalloc
(fill-gtbuf.size) - 10 buffer (unsigned char )
fill-gtbuf.buffer - 07 fill-gtnfields num // initialization
-
- i 0
- while (i lt num)
- ...
- 15 i
- 16
- 17
10Disguise 5 Code Insertion
Insert immaterial code
- 01 static void
- 02 fill_content (struct line fill, int num)
- 03
- 04 int i
- 05 unsigned char buffer
- 06 struct field fields
- 11 fill-gtfields fields
- (struct field ) xmalloc (sizeof (struct
field) num) - 08 fill-gtbuf.size fill-gtbuf.length num
1 - 09 fill-gtbuf.buffer (char) xmalloc
(fill-gtbuf.size) - 10 buffer (unsigned char )
fill-gtbuf.buffer - 07 fill-gtnfields num // initialization
-
- i 0
- while (i lt num)
- ... for (int j 0 j lt i j)
- 15 i
- 16
- 17
11Fully Disguised
12Outline
- Plagiarism Disguises
- Review of Plagiarism Detection
- GPLAG PDG-based Plagiarism Detection
- Efficiency and Scalability
- Experiments
- Conclusions
13Review of Plagiarism Detection
- String-based Baker et al. 1995
- A program represented as a string
- Blanks and comments ignored.
- AST-based Baxter et al. 1998, Kontogiannis et
al. 1995 - A program is represented as an Abstract Syntax
Tree (AST) - Fragile to statement reordering, control
replacement and code insertion - Token-based Kamiya et al. 2002, Prechelt et al.
2002 - Variables of the same type are mapped to the same
token - A program is represented as a token string
- Fingerprint of token strings is used for
robustness Schleimer et al. 2003 - Partially robust to statement reordering, control
replacement and code insertion - Representatives Moss and JPlag
-
14Outline
- Plagiarism Disguises
- Review of Plagiarism Detection
- GPLAG PDG-based Plagiarism Detection
- Efficiency and Scalability
- Experiments
- Conclusions
15Graphic representation of source code
int sum(int array, int count) int i, sum
sum 0 for(i 0 i lt count i) sum
add(sum, arrayi) return sum
int add(int a, int b) return a b
16Graphic representation of source code
int sum(int array, int count) int i, sum
sum 0 for(i 0 i lt count i) sum
add(sum, arrayi) return sum
int add(int a, int b) return a b
17Control Dependency
int sum(int array, int count) int i, sum
sum 0 for(i 0 i lt count i) sum
add(sum, arrayi) return sum
int add(int a, int b) return a b
18Data Dependency
int sum(int array, int count) int i, sum
sum 0 for(i 0 i lt count i) sum
add(sum, arrayi) return sum
int add(int a, int b) return a b
19Plagiarism Detectible?
20Corresponding PDGs
PDG for the Original Code
PDG for the Plagiarized Code
21PDG-based Plagiarism Detection
- A program is represented as a set of PDGs
- Let g be a PDG of Procedure P in the original
program - Let g be a PDG of Procedure P in the plagiarism
suspect - Subgraph isomorphism implies plagiarism
- If g is subgraph isomorphic to g, P is likely
plagiarized from P - ?-isomorphism Graph g is ?-isomorphic to g if
there exists a subgraph s of g such that s is
subgraph isomorphic to g, and s ? g. - If g is ?isomorphic to g, the PDG pair (g, g)
is regarded as a plagiarized PDG pair, and is
then returned to human beings for examination.
22Advantages
- Robust because it is hard to overhaul PDGs
- Dependencies encode program logic
- Incentive of plagiarism
23Outline
- Plagiarism Disguises
- Review of Plagiarism Detection
- GPLAG PDG-based Plagiarism Detection
- Efficiency and Scalability
- Experiments
- Conclusions
24Efficiency and Scalability
- Search space
- If the original program has n procedures and the
plagiarism suspect has m procedures - nm subgraph isomorphism testings
- Pruning search space
- Lossless filter
- Statistical lossy filter
25Lossless filter
- Interestingness
- PDGs smaller than an interesting size K are
excluded from both sides - ?-isomorphism definition
- A PDG pair (g, g) is discarded if g lt?g.
26Lossy Filter
- Observation
- If procedure P is plagiarized from procedure P,
its PDG g should look similar to g. - So discard those dissimilar PDG pairs
- Requirement
- This filter must be light-weighted
27Vertex Histogram
- Represent PDG g by
- h(g) (n1, n2, , nk),
- where ni is the frequency of the ith kind of
vertices. - Similarly, represent PDG g by
- h(g) (m1, m2, , mk).
- Direct similarity measurement?
- How to define a proper similarity threshold?
- Is thus defined threshold program-independent?
28Hypothesis Testing-based Approach
- Basic idea
- Estimate a k-dimensional multinomial distribution
from h(g) - Test whether h(g) is likely an observation from
- If it is, g looks similar to g, and an
isomorphism testing is needed. - Otherwise, (g, g) is discarded
29Technical Details
30Technical Details (contd)
31Work-flow of GPLAG
- PDGs are generated with Codesurfer
- Isomorphism testing is implemented with VFLib.
32Outline
- Plagiarism Disguises
- Review of Plagiarism Detection
- GPLAG PDG-based Plagiarism Detection
- Efficiency and Scalability
- Experiments
- Conclusions
33Experiment Design
- Subject programs
- Effectiveness
- Filter efficiency
- Core-part plagiarism detection
34Effectiveness
- 2-hour manual plagiarism, but can be automated?
- GPLAG detects all plagiarized PDG pairs within 1
second - PDG isomorphism also reveals what plagiarism
disguises are applied
35Efficiency
- Subject programs
- bc, less and tar.
- Exact copy as plagiarism.
- Lossless and lossy filter
- Pruning PDG-pairs.
- Implication to overall time cost.
36Pruning Uninteresting PDG-pairs
- Lossless only
- Lossless and lossy
37Implication to Overall Time Cost
- Time-out for subgraph isomorphism testing, time
hogs. - Lossless filter does not save much time.
- Lossy filter significantly reduces the time cost.
- Major time saving comes from the avoidance of
time hogs.
38Detection of Core-part Plagiarism
- Lower time cost with lossy filter.
- Lower false positives with lossy filter.
39Outline
- Plagiarism Disguises
- Review of Plagiarism Detection
- GPLAG PDG-based Plagiarism Detection
- Efficiency and Scalability
- Experiments
- Conclusions
40Conclusions
- We developed a new algorithm GPLAG for software
plagiarism detection - It is more effective to fight against
professional plagiarists - We developed a statistical lossy filter, which
improves the efficiency of GPLAG - We experimentally verified the effectiveness and
efficiency of GPLAG
41Q A
Thank You!
42References
- 1 B. S. Baker. On finding duplication and near
duplication in large software systems. In Proc.
of 2nd Working Conf. on Reverse Engineering,
1995. - 2 I. D. Baxter, A. Yahin, L. Moura, M.
SantAnna, and L. Bier. Clone detection using
abstract syntax trees. In Proc. of Int. Conf. on
Software Maintenance, 1998. - 3 K. Kontogiannis, M. Galler, and R. DeMori.
Detecting code similarity using patterns. In
Working Notes of 3rd Workshop on AI and Software
Engineering, 1995. - 4 T. Kamiya, S. Kusumoto, and K. Inoue.
CCFinder a multilinguistic token-based code
clone detection system for large scale source
code. IEEE Trans. Softw. Eng., 28(7), 2002. - 5 L. Prechelt, G. Malpohl, and M. Philippsen.
Finding plagiarisms among a set of programs with
JPlag. J. of Universal Computer Science, 8(11),
2002. - 6 S. Schleimer, D. S. Wilkerson, and A. Aiken.
Winnowing local algorithms for document
fingerprinting. SIGMOD, 2003. - 7 V. B. Livshits and T. Zimmermann. Dynamine
Finding common error patterns by mining software
revision histories. In Proc. of 13th Int. Symp.
on the Foundations of Software Engineering, 2005. - 8 C. Liu, X. Yan, and J. Han. Mining control
flow abnormality for logic error isolation. In In
Proc. 2006 SIAM Int. Conf. on Data Mining, 2006. - 9 C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu.
Mining behavior graphs for backtrace of
noncrashing bugs. In SDM, 2005.