GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

Description:

* Corresponding PDGs PDG for the Original Code PDG for the Plagiarized Code * PDG-based Plagiarism Detection A program is represented as a set of PDGs Let g be a PDG ... – PowerPoint PPT presentation

Number of Views:499
Avg rating:3.0/5.0
Slides: 43
Provided by: chao61
Category:

less

Transcript and Presenter's Notes

Title: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis


1
GPLAG Detection of Software Plagiarism by
Program Dependence Graph Analysis
  • Chao Liu, Chen Chen,
  • Jiawei Han, Philip S. Yu
  • University of Illinois at Urbana-Champaign
  • IBM T.J. Waston Research Center
  • Presented by Chao Liu

2
Motivations
  • Blossom of open-source projects
  • SourceForge.net 125,090 projects as July 2006
  • Convenience for software plagiarism?
  • You can always find something online
  • Core-part plagiarism
  • Ripping off GUIs and irrelevant parts
  • (Illegally) reuse the implementations of
    core-algorithms
  • Our goal
  • Efficient detection of core-part plagiarism

3
Challenges
  • Effectiveness
  • Professional plagiarists
  • Automated plagiarism
  • Efficiency
  • Only a small part of code is plagiarized, how to
    detect it efficiently?

4
Outline
  • Plagiarism Disguises
  • Review of Plagiarism Detection
  • GPLAG PDG-based Plagiarism Detection
  • Efficiency and Scalability
  • Experiments
  • Conclusions

5
Original Program
A procedure in a program, called join
01 static void 02 make_blank (struct line
blank, int count) 03 04 int i 05
unsigned char buffer 06 struct field
fields 07 blank-gtnfields count 08
blank-gtbuf.size blank-gtbuf.length count
1 09 blank-gtbuf.buffer (char) xmalloc
(blank-gtbuf.size) 10 buffer (unsigned char
) blank-gtbuf.buffer 11 blank-gtfields
fields (struct field ) xmalloc (sizeof
(struct field) count) 12 for (i 0 i lt
count i) 13 ... 14 15
6
Disguise 1 Format Alteration
Insert comments and blanks
01 static void 02 make_blank (struct line
blank, int count) 03 04 int i 05
unsigned char buffer 06 struct field
fields 07 blank-gtnfields count //
initialization 08 blank-gtbuf.size
blank-gtbuf.length count 1 09
blank-gtbuf.buffer (char) xmalloc
(blank-gtbuf.size) 10 buffer (unsigned char
) blank-gtbuf.buffer 11 blank-gtfields
fields (struct field ) xmalloc (sizeof
(struct field) count) 12 for (i 0 i
lt count i) 13 ... 14 15
7
Disguise 2 Identifier Renaming
Rename variables consistently
01 static void 02 fill_content (struct line
fill, int num) 03 04 int i 05 unsigned
char buffer 06 struct field fields 07
fill-gtnfields num // initialization 08
fill-gtbuf.size fill-gtbuf.length num 1 09
fill-gtbuf.buffer (char) xmalloc
(fill-gtbuf.size) 10 buffer (unsigned char
) fill-gtbuf.buffer 11 fill-gtfields fields
(struct field ) xmalloc (sizeof (struct
field) num) 12 for (i 0 i lt num
i) 13 ... 14 15
8
Disguise 3 Statement Reordering
Reorder non-dependent statements
01 static void 02 fill_content (struct line
fill, int num) 03 04 int i 05 unsigned
char buffer 06 struct field fields 11
fill-gtfields fields (struct field )
xmalloc (sizeof (struct field) num) 08
fill-gtbuf.size fill-gtbuf.length num 1 09
fill-gtbuf.buffer (char) xmalloc
(fill-gtbuf.size) 10 buffer (unsigned char
) fill-gtbuf.buffer 07 fill-gtnfields num
// initialization 12 for (i 0 i lt num
i) 13 ... 14 15
9
Disguise 4 Control Replacement
Use equivalent control structure
  • 01 static void
  • 02 fill_content (struct line fill, int num)
  • 03
  • 04 int i
  • 05 unsigned char buffer
  • 06 struct field fields
  • 11 fill-gtfields fields
  • (struct field ) xmalloc (sizeof (struct
    field) num)
  • 08 fill-gtbuf.size fill-gtbuf.length num
    1
  • 09 fill-gtbuf.buffer (char) xmalloc
    (fill-gtbuf.size)
  • 10 buffer (unsigned char )
    fill-gtbuf.buffer
  • 07 fill-gtnfields num // initialization
  • i 0
  • while (i lt num)
  • ...
  • 15 i
  • 16
  • 17

10
Disguise 5 Code Insertion
Insert immaterial code
  • 01 static void
  • 02 fill_content (struct line fill, int num)
  • 03
  • 04 int i
  • 05 unsigned char buffer
  • 06 struct field fields
  • 11 fill-gtfields fields
  • (struct field ) xmalloc (sizeof (struct
    field) num)
  • 08 fill-gtbuf.size fill-gtbuf.length num
    1
  • 09 fill-gtbuf.buffer (char) xmalloc
    (fill-gtbuf.size)
  • 10 buffer (unsigned char )
    fill-gtbuf.buffer
  • 07 fill-gtnfields num // initialization
  • i 0
  • while (i lt num)
  • ... for (int j 0 j lt i j)
  • 15 i
  • 16
  • 17

11
Fully Disguised
12
Outline
  • Plagiarism Disguises
  • Review of Plagiarism Detection
  • GPLAG PDG-based Plagiarism Detection
  • Efficiency and Scalability
  • Experiments
  • Conclusions

13
Review of Plagiarism Detection
  • String-based Baker et al. 1995
  • A program represented as a string
  • Blanks and comments ignored.
  • AST-based Baxter et al. 1998, Kontogiannis et
    al. 1995
  • A program is represented as an Abstract Syntax
    Tree (AST)
  • Fragile to statement reordering, control
    replacement and code insertion
  • Token-based Kamiya et al. 2002, Prechelt et al.
    2002
  • Variables of the same type are mapped to the same
    token
  • A program is represented as a token string
  • Fingerprint of token strings is used for
    robustness Schleimer et al. 2003
  • Partially robust to statement reordering, control
    replacement and code insertion
  • Representatives Moss and JPlag

14
Outline
  • Plagiarism Disguises
  • Review of Plagiarism Detection
  • GPLAG PDG-based Plagiarism Detection
  • Efficiency and Scalability
  • Experiments
  • Conclusions

15
Graphic representation of source code
int sum(int array, int count) int i, sum
sum 0 for(i 0 i lt count i) sum
add(sum, arrayi) return sum

int add(int a, int b) return a b
16
Graphic representation of source code

int sum(int array, int count) int i, sum
sum 0 for(i 0 i lt count i) sum
add(sum, arrayi) return sum
int add(int a, int b) return a b
17
Control Dependency
int sum(int array, int count) int i, sum
sum 0 for(i 0 i lt count i) sum
add(sum, arrayi) return sum
int add(int a, int b) return a b
18
Data Dependency
int sum(int array, int count) int i, sum
sum 0 for(i 0 i lt count i) sum
add(sum, arrayi) return sum
int add(int a, int b) return a b
19
Plagiarism Detectible?
20
Corresponding PDGs
PDG for the Original Code
PDG for the Plagiarized Code
21
PDG-based Plagiarism Detection
  • A program is represented as a set of PDGs
  • Let g be a PDG of Procedure P in the original
    program
  • Let g be a PDG of Procedure P in the plagiarism
    suspect
  • Subgraph isomorphism implies plagiarism
  • If g is subgraph isomorphic to g, P is likely
    plagiarized from P
  • ?-isomorphism Graph g is ?-isomorphic to g if
    there exists a subgraph s of g such that s is
    subgraph isomorphic to g, and s ? g.
  • If g is ?isomorphic to g, the PDG pair (g, g)
    is regarded as a plagiarized PDG pair, and is
    then returned to human beings for examination.

22
Advantages
  • Robust because it is hard to overhaul PDGs
  • Dependencies encode program logic
  • Incentive of plagiarism

23
Outline
  • Plagiarism Disguises
  • Review of Plagiarism Detection
  • GPLAG PDG-based Plagiarism Detection
  • Efficiency and Scalability
  • Experiments
  • Conclusions

24
Efficiency and Scalability
  • Search space
  • If the original program has n procedures and the
    plagiarism suspect has m procedures
  • nm subgraph isomorphism testings
  • Pruning search space
  • Lossless filter
  • Statistical lossy filter

25
Lossless filter
  • Interestingness
  • PDGs smaller than an interesting size K are
    excluded from both sides
  • ?-isomorphism definition
  • A PDG pair (g, g) is discarded if g lt?g.

26
Lossy Filter
  • Observation
  • If procedure P is plagiarized from procedure P,
    its PDG g should look similar to g.
  • So discard those dissimilar PDG pairs
  • Requirement
  • This filter must be light-weighted

27
Vertex Histogram
  • Represent PDG g by
  • h(g) (n1, n2, , nk),
  • where ni is the frequency of the ith kind of
    vertices.
  • Similarly, represent PDG g by
  • h(g) (m1, m2, , mk).
  • Direct similarity measurement?
  • How to define a proper similarity threshold?
  • Is thus defined threshold program-independent?

28
Hypothesis Testing-based Approach
  • Basic idea
  • Estimate a k-dimensional multinomial distribution
    from h(g)
  • Test whether h(g) is likely an observation from
  • If it is, g looks similar to g, and an
    isomorphism testing is needed.
  • Otherwise, (g, g) is discarded

29
Technical Details
30
Technical Details (contd)
31
Work-flow of GPLAG
  • PDGs are generated with Codesurfer
  • Isomorphism testing is implemented with VFLib.

32
Outline
  • Plagiarism Disguises
  • Review of Plagiarism Detection
  • GPLAG PDG-based Plagiarism Detection
  • Efficiency and Scalability
  • Experiments
  • Conclusions

33
Experiment Design
  • Subject programs
  • Effectiveness
  • Filter efficiency
  • Core-part plagiarism detection

34
Effectiveness
  • 2-hour manual plagiarism, but can be automated?
  • GPLAG detects all plagiarized PDG pairs within 1
    second
  • PDG isomorphism also reveals what plagiarism
    disguises are applied

35
Efficiency
  • Subject programs
  • bc, less and tar.
  • Exact copy as plagiarism.
  • Lossless and lossy filter
  • Pruning PDG-pairs.
  • Implication to overall time cost.

36
Pruning Uninteresting PDG-pairs
  • Lossless only
  • Lossless and lossy

37
Implication to Overall Time Cost
  • Time-out for subgraph isomorphism testing, time
    hogs.
  • Lossless filter does not save much time.
  • Lossy filter significantly reduces the time cost.
  • Major time saving comes from the avoidance of
    time hogs.

38
Detection of Core-part Plagiarism
  • Lower time cost with lossy filter.
  • Lower false positives with lossy filter.

39
Outline
  • Plagiarism Disguises
  • Review of Plagiarism Detection
  • GPLAG PDG-based Plagiarism Detection
  • Efficiency and Scalability
  • Experiments
  • Conclusions

40
Conclusions
  • We developed a new algorithm GPLAG for software
    plagiarism detection
  • It is more effective to fight against
    professional plagiarists
  • We developed a statistical lossy filter, which
    improves the efficiency of GPLAG
  • We experimentally verified the effectiveness and
    efficiency of GPLAG

41
Q A
Thank You!
42
References
  • 1 B. S. Baker. On finding duplication and near
    duplication in large software systems. In Proc.
    of 2nd Working Conf. on Reverse Engineering,
    1995.
  • 2 I. D. Baxter, A. Yahin, L. Moura, M.
    SantAnna, and L. Bier. Clone detection using
    abstract syntax trees. In Proc. of Int. Conf. on
    Software Maintenance, 1998.
  • 3 K. Kontogiannis, M. Galler, and R. DeMori.
    Detecting code similarity using patterns. In
    Working Notes of 3rd Workshop on AI and Software
    Engineering, 1995.
  • 4 T. Kamiya, S. Kusumoto, and K. Inoue.
    CCFinder a multilinguistic token-based code
    clone detection system for large scale source
    code. IEEE Trans. Softw. Eng., 28(7), 2002.
  • 5 L. Prechelt, G. Malpohl, and M. Philippsen.
    Finding plagiarisms among a set of programs with
    JPlag. J. of Universal Computer Science, 8(11),
    2002.
  • 6 S. Schleimer, D. S. Wilkerson, and A. Aiken.
    Winnowing local algorithms for document
    fingerprinting. SIGMOD, 2003.
  • 7 V. B. Livshits and T. Zimmermann. Dynamine
    Finding common error patterns by mining software
    revision histories. In Proc. of 13th Int. Symp.
    on the Foundations of Software Engineering, 2005.
  • 8 C. Liu, X. Yan, and J. Han. Mining control
    flow abnormality for logic error isolation. In In
    Proc. 2006 SIAM Int. Conf. on Data Mining, 2006.
  • 9 C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu.
    Mining behavior graphs for backtrace of
    noncrashing bugs. In SDM, 2005.
Write a Comment
User Comments (0)
About PowerShow.com