Pattern-Growth Methods for Sequential Pattern Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Pattern-Growth Methods for Sequential Pattern Mining

Description:

Comparing PrefixSpan with GSP and FreeSpan in large databases ... Comparing effects of pseudo-projection. Comparing I/O cost and scalability ... – PowerPoint PPT presentation

Number of Views:807
Avg rating:3.0/5.0
Slides: 39
Provided by: HKUC
Category:

less

Transcript and Presenter's Notes

Title: Pattern-Growth Methods for Sequential Pattern Mining


1
Pattern-Growth Methods for Sequential Pattern
Mining
  • Iris Zhang
  • 2003-5-14

2
Outline
  • Sequential pattern mining
  • Apriori-like methods
  • GSP
  • Pattern-growth methods
  • FreeSpan
  • PrefixSpan
  • Performance analysis
  • Conclusions

3
Motivation
  • Sequential pattern mining Finding time-related
    frequent patterns
  • Most data and applications are time-related
  • Customer shopping patterns, telephone calling
    patterns
  • Natural disasters (e.g., earthquake, hurricane)
  • Disease and treatment
  • Stock market fluctuation
  • Weblog click stream analysis
  • DNA sequence analysis

4
Concepts
  • Let Ii1,i2,,in be a set of all items
  • Itemset is a subset of items
  • Sequence is an ordered list of itemset. itemsets
    are called elements. The number of items in the
    sequence is its length
  • e.g. lt (ef)(ab)(df)cb gt
  • A sequence ?lta1a2angt is called subsequence of
    ?ltb1b2bmgt, denoted ???, if there exist integers
    1?j1 ltj2ltltjn ?m such that a1?bj1,
    a2?bj2,,an?bjn
  • e.g. lta(bc)dcgtis subsequence of lta(abc)(ac)d(cf)gt

5
Concepts (cont)
  • Sequence database is a set of tuples ltsid,sgt, sid
    is a sequence_id, and s is a sequence. A tuple is
    said to contain a sequence ? if ? is a
    subsequence of s
  • Support of ? is the number of tuples in the
    database containing ?
  • If the support of ? no less than a threshold, it
    is called sequential pattern
  • lt(ab)cgt is a sequential pattern given support
    threshold min_sup 2

SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
6
Problem definition
  • Given a sequence database and min_sup threshold,
    the problem of sequential pattern mining is to
    find the complete set of sequential patterns in
    the database

7
Apriori-like methods
  • Apriori property If a sequence S is not
    frequent, then every super-sequence of S is not
    frequent
  • e.g. ltbhgt is infrequent, so do ltabhgt,ltb(dh)gt
  • GSP (Generalized Sequential Pattern) algorithm
  • Level-by-level do
  • Generate candidate sequences
  • Use Apriori property to prune candidates
  • Scan database to collect support counts

8
GSP Mining Process
9
Bottlenecks of Apriori-Like Methods
  • Potentially huge set of candidate sequences
  • 1,000 frequent length-1 sequences generate
    length-2 candidates
  • Multiple scans of database
  • Difficulties at mining long sequential patterns
  • Exponential number of short candidates
  • A length-100 sequential pattern needs candidate
    sequences

10
Pattern-growth methods
  • A divide-and-conquer approach
  • Recursively project a sequence database into a
    set of smaller databases
  • Mine each projected database to find the subset
    of patterns
  • Algorithms
  • FreeSpan Frequent Pattern-Projected Sequential
    Pattern Mining
  • PrefixSpan Prefix-Projected Sequential Pattern
    Mining

11
FreeSpan
  • Example given a sequence database S and
    min_support 2
  • Step 1 find length-1 sequential patterns and
    list them in support descending order
  • f_list a4,b4,c4,d3,e3,f3

SID Sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lt(eg(af)cbcgt
12
FreeSpan (cont)
  • Step 2 divide search space. The complete set of
    seq. pat. can be partitioned into 6 disjoint
    subsets
  • ones only contain item a
  • ones contain item b but no items after b in
    f_list
  • ones contain item c but no items after c in
    f_list
  • ones contain item d but no items after d in
    f_list
  • ones contain item e but no items after e in
    f_list
  • ones contain item f
  • find subsets of sequential patterns. They can be
    mined by constructing projected databases and
    mining each recursively

13
FreeSpan (cont)
  • Finding Seq. Patterns containing item b but no
    items after b in f_list
  • ltbgt-projected database lta(ab)agt, ltabagt, lt(ab)bgt,
    ltabgt
  • Find all the length-2 seq. pat. containing item b
    but no items after b in f_list ltabgt4, ltbagt2,
    lt(ab)gt2
  • Further partition and mining

SID Sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lt(eg(af)cbcgt
14
From FreeSpan to PrefixSpan
  • Freespan
  • Projection-based No candidate sequence needs to
    be generated
  • But, projection can be performed at any point in
    the sequence, and the projected sequences may not
    shrink much. For example, the size of f-projected
    database is the same as the original sequence
    database
  • PrefixSpan
  • Projection-based
  • But only prefix-based projection less
    projections and quickly shrinking sequences

15
PrefixSpan-concepts
  • Suppose all items in an element are listed
    alphabetically.
  • Given a sequence ?lte1e2engt, ?lte1e2emgt(m?n)
  • Prefix ? is the prefix of ? iff (1) eiei (i
    ?m-1) (2) em ? em(3) all items in (em- em) are
    alphabetically after those in em.
  • e.g. ?lta(abc)(ac)d(cf)gt, ?lta(ab)gt, ?lta(bc)gt
  • Postfix sequence ?lte1e2emgt, ?ltemem1engt
    is called the postfix of ? w.r.t. prefix ?, where
    em(em-em), denoted as ??.?
  • e.g. ?lt(_c)(ac)d(cf)gt is the postfix of ? w.r.t.
    prefix lta(ab)gt

16
PrefixSpan-concepts (cont)
  • Projected database let ? be a sequential pattern
    in S. ?-projected database, denoted s?, is the
    collection of postfixes of sequences in S w.r.t.
    prefix ?
  • Support count in projected database let ? be a
    sequential pattern in S, ? be a sequence having
    prefix ?. The support count of ? in ?-projected
    database is the number of sequence ? in s? such
    that ???.?

17
PrefixSpan-process
  • Step 1 find length-1 sequential patterns
  • ltagt4, ltbgt4, ltcgt4, ltdgt3, ltegt3, ltfgt3
  • Step 2 divide search space. The complete set of
    seq. pat. can be partitioned into 6 subsets
  • ones having prefix ltagt
  • ones having prefix ltbgt
  • ones having prefix ltfgt
  • find subsets of sequential patterns. They can be
    mined by constructing projected databases and
    mining each recursively

SID Sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lt(eg(af)cbcgt
18
PrefixSpan-Process (cont)
  • Finding Seq. Patterns with Prefix ltagt
  • ltagt-projected database lt(abc)(ac)d(cf)gt,
    lt(_d)c(bc)(ae)gt, lt(_b)(df)cbgt, lt(_f)cbcgt
  • Find all the length-2 seq. pat. having prefix
    ltagtltaagt2, ltabgt4, lt(ab)gt2, ltacgt4, ltadgt2,
    ltafgt2
  • Further partition into 6 subsets
  • Having prefix ltaagt
  • Having prefix ltafgt

SID Sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lt(eg(af)cbcgt
19
Completeness of PrefixSpan
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
prefix ltcgt, , ltfgt
prefix ltagt
prefix ltbgt
ltagt-projected database lt(abc)(ac)d(cf)gt lt(_d)c(bc)
(ae)gt lt(_b)(df)cbgt lt(_f)cbcgt
ltbgt-projected database

Length-2 seq. pan ltaagt, ltabgt, lt(ab)gt, ltacgt, ltadgt,
ltafgt

prefix ltafgt
prefix ltaagt

ltaagt-proj. db
ltafgt-proj. db
20
Efficiency of PrefixSpan
  • No candidate sequence needs to be generated
  • Projected databases keep shrinking
  • Major cost of PrefixSpan constructing projected
    databases
  • Can be improved by bi-level projections and
    pseudo-projections

21
Optimization Techniques in PrefixSpan
  • Single-level vs. bi-level projection
  • Bi-level projection with 3-way checking may
    reduce the number and size of projected databases
  • Physical projection vs. pseudo-projection
  • Pseudo-projection may reduce the effort of
    projection when the projected database fits in
    main memory

22
S-matrix for sequence database
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
ltaagt happens twice
lt(ac)gt happens once
2
a
S-matrix
ltacgt happens 4 times
1
(4, 2, 2)
b
3
(3, 3, 2)
(4, 2, 1)
c
0
(1, 3, 0)
(2, 2, 0)
(2, 1, 1)
d
0
(1, 1, 0)
(1, 2, 0)
(1, 2, 0)
(1, 2, 1)
e
ltcagthappens twice
1
(2, 0, 1)
(1, 1, 1)
(1, 2, 1)
(2, 2, 0)
(2, 1, 1)
f
f
e
d
c
b
a
All length-2 sequential patterns are found in
S-matrix
23
S-matrix for ltabgt-projected database
  • ltabgt-projected database
  • lt(_c)(ac)d(cf)gt,lt(_c)(ae)gt,ltcgt
  • frequent itemsltagt,ltcgt,lt(_c)gt
  • S-matrix

SID Sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lt(eg(af)cbcgt
No a(_c), no count
Lead to pattern lta(bc)agt
a 0
c (1, 0, 1) 1
(_c) (?, 2, ?) (?, 1, ?) ?
a c (_c)
24
Scaling-up by Bi-level Projection
  • Partition search space based on length-2
    sequential patterns
  • Only form projected databases and pursue
    recursive mining over bi-level projected databases

25
Benefits of Bi-level Projection
  • More patterns are found in each shoot
  • Much less projections
  • In the example, there are 53 patterns.
  • 53 level-by-level projections
  • 22 bi-level projections

26
3-way Apriori Checking
  • Using Apriori heuristic to prune items in
    projected databases

ltacdgt cannot be a pattern w.r.t.
min_support2 exclude d from ltacgt-projected
database
a 2
b (4, 2, 2) 1
c (4, 2, 1) (3, 3, 2) 3
d (2, 1, 1) (2, 2, 0) (1, 3, 0) 0
e (1, 2, 1) (1, 2, 0) (1, 2, 0) (1, 1, 0) 0
f (2, 1, 1) (2, 2, 0) (1, 2, 1) (1, 1, 1) (2, 0, 1) 1
a b c d e f
27
Pseudo-projection
  • Major cost of PrefixSpan projection
  • Postfixes of sequences often appear repeatedly in
    recursive projected databases
  • When the projected database fit in memory, use
    pointers to form projections
  • Pointer to the sequence
  • Offset of the postfix

28
Pseudo-Projection vs. Physical Projection
  • Pseudo-projection avoids physically copying
    postfixes
  • Efficient when database fits in main memory
  • Not efficient when database cannot fit in main
    memory
  • Disk-based random accessing is very costly
  • Suggested Approach
  • Integration of physical and pseudo-projection
  • Swapping to pseudo-projection when the data set
    fits in memory

29
Experiments
  • Synthetic datasets were generated using procedure
    described in R.Agrawal and R.Srikant. Mining
    sequential patterns. In Proc. 1995 ICDE95
  • number of items 1000
  • number of sequences in the data set 10,000
  • average number of items within elements 8
  • average number of elements in a sequence 8

30
Experiments (cont)
  • Comparing PrefixSpan with GSP and FreeSpan in
    large databases
  • GSP (IBM Almaden, Srikant Agrawal EDBT96)
  • FreeSpan (J. Han J. Pei, B. Mortazavi-Asi, Q.
    Chen, U. Dayal, M.C. Hsu, KDD00)
  • Prefix-Span-1 (single-level projection)
  • Prefix-Span-2 (bi-level projection)
  • Comparing effects of pseudo-projection
  • Comparing I/O cost and scalability

31
PrefixSpan Is Faster Than GSP and FreeSpan
32
Effect of Pseudo-Projection for projected
database fit in memory
33
I/O Cost When It Cannot Fit in Memory
34
Scalability (When DB Is Large)
min_sup0.2
35
Conclusions
  • Both PrefixSpan and FreeSpan are pattern-growth
    methods which perform better than Apriori-like
    methods for sequential pattern mining problem
  • PrefixSpan is more elegant than FreeSpan
  • Apriori heuristic is integrated into bi-level
    projection in PrefixSpan
  • Pseudo-projection substantially enhances the
    performance of the memory-based processing

36
References
  • J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U.
    Dayal, and M.-C. Hsu. FreeSpan Frequent
    pattern-projected sequential pattern mining.
    KDD'00, pages 355-359.
  • J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q.
    Chen, U. Dayal, and M.-C. Hsu. PrefixSpan
    Mining sequential patterns efficiently by
    prefix-projected pattern growth. ICDE'01, pages
    215-224.
  • R. Srikant and R. Agrawal. Mining sequential
    patterns Generalizations and performance
    improvements. EDBT'96, pages 3-17.

37
QA
38
Thanks
Write a Comment
User Comments (0)
About PowerShow.com