Pattern-Growth Methods for Sequential Pattern Mining - PowerPoint PPT Presentation

About This Presentation

Title:

Pattern-Growth Methods for Sequential Pattern Mining

Description:

Comparing PrefixSpan with GSP and FreeSpan in large databases ... Comparing effects of pseudo-projection. Comparing I/O cost and scalability ... – PowerPoint PPT presentation

Number of Views:807

Avg rating:3.0/5.0

Slides: 39

Provided by: HKUC

Category:

more less

Transcript and Presenter's Notes

Title: Pattern-Growth Methods for Sequential Pattern Mining

1
Pattern-Growth Methods for Sequential Pattern
Mining

Iris Zhang
2003-5-14

2
Outline

Sequential pattern mining
Apriori-like methods
GSP
Pattern-growth methods
FreeSpan
PrefixSpan
Performance analysis
Conclusions

3
Motivation

Sequential pattern mining Finding time-related
frequent patterns
Most data and applications are time-related
Customer shopping patterns, telephone calling
patterns
Natural disasters (e.g., earthquake, hurricane)
Disease and treatment
Stock market fluctuation
Weblog click stream analysis
DNA sequence analysis

4
Concepts

Let Ii1,i2,,in be a set of all items
Itemset is a subset of items
Sequence is an ordered list of itemset. itemsets
are called elements. The number of items in the
sequence is its length
e.g. lt (ef)(ab)(df)cb gt
A sequence ?lta1a2angt is called subsequence of
?ltb1b2bmgt, denoted ???, if there exist integers
1?j1 ltj2ltltjn ?m such that a1?bj1,
a2?bj2,,an?bjn
e.g. lta(bc)dcgtis subsequence of lta(abc)(ac)d(cf)gt

5
Concepts (cont)

Sequence database is a set of tuples ltsid,sgt, sid
is a sequence_id, and s is a sequence. A tuple is
said to contain a sequence ? if ? is a
subsequence of s
Support of ? is the number of tuples in the
database containing ?
If the support of ? no less than a threshold, it
is called sequential pattern
lt(ab)cgt is a sequential pattern given support
threshold min_sup 2

SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
6
Problem definition

Given a sequence database and min_sup threshold,
the problem of sequential pattern mining is to
find the complete set of sequential patterns in
the database

7
Apriori-like methods

Apriori property If a sequence S is not
frequent, then every super-sequence of S is not
frequent
e.g. ltbhgt is infrequent, so do ltabhgt,ltb(dh)gt
GSP (Generalized Sequential Pattern) algorithm
Level-by-level do
Generate candidate sequences
Use Apriori property to prune candidates
Scan database to collect support counts

8
GSP Mining Process
9
Bottlenecks of Apriori-Like Methods

Potentially huge set of candidate sequences
1,000 frequent length-1 sequences generate
length-2 candidates
Multiple scans of database
Difficulties at mining long sequential patterns
Exponential number of short candidates
A length-100 sequential pattern needs candidate
sequences

10
Pattern-growth methods

A divide-and-conquer approach
Recursively project a sequence database into a
set of smaller databases
Mine each projected database to find the subset
of patterns
Algorithms
FreeSpan Frequent Pattern-Projected Sequential
Pattern Mining
PrefixSpan Prefix-Projected Sequential Pattern
Mining

11
FreeSpan

Example given a sequence database S and
min_support 2
Step 1 find length-1 sequential patterns and
list them in support descending order
f_list a4,b4,c4,d3,e3,f3

SID Sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lt(eg(af)cbcgt
12
FreeSpan (cont)

Step 2 divide search space. The complete set of
seq. pat. can be partitioned into 6 disjoint
subsets
ones only contain item a
ones contain item b but no items after b in
f_list
ones contain item c but no items after c in
f_list
ones contain item d but no items after d in
f_list
ones contain item e but no items after e in
f_list
ones contain item f
find subsets of sequential patterns. They can be
mined by constructing projected databases and
mining each recursively

13
FreeSpan (cont)

Finding Seq. Patterns containing item b but no
items after b in f_list
ltbgt-projected database lta(ab)agt, ltabagt, lt(ab)bgt,
ltabgt
Find all the length-2 seq. pat. containing item b
but no items after b in f_list ltabgt4, ltbagt2,
lt(ab)gt2
Further partition and mining

SID Sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lt(eg(af)cbcgt
14
From FreeSpan to PrefixSpan

Freespan
Projection-based No candidate sequence needs to
be generated
But, projection can be performed at any point in
the sequence, and the projected sequences may not
shrink much. For example, the size of f-projected
database is the same as the original sequence
database
PrefixSpan
Projection-based
But only prefix-based projection less
projections and quickly shrinking sequences

15
PrefixSpan-concepts

Suppose all items in an element are listed
alphabetically.
Given a sequence ?lte1e2engt, ?lte1e2emgt(m?n)
Prefix ? is the prefix of ? iff (1) eiei (i
?m-1) (2) em ? em(3) all items in (em- em) are
alphabetically after those in em.
e.g. ?lta(abc)(ac)d(cf)gt, ?lta(ab)gt, ?lta(bc)gt
Postfix sequence ?lte1e2emgt, ?ltemem1engt
is called the postfix of ? w.r.t. prefix ?, where
em(em-em), denoted as ??.?
e.g. ?lt(_c)(ac)d(cf)gt is the postfix of ? w.r.t.
prefix lta(ab)gt

16
PrefixSpan-concepts (cont)

Projected database let ? be a sequential pattern
in S. ?-projected database, denoted s?, is the
collection of postfixes of sequences in S w.r.t.
prefix ?
Support count in projected database let ? be a
sequential pattern in S, ? be a sequence having
prefix ?. The support count of ? in ?-projected
database is the number of sequence ? in s? such
that ???.?

17
PrefixSpan-process

Step 1 find length-1 sequential patterns
ltagt4, ltbgt4, ltcgt4, ltdgt3, ltegt3, ltfgt3
Step 2 divide search space. The complete set of
seq. pat. can be partitioned into 6 subsets
ones having prefix ltagt
ones having prefix ltbgt
ones having prefix ltfgt
find subsets of sequential patterns. They can be
mined by constructing projected databases and
mining each recursively

SID Sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lt(eg(af)cbcgt
18
PrefixSpan-Process (cont)

Finding Seq. Patterns with Prefix ltagt
ltagt-projected database lt(abc)(ac)d(cf)gt,
lt(_d)c(bc)(ae)gt, lt(_b)(df)cbgt, lt(_f)cbcgt
Find all the length-2 seq. pat. having prefix
ltagtltaagt2, ltabgt4, lt(ab)gt2, ltacgt4, ltadgt2,
ltafgt2
Further partition into 6 subsets
Having prefix ltaagt
Having prefix ltafgt

SID Sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lt(eg(af)cbcgt
19
Completeness of PrefixSpan
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
prefix ltcgt, , ltfgt
prefix ltagt
prefix ltbgt
ltagt-projected database lt(abc)(ac)d(cf)gt lt(_d)c(bc)
(ae)gt lt(_b)(df)cbgt lt(_f)cbcgt
ltbgt-projected database

Length-2 seq. pan ltaagt, ltabgt, lt(ab)gt, ltacgt, ltadgt,
ltafgt

prefix ltafgt
prefix ltaagt

ltaagt-proj. db
ltafgt-proj. db
20
Efficiency of PrefixSpan

No candidate sequence needs to be generated
Projected databases keep shrinking
Major cost of PrefixSpan constructing projected
databases
Can be improved by bi-level projections and
pseudo-projections

21
Optimization Techniques in PrefixSpan

Single-level vs. bi-level projection
Bi-level projection with 3-way checking may
reduce the number and size of projected databases
Physical projection vs. pseudo-projection
Pseudo-projection may reduce the effort of
projection when the projected database fits in
main memory

22
S-matrix for sequence database
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
ltaagt happens twice
lt(ac)gt happens once
2
a
S-matrix
ltacgt happens 4 times
1
(4, 2, 2)
b
3
(3, 3, 2)
(4, 2, 1)
c
0
(1, 3, 0)
(2, 2, 0)
(2, 1, 1)
d
0
(1, 1, 0)
(1, 2, 0)
(1, 2, 0)
(1, 2, 1)
e
ltcagthappens twice
1
(2, 0, 1)
(1, 1, 1)
(1, 2, 1)
(2, 2, 0)
(2, 1, 1)
f
f
e
d
c
b
a
All length-2 sequential patterns are found in
S-matrix
23
S-matrix for ltabgt-projected database

ltabgt-projected database
lt(_c)(ac)d(cf)gt,lt(_c)(ae)gt,ltcgt
frequent itemsltagt,ltcgt,lt(_c)gt
S-matrix

SID Sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lt(eg(af)cbcgt
No a(_c), no count
Lead to pattern lta(bc)agt
a 0
c (1, 0, 1) 1
(_c) (?, 2, ?) (?, 1, ?) ?
a c (_c)
24
Scaling-up by Bi-level Projection

Partition search space based on length-2
sequential patterns
Only form projected databases and pursue
recursive mining over bi-level projected databases

25
Benefits of Bi-level Projection

More patterns are found in each shoot
Much less projections
In the example, there are 53 patterns.
53 level-by-level projections
22 bi-level projections

26
3-way Apriori Checking

Using Apriori heuristic to prune items in
projected databases

ltacdgt cannot be a pattern w.r.t.
min_support2 exclude d from ltacgt-projected
database
a 2
b (4, 2, 2) 1
c (4, 2, 1) (3, 3, 2) 3
d (2, 1, 1) (2, 2, 0) (1, 3, 0) 0
e (1, 2, 1) (1, 2, 0) (1, 2, 0) (1, 1, 0) 0
f (2, 1, 1) (2, 2, 0) (1, 2, 1) (1, 1, 1) (2, 0, 1) 1
a b c d e f
27
Pseudo-projection

Major cost of PrefixSpan projection
Postfixes of sequences often appear repeatedly in
recursive projected databases
When the projected database fit in memory, use
pointers to form projections
Pointer to the sequence
Offset of the postfix

28
Pseudo-Projection vs. Physical Projection

Pseudo-projection avoids physically copying
postfixes
Efficient when database fits in main memory
Not efficient when database cannot fit in main
memory
Disk-based random accessing is very costly
Suggested Approach
Integration of physical and pseudo-projection
Swapping to pseudo-projection when the data set
fits in memory

29
Experiments

Synthetic datasets were generated using procedure
described in R.Agrawal and R.Srikant. Mining
sequential patterns. In Proc. 1995 ICDE95
number of items 1000
number of sequences in the data set 10,000
average number of items within elements 8
average number of elements in a sequence 8

30
Experiments (cont)

Comparing PrefixSpan with GSP and FreeSpan in
large databases
GSP (IBM Almaden, Srikant Agrawal EDBT96)
FreeSpan (J. Han J. Pei, B. Mortazavi-Asi, Q.
Chen, U. Dayal, M.C. Hsu, KDD00)
Prefix-Span-1 (single-level projection)
Prefix-Span-2 (bi-level projection)
Comparing effects of pseudo-projection
Comparing I/O cost and scalability

31
PrefixSpan Is Faster Than GSP and FreeSpan
32
Effect of Pseudo-Projection for projected
database fit in memory
33
I/O Cost When It Cannot Fit in Memory
34
Scalability (When DB Is Large)
min_sup0.2
35
Conclusions

Both PrefixSpan and FreeSpan are pattern-growth
methods which perform better than Apriori-like
methods for sequential pattern mining problem
PrefixSpan is more elegant than FreeSpan
Apriori heuristic is integrated into bi-level
projection in PrefixSpan
Pseudo-projection substantially enhances the
performance of the memory-based processing

36
References

J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U.
Dayal, and M.-C. Hsu. FreeSpan Frequent
pattern-projected sequential pattern mining.
KDD'00, pages 355-359.
J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q.
Chen, U. Dayal, and M.-C. Hsu. PrefixSpan
Mining sequential patterns efficiently by
prefix-projected pattern growth. ICDE'01, pages
215-224.
R. Srikant and R. Agrawal. Mining sequential
patterns Generalizations and performance
improvements. EDBT'96, pages 3-17.