Multi-dimensional Sequential Pattern Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Multi-dimensional Sequential Pattern Mining

Description:

Multi-dimensional Sequential Pattern Mining ... Various groups of customers may have different patterns ... database using sequential pattern mining methods. 10 ... – PowerPoint PPT presentation

Number of Views:164
Avg rating:3.0/5.0
Slides: 25
Provided by: peij
Category:

less

Transcript and Presenter's Notes

Title: Multi-dimensional Sequential Pattern Mining


1
Multi-dimensional Sequential Pattern Mining
  • Helen Pinto, Jiawei Han, Jian Pei, Ke Wang,
    Qiming Chen, Umeshwar Dayal

2
Outline
  • Why multidimensional sequential pattern mining?
  • Problem definition
  • Algorithms
  • Experimental results
  • Conclusions

3
Why Sequential Pattern Mining?
  • Sequential pattern mining Finding time-related
    frequent patterns (frequent subsequences)
  • Many data and applications are time-related
  • Customer shopping patterns, telephone calling
    patterns
  • E.g., first buy computer, then CD-ROMS, software,
    within 3 mos.
  • Natural disasters (e.g., earthquake, hurricane)
  • Disease and treatment
  • Stock market fluctuation
  • Weblog click stream analysis
  • DNA sequence analysis

4
Motivating Example
  • Sequential patterns are useful
  • free internet access ? buy package 1 ? upgrade
    to package 2
  • Marketing, product design development
  • Problems lack of focus
  • Various groups of customers may have different
    patterns
  • MD-sequential pattern mining integrate
    multi-dimensional analysis and sequential pattern
    mining

5
Sequences and Patterns
  • Given a set of sequences, find the complete set
    of frequent subsequences

A sequence lt (ef) (ab) (df) c b gt
A sequence database
Elements items within an element are listed
alphabetically
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern
6
Sequential Pattern Basics
A sequence database
ltad(ae)gt is a subsequence of lta(bd)bcb(ade)gt
Given support threshold min_sup 2, lt(bd)cbgt is a
sequential pattern
7
MD Sequence Database
  • P(,Chicago,,ltbfgt) matches tuple 20 and 30
  • If support 2, P is a MD sequential pattern

cid Cust_grp City Age_grp sequence
10 Business Boston Middle lt(bd)cbagt
20 Professional Chicago Young lt(bf)(ce)(fg)gt
30 Business Chicago Middle lt(ah)abfgt
40 Education New York Retired lt(be)(ce)gt
8
Mining of MD Seq. Pat.
  • Embedding MD information into sequences
  • Using a uniform seq. pat. mining method
  • Integration of seq. pat. mining and MD analysis
    method

9
UNISEQ
  • Embed MD information into sequences

cid Cust_grp City Age_grp sequence
10 Business Boston Middle lt(bd)cbagt
20 Professional Chicago Young lt(bf)(ce)(fg)gt
30 Business Chicago Middle lt(ah)abfgt
40 Education New York Retired lt(be)(ce)gt
Mine the extended sequence database using
sequential pattern mining methods
cid MD-extension of sequences
10 lt(Business,Boston,Middle)(bd)cbagt
20 lt(Professional,Chicago,Young)(bf)(ce)(fg)gt
30 lt(Business,Chicago,Middle)(ah)abfgt
40 lt(Education,New York,Retired)(be)(ce)gt
10
Mine Sequential Patterns by Prefix Projections
  • Step 1 find length-1 sequential patterns
  • ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt
  • Step 2 divide search space. The complete set of
    seq. pat. can be partitioned into 6 subsets
  • The ones having prefix ltagt
  • The ones having prefix ltbgt
  • The ones having prefix ltfgt

SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
11
Find Seq. Patterns with Prefix ltagt
  • Only need to consider projections w.r.t. ltagt
  • ltagt-projected database lt(abc)(ac)d(cf)gt,
    lt(_d)c(bc)(ae)gt, lt(_b)(df)cbgt, lt(_f)cbcgt
  • Find all the length-2 seq. pat. Having prefix
    ltagt ltaagt, ltabgt, lt(ab)gt, ltacgt, ltadgt, ltafgt
  • Further partition into 6 subsets
  • Having prefix ltaagt
  • Having prefix ltafgt

SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
12
Completeness of PrefixSpan
SDB
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
Having prefix ltcgt, , ltfgt
Having prefix ltagt
Having prefix ltbgt
ltagt-projected database lt(abc)(ac)d(cf)gt lt(_d)c(bc)
(ae)gt lt(_b)(df)cbgt lt(_f)cbcgt
ltbgt-projected database

Length-2 sequential patterns ltaagt, ltabgt,
lt(ab)gt, ltacgt, ltadgt, ltafgt

Having prefix ltaagt
Having prefix ltafgt

ltaagt-proj. db
ltafgt-proj. db
13
Efficiency of PrefixSpan
  • No candidate sequence needs to be generated
  • Projected databases keep shrinking
  • Major cost of PrefixSpan constructing projected
    databases
  • Can be improved by bi-level projections

14
Mining MD-Patterns
MD pattern (,Chicago,)
cid Cust_grp City Age_grp sequence
10 Business Boston Middle lt(bd)cbagt
20 Professional Chicago Young lt(bf)(ce)(fg)gt
30 Business Chicago Middle lt(ah)abfgt
40 Education New York Retired lt(be)(ce)gt
(cust-grp,city,age-grp)
(cust-grp,city)
Cust-grp,,age-grp)
(cust-grp,,)
(,city,)
(,,age-grp)
BUC processing
All
15
Dim-Seq
  • First find MD-patterns
  • E.g. (,Chicago,)
  • Form projected sequence database
  • lt(bf)(ce)(fg)gt and lt(ah)abfgt for (,Chicago,)
  • Find seq. pat in projected database
  • E.g. (,Chicago,,ltbfgt)

cid Cust_grp City Age_grp sequence
10 Business Boston Middle lt(bd)cbagt
20 Professional Chicago Young lt(bf)(ce)(fg)gt
30 Business Chicago Middle lt(ah)abfgt
40 Education New York Retired lt(be)(ce)gt
16
Seq-Dim
  • Find sequential patterns
  • E.g. ltbfgt
  • Form projected MD-database
  • E.g. (Professional,Chicago,Young) and
    (Business,Chicago,Middle) for ltbfgt
  • Mine MD-patterns
  • E.g. (,Chicago,,ltbfgt)

cid Cust_grp City Age_grp sequence
10 Business Boston Middle lt(bd)cbagt
20 Professional Chicago Young lt(bf)(ce)(fg)gt
30 Business Chicago Middle lt(ah)abfgt
40 Education New York Retired lt(be)(ce)gt
17
Scalability Over Dimensionality
18
Scalability Over Cardinality
19
Scalability Over Support Threshold
20
Scalability Over Database Size
21
Pros Cons of Algorithms
  • Seq-Dim is efficient and scalable
  • Fastest in most cases
  • UniSeq is also efficient and scalable
  • Fastest with low dimensionality
  • Dim-Seq has poor scalability

22
Conclusions
  • MD seq. pat. mining are interesting and useful
  • Mining MD seq. pat. efficiently
  • Uniseq, Dim-Seq, and Seq-Dim
  • Future work
  • Applications of sequential pattern mining

23
References (1)
  • R. Agrawal and R. Srikant. Fast algorithms for
    mining association rules. VLDB'94, pages
    487-499.
  • R. Agrawal and R. Srikant. Mining sequential
    patterns. ICDE'95, pages 3-14.
  • C. Bettini, X. S. Wang, and S. Jajodia. Mining
    temporal relationships with multiple
    granularities in time sequences. Data Engineering
    Bulletin, 2132-38, 1998.
  • M. Garofalakis, R. Rastogi, and K. Shim. Spirit
    Sequential pattern mining with regular expression
    constraints. VLDB'99, pages 223-234.
  • J. Han, G. Dong, and Y. Yin. Efficient mining of
    partial periodic patterns in time series
    database. ICDE'99, pages 106-115.
  • J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U.
    Dayal, and M.-C. Hsu. FreeSpan Frequent
    pattern-projected sequential pattern mining.
    KDD'00, pages 355-359.

24
References (2)
  • J. Han, J. Pei, and Y. Yin. Mining frequent
    patterns without candidate generation. SIGMOD'00,
    pages 1-12.
  • H. Lu, J. Han, and L. Feng. Stock movement and
    n-dimensional intertransaction association rules.
    DMKD'98, pages 121-127.
  • H. Mannila, H. Toivonen, and A. I. Verkamo.
    Discovery of frequent episodes in event
    sequences. Data Mining and Knowledge Discovery,
    1259-289, 1997.
  • B. "Ozden, S. Ramaswamy, and A. Silberschatz.
    Cyclic association rules. ICDE'98, pages 412-421.
  • J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q.
    Chen, U. Dayal, and M.-C. Hsu. PrefixSpan
    Mining sequential patterns efficiently by
    prefix-projected pattern growth. ICDE'01, pages
    215-224.
  • R. Srikant and R. Agrawal. Mining sequential
    patterns Generalizations and performance
    improvements. EDBT'96, pages 3-17.
Write a Comment
User Comments (0)
About PowerShow.com