Data Mining Techniques Sequential Patterns - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining Techniques Sequential Patterns

Description:

Catalog companies also collect such data using the orders they receive ... 'Computer Science and Programming Language', followed by 'Data Structure' ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 18
Provided by: leeyu2
Category:

less

Transcript and Presenter's Notes

Title: Data Mining Techniques Sequential Patterns


1
Data Mining Techniques Sequential Patterns
2
Sequential Pattern Mining
  • Progress in bar-code technology has made it
    possible for retail organizations to collect and
    store massive amounts of sales data, referred to
    as the basket data
  • A record in such data typically consists of the
    transaction date and the items bought in the
    transaction
  • Very often, data records also contain
    customer-id, particularly when the purchase has
    been made using a credit card or a frequent-buyer
    card
  • Catalog companies also collect such data using
    the orders they receive

3
Sequential Pattern Mining
  • An example of such a pattern is that customers
    typically rent Star Wars (????), then Empire
    Strikes Back (?????), and then Return of the
    Jedi (?????)
  • These rentals need not be consecutive
  • Customers who rent some other videos in between
    also support this sequential pattern
  • Elements of a sequential pattern need not be
    simple items
  • Computer Science and Programming Language,
    followed by Data Structure, followed by System
    Programs and Operating Systems is an example of
    a sequential pattern in which the elements are
    sets of items

4
Sequential Pattern Mining
  • Given Transaction Time, Customer Id, Items Bought

Original Database
Answer Set
5
Definition
  • The length of a sequence is the number of
    itemsets in the sequence
  • A sequence of length k is called a k-sequence
  • The support for an itemset i is defined as the
    fraction of customers who bought the items in i
    in a single transaction
  • The itemset i and the 1-sequence ltigt have the
    same support
  • An itemset with minimum support is called a large
    (frequent) itemset or litemset

6
AprioriAll Algorithm
  • Each itemset in a large sequence must have
    minimum support
  • Any large sequence must be a list of litemsets
  • Finding all sequential patterns in five phases
  • Sort Phase
  • Litemset Phase
  • Transformation Phase
  • Sequence Phase
  • Maximal Phase

7
AprioriAll AlgorithmSort Phase
Customer-Sequence Version of the Database
8
AprioriAll AlgorithmLitemset Phase
Apriori/DHP FP Growth
min_sup_count2
9
AprioriAll AlgorithmTransformation Phase
10
AprioriAll AlgorithmSequence Phase
Large 2-Sequences
Customer Sequences
Large 1-Sequences
2
Large 4-Sequences
Maximal Large Sequences
Large 3-Sequences
11
Sequence PhaseCandidate Generation
12
AprioriAll AlgorithmMaximal Phase
  • The sequence lt(3) (4 5) (8)gt is contained in lt(7)
    (3 8) (9) (4 5 6) (8)gt, since (3) ? (3 8), (4 5)
    ? (4 5 6) and (8) ? (8)
  • The sequence lt(3) (5)gt is not contained in lt(3
    5)gt (and vice versa)
  • The former represents items 3 and 5 being bought
    one after the other
  • The latter represents items 3 and 5 being bought
    together.
  • In a set of sequences, a sequence s is maximal if
    s is not contained in any other sequence.

13
AprioriAll Algorithm
Answer Set
  • With minimum support set to 25, i.e., a minimum
    support of 2 customers
  • lt (30) (90)gt and lt(30) (40 70)gt are maximal
  • lt(10 20) (30)gt which is only supported by
    customer 2 does not have minimum support
  • lt(30)gt, lt(40)gt, lt(70)gt, lt(90)gt, lt(30) (40)gt,
    lt(30) (70)gt and lt(40 70)gt, though having minimum
    support, are not in the answer because they are
    not maximal.

14
Summary
15
Discussions
  • AprioriAll algorithm will generate a huge set of
    candidate sequences
  • If there are 1000 frequent sequences of length-1,
    the algorithm will generate 1000 1000 (1000
    999) / 2 1,499,500 candidate sequences
  • Many scans of databases in mining
  • Difficulties at mining long sequential patterns

16
Research Topics
  • Time-Interval Sequential Patterns
  • Time-Gap Sequential Patterns
  • Non-redundant Sequential Patterns
  • Constrained Sequential Pattern Mining
  • Multi-dimensional Sequential Patterns
  • Generalized Sequential Patterns
  • Incremental Mining Sequential Patterns
  • Data Stream Sequential Pattern Mining
  • Interactive Mining Sequential Patterns

17
Exercise 6
A Sequence Database (min-sup 50)
Customer sequence
SID
lta(abc)(ac)d(cf)gt
10
lt(ad)c(bc)(ae)gt
20
lt(ef)(ab)(df)cbgt
30
lteg(af)cbcgt
40
Write a Comment
User Comments (0)
About PowerShow.com