Title: Applications of BFS and DFS: the Apriori and FPGrowth Algorithms
1Applications of BFS and DFS the Apriori and
FPGrowth Algorithms
Modified from Slides of Stanford CS345A and UIUC
CS412
- Jianlin Feng
- School of Software
- SUN YAT-SEN UNIVERSITY
2The Market-Basket Model
- A large set of items,
- e.g., things sold in a supermarket.
- A large set of baskets,
- each of which is a small set of the items,
- e.g., the things one customer buys on one day.
3Applications (1)
- Items products
- baskets sets of products someone bought in one
trip to the store. - Example application given that many people buy
beer and diapers together - Run a sale on diapers raise price of beer.
- Only useful if many buy diapers beer.
4Applications (2)
- items words.
- Baskets Web pages
- Unusual words appearing together in a large
number of documents, - may indicate an interesting relationship.
- e.g., Brad and Angelina
5Support
- Support for itemset I
- the number of baskets containing all items in I.
- Sometimes given as a percentage.
- Given a support threshold s,
- sets of items that appear in at least s baskets
are called frequent itemsets.
6Example Frequent Itemsets
- Itemsmilk, coke, pepsi, beer, juice.
- Support 3 baskets.
- B1 m, c, b B2 m, p, j
- B3 m, b B4 c, j
- B5 m, p, b B6 m, c, b, j
- B7 c, b, j B8 b, c
- Frequent itemsets m, c, b, j,
7Association Rules
- If-then rules about the contents of baskets.
- i1, i2,,ik ? j means if a basket contains
all of i1,,ik then it is likely to contain j. - Confidence of this association rule is the
probability of j given i1,,ik.
8Example Confidence
- B1 m, c, b B2 m, p, j
- B3 m, b B4 c, j
- B5 m, p, b B6 m, c, b, j
- B7 c, b, j B8 b, c
- An association rule m, b ? c.
- Confidence 2/4 50.
_ _
9Finding Association Rules
- Question find all association rules with
support s and confidence c . - Note support of an association rule is the
support of the set of items on the left. - Hard part finding the frequent itemsets.
- Note if i1, i2,,ik ? j has high support and
confidence, then both i1, i2,,ik and
i1, i2,,ik ,j will be frequent.
10A-Priori Algorithm (1)
- Key idea monotonicity
- if a set of items appears at least s times, so
does every subset. - Contrapositive for pairs if item i does not
appear in s baskets, then no pair including i
can appear in s baskets.
11A-Priori Algorithm (2)
- Pass 1 Read baskets and count in main memory the
occurrences of each item. - Requires only memory proportional to items.
- Items that appear at least s times are the
frequent items.
12A-Priori Algorithm (3)
- Pass 2 Read baskets again and count in main
memory only those pairs both of which were found
in Pass 1 to be frequent. - Requires memory proportional to square of
frequent items only (for counts), plus a list of
the frequent items (so you know what must be
counted).
13Picture of A-Priori
Item counts
Frequent items
Counts of pairs of frequent items
Pass 1
Pass 2
14Frequent Triples, Etc.
- For each k, we construct two sets of k -sets
(sets of size k ) - Ck candidate k -sets those that might be
frequent sets (support gt s ) based on information
from the pass for k 1. - Lk the set of truly frequent k -sets.
15Filter
Filter
Construct
Construct
C1
L1
C2
L2
C3
First pass
Second pass
16The Apriori Algorithm (Pseudo-Code)
- Ck Candidate itemset of size k
- Lk frequent itemset of size k
- L1 frequent items
- for (k 1 Lk !? k) do begin
- Ck1 candidates generated from Lk
- for each transaction t in database do
- increment the count of all candidates in Ck1
that are contained in t - Lk1 candidates in Ck1 with min_support
- end
- return ?k Lk
17Implementation of Apriori
- How to generate candidates?
- Step 1 self-joining Lk
- Step 2 pruning
- Example of Candidate-generation
- L3abc, abd, acd, ace, bcd
- Self-joining L3L3
- abcd from abc and abd
- acde from acd and ace
- Pruning
- acde is removed because ade is not in L3
- C4 abcd
18Depth-First SearchFrequent Pattern Growth
Approach
- Bottlenecks of the Apriori approach
- Breadth-first (i.e., level-wise) search
- Candidate generation and test
- Often generates a huge number of candidates
- The FPGrowth Approach (J. Han, J. Pei, and Y.
Yin, SIGMOD 00) - Depth-first search
- Avoid explicit candidate generation
- Major philosophy Grow long patterns from short
ones using local frequent items only - abc is a frequent pattern
- Get all transactions having abc, i.e., project
DB on abc DBabc - d is a local frequent item in DBabc ? abcd is
a frequent pattern
19Construct FP-tree from a Transaction Database
TID Items bought (ordered) frequent
items 100 f, a, c, d, g, i, m, p f, c, a, m,
p 200 a, b, c, f, l, m, o f, c, a, b,
m 300 b, f, h, j, o, w f, b 400 b, c,
k, s, p c, b, p 500 a, f, c, e, l, p, m,
n f, c, a, m, p
min_support 3
- Scan DB once, find frequent 1-itemset (single
item pattern) - Sort frequent items in frequency descending
order, f-list - Scan DB again, construct FP-tree
F-list f-c-a-b-m-p