Approach to Data Mining from Algorithm and Computation - PowerPoint PPT Presentation

About This Presentation
Title:

Approach to Data Mining from Algorithm and Computation

Description:

Approach to Data Mining from Algorithm and ... graph mining, etc. Modeling ... 2,4 1,3,4 2,3,4 1,2,3,4 frequent Apriori uses long time much memory when ... – PowerPoint PPT presentation

Number of Views:249
Avg rating:3.0/5.0
Slides: 39
Provided by: acjp
Category:

less

Transcript and Presenter's Notes

Title: Approach to Data Mining from Algorithm and Computation


1
Approach to Data Mining from Algorithm and
Computation
  • Takeaki Uno, ETH Switzerland, NII Japan
  • Hiroki Arimura, Hokkaido University, Japan

2
Frequent Pattern Mining
  • Data mining is an important tool for analysis
    of data in many scientific and industrial areas
  • The aim of data mining is to
  • find interesting, or valuable something
  • But, we dont know what is interesting, nor is
    valuable
  • So, we give some criteria that would be
    satisfied by interesting or valuable something,
    and find all patterns satisfying them.

3
Image of Pattern Mining
  • Pattern mining is a problem of find all
    patterns from the given (possibly structured)
    database satisfying the given constraints

databases
extract interesting patterns
name
name
C
person
H
name
name
H
H
H
C
name
age
phone
O
family
C
C
O
H
person
person
H
C
H
C
C
H
N
name
name
name
H
XML database
chemical compounds
Frequent pattern mining is an enumeration problem
of all patterns appearing frequently, at least
given threshold in the database
4
Approach from
  • In real world, the inputs database is usually
    huge, and the output patterns are so huge, thus
    efficient computation is very important
  • Many research are done, but many of them are
    based on database, data engineering, modeling,
    not algorithm.
  • Ex.) How to make data compressed, how to execute
    query fast, which model is good, etc
  • Here we see want to separate the problems
  • from algorithmic view, what is important?, what
    we can do?

5
Distinguish the Focus, Problems
  • my algorithm is very fast for these datasets,
  • - but the data is very artificial, or
    including few items
  • the algorithm might not work for huge
    datasets, if it,
  • - difficult to be fast for both small and
    huge
  • We would like to distinguish the techniques and
    problems
  • - scalability
  • - I/O
  • - Huge datasets
  • - Data compression
  • The techniques would be orthogonal

6
Approach from
  • Many research are done, but many of them are
    based on database, data engineering, modeling,
    not algorithm.
  • Ex.) How to make data compressed, how to execute
    query fast, which model is good, etc
  • Here we see the problems as enumeration
    problems,
  • and try to clarify what kind of techniques are
    important
  • for efficient computation, with examples on
    itemset mining

Good Models
Solvable Models
7
From the Algorithm Theory
  • Here we focus only on algorithms, thus topics
    are
  • - output sensitive computation time
  • (bad, if long time for small output)
  • - memory use should depend only on input size
  • - computation time for an iteration
  • - reduce the input of each iteration

This is so important!!!
Good Models
Solvable Models
8
From the Algorithm Theory
  • Here we focus only on the case that the input
    fits the memory
  • - scalability output sensitive computation
    time
  • (bad, if long time for small output)
  • - memory use should depend only on input size
  • - computation time for an iteration
  • - reduce the input of each iteration
  • (from bottom wideness)



TIME

iterations
time of an iteration
I/O
This is so important!!!
9
Bottom Wideness
  • Enumeration algorithms usually have recursive
    tree structures,
  • there are many iterations in deeper levels

Procedure to reduce input of recursive calls
Size time
10
Bottom Wideness
  • Enumeration algorithms usually have recursive
    tree structures,
  • there are many iterations in deeper levels

Procedure to reduce input of recursive calls
Size time
Total computation time will be half only by one
reduction for input
11
Bottom Wideness
  • Enumeration algorithms usually have recursive
    tree structures,
  • there are many iterations in deeper levels

Procedure to reduce input of recursive calls
Size time
Total computation time will be half only by one
reduction for input
Recursively reduce the input ? computation time
is much reduced
12
Advantage of Bottom Wideness
  • Suppose that the recursion tree has iterations
    exponentially many on lower levels (ex. (2
    level i) ? level i1

recursion tree
O(n3)
O(1)
amortized computation time is O(1) for each
output !!
13
Advantage of Bottom Wideness
  • Suppose that the recursion tree has iterations
    exponentially many on lower levels (ex. (2
    level i) ? level i1

recursion tree
O(n5)
O(n)
amortized computation time is O(n) for each
output !!
Computation time for each output depends only on
bottom levels ? reduce the computation time on
lower levels by reduction of input
14
Frequent Itemset Mining
  • Transaction database D
  • a database composed of transactions defined
    on itemset E
  • i.e., ?t ?D, t ?E
  • - basket data
  • - links of web pages
  • - words in documents
  • A subset P of E is called an itemset
  • occurrence of P a transaction in D including
    P
  • denotation Occ(P) of P set of occurrences of
    P
  • Occ(P) is called frequency of P

1,2,5,6,7 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
D
denotation of 1,2 1,2,5,6,7,9, 1,2,7,8,9
15
Frequent Itemset
  • Given a minimum support ?,
  • Frequent itemset an itemset s.t. (frequency) ? ?
  • (a subset of items, which is included in at
    least ? transactions)
  • Ex.)

patterns included in at least 3 transactions 1
2 7 9 1,7 1,9 2,7 2,9
7,9 1,7,9 2,7,9
1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
T
16
Techniques for Efficient Mining
  • There are many techniques for fast mining
  • - apriori
  • - backtracking
  • - down project
  • - pruning by infrequent subset
  • - bitmap
  • - occurrence deliver
  • - FP-tree (trie, prefix tree)
  • - filtering (unification)
  • - conditional (projected) database
  • - trimming of database

for search strategy
for speeding up iterations
for database reduction bottom wideness
17
Search Strategies
  • Frequent Itemsets form
  • a connected component on itemset lattice
  • - Apriori algorithms generate
  • itemsets level-by-level
  • ? pruning by infrequent subsets
  • ? much memory use
  • - Backtracking algorithms generate
  • itemset in depth-first manner
  • ? small memory use
  • ? match down project, etc.

1,2,3,4
1,2,3
1,2,4
1,3,4
2,3,4
1,3
1,2
3,4
2,4
1,4
2,3
1
2
3
4
f
frequent
Apriori uses long time much memory when output is
large
18
Backtracking
apriori
  • Set k 0, Ok f
  • While (Ok?f)
  • for each P?e, P?Ok
  • if all P?e-f ? Ok then
  • compute Occ(P?e)
  • if Occ(P?e) ??then Ok1 ? P?e
  • k k1

1,2,3,4
1,2,3
1,2,4
1,3,4
2,3,4
1,3
1,2
3,4
2,4
1,4
2,3
1
2
3
4
backtracking
Backtrack (P, Occ(P)) for each egttail(P)
compute Occ(P?e) if Occ(P?e) ??then
Backtrack ( P?e, Occ(P?e) )
f
frequent
19
Speeds Iteration
  • Bottleneck in iteration is computing Occ(P?e)
  • - down project Occ(P?e) Occ(P?e) n Occ(e)
  • ? O(D?P) the size of database of the
    part larger than tail(P)
  • - pruning by infrequent subset
  • ? P search query O(c D?P)
  • - bitmap compute Occ(P?e) n Occ(e) by AND
    operation
  • ? (n -tail(P)) m/32 operations
  • - occurrence deliver comp. Occ(P?e) for all e by
    one scan of D(P)?P
  • ? O(D(P)?P) D(P) is transactions
    including P

D
m
n
bitmap is slow if database is sparse, pruning is
slow for huge output occurrence deliver is fast
if threshold (minimum support) is small
20
Occurrence Deliver
  • Compute the denotations of P ?i for all is
    at once,


A 1 2 5 6 7 9
B 2 3 4 5
C 1 2 7 8 9
D 1 7 9
E 2 7 9
F 2
A
A
A
A
1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
C
C
C
D
D
P 1,7

Check the frequency for all items to be added in
linear time of the database size
Generating the recursive calls in reverse
direction, we can re-use the memory
21
Database Reductions
  • Conditional database is to reduce database by
    unnecessary items and transactions, for deeper
    levels

1,3,5 1,3,5 1,3,5 1,2,5,6 1,4,6 1,2,6
3,5 3,5 3,5 5,6 6 6
? 3
3,5 3 5,6 6 2
filtering
filtering
Remove infrequent items, items included in all
Unify same transactions
5
Linear time
3
5
6
1
O(Dlog D) time
6
FP-tree, prefix tree
Compact if database is dense and large
Remove infrequent items, automatically unified
22
Summary of Techniques
  • Database is dense and large even for bottom
    levels of computation ? support is large
  • output solutions is huge ? support is small
  • Prediction
  • - apriori will be slow when support is small
  • - conditional database is fast when support is
    small
  • - bitmap will be slow for sparse datasets
  • - FP-tree will be bit slow for sparse datasets,
  • and fast for large support

23
Results from FIMI 04 (sparse datasets)
bitmap
bitmap
apriori
apriori
FP-tree
cond.
cond.
O(n) vs O(nlogn)
O(n) vs O(nlogn)
  • Conditional database is good, bitmap is slow
  • FP-tree ? large support, occurrence deliver ?
    small support

24
Results on Dense Datasets
bitmap
bitmap
apriori
apriori
FP-tree, cond
FP-tree cond.
  • Apriori is still slow for middle supports,
  • FP-tree is good

nodes in FP-tree (D filtered)/6
25
Summary on Computation
  • We can understand the reason of efficiency from
    algorithmic view
  • - reduce the input of each iteration according
    to bottom wideness
  • - reduce the computation time for an iteration
  • (probably, combination of conditional database,
    patricia tree, and occurrence deliver will be
    good)
  • We can observe similarly other pattern mining
    problems,
  • sequence mining, string mining, tree mining,
    graph mining, etc.

Next we see closed pattern which represents some
similar patterns, we begin with itemsets
26
Modeling Closed Itemsets Pasquier et. al. 1999
  • Usually, frequent itemsets is huge, when we
    mine in depth
  • ? we want to decrease itemsets in some way
  • There are many ways for this task, ex., giving
    some scores, group similar itemsets, looking at
    the other parameters, etc.

But, we would like to approach from theory
Here we introduce closed patterns Consider the
itemsets having the same denotations ? we would
say they have the same information we focus
only on the maximal among them, called closed
pattern ( intersection of occurrences in the
denotation)
27
Example of Closed Itemset
patterns included in at least 3 transactions 1
2 7 9 1,7 1,9 2,7 2,9
7,9 1,7,9 2,7,9
1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
T
  • In general, frequent itemsets ?
    frequent closed itemsets
  • Especially, ltlt holds if database has some
    structures
  • (Databases with some structure tend to have huge
    frequent itemsets, thus this is an advantage)

28
Difference of itemsets
  • frequent itemsets ltlt closed itemsets
  • when threshold ?is small

29
Closure Extension of Itemset
  • Usual backtracking does not work for closed
    itemsets,
  • because there are possibly big gap between
    closed itemsets
  • On the other hand, any closed itemset
  • is obtained from another by
  • add an item and
  • take closure (maximal)
  • - closure of P is the closed itemset
  • having the same denotation to P,
  • and computed by taking intersection of Occ(P)

1,2,3,4
1,2,3
1,2,4
1,3,4
2,3,4
1,3
3,4
2,4
1,4
2,3
1,2
1
3
4
2
f
This is an adjacency structure defined on closed
itemsets, thus we can perform graph search on it,
with using memory
30
PPC extension
  • Closure extension gives us an acyclic adjacency
    structure for us,
  • but its not enough to get a memory efficient
    algorithm
  • (we need to store discovered itemsets in
    memory)
  • We introduce PPC extension to obtain tree
    structure

PPC extension
A closure extension P obtained from Pe is a
PPC extension ? prefixes of P and P are the
same (smaller than e)
Any closed itemset is a PPC extension of
just one other closed itemset
31
Example of PPC Extension
closure extension ppc extension
f
1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
2
7,9
1,7,9
D
2,7,9
2,5
1,2,7,9
2,3,4,5
  • closure extension
  • ? acyclic
  • ppc extension
  • ? tree

1,2,7,8,9
1,2,5,6,7,9
32
Example of PPC Extension
1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
f
D
2
7,9
1,7,9
  • (1,2,7,9), (1,2,7), (1,2) ?
  • 1,2,5,6,7,9, 1,2,7,8,9
  • (1,7,9), (1,7), (1) ?
  • 1,7,9,
  • 1,2,5,6,7,9, 1,2,7,8,9

2,7,9
2,5
1,2,7,9
2,3,4,5
1,2,7,8,9
1,2,5,6,7,9
33
For Efficient Computation
  • Computation of closure takes long time
  • We use database reduction, from the fact that
  • if P is PPC extension by Pe, and P is PPC
    extension by Pf
  • then e lt f , thus prefix is used only for
    intersection!

e 5
1,2,5,6,7,9 2,3,4,5 1,2,5,7,8,9 1,5,6,7 2,5,7,9 2,
3,5,6,7,8
1,2,5,6,7,9 2,5 1,2,5,7,9 1,5,6,7 2,5,7,9 2,5,6,7
1,2,5,6,7,9 2,5,7,9 2 5,6,7 2
34
Experiment vs. Frequent Itemset (sparse)
  • Computation time/itemset is very stable
  • There is no big difference of computation time

35
Experiment vs. Frequent Itemset (dense)
  • Computation time/itemset is very stable
  • There is no big difference of computation time

36
Compare to Other Methods
  • There are roughly two methods to enumerate
    closed patterns
  • frequent pattern base enumerate all frequent
    patterns, and output only closed ones ( some
    pruning),
  • check closedness by keeping all discovered
    itemsets in memory
  • closure base compute closed pattern by
    closure, and avoid the duplication by keeping all
    discovered itemsets in memory

If solution is small, frequent pattern base is
fast, since search for checking closedness takes
very short time
37
vs. Other Implementations (sparse)
  • Large minimum support ? frequent pattern base
  • Small minimum support ? PPC extension

38
vs. Other Implementations (dense)
  • Small minimum support ? PPC extension and
    database reduction are good

39
Extend Closed Patterns
  • There are several mining problems for which we
    can introduce closed patterns (union of
    occurrences is unique!!)
  • - Ranked trees (labeled trees without
    siblings of the same label)
  • - Motifs (string with wildcards)

A
AB??EF?H ABCDEFGH ABZZEFZH
B
A
C
A
A
B
For these problems, PPC extension also works
similarly, with conditional database and
occurrence deliver
40
Conclusion
  • We overviews the techniques on frequent pattern
    mining as enumeration algorithms, and show that
  • - complexity of one iteration and bottom
    wideness are important
  • We show that closed pattern is probably a
    valuable model,
  • and can be enumerated efficiently

ABCD ACBD
Future works
Develop efficient algorithms and
implementations for other basic mining problems
Extend the class of problems in which closed
patterns work well
Write a Comment
User Comments (0)
About PowerShow.com