Loading...

PPT – Approach to Data Mining from Algorithm and Computation PowerPoint presentation | free to download - id: 7c1ac7-NTlhY

The Adobe Flash plugin is needed to view this content

Approach to Data Mining from Algorithm and

Computation

- Takeaki Uno, ETH Switzerland, NII Japan
- Hiroki Arimura, Hokkaido University, Japan

Frequent Pattern Mining

- Data mining is an important tool for analysis

of data in many scientific and industrial areas - The aim of data mining is to
- find interesting, or valuable something
- But, we dont know what is interesting, nor is

valuable - So, we give some criteria that would be

satisfied by interesting or valuable something,

and find all patterns satisfying them.

Image of Pattern Mining

- Pattern mining is a problem of find all

patterns from the given (possibly structured)

database satisfying the given constraints

databases

extract interesting patterns

name

name

C

person

H

name

name

H

H

H

C

name

age

phone

O

family

C

C

O

H

person

person

H

C

H

C

C

H

N

name

name

name

H

XML database

chemical compounds

Frequent pattern mining is an enumeration problem

of all patterns appearing frequently, at least

given threshold in the database

Approach from

- In real world, the inputs database is usually

huge, and the output patterns are so huge, thus

efficient computation is very important - Many research are done, but many of them are

based on database, data engineering, modeling,

not algorithm. - Ex.) How to make data compressed, how to execute

query fast, which model is good, etc - Here we see want to separate the problems
- from algorithmic view, what is important?, what

we can do?

Distinguish the Focus, Problems

- my algorithm is very fast for these datasets,
- - but the data is very artificial, or

including few items - the algorithm might not work for huge

datasets, if it, - - difficult to be fast for both small and

huge - We would like to distinguish the techniques and

problems - - scalability
- - I/O
- - Huge datasets
- - Data compression
- The techniques would be orthogonal

Approach from

- Many research are done, but many of them are

based on database, data engineering, modeling,

not algorithm. - Ex.) How to make data compressed, how to execute

query fast, which model is good, etc - Here we see the problems as enumeration

problems, - and try to clarify what kind of techniques are

important - for efficient computation, with examples on

itemset mining

Good Models

Solvable Models

From the Algorithm Theory

- Here we focus only on algorithms, thus topics

are - - output sensitive computation time
- (bad, if long time for small output)
- - memory use should depend only on input size
- - computation time for an iteration
- - reduce the input of each iteration

This is so important!!!

Good Models

Solvable Models

From the Algorithm Theory

- Here we focus only on the case that the input

fits the memory - - scalability output sensitive computation

time - (bad, if long time for small output)
- - memory use should depend only on input size
- - computation time for an iteration
- - reduce the input of each iteration
- (from bottom wideness)

TIME

iterations

time of an iteration

I/O

This is so important!!!

Bottom Wideness

- Enumeration algorithms usually have recursive

tree structures, - there are many iterations in deeper levels

Procedure to reduce input of recursive calls

Size time

Bottom Wideness

- Enumeration algorithms usually have recursive

tree structures, - there are many iterations in deeper levels

Procedure to reduce input of recursive calls

Size time

Total computation time will be half only by one

reduction for input

Bottom Wideness

- Enumeration algorithms usually have recursive

tree structures, - there are many iterations in deeper levels

Procedure to reduce input of recursive calls

Size time

Total computation time will be half only by one

reduction for input

Recursively reduce the input ? computation time

is much reduced

Advantage of Bottom Wideness

- Suppose that the recursion tree has iterations

exponentially many on lower levels (ex. (2

level i) ? level i1

recursion tree

O(n3)

O(1)

amortized computation time is O(1) for each

output !!

Advantage of Bottom Wideness

- Suppose that the recursion tree has iterations

exponentially many on lower levels (ex. (2

level i) ? level i1

recursion tree

O(n5)

O(n)

amortized computation time is O(n) for each

output !!

Computation time for each output depends only on

bottom levels ? reduce the computation time on

lower levels by reduction of input

Frequent Itemset Mining

- Transaction database D
- a database composed of transactions defined

on itemset E - i.e., ?t ?D, t ?E
- - basket data
- - links of web pages
- - words in documents
- A subset P of E is called an itemset
- occurrence of P a transaction in D including

P - denotation Occ(P) of P set of occurrences of

P - Occ(P) is called frequency of P

1,2,5,6,7 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2

D

denotation of 1,2 1,2,5,6,7,9, 1,2,7,8,9

Frequent Itemset

- Given a minimum support ?,
- Frequent itemset an itemset s.t. (frequency) ? ?
- (a subset of items, which is included in at

least ? transactions) - Ex.)

patterns included in at least 3 transactions 1

2 7 9 1,7 1,9 2,7 2,9

7,9 1,7,9 2,7,9

1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2

T

Techniques for Efficient Mining

- There are many techniques for fast mining
- - apriori
- - backtracking
- - down project
- - pruning by infrequent subset
- - bitmap
- - occurrence deliver
- - FP-tree (trie, prefix tree)
- - filtering (unification)
- - conditional (projected) database
- - trimming of database

for search strategy

for speeding up iterations

for database reduction bottom wideness

Search Strategies

- Frequent Itemsets form
- a connected component on itemset lattice
- - Apriori algorithms generate
- itemsets level-by-level
- ? pruning by infrequent subsets
- ? much memory use
- - Backtracking algorithms generate
- itemset in depth-first manner
- ? small memory use
- ? match down project, etc.

1,2,3,4

1,2,3

1,2,4

1,3,4

2,3,4

1,3

1,2

3,4

2,4

1,4

2,3

1

2

3

4

f

frequent

Apriori uses long time much memory when output is

large

Backtracking

apriori

- Set k 0, Ok f
- While (Ok?f)
- for each P?e, P?Ok
- if all P?e-f ? Ok then
- compute Occ(P?e)
- if Occ(P?e) ??then Ok1 ? P?e
- k k1

1,2,3,4

1,2,3

1,2,4

1,3,4

2,3,4

1,3

1,2

3,4

2,4

1,4

2,3

1

2

3

4

backtracking

Backtrack (P, Occ(P)) for each egttail(P)

compute Occ(P?e) if Occ(P?e) ??then

Backtrack ( P?e, Occ(P?e) )

f

frequent

Speeds Iteration

- Bottleneck in iteration is computing Occ(P?e)

- - down project Occ(P?e) Occ(P?e) n Occ(e)
- ? O(D?P) the size of database of the

part larger than tail(P) - - pruning by infrequent subset
- ? P search query O(c D?P)
- - bitmap compute Occ(P?e) n Occ(e) by AND

operation - ? (n -tail(P)) m/32 operations
- - occurrence deliver comp. Occ(P?e) for all e by

one scan of D(P)?P - ? O(D(P)?P) D(P) is transactions

including P

D

m

n

bitmap is slow if database is sparse, pruning is

slow for huge output occurrence deliver is fast

if threshold (minimum support) is small

Occurrence Deliver

- Compute the denotations of P ?i for all is

at once,

A 1 2 5 6 7 9

B 2 3 4 5

C 1 2 7 8 9

D 1 7 9

E 2 7 9

F 2

A

A

A

A

1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2

C

C

C

D

D

P 1,7

Check the frequency for all items to be added in

linear time of the database size

Generating the recursive calls in reverse

direction, we can re-use the memory

Database Reductions

- Conditional database is to reduce database by

unnecessary items and transactions, for deeper

levels

1,3,5 1,3,5 1,3,5 1,2,5,6 1,4,6 1,2,6

3,5 3,5 3,5 5,6 6 6

? 3

3,5 3 5,6 6 2

filtering

filtering

Remove infrequent items, items included in all

Unify same transactions

5

Linear time

3

5

6

1

O(Dlog D) time

6

FP-tree, prefix tree

Compact if database is dense and large

Remove infrequent items, automatically unified

Summary of Techniques

- Database is dense and large even for bottom

levels of computation ? support is large - output solutions is huge ? support is small
- Prediction
- - apriori will be slow when support is small
- - conditional database is fast when support is

small - - bitmap will be slow for sparse datasets
- - FP-tree will be bit slow for sparse datasets,
- and fast for large support

Results from FIMI 04 (sparse datasets)

bitmap

bitmap

apriori

apriori

FP-tree

cond.

cond.

O(n) vs O(nlogn)

O(n) vs O(nlogn)

- Conditional database is good, bitmap is slow
- FP-tree ? large support, occurrence deliver ?

small support

Results on Dense Datasets

bitmap

bitmap

apriori

apriori

FP-tree, cond

FP-tree cond.

- Apriori is still slow for middle supports,
- FP-tree is good

nodes in FP-tree (D filtered)/6

Summary on Computation

- We can understand the reason of efficiency from

algorithmic view - - reduce the input of each iteration according

to bottom wideness - - reduce the computation time for an iteration
- (probably, combination of conditional database,

patricia tree, and occurrence deliver will be

good) - We can observe similarly other pattern mining

problems, - sequence mining, string mining, tree mining,

graph mining, etc.

Next we see closed pattern which represents some

similar patterns, we begin with itemsets

Modeling Closed Itemsets Pasquier et. al. 1999

- Usually, frequent itemsets is huge, when we

mine in depth - ? we want to decrease itemsets in some way
- There are many ways for this task, ex., giving

some scores, group similar itemsets, looking at

the other parameters, etc.

But, we would like to approach from theory

Here we introduce closed patterns Consider the

itemsets having the same denotations ? we would

say they have the same information we focus

only on the maximal among them, called closed

pattern ( intersection of occurrences in the

denotation)

Example of Closed Itemset

patterns included in at least 3 transactions 1

2 7 9 1,7 1,9 2,7 2,9

7,9 1,7,9 2,7,9

1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2

T

- In general, frequent itemsets ?

frequent closed itemsets - Especially, ltlt holds if database has some

structures - (Databases with some structure tend to have huge

frequent itemsets, thus this is an advantage)

Difference of itemsets

- frequent itemsets ltlt closed itemsets
- when threshold ?is small

Closure Extension of Itemset

- Usual backtracking does not work for closed

itemsets, - because there are possibly big gap between

closed itemsets - On the other hand, any closed itemset
- is obtained from another by
- add an item and
- take closure (maximal)
- - closure of P is the closed itemset
- having the same denotation to P,
- and computed by taking intersection of Occ(P)

1,2,3,4

1,2,3

1,2,4

1,3,4

2,3,4

1,3

3,4

2,4

1,4

2,3

1,2

1

3

4

2

f

This is an adjacency structure defined on closed

itemsets, thus we can perform graph search on it,

with using memory

PPC extension

- Closure extension gives us an acyclic adjacency

structure for us, - but its not enough to get a memory efficient

algorithm - (we need to store discovered itemsets in

memory) - We introduce PPC extension to obtain tree

structure

PPC extension

A closure extension P obtained from Pe is a

PPC extension ? prefixes of P and P are the

same (smaller than e)

Any closed itemset is a PPC extension of

just one other closed itemset

Example of PPC Extension

closure extension ppc extension

f

1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2

2

7,9

1,7,9

D

2,7,9

2,5

1,2,7,9

2,3,4,5

- closure extension
- ? acyclic
- ppc extension
- ? tree

1,2,7,8,9

1,2,5,6,7,9

Example of PPC Extension

1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2

f

D

2

7,9

1,7,9

- (1,2,7,9), (1,2,7), (1,2) ?
- 1,2,5,6,7,9, 1,2,7,8,9
- (1,7,9), (1,7), (1) ?
- 1,7,9,
- 1,2,5,6,7,9, 1,2,7,8,9

2,7,9

2,5

1,2,7,9

2,3,4,5

1,2,7,8,9

1,2,5,6,7,9

For Efficient Computation

- Computation of closure takes long time
- We use database reduction, from the fact that
- if P is PPC extension by Pe, and P is PPC

extension by Pf - then e lt f , thus prefix is used only for

intersection!

e 5

1,2,5,6,7,9 2,3,4,5 1,2,5,7,8,9 1,5,6,7 2,5,7,9 2,

3,5,6,7,8

1,2,5,6,7,9 2,5 1,2,5,7,9 1,5,6,7 2,5,7,9 2,5,6,7

1,2,5,6,7,9 2,5,7,9 2 5,6,7 2

Experiment vs. Frequent Itemset (sparse)

- Computation time/itemset is very stable
- There is no big difference of computation time

Experiment vs. Frequent Itemset (dense)

- Computation time/itemset is very stable
- There is no big difference of computation time

Compare to Other Methods

- There are roughly two methods to enumerate

closed patterns - frequent pattern base enumerate all frequent

patterns, and output only closed ones ( some

pruning), - check closedness by keeping all discovered

itemsets in memory - closure base compute closed pattern by

closure, and avoid the duplication by keeping all

discovered itemsets in memory

If solution is small, frequent pattern base is

fast, since search for checking closedness takes

very short time

vs. Other Implementations (sparse)

- Large minimum support ? frequent pattern base
- Small minimum support ? PPC extension

vs. Other Implementations (dense)

- Small minimum support ? PPC extension and

database reduction are good

Extend Closed Patterns

- There are several mining problems for which we

can introduce closed patterns (union of

occurrences is unique!!) - - Ranked trees (labeled trees without

siblings of the same label) - - Motifs (string with wildcards)

A

AB??EF?H ABCDEFGH ABZZEFZH

B

A

C

A

A

B

For these problems, PPC extension also works

similarly, with conditional database and

occurrence deliver

Conclusion

- We overviews the techniques on frequent pattern

mining as enumeration algorithms, and show that - - complexity of one iteration and bottom

wideness are important - We show that closed pattern is probably a

valuable model, - and can be enumerated efficiently

ABCD ACBD

Future works

Develop efficient algorithms and

implementations for other basic mining problems

Extend the class of problems in which closed

patterns work well