DCI, a hybrid algorithm for counting frequent sets

About This Presentation

Title:

DCI, a hybrid algorithm for counting frequent sets

Description:

Designing efficient and scalable DM algorithms. for Cluster of ... Proc.of the 3rd Workshop on High Performance Data Mining, Cancun, Mexico, May 5th, 2000. ... – PowerPoint PPT presentation

Number of Views:144

Avg rating:3.0/5.0

Slides: 25

Provided by: enricopa

Category:

more less

Transcript and Presenter's Notes

Title: DCI, a hybrid algorithm for counting frequent sets

1
DCI, a hybrid algorithm for counting frequent sets

S. Orlando, P. Palmerini, R. Perego
CNUCE-CNR, Pisa
Università Cà Foscari, Venezia

2
Goals

Designing efficient and scalable DM algorithms
for Cluster of Workstations
exploiting parallelism at various levels
considering the particular features of the target
architecture
Focus on adaptiveness
dynamic policies for load balancing,
partitioning, and scheduling

3
Recent Work on DM

Parallel k-means for SMP Clusters
Implementation issues in the design of I/O
intensive data mining applications on clusters of
workstations. Proc.of the 3rd Workshop on High
Performance Data Mining, Cancun, Mexico, May 5th,
2000.
Parallel C4.5 on COWs
Frequent Patterns
Enhancing the Apriori algorithm for frequent set
counting. Submitted to DaWaK 2001.
A Hybrid Algorithm for Counting Frequent Sets.
Submitted to ACM KDD 2001.

4
Algorithms for FSC
BFS
DFS
Intersecting
Counting
Counting
Intersecting
Apriori
Partition
FP-growth
MaxEclat
Horizontal DB
Horizontal DB
Vertical DB
Vertical DB
Level-wise, strictly iterative
Divide conquer, recursive
5
DCIDirect Counting Intersecting

BFS, hybrid, level-wise algorithm
Counting-based during early iterations
Innovative method for storing candidates and
counting their support
Effective pruning of transactions
Intersection-based when database size fits into
the main memory
Fully optimized k-way intersections

6
Notation
7
DCI direct counting

Apriori hash tree is inefficient for small k
few levels ? few candidate partitions
DCI is based on contiguous and directly
accessible data structures to enhance locality
and minimizing overheads due to pointer
dereferencing

8
DCI direct counting, k2

C2 is exactly F1 X F1
Array COUNTSC2
OPMPH function
D2 C2 ? 1,,(m2)

2
9
DCI direct counting, kgt2

A limited and directly accessible PREFIX tree
allows to select contiguous regions of Ck sharing
the same prefix
The lexicographic order of t and Ck allows to
efficiently check itemset inclusion
High locality

10
DCI transaction pruning

Fewer and shorter transactions entail less
computation to perform
A pruned database Dk is written to the disk at
each iteration
I/O cost is reduced as execution progresses
Pruning may be expensive in time and space (e.g.
DHP hash filter)

11
DCI transaction pruning

Global Pruning an item is kept in t iff it
appears in at least (k-1) itemsets of Fk-1
Fk-1 ? global_countm
If (global_counti gt k-1) then ltkeep igt
Local Pruning an item is kept in t iff it
appears in at least k itemsets of Ck
Ck ? local_countt
If (local _counti gt k) then ltkeep igt

12
DCI transaction pruning
13
DCI tidlist Intersecting
When the pruned database fits in the main memory,
DCI build on the fly a bitvector vertical database
14
DCI Tidlist Intersecting

Item tidlists ? k-way intersections
low memory requirements but too many operations!
(k-1)-itemsets tidlists ? 2-way intersections
huge memory requirements (Partition)
DCI caches (and reuses) the intersections
(bitvectors) corresponding to all the prefixes of
current candidate itemset

Cache size (bits)
15
Number of AND operations
16
Memory requirements
17
Other Optimizations

Item ID are reassigned (in increasing order of
item support) to enhance locality
Bitvectors are sparse, we actually intersect only
non-zero sections
Transactions are pruned from the vertical
database by removing whole columns

18
DCI Time Complexity
Candidate checks
Subset counting time
Intersecting time
Tidlist length
19
Experiments

TESTBED Pentium III - 450 MHz, 512 MB RAM, 18 GB
U2-SCSI disk, Linux 2.2.12-20
Synthetically generated datasets

20
Per-iteration execution times
21
Total execution times
22
Memory requirements
23
Scalabilty
24
Conclusions

For low supports, DCI outperforms Apriori of at
least one order of magnitude
Further work
Comparison with other state-of-the-art algorithms
(FP-growth on short/medium patterns?)
Managing candidate explosion
Disk resident candidate set?
Parallel version of DCI for clusters of SMPs
Threads and shared memory within the SMP
Message-passing among SMPs