DCI, a hybrid algorithm for counting frequent sets - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

DCI, a hybrid algorithm for counting frequent sets

Description:

Designing efficient and scalable DM algorithms. for Cluster of ... Proc.of the 3rd Workshop on High Performance Data Mining, Cancun, Mexico, May 5th, 2000. ... – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 25
Provided by: enricopa
Category:

less

Transcript and Presenter's Notes

Title: DCI, a hybrid algorithm for counting frequent sets


1
DCI, a hybrid algorithm for counting frequent sets
  • S. Orlando, P. Palmerini, R. Perego
  • CNUCE-CNR, Pisa
  • Università Cà Foscari, Venezia

2
Goals
  • Designing efficient and scalable DM algorithms
  • for Cluster of Workstations
  • exploiting parallelism at various levels
  • considering the particular features of the target
    architecture
  • Focus on adaptiveness
  • dynamic policies for load balancing,
    partitioning, and scheduling

3
Recent Work on DM
  • Parallel k-means for SMP Clusters
  • Implementation issues in the design of I/O
    intensive data mining applications on clusters of
    workstations. Proc.of the 3rd Workshop on High
    Performance Data Mining, Cancun, Mexico, May 5th,
    2000.
  • Parallel C4.5 on COWs
  • Frequent Patterns
  • Enhancing the Apriori algorithm for frequent set
    counting. Submitted to DaWaK 2001.
  • A Hybrid Algorithm for Counting Frequent Sets.
  • Submitted to ACM KDD 2001.

4
Algorithms for FSC
BFS
DFS
Intersecting
Counting
Counting
Intersecting
Apriori
Partition
FP-growth
MaxEclat
Horizontal DB
Horizontal DB
Vertical DB
Vertical DB
Level-wise, strictly iterative
Divide conquer, recursive
5
DCIDirect Counting Intersecting
  • BFS, hybrid, level-wise algorithm
  • Counting-based during early iterations
  • Innovative method for storing candidates and
    counting their support
  • Effective pruning of transactions
  • Intersection-based when database size fits into
    the main memory
  • Fully optimized k-way intersections

6
Notation
7
DCI direct counting
  • Apriori hash tree is inefficient for small k
  • few levels ? few candidate partitions
  • DCI is based on contiguous and directly
    accessible data structures to enhance locality
    and minimizing overheads due to pointer
    dereferencing

8
DCI direct counting, k2
  • C2 is exactly F1 X F1
  • Array COUNTSC2
  • OPMPH function
  • D2 C2 ? 1,,(m2)

2
9
DCI direct counting, kgt2
  • A limited and directly accessible PREFIX tree
    allows to select contiguous regions of Ck sharing
    the same prefix
  • The lexicographic order of t and Ck allows to
    efficiently check itemset inclusion
  • High locality

10
DCI transaction pruning
  • Fewer and shorter transactions entail less
    computation to perform
  • A pruned database Dk is written to the disk at
    each iteration
  • I/O cost is reduced as execution progresses
  • Pruning may be expensive in time and space (e.g.
    DHP hash filter)

11
DCI transaction pruning
  • Global Pruning an item is kept in t iff it
    appears in at least (k-1) itemsets of Fk-1
  • Fk-1 ? global_countm
  • If (global_counti gt k-1) then ltkeep igt
  • Local Pruning an item is kept in t iff it
    appears in at least k itemsets of Ck
  • Ck ? local_countt
  • If (local _counti gt k) then ltkeep igt

12
DCI transaction pruning
13
DCI tidlist Intersecting
When the pruned database fits in the main memory,
DCI build on the fly a bitvector vertical database
14
DCI Tidlist Intersecting
  • Item tidlists ? k-way intersections
  • low memory requirements but too many operations!
  • (k-1)-itemsets tidlists ? 2-way intersections
  • huge memory requirements (Partition)
  • DCI caches (and reuses) the intersections
    (bitvectors) corresponding to all the prefixes of
    current candidate itemset

Cache size (bits)
15
Number of AND operations
16
Memory requirements
17
Other Optimizations
  • Item ID are reassigned (in increasing order of
    item support) to enhance locality
  • Bitvectors are sparse, we actually intersect only
    non-zero sections
  • Transactions are pruned from the vertical
    database by removing whole columns

18
DCI Time Complexity
Candidate checks
Subset counting time
Intersecting time
Tidlist length
19
Experiments
  • TESTBED Pentium III - 450 MHz, 512 MB RAM, 18 GB
    U2-SCSI disk, Linux 2.2.12-20
  • Synthetically generated datasets

20
Per-iteration execution times
21
Total execution times
22
Memory requirements
23
Scalabilty
24
Conclusions
  • For low supports, DCI outperforms Apriori of at
    least one order of magnitude
  • Further work
  • Comparison with other state-of-the-art algorithms
    (FP-growth on short/medium patterns?)
  • Managing candidate explosion
  • Disk resident candidate set?
  • Parallel version of DCI for clusters of SMPs
  • Threads and shared memory within the SMP
  • Message-passing among SMPs
Write a Comment
User Comments (0)
About PowerShow.com