Improving the Efficiency of Frequent Pattern Mining by Compact Data Structure Design - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Improving the Efficiency of Frequent Pattern Mining by Compact Data Structure Design

Description:

Improving the Efficiency of Frequent Pattern Mining by ... Index 1 2 3 4 5. 1. 2. 2. 1. 1. 1. 1. Item 1 3 4 5 7. Count 3 3 4 3 2. 3 4 5 7. 1 4 5 7. 1 3 4 ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 23
Provided by: YUD3
Category:

less

Transcript and Presenter's Notes

Title: Improving the Efficiency of Frequent Pattern Mining by Compact Data Structure Design


1
Improving the Efficiency of Frequent Pattern
Mining by Compact Data Structure Design
  • Raj P. Gopalan Yudho Giri Sucahyo
  • School of Computing
  • Curtin University of Technology
  • Bentley, Western Australia 6102
  • raj, sucahyoy_at_computing.edu.au

2
Outline
  • Introduction
  • Data Structure
  • Algorithms
  • Performance Study
  • Further Work
  • Conclusion

3
Introduction
  • Association rule mining
  • Finds interesting patterns or relationships among
    items in a given data set.
  • Two steps
  • Find the frequent itemsets.
  • Use the result of Step 1 to generate association
    rules.
  • Step 1 computationally very expensive.
  • So, focus of significant research effort.

4
Finding Frequent Itemsets
  • Two general approaches
  • Candidate generation-and-test
  • Apriori (Agrawal et al., SIGMOD93) and its
    variants
  • Pattern Growth Approach
  • FP-Growth (Han et al., SIGMOD00)
  • H-Mine (Pei et al., ICDM01)
  • Opportune Project (OP) (Liu et al., SIGKDD02)
  • Data Structures
  • Array based H-struct in H-Mine
  • Tree based FP-Tree in FP-Growth

5
Contributions
  • Present the impact of improving data structure
    design on the efficiency of frequent pattern
    mining.
  • Compare the performance with Apriori, Eclat (Zaki
    2000), FP-Growth and OP algorithms.

6
Association Rules
  • Given a database of transactions containing
    various items, statements of the form
  • A ? B (10, 80)
  • 80 of transactions that purchase A also purchase
    B and 10 of all transactions contain both of
    them.

7
Binary Representations of Transactions
Support 2
8
Data Structures
  • Based on these observations
  • Item identifiers can be mapped to a range of
    integers.
  • Transaction identifiers can be ignored provided
    the items of each transaction are linked together.

9
Transaction Tree
10
ITL Data Structure
  • ITEMTABLE
  • Every item, with its support and a link to the
    first occurrence in TransLink.
  • TRANSLINK
  • Every transaction in database, with
  • items in sorted order.
  • Each item has a link to the next
  • occurrence.

11
ITL Data Structure
Index 1 2 3 4 5
Item 1 2 3 4 5
Count 2 3 4 4 5
1
1
1
2 3 4 5
2
2
1
1
1
1
1 2 3 5
2
1
1
1 2 4 5
12
Tree Data Structure
  • ITEMTABLE
  • Every item, with its support and a pointer to
    the root of the subtree of the item.
  • COMPRESSED TRANSACTION TREE
  • All transactions of the database containing
    frequent items. Only frequent items will be
    stored in the tree.

13
ITL-Mine Algorithm
  • Three steps
  • Construct ItemTable and TransLink.
  • Prune any item below minimum support.
  • Mine Frequent Itemsets of 2 or more items.
  • Algorithm details in another paper.

14
TreeITL-Mine Algorithm
  • Four steps
  • Identify Frequent Items and Initialize ItemTable
  • Construct Transaction Tree
  • Construct ITL
  • Mine Frequent Itemsets of 2 or more items.
  • Algorithm details in another paper.

15
CT-ITL Algorithm
  • Four steps
  • Identify Frequent Items and Initialize ItemTable
  • Construct the Compressed Transaction Tree
  • Construct Compact ITL
  • Mine Frequent Item Sets of 2 or more items.
  • Algorithm details in another paper.

16
CT-Mine Algorithm
  • Three steps
  • Identify Frequent Items and Initialize ItemTable
  • Construct the Compressed Transaction Tree
  • Mining
  • Algorithm details in another paper.

17
Performance Study on Connect-4
67,557 trans of items 129 Avg 43 items/trans
18
Performance Study on Pumsb
49,046 trans of items 2,087 Avg 50 items/trans
19
Performance Study on Connect-4
67,557 trans of items 129 Avg 43 items/trans
20
Performance Study on Pumsb
49,046 trans of items 2,087 Avg 50 items/trans
21
Further Work
  • Extend CT-ITL and CT-Mine for very large
    databases.
  • Investigate the relative performance of CT-ITL
    and CT-Mine on various practical data sets.
  • Building a Data Mining Query Optimizer.

22
Conclusion
  • The influence of compact data structure design on
    the performance of frequent pattern mining has
    been discussed.
  • The performance of the algorithms against
    Apriori, Eclat, FP-Growth and OP on various data
    sets is presented. The results show that our
    fastest algorithms perform better than all others
    at a number of commonly used support levels.
Write a Comment
User Comments (0)
About PowerShow.com