Title: Improving the Efficiency of Frequent Pattern Mining by Compact Data Structure Design
1Improving the Efficiency of Frequent Pattern
Mining by Compact Data Structure Design
- Raj P. Gopalan Yudho Giri Sucahyo
- School of Computing
- Curtin University of Technology
- Bentley, Western Australia 6102
- raj, sucahyoy_at_computing.edu.au
2Outline
- Introduction
- Data Structure
- Algorithms
- Performance Study
- Further Work
- Conclusion
3Introduction
- Association rule mining
- Finds interesting patterns or relationships among
items in a given data set. - Two steps
- Find the frequent itemsets.
- Use the result of Step 1 to generate association
rules. - Step 1 computationally very expensive.
- So, focus of significant research effort.
4Finding Frequent Itemsets
- Two general approaches
- Candidate generation-and-test
- Apriori (Agrawal et al., SIGMOD93) and its
variants - Pattern Growth Approach
- FP-Growth (Han et al., SIGMOD00)
- H-Mine (Pei et al., ICDM01)
- Opportune Project (OP) (Liu et al., SIGKDD02)
- Data Structures
- Array based H-struct in H-Mine
- Tree based FP-Tree in FP-Growth
5Contributions
- Present the impact of improving data structure
design on the efficiency of frequent pattern
mining. - Compare the performance with Apriori, Eclat (Zaki
2000), FP-Growth and OP algorithms.
6Association Rules
- Given a database of transactions containing
various items, statements of the form - A ? B (10, 80)
- 80 of transactions that purchase A also purchase
B and 10 of all transactions contain both of
them.
7Binary Representations of Transactions
Support 2
8Data Structures
- Based on these observations
- Item identifiers can be mapped to a range of
integers. - Transaction identifiers can be ignored provided
the items of each transaction are linked together.
9Transaction Tree
10ITL Data Structure
- ITEMTABLE
- Every item, with its support and a link to the
first occurrence in TransLink.
- TRANSLINK
- Every transaction in database, with
- items in sorted order.
- Each item has a link to the next
- occurrence.
11ITL Data Structure
Index 1 2 3 4 5
Item 1 2 3 4 5
Count 2 3 4 4 5
1
1
1
2 3 4 5
2
2
1
1
1
1
1 2 3 5
2
1
1
1 2 4 5
12Tree Data Structure
- ITEMTABLE
- Every item, with its support and a pointer to
the root of the subtree of the item.
- COMPRESSED TRANSACTION TREE
- All transactions of the database containing
frequent items. Only frequent items will be
stored in the tree.
13ITL-Mine Algorithm
- Three steps
- Construct ItemTable and TransLink.
- Prune any item below minimum support.
- Mine Frequent Itemsets of 2 or more items.
- Algorithm details in another paper.
14TreeITL-Mine Algorithm
- Four steps
- Identify Frequent Items and Initialize ItemTable
- Construct Transaction Tree
- Construct ITL
- Mine Frequent Itemsets of 2 or more items.
- Algorithm details in another paper.
15CT-ITL Algorithm
- Four steps
- Identify Frequent Items and Initialize ItemTable
- Construct the Compressed Transaction Tree
- Construct Compact ITL
- Mine Frequent Item Sets of 2 or more items.
- Algorithm details in another paper.
16CT-Mine Algorithm
- Three steps
- Identify Frequent Items and Initialize ItemTable
- Construct the Compressed Transaction Tree
- Mining
- Algorithm details in another paper.
17Performance Study on Connect-4
67,557 trans of items 129 Avg 43 items/trans
18Performance Study on Pumsb
49,046 trans of items 2,087 Avg 50 items/trans
19Performance Study on Connect-4
67,557 trans of items 129 Avg 43 items/trans
20Performance Study on Pumsb
49,046 trans of items 2,087 Avg 50 items/trans
21Further Work
- Extend CT-ITL and CT-Mine for very large
databases. - Investigate the relative performance of CT-ITL
and CT-Mine on various practical data sets. - Building a Data Mining Query Optimizer.
22Conclusion
- The influence of compact data structure design on
the performance of frequent pattern mining has
been discussed. - The performance of the algorithms against
Apriori, Eclat, FP-Growth and OP on various data
sets is presented. The results show that our
fastest algorithms perform better than all others
at a number of commonly used support levels.