Improving the Efficiency of Frequent Pattern Mining by Compact Data Structure Design - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Improving the Efficiency of Frequent Pattern Mining by Compact Data Structure Design

Description:

Improving the Efficiency of Frequent Pattern Mining by ... Index 1 2 3 4 5. 1. 2. 2. 1. 1. 1. 1. Item 1 3 4 5 7. Count 3 3 4 3 2. 3 4 5 7. 1 4 5 7. 1 3 4 ... – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 23

Provided by: YUD3

Category:

more less

Transcript and Presenter's Notes

Title: Improving the Efficiency of Frequent Pattern Mining by Compact Data Structure Design

1
Improving the Efficiency of Frequent Pattern
Mining by Compact Data Structure Design

Raj P. Gopalan Yudho Giri Sucahyo
School of Computing
Curtin University of Technology
Bentley, Western Australia 6102
raj, sucahyoy_at_computing.edu.au

2
Outline

Introduction
Data Structure
Algorithms
Performance Study
Further Work
Conclusion

3
Introduction

Association rule mining
Finds interesting patterns or relationships among
items in a given data set.
Two steps
Find the frequent itemsets.
Use the result of Step 1 to generate association
rules.
Step 1 computationally very expensive.
So, focus of significant research effort.

4
Finding Frequent Itemsets

Two general approaches
Candidate generation-and-test
Apriori (Agrawal et al., SIGMOD93) and its
variants
Pattern Growth Approach
FP-Growth (Han et al., SIGMOD00)
H-Mine (Pei et al., ICDM01)
Opportune Project (OP) (Liu et al., SIGKDD02)
Data Structures
Array based H-struct in H-Mine
Tree based FP-Tree in FP-Growth

5
Contributions

Present the impact of improving data structure
design on the efficiency of frequent pattern
mining.
Compare the performance with Apriori, Eclat (Zaki
2000), FP-Growth and OP algorithms.

6
Association Rules

Given a database of transactions containing
various items, statements of the form
A ? B (10, 80)
80 of transactions that purchase A also purchase
B and 10 of all transactions contain both of
them.

7
Binary Representations of Transactions
Support 2
8
Data Structures

Based on these observations
Item identifiers can be mapped to a range of
integers.
Transaction identifiers can be ignored provided
the items of each transaction are linked together.

9
Transaction Tree
10
ITL Data Structure

ITEMTABLE
Every item, with its support and a link to the
first occurrence in TransLink.

TRANSLINK
Every transaction in database, with
items in sorted order.
Each item has a link to the next
occurrence.

11
ITL Data Structure
Index 1 2 3 4 5
Item 1 2 3 4 5
Count 2 3 4 4 5
1
1
1
2 3 4 5
2
2
1
1
1
1
1 2 3 5
2
1
1
1 2 4 5
12
Tree Data Structure

ITEMTABLE
Every item, with its support and a pointer to
the root of the subtree of the item.

COMPRESSED TRANSACTION TREE
All transactions of the database containing
frequent items. Only frequent items will be
stored in the tree.

13
ITL-Mine Algorithm

Three steps
Construct ItemTable and TransLink.
Prune any item below minimum support.
Mine Frequent Itemsets of 2 or more items.
Algorithm details in another paper.

14
TreeITL-Mine Algorithm

Four steps
Identify Frequent Items and Initialize ItemTable
Construct Transaction Tree
Construct ITL
Mine Frequent Itemsets of 2 or more items.
Algorithm details in another paper.

15
CT-ITL Algorithm

Four steps
Identify Frequent Items and Initialize ItemTable
Construct the Compressed Transaction Tree
Construct Compact ITL
Mine Frequent Item Sets of 2 or more items.
Algorithm details in another paper.

16
CT-Mine Algorithm

Three steps
Identify Frequent Items and Initialize ItemTable
Construct the Compressed Transaction Tree
Mining
Algorithm details in another paper.

17
Performance Study on Connect-4
67,557 trans of items 129 Avg 43 items/trans
18
Performance Study on Pumsb
49,046 trans of items 2,087 Avg 50 items/trans
19
Performance Study on Connect-4
67,557 trans of items 129 Avg 43 items/trans
20
Performance Study on Pumsb
49,046 trans of items 2,087 Avg 50 items/trans
21
Further Work

Extend CT-ITL and CT-Mine for very large
databases.
Investigate the relative performance of CT-ITL
and CT-Mine on various practical data sets.
Building a Data Mining Query Optimizer.

22
Conclusion

The influence of compact data structure design on
the performance of frequent pattern mining has
been discussed.
The performance of the algorithms against
Apriori, Eclat, FP-Growth and OP on various data
sets is presented. The results show that our
fastest algorithms perform better than all others
at a number of commonly used support levels.

Write a Comment

User Comments (0)