Mining Frequent Item Sets by Opportunistic Projection - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Frequent Item Sets by Opportunistic Projection

Description:

1 Institute of Artificial Intelligence, Zhejiang ... 3 Department of Computer Science, UIUC, USA. 4 Dept. of CS, Hangzhou University of Commerce, China ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 35
Provided by: junqia
Category:

less

Transcript and Presenter's Notes

Title: Mining Frequent Item Sets by Opportunistic Projection


1
Mining Frequent Item Sets by Opportunistic
Projection
  • Junqiang Liu1,4, Yunhe Pan1, Ke Wang2, Jiawei
    Han3
  • 1 Institute of Artificial Intelligence, Zhejiang
    University, China
  • 2 School of Computing Science, Simon Fraser
    University, Canada
  • 3 Department of Computer Science, UIUC, USA
  • 4 Dept. of CS, Hangzhou University of Commerce,
    China

2
Outline
  • How to discover frequent item sets
  • Previous works
  • Our approach Mining Frequent Item Sets by
    Opportunistic Projection
  • Performance evaluations
  • Conclusions

3
What Are Frequent Items Sets
  • What is a frequent item set?
  • set of items, X, that occurs together frequently
    in a database, i.e., support(X) a given
    threshold
  • Example

Given support threshold 3, frequent item sets are
as follows a3, b3, c4, f 4, m3, p3, ac3,
af 3, am3, cf 3, cm3, cp3, fm3, acf 3,
acm3, afm3, cfm3, acfm3
tid items
01 a c d f g i m p
02 a b c f l m o
03 b f h j o
04 b c k p s
05 a c e f l m n p
4
How To Discover Frequent Item Sets
  • Frequent item sets can be represented by a tree,
    which is not necessarily materailized.
  • Mining process
  • a process of tree construction, accompanied by
  • a process of projecting transaction subsets

5
Frequent Item Set Tree - FIST
  • FIST is an ordered tree
  • each node (item,weight)
  • the following are imposed
  • items ordered on a path (top-down)
  • items ordered at children (left to right)
  • Frequent item set
  • a path starting from the FIST root
  • its support is the ending nodes weight
  • PTS - projected transaction subset
  • Each FIST node has its own PTS, filtered or
    unfiltered
  • All transactions that support the frequent item
    set represented by the node

6
Frequent Item Set Tree (example)
7
Factors relate toMining Efficiency and
Scalability
  • The FIST construction strategy
  • breadth first v.s. depth first
  • The PTS representation
  • Memory-based representation array-based,
    tree-based, vertical bitmap, horizontal
    bitstring, etc.
  • Disk-based representation
  • PTS projecting method and item counting method

8
Previous Works
Research Strategy PTS Representation Projecting Method Remarks
Apriori breadth first original DB on the fly Repetitive DB Scans Huge FIST for dense Exp. pattern matching
Tree Projection breadth first original DB on the fly Repetitive DB Scans Huge FIST for dense Exp. pattern matching
FPGrowth depth first FP-tree recursively materialize conditional DB/Fptree of conditional FPtree in same order of mag. as of fre. item sets
H-Mine depth first H-struct partially materialize sub H-struct Not most eff. for sparse Call FP-Growth for dense Partition for large
Depth Project depth first horizontal bitstring selective projection Maximal fre. item sets Less efficient than array-based for sparse large Less efficient than tree-based for dense
MAFIA depth first vertical bitmap recursively materialize compressions Maximal fre. item sets Less efficient than array-based for sparse large Less efficient than tree-based for dense
9
Our Approach Mining Frequent Item Sets by
Opportunistic Projection
  • Philosophy
  • The algorithm must adapt the construction
    strategy of FIST, the representation of PTS, and
    the methods of item counting in and projection of
    PTSs to the features of PTSs.
  • Main points
  • Mining sparse data by projecting array-based PTS
  • Intelligent projecting tree-based PTS for dense
    data
  • Heuristics for opportunistic projection

10
Mining sparse data by projecting array-based PTS
  • TVLA threaded varied length array for sparse
    PTS
  • FIL local frequent items list
  • LQ linked queues
  • arrays
  • Each local frequent item has a FIL entry that
    consists of an item, a count, a pointer.
  • Each transaction is stored in an array that is
    threaded to FIL by LQ according to the heading
    item in the imposing order.

11
How to project TVLA for PTS
  • Arrays (transactions) that support a nodes first
    child are threaded by the LQ attached to the
    first entry of FIL. (see previous figure)
  • TVLA for a child nodes PTS has its own FIL and
    LQ.
  • A child TVLA is unfiltered if it shares arrays
    with its parent, filtered otherwise.

12
How to project TVLA for PTS (cont.)
  • Get next childs PTS by shifting transactions
    threaded in the LQ currently explored (current
    childs PTS)

13
Intelligent projecting tree-based PTS for dense
data
  • Tree-based Representation of dense PTS, inspired
    by FP-Growth
  • Novel projecting methods, totally differ from
    FP-Growth
  • Bottom up pseudo projection
  • Top down pseudo projection

14
Tree-based Representation of dense PTS
  • TTF - threaded transaction forest
  • IL - item list each entry consists of an item, a
    count, and a pointer.
  • Forest each node labeled by an item, associated
    with a weight.
  • Each local item in PTS has an entry in the IL.
  • Each transaction in the PTS is one path starting
    from a root in the forest.
  • count is the number of transactions represented
    by the path.
  • All nodes of the same item threaded by an IL
    entry.
  • TTF is filtered if only local frequent items
    appear in TTF, otherwise unfiltered.

15
Bottom up pseudo projection of TTF (example)
16
Top down pseudo projection of TTF (example)
17
Opportunistic Projection Observations and
Heuristics
  • Observation 1
  • Upper portion of a FIST can fit in memory.
  • Transactions Number that support length k item
    sets decreases sharply when k is greater than 2.
  • Heuristic 1
  • Grow the upper portion of a FIST breadth first.
  • Grow the lower portion under level k depth first,
    whenever the reduced transaction set can be
    represented by a memory based structure, either
    TVLA or TTF.

18
Opportunistic Projection Observations and
Heuristics(2)
  • Observation 2
  • TTF compresses well at lower levels or denser
    branches, where there are fewer local frequent
    items in PTSs and the relative support is larger.
  • TTF is space expensive relative to TVLA if its
    compression ratio is less than 6-t/n ( t number
    of transactions, n number of items in a PTS).
  • Heuristic 2
  • Represent PTSs by TVLA at high levels on FIST,
    unless the estimated compression ratio of TTF is
    sufficiently high.

19
Opportunistic Projection Observations and
Heuristics(3)
  • Observation 3
  • PTSs shrink very quickly at high levels or sparse
    branches on FIST where filtered PTSs are usually
    in form of TVLA.
  • PTSs at lower levels or dense branches shrink
    slowly where PTSs are represented by TTF. The
    creation of filtered TTF involves expensive
    pattern matching.
  • Heuristic 3
  • Make a filtered copy for the child TVLA as long
    as there is free memory when projecting a parent
    TVLA.
  • Delimitate the pseudo child TTF first and then
    make a filtered copy if it shrinks substantially
    sharp when projecting a parent TTF.

20
Algorithm OpportuneProject
  • OpportuneProject(Database D)
  • begin
  • create a null root for frequent item set tree
    T
  • D BreadthFirst(T, D)
  • GuidedDepthFirst(root_of_T, D)
  • end

21
Performance Evaluation Efficiency on BMS-POS
(sparse)
22
Performance Evaluation Efficiency on
BMS-WebView1 (sparse)
23
Performance Evaluation Efficiency on
BMS-WebView2 (sparse)
24
Performance Evaluation Efficiency on Connect4
(dense)
25
Performance Evaluation Efficiency on
T25I20D100kN20kL5k
26
Performance Evaluation Scalability on
T25I20D1mN20kL5k
27
Performance Evaluation Scalability on
T25I20D10mN20kL5k
28
Performance Evaluation Scalability on
T25I20D100k15mN20kL5k
29
Conclusions
  • OpportuneProject
  • maximize efficiency and scalability for all data
    features by combining
  • depth first with breadth first search strategies
  • array-based and tree-based representation for
    projected transaction subsets
  • unfiltered, and filetered projections

30
Acknowledgement
  • We would like to thank
  • Blue Martini Software, Inc.
  • for providing us the BMS datasets!

31
References
  • 1 R. Agarwal, C. Aggarwal, and V. V. V. Prasad.
    A tree projection algorithm for generation of
    frequent itemsets. In Journal of Parallel and
    Distributed Computing (Special Issue on High
    Performance Data Mining), 2000.
  • 2 R. Agarwal, C. Aggarwal, and V. V. V. Prasad.
    Depth first generation of long patterns, in
    Proceedings of SIGKDD Conference, 2000.
  • 3 R. Agrawal, T. Imielinski, and A. Swami.
    Mining association rules between sets of items in
    large databases. In SIGMOD93, Washington, D.C.,
    May 1993.
  • 4 R. Agrawal and R. Srikant. Fast algorithms
    for mining association rules. In VLDB'94, pp.
    487-499, Santiago, Chile, Sept. 1994.
  • 5 R.J.Bayardo. Efficiently mining long patterns
    from databases. In SIGMOD98, pp. 85-93, Seattle,
    Washington, June 1998.
  • 6 D.Burdick, M.Calimlim, J.Gehrke. MAFIA A
    maximal frequent itemset algorithm for
    transactional databases. In proceedings of the
    17th Internation Conference on Data Engineering,
    Heidelberg, Germany, April 2001.
  • 7 Sergey Brin, Rajeev Motwani, Jeffrey D.
    Ullman, Shalom Tsur. Dynamic Itemset Counting and
    Implication Rules for Market Basket Analysis. In
    SIGMOD97, 255-264. Tucson, AZ, May 1997.

32
References (2)
  • 8 J. Han and Y. Fu. Discovery of multiple-level
    association rules from large databases. In
    VLDB'95, Zuich, Switzerland, Sept. 1995.
  • 9 J. Han, J. Pei, and Y. Yin. Mining frequent
    patterns without candidate generation. In
    SIGMOD2000, Dallas, TX, May 2000.
  • 10 D-I. Lin and Z. M. Kedem. Pincer-search A
    new algorithm for discovering the maximum
    frequent set. In 6th Intl. Conf. Extending
    Database Technology, March 1998.
  • 11 J.S.Park, M.S.Chen, and P.S.Yu. An effective
    hash based algorithm for mining association
    rules. In Proc. 1995 ACM-SIGMOD, 175-186, San
    Jose, CA, Feb. 1995.
  • 12 J. Pei, J. Han, H. Lu, S. Nishio, S. Tang,
    and D. Yang, H-Mine Hyper-Structure Mining of
    Frequent Patterns in Large Databases, Proc. 2001
    Int. Conf. on Data Mining (ICDM'01), San Jose,
    CA, Nov. 2001.
  • 13 Ashok Sarasere, Edward Omiecinsky, and
    Shamkant Navathe. An efficient algorithm for
    mining association rules in large databases. In
    21st Int'l Conf. on Very Large Databases (VLDB),
    Zurich, Switzerland, Sept. 1995.

33
References (3)
  • 14 H.Toivonen. Sampling large databases for
    association rules. In Proc. 1996 Int. Conf. Very
    Large Data Bases (VLDB96), 134-145, Bombay,
    India, Sept. 1996.
  • 15 Zijian Zheng, Ron Kohavi and Llew Mason.
    Real World Performance of Association Rule
    Algorithms. In Proc. 2001 Int. Conf. on Knowledge
    Discovery in Databases (KDD'01), San Francisco,
    California, Aug. 2001.
  • 16 http//fuzzy.cs.uni-magdeburg.de/borgelt/src
    /apriori.exe
  • 17 http//www.almaden.ibm.com/cs/quest/syndata.h
    tml
  • 18 http//www.ics.uci.edu/mlearn/MLRepository.h
    tml

34
  • Thank you !!!
Write a Comment
User Comments (0)
About PowerShow.com