Indexing and Data Access Methods for Database Mining - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Indexing and Data Access Methods for Database Mining

Description:

Database (DB) A nonempty sequence of transactions. Support of an itemset X Fraction of transactions in DB that contain X. ... PASSES over the database ... – PowerPoint PPT presentation

Number of Views:127
Avg rating:3.0/5.0
Slides: 30
Provided by: ganesh4
Category:

less

Transcript and Presenter's Notes

Title: Indexing and Data Access Methods for Database Mining


1
Indexing and Data Access Methods for Database
Mining
  • Ganesh Ramesh
  • William A. Maniatty
  • University at Albany, SUNY
  • Mohammed J. Zaki
  • RPI

ACM-SIGMOD workshop on Data Mining and Knowledge
Discovery (DMKD) - 2002
2
Introduction
  • Data Mining Flat File Mining???
  • Much time is spent in data transformation before
    actual mining techniques are used.
  • What is the NEED?
  • Support for Direct Mining of Data stored in
    Various Data Layouts (other than Flat Files).
  • Support for incorporating New Layouts with little
    or no modification to the existing
    implementation.
  • Support for Easy Integration of New Techniques
    into the existing infrastructure.

3
Related Work - I
  • Language Support for Mining and Database
    Integration
  • DMQL (Han et. al.) Mining Query Language
  • MSQL (Imielinski et. al.) SQL Extension
  • MINE RULE SQL operator (Meo et. al.) and Query
    Flocks (Tsur et. al.) extends association rules
    to support more general mining queries

4
Related Work - II
  • Mining and Database Integration.
  • Agrawal and Shim study tightly coupled data
    mining applications on an RDBMS by pushing parts
    of the mining computations into the database
    system.
  • Sarawagi et. al. study Architectural alternatives
    for integrating mining and Databases in the
    context of ARM.

5
Our Contributions
  • An Alternative Approach Systems Programming
    Techniques and Data Access Methods for studying
    Efficiency of Mining techniques
  • Efficient Indexing and Data Access Support for
    query execution
  • Performance Analysis of existing mining methods
    with respect to Data Access Methods to understand
    them better
  • Explore the possibility of ad-hoc query support
    and unified approaches for mining

6
Association Rule Mining (ARM) and Frequent
Itemset Computation
  • ARM One of the classic problems in Data Mining
    with a variety of techniques
  • A RICH area which uses a variety of data access
    methods
  • Hence a Natural Starting point

7
ARM Terminology I
  • I Set of Items, Transaction A nonempty subset
    of I, uniquely identified by a TID, Transaction t
    contains an itemset X if X is a subset of t.
  • Database (DB) A nonempty sequence of
    transactions.
  • Support of an itemset X Fraction of
    transactions in DB that contain X.
  • Given 0 lt minsup lt1, all itemsets X that have
    support atleast minsup in DB are called FREQUENT
    ITEMSETS or LARGE ITEMSETS.

8
ARM Terminology II
  • Association Rule X gt Y where X and Y are
    disjoint itemsets.
  • Confidence of the rule X gt Y is the conditional
    probability that a transaction contains Y, given
    that it contains X.
  • Frequent Itemsets First step computed by Almost
    ALL ARM techniques Computationally more
    challenging.

9
Computing Frequent Itemsets A Comparison of
various approaches
  • HORIZONTAL Mining approaches (APRIORI-Based)
  • Uses a HORIZONTAL data layout where transactions
    are stored as (tid, itemset)
  • Uses levelwise traversal of the lattice space of
    frequent itemsets (there are some exceptions)
  • Make MULTIPLE PASSES over the database
  • Uses a data structure for counting support of
    candidate itemsets
  • VERTICAL Mining approaches (ECLAT-Based)
  • Uses a VERTICAL or INVERTED data layout where
    for each item the list of transactions (TIDs)
    containing the item are stored (also called a
    tidlist)
  • Uses tidlist intersections to compute the support
    of an itemset
  • Uses Equivalence class based approaches and Depth
    First Traversal of the pattern space

10
Data Access Methods and Middleware Design - I
  • HORIZONTAL Mining approaches (APRIORI-Based)
  • Group Data by Transactions
  • Coarse Grained Index tid is the key and the
    itemset is a variable length data field
  • Fine Grained Index (tid,item) is the key and no
    data field is associated with the key
  • VERTICAL Mining approaches (ECLAT-Based)
  • Group Data by items or Itemset IDs
  • Coarse Grained Index Itemset ID is the key and
    the tidlist is a variable length data field
  • Fine Grained Index (Itemset ID,tid) is the key
    and no data field is associated with the key

11
Data Access Methods and Middleware Design - II
  • Hybrid Granularity
  • Fragment Coarse Grained variable length Data into
    Blocks of fixed size
  • Useful in instances where large variable length
    data is NOT supported by the indexing mechanism

12
Data Access Requirements Horizontal and Vertical
ARM
  • HORIZONTAL Mining approaches (APRIORI-Based)
  • Open a Database
  • Close a Database
  • Populate a Database
  • Get Next Transaction
  • Reset to the Start of the Database
  • Delete Item(s) from a transaction
  • Delete the entire transaction
  • (ONLY FOR DATA LAYOUTS WHERE DELETION IS POSSIBLE)
  • VERTICAL Mining approaches (ECLAT-Based)
  • Open a Database
  • Close a Database
  • Populate a Database
  • Get tidlists for an itemset
  • Insert itemset and associated tidlist into the
    database

13
Low Level Data Access Methods and Indexed
Structures
  • Various Indexed Data Layouts Support the
    Functionality of the described Middleware The
    ones we study are
  • Flat Files with access through Middleware
  • B-Tree with Different Levels of Granularity
    (Fine, Coarse and Hybrid) Two different
    versions were used
  • Generalized Index Search Tree (GiST)
  • Sleepycats Berkeley DB

14
A Linked Block Structure
  • Method of Data Access by Apriori.
  • Sequential traversal of transactions.
  • Deletion done in current transaction that is
    being processed.
  • To support deletion while retaining a structure
    that is similar to Flat File Data Layout - Linked
    Block Structure (referred to as a Block File
    Structure/BFS Layout).

15
Deletion and Hybrid Strategies
  • Pruning was done in the methods using the
    horizontal data layouts (Yu et.al.).
  • Vertical data layout methods Implicit pruning
    through tidlist intersection.
  • CONJECTURE Apriori performs well in initial
    passes while Eclat does better in later passes.
    To study this empirically, we implemented a
    HYBRID algorithm which starts with Apriori and
    switches to Eclat.

16
Notation and Terminology
  • Apriori with the Block File Structure will be
    denoted by APR-BFS.
  • For methods involving deletion, the word prune
    will follow the notation (eg. APR-BFS Prune). We
    studied pruning with respect to Apriori in GiST
    and BFS layouts.

17
Experimental Results
  • Effect of Data Layout on Execution Time
  • Effect of Data Layout on Dataset Size
  • Effect of Pruning on Data Access Methods
  • Effect of Data Access Methods on where the time
    is spent by the algorithm
  • Study of Hybrid strategies

18
  • High Normalization degrades performace.
  • Coarse Grain Eclat is faster than Apriori.
  • Fine Grain Apriori is faster than Eclat.

19
  • Fine Grain 10x slower than Coarse Grain.
  • BFS and Flat Files are comparable in performance.
    Pruning accelerates BFS.
  • Coarse Grain 2-3x slower than Flat Files.
  • Gist B-tree Emulation slower than Sleepycat DB
    Native B-tree.

20
  • Insertion and deletion make Eclat sensitive to
    Access method
  • Flat Files 3x faster than Coarse Grain
  • Fine Granularity of storage Very Expensive (gt
    10x)
  • Native B-tree in Sleepycat Berkeley DB
    outperforms Emulated B-tree in GiST (By as much
    as 2x)

21
  • Pruning in Hybrid degrades performance
  • Deletion from transactions in lower passes not
    exploited by Eclat

22
  • Reinforces findings on synthetic data.
  • High Normalization degrades performace.
  • Coarse Grain Eclat is faster than Apriori.
  • Fine Grain Apriori is faster than Eclat.

23
  • Reinforces findings on synthetic data.
  • Fine Grain 10x slower than flat files.
  • BFS performance close to flat files. Pruning
    helps.
  • Native B-tree in Sleepycat Berkeley DB
    outperforms Emulated B-tree in GiST (By as much
    as 2x)

24
  • Fine Grain data layout is at least 3 times larger
    than flat files.
  • Meta Data Overhead
  • Flat Files most economical
  • BFS close to flat files
  • Vertical Coarse Grain has long tidlists (and less
    indexing)

25
  • Pruning helps SOMETIMES A positive example is
    shown here.
  • When can pruning help
  • Early Pruning
  • Sufficiently large number of passes
  • Substantial number of items/transactions pruned

26
  • GNU gprof Measures Run Time.
  • Fine Grain Data Access methods dominate.
  • Flat Files Negligible Access Time.
  • Apriori Near even split between counting and
    candidate generation.
  • Eclat Passes one and two take 30 of the time.
  • Eclat on Coarse Grain GiST uses a lot more time
    on data access than Berkeley DB.

27
(No Transcript)
28
  • Eclat outperforms both Apriori and Hybrid
  • Hybrid provides a superior alternative to Apriori

29
Conclusions
  • For Coarse Grained Methods to be scalable, good
    BLOB support is needed to store long tidlists.
  • Eclat Very sensitive to the granularity. Goes
    from being the fastest (Coarse Grain) to the
    slowest (Fine Grain).
  • Excessive Normalization causes meta data
    inflation and performance degradation.

30
Future Directions
  • Extensions to OTHER mining techniques Sequence
    Mining (study underway), Clustering and
    Classification
  • Our of Core data structures to store the
    intermediate results (like candidates for
    instance)
  • Enhancement of profiling technology to perform
    data acquisition and Mining the profiled data to
    learn about the performance
Write a Comment
User Comments (0)
About PowerShow.com