Go Green: Recycle and Reuse in HumanCentered, Interactive Data Mining - PowerPoint PPT Presentation

Loading...

PPT – Go Green: Recycle and Reuse in HumanCentered, Interactive Data Mining PowerPoint presentation | free to download - id: 30ff2-MWZkM



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Go Green: Recycle and Reuse in HumanCentered, Interactive Data Mining

Description:

Go Green: Recycle and Reuse in Human-Centered, Interactive Data Mining. Presented by ... None of these intend to recycle and reuse KDD process. Constraint ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 47
Provided by: iscp7
Learn more at: http://homepages.inf.ed.ac.uk
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Go Green: Recycle and Reuse in HumanCentered, Interactive Data Mining


1
Go Green Recycle and Reuse in Human-Centered, Int
eractive Data Mining
  • Presented by
  • Cong Gao

2
Outline
  • Introduction
  • An Architecture for recycle and reuse
  • Recycling Frequent Patterns through Compression
  • Performance Studies
  • Related Works
  • Conclusions Future work

3
Introduction
  • Data Mining nontrivial extraction of implicit,
    previously unknown, potentially useful
    information fr large DB

specify the initial datasets constraints.
run the mining algorithm
End
Mining task is usually an iterative process
  • Dataset change
  • Constraint change

Not satisfied
4
Introduction
  • The main problems of the iterative mining
  • Each process (Iteration) still requires large
    amount of CPU and I/O
  • Each process is a black box. Users accept or
    abandon results at the end of each process
  • One process ends before next begins

5
Introduction
  • Current solution to above problem
  • Impose constraint to reduce resources used to
    find uninterested knowledge
  • Mining interactively, breakpoint

6
Introduction-- Our Observations
  • Many useful information in previous mining
    process
  • Sources (task, user)
  • Usage
  • Can be reused for new round of mining.
  • Can also be reused for query answering and
    planning

7
Introduction
  • Examples of useful information from previous
    round of mining
  • Discovered Knowledge e.g.
  • closed patterns for query answering
  • clustering for sparse or empty region
    identification
  • Selectivity and Distribution of Data Values e.g.
  • decision tree can provide information for query
    optimization
  • Mining Cost e.g. data mining plan

8
Our objective contributions
  • Objective is on recycling and reusing knowledge
    about the KDD processes
  • Query answering using views
  • More difficult, complicated, heterogeneous
  • Contributions propose a system architecture for
    recycling and reusing KDD processes

9
Our contributions (cont)
  • Illustrate the architecture with frequent
    patterns mining.
  • Use the frequent patterns discovered in a
    previous mining to compress the database for
    subsequent mining under different constraints.
  • To minimize the mining cost from the compression
    as oppose to minimizing the storage space, which
    does not necessary minimize the mining cost.
  • Extensive experiments were conducted.

10
Outline
  • Introduction
  • An Architecture for recycle and reuse
  • Recycling Frequent Patterns through Compression
  • Performance Studies
  • Related Works
  • Conclusions Future work

11
An Architecture for Recycle and Reuse
12
Knowledge Recycle Bin
  • Task stores information about data mining
    processes previously executed
  • Information includes
  • Data Selection Predicates
  • e.g. Constraints, data selection predicate.
  • Determine the relevancy of mining processes
  • Algorithms and Parameters
  • e.g. k in K-means.
  • Evaluate the usefulness of recycling the KDD
    process in a different mining context

13
Knowledge Recycle Bin (cont.)
  • Knowledge/Patterns Discovered
  • Encapsulate a lot of information about the KDD
    process
  • Storage efficient retrieval by Bin manager
  • Time of Mining
  • Determine the validity of result of a mining
    process
  • Help the Bin manager
  • Efficiency Measurement
  • CPU and I/O
  • Facilitate better selection of mining algorithms
    plan

14
Bin Manager
  • Tasks
  • Storage to Bin
  • Minimum redundancy
  • Explicit storage
  • Retrieval from Bin
  • Bridge for Mining Optimizer and Knowledge recycle
    bin
  • Removal from Bin
  • The least used
  • The oldest

15
Mining Optimizer
  • Task
  • to select the best strategy for performing a
    mining task of the user, using existing knowledge
    or patterns in the recycle bin

16
Mining Optimizer
  • Main Components
  • Statistical Synopsis.
  • Similar to Histogram in database query optimizer
  • E.g. frequent patternmulti-dimensional histogram
  • Mining Plan
  • Similar to database query optimizer in function
  • Hard as mining tasks are complex process
  • maintenance of a knowledge recycle bin help
  • Rule-based, heterogeneous

17
Outline
  • Introduction
  • An Architecture for recycle and reuse
  • Recycling Frequent Patterns through Compression
  • Performance Studies
  • Related Works
  • Conclusions Future work

18
Frequent Pattern Mining
  • Given a minimum support threshold, a pattern
    (itemset) is frequent in a database when its
    support is greater than the threshold
  • Frequent pattern mining is to discover all
    frequent patterns

19
Our Objective
  • Show that the architecture is practical and
    useful, instead of developing a new algorithm
  • Frequent patterns can be used to estimate the
    cost for visiting some portion of the search
    space that have been visited before
  • It is possible to use such estimation to develop
    a mining plan such that the cost of a new round
    of mining is reduced

20
Our Approach
  • Select a set of frequent patterns from the
    recycle bin to compress the data to be mined
    using these patterns.
  • The selection criteria take into account the
    estimated saving that could occur when the
    dataset is compressed with a particular pattern.

21
Our Approach--Assumptions
  • It is easy to gather a set of relevant pattern
    from the recycle bin data selection predicate
    and mining parameter
  • Depth first search algorithm for mining, i.e.
    Eclat, VIPER, H-Mine, Fp-tree, etc.

22
Compression Strategies
  • Example given the set of frequent patterns under
    old minimum support 3, how to mine when support
    is changed to 2 ?

23
Compression Strategies
  • How can compression speed up mining?
  • Saving in counting
  • Saving in constructing projected DB
  • How to choose a pattern to compress DB? i.e. how
    to estimate the saving when using a specific
    pattern for compression.
  • E.g. Given the set of FP f3, fg3, fgc3,
    g3, gc3, a3, ae3, ac3, e4, ec3, c4 with
    minimum support 3.

24
Compression Strategies
  • Three strategies to compute the utility of each
    pattern
  • Strategy 1 Maximal Length Principle (MLP)
  • The utility function is U(X) XDBX.C,
    where X.C is the number of tuples that contain
    pattern X
  • Strategy 2 Maximal Area Principle (MAP)
  • The utility function is U(X) X X.C
  • Strategy 3 Minimize Cost Principle (MCP)
  • The utility function is U(X) (2X-1) X.C
  • Each tuple is compressed with a pattern selected
    based on utility value
  • Compression does not be limited by mining
    algorithm

25
The Mining Algorithm
  • Representation of compressed database with
    RP-struct
  • return

Group Head Group Tail RP-Header Table
26
Mining Algorithm
  • With an example
  • Find those patterns containing item d Figure
  • Find those containing item f but no d. Figure
  • Find those containing g, but not f and d
  • Find those containing a but not g, f and d
  • Find those containing e but not a, g, f and d
  • Find those only containing c

27
Mining Algorithm
Return
28
Mining Algorithm
  • When dataset can not fit in memory
  • Project the compressed database onto its set of
    frequent items.
  • Partition-based
  • Parallel projection

29
Outline
  • Introduction
  • An Architecture for recycle and reuse
  • Recycling Frequent Patterns through Compression
  • Performance Studies
  • Related Works
  • Conclusions Future work

30
Performances Studies
  • Use support threshold old to generate a set of
    patterns for the knowledge recycle bin
  • Lower support threshold to new to recycle
    patterns
  • Dataset

31
Analysis of Compression Strategies
Compression time and compression ratio
MLP MCP MAP
32
Mining in Main Memory
  • To evaluate the effectiveness of three
    compression strategies
  • MCP MLP MAP
  • better compression does not necessary means
    better performance.
  • minimizing mining cost (MCP) is more effective
    than minimizing storage space (MLP MAP)

33
Mining in Main Memory
  • To evaluate the effectiveness of pattern
    recycling
  • To compare with H-Mine since we adopt its data
    structure for uncompressed data
  • To compare with FP-tree since it also compress
    database. But it can not prove the effectiveness
    of recycling
  • RP-MCP gtgtH-Mine
  • RP-MCP gtgtFP-tree except on very dense dataset

34
Mining in Main Memory
Dataset SD(FP-tree run out of memory)
Dataset Weather
35
Mining in Main Memory
Dataset Forest
Dataset Thrombin
36
Mining in memory
Dataset Connect-4
37
Mining in Memory
  • Two interesting observations
  • RP-MCP can be applied to incremental mining. It
    can work well when the constraint (here, minimum
    support) change dramatically. Existing
    incremental mining techniques can not.

38
Mining in Memory
  • Two interesting observations (cont)
  • When minimum support is low, RP-MCP gtgt H-Mine .
    The saving gtgt the time used to generate the set
    of frequent patterns in knowledge recycle bin
    the compression time. We could divide a new
    mining task with low minimum support into two
    steps
  • (a) we first run it with a high minimum support
  • (b) compressing the database with the strategy
    MCP and mine the compressed database with the
    actual low minimum support.

39
Mining with memory Limitation
Dataset SD
Dataset Weather
40
Mining with memory Limitation
Dataset Forest
Dataset Connect-4
41
Outline
  • Introduction
  • An Architecture for recycle and reuse
  • Recycling Frequent Patterns through Compression
  • Performance Studies
  • Related Works
  • Conclusions Future work

42
Related Works
  • A Specific mining task
  • e.g. association rules, clustering, decision
    tree.
  • our work is different
  • Integrate different mining task together
  • e.g. Spartan, CBA, Netcube, DBMS
  • None of these intend to recycle and reuse KDD
    process
  • Constraint based mining
  • Prolong KDD process
  • Make recycle and reuse more important

43
Related Works
  • Incremental mining
  • Our algorithm can be used for incremental mining
    although it is not our initial intention
  • compared with ours, existing techniques have
    disadvantages
  • store extra information, such as negative border
  • Not applicable when the change of constraint or
    dataset are significant
  • Existing technique become awkward when the size
    of data set reduce

44
Conclusions Future work
  • Proposed an architecture for recycling and
    reusing knowledge
  • Used frequent pattern mining as an example to
    illustrate the architecture
  • Our experiments proved the architecture useful
    and practical

45
Conclusions Future work
  • Investigate the possible recycling and reusing of
    the other data mining results
  • Extend the architecture to multi-user platform,
    P2P.

46
  • Questions and Comments
About PowerShow.com