Prediction Cubes - PowerPoint PPT Presentation

About This Presentation
Title:

Prediction Cubes

Description:

Prediction Cubes. Bee-Chung Chen, Lei Chen, Yi Lin and Raghu Ramakrishnan ... The Idea. Build OLAP data cubes in which cell values represent decision/prediction ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 35
Provided by: CISE9
Learn more at: https://www.cise.ufl.edu
Category:
Tags: cubes | prediction | the

less

Transcript and Presenter's Notes

Title: Prediction Cubes


1
Prediction Cubes
  • Bee-Chung Chen, Lei Chen,
  • Yi Lin and Raghu Ramakrishnan
  • University of Wisconsin - Madison

2
Subset Mining
  • We want to find interesting subsets of the
    dataset
  • Interestingness Defined by the model built on
    a subset
  • Cube space A combination of dimension attribute
    values defines a candidate subset (just like
    regular OLAP)
  • We want the measures to represent
    decision/prediction behavior
  • Summarize a subset using the model built on it
  • Big change from regular OLAP!

3
The Idea
  • Build OLAP data cubes in which cell values
    represent decision/prediction behavior
  • In effect, build a tree for each cell/region in
    the cubeobserve that this is not the same as a
    collection of trees used in an ensemble method!
  • The idea is simple, but it leads to promising
    data mining tools
  • Ultimate objective Exploratory analysis of the
    entire space of data mining choices
  • Choice of algorithms, data conditioning
    parameters

4
Example (1/7) Regular OLAP
Z Dimensions
Y Measure
Goal Look for patterns of unusually
high numbers of applications
5
Example (2/7) Regular OLAP
Goal Look for patterns of unusually
high numbers of applications
Z Dimensions
Y Measure
Finer regions
6
Example (3/7) Decision Analysis
Goal Analyze a banks loan decision process
w.r.t. two dimensions Location and Time
Fact table D
Z Dimensions
X Predictors
Y Class
7
Example (3/7) Decision Analysis
  • Are there branches (and time windows) where
    approvals were closely tied to sensitive
    attributes (e.g., race)?
  • Suppose you partitioned the training data by
    location and time, chose the partition for a
    given branch and time window, and built a
    classifier. You could then ask, Are the
    predictions of this classifier closely correlated
    with race?
  • Are there branches and times with decision making
    reminiscent of 1950s Alabama?
  • Requires comparison of classifiers trained using
    different subsets of data.

8
Example (4/7) Prediction Cubes
  • Build a model using data from USA in Dec., 1985
  • Evaluate that model
  • Measure in a cell
  • Accuracy of the model
  • Predictiveness of Race
  • measured based on that
  • model
  • Similarity between that
  • model and a given model

9
Example (5/7) Model-Similarity
Given - Data table D - Target model h0(X)
- Test set ? w/o labels
The loan decision process in USA during Dec 04
was similar to a discriminatory decision model
10
Example (6/7) Predictiveness
Given - Data table D - Attributes V -
Test set ? w/o labels
Data table D
Yes No . . No
Yes No . . Yes
Build models
h(X?V)
h(X)
Level Country, Month
Predictiveness of V
Race was an important predictor of loan approval
decision in USA during Dec 04
Test set ?
11
Example (7/7) Prediction Cube
Cell value Predictiveness of Race
12
Efficient Computation
  • Reduce prediction cube computation to data cube
    computation
  • Represent a data-mining model as a distributive
    or algebraic (bottom-up computable) aggregate
    function, so that data-cube techniques can be
    directly applied

13
Bottom-Up Data Cube Computation
Cell Values Numbers of loan applications
14
Functions on Sets
  • Bottom-up computable functions Functions that
    can be computed using only summary information
  • Distributive function ?(X) F(?(X1), ,
    ?(Xn))
  • X X1 ? ? Xn and Xi ? Xj ??
  • E.g., Count(X) Sum(Count(X1), , Count(Xn))
  • Algebraic function ?(X) F(G(X1), , G(Xn))
  • G(Xi) returns a length-fixed vector of values
  • E.g., Avg(X) F(G(X1), , G(Xn))
  • G(Xi) Sum(Xi), Count(Xi)
  • F(s1, c1, , sn, cn) Sum(si) / Sum(ci)

15
Scoring Function
  • Represent a model as a function of sets
  • Conceptually, a machine-learning model h(X
    ?Z(D)) is a scoring function Score(y, x ?Z(D))
    that gives each class y a score on test example x
  • h(x ?Z(D)) argmax y Score(y, x ?Z(D))
  • Score(y, x ?Z(D)) ? p(y x, ?Z(D))
  • ?Z(D) The set of training examples (a cube
    subset of D)

16
Bottom-up Score Computation
  • Key observations
  • Observation 1 Score(y, x ?Z(D)) is a function
    of cube subset ?Z(D) if it is distributive or
    algebraic, the data cube bottom-up technique can
    be directly applied
  • Observation 2 Having the scores for all the test
    examples and all the cells is sufficient to
    compute a prediction cube
  • Scores ?? predictions ?? cell values
  • Details depend on what each cell means (i.e.,
    type of prediction cubes) but straightforward

17
Machine-Learning Models
  • Naïve Bayes
  • Scoring function algebraic
  • Kernel-density-based classifier
  • Scoring function distributive
  • Decision tree, random forest
  • Neither distributive, nor algebraic
  • PBE Probability-based ensemble (new)
  • To make any machine-learning model distributive
  • Approximation

18
Probability-Based Ensemble
PBE version of decision tree on WA, 85
Decision tree on WA, 85
Decision trees built on the lowest-level cells
19
Probability-Based Ensemble
  • Scoring function
  • h(y x bi(D)) Model hs estimation of p(y x,
    bi(D))
  • g(bi x) A model that predicts the probability
    that x belongs to base subset bi(D)

20
Outline
  • Motivating example
  • Definition of prediction cubes
  • Efficient prediction cube materialization
  • Experimental results
  • Conclusion

21
Experiments
  • Quality of PBE on 8 UCI datasets
  • The quality of the PBE version of a model is
    slightly worse (0 6) than the quality of the
    model trained directly on the whole training
    data.
  • Efficiency of the bottom-up score computation
    technique
  • Case study on demographic data

PBE
vs.
22
Efficiency of Bottom-up Score Computation
  • Machine-learning models
  • J48 J48 decision tree
  • RF Random forest
  • NB Naïve Bayes
  • KDC Kernel-density-based classifier
  • Bottom-up method vs. Exhaustive method
  • ? PBE-J48
  • PBE-RF
  • NB
  • KDC
  • ? J48ex
  • RFex
  • NBex
  • KDCex

23
Synthetic Dataset
  • Dimensions Z1, Z2 and Z3.
  • Decision rule

Z1 and Z2
Z3
24
Efficiency Comparison
Using exhaustive method
Execution Time (sec)
Using bottom-up score computation
of Records
25
Related Work Building models on OLAP Results
  • Multi-dimensional regression Chen, VLDB 02
  • Goal Detect changes of trends
  • Build linear regression models for cube cells
  • Step-by-step regression in stream cubes Liu,
    PAKDD 03
  • Loglinear-based quasi cubes Barbara, J. IIS 01
  • Use loglinear model to approximately compress
    dense regions of a data cube
  • NetCube Margaritis, VLDB 01
  • Build Bayes Net on the entire dataset of
    approximate answer count queries

26
Related Work (Contd.)
  • Cubegrades Imielinski, J. DMKD 02
  • Extend cubes with ideas from association rules
  • How does the measure change when we rollup or
    drill down?
  • Constrained gradients Dong, VLDB 01
  • Find pairs of similar cell characteristics
    associated with big changes in measure
  • User-cognizant multidimensional analysis
    Sarawagi, VLDBJ 01
  • Help users find the most informative unvisited
    regions in a data cube using max entropy
    principle
  • Multi-Structural DBs Fagin et al., PODS 05, VLDB
    05

27
Take-Home Messages
  • Promising exploratory data analysis paradigm
  • Can use models to identify interesting subsets
  • Concentrate only on subsets in cube space
  • Those are meaningful subsets, tractable
  • Precompute results and provide the users with an
    interactive tool
  • A simple way to plug something into cube-style
    analysis
  • Try to describe/approximate something by a
    distributive or algebraic function

28
Big Picture
  • Why stop with decision behavior? Can apply to
    other kinds of analyses too
  • Why stop at browsing? Can mine prediction cubes
    in their own right
  • Exploratory analysis of mining space
  • Dimension attributes can be parameters related to
    algorithm, data conditioning, etc.
  • Tractable evaluation is a challenge
  • Large number of dimensions, real-valued
    dimension attributes, difficulties in
    compositional evaluation
  • Active learning for experiment design, extending
    compositional methods

29
Community Information Management (CIM)
UI
Anhai Doan University of Illinois at
Urbana-Champaign Raghu Ramakrishnan University
of Wisconsin-Madison
30
Structured Web-Queries
UI
  • Example Queries
  • How many alumni are top-10 faculty members?
  • Wisconsin does very well, by the way
  • Find trends in publications
  • By topic, by conference, by alumni of schools
  • Change tracking
  • Alert me if my co-authors publish new papers or
    move to new jobs
  • Information is extracted from text sources on the
    web, then queried

31
Key Ideas
UI
  • Communities are ideally scoped chunks of the web
    for which to build enhanced portals
  • Relative uniformity in content, interests
  • Can exploit people power via mass
    collaboration, to augment extraction
  • CIM platform Facilitate collaborative creation
    and maintenance of community portals
  • Extraction management
  • Uncertainty, provenance, maintenance,
    compositional inference for refining extracted
    information
  • Mass collaboration for extraction and integration

Watch for new DBWorld!
32
Challenges
UI
  • User Interaction
  • Declarative specification of background knowledge
    and user feedback
  • Intelligent prompting for user input
  • Explanation of results

33
Challenges
UI
  • Extraction and Query Plans
  • Starting from user input (ER schema, hints) and
    background knowledge (e.g., standard types,
    look-up tables), compile a query into an
    execution plan
  • Must cover extraction, storage and indexing, and
    relational processing
  • And maintenance!
  • Algebra to represent such plans? Query optimizer?
  • Handling uncertainty, constraints, conflicts,
    multiple related sources, ranking, modular
    architecture

34
Challenges
UI
  • Managing extracted data
  • Mapping between extracted metadata and source
    data
  • Uncertainty of mapping
  • Conflicts (in user input, background knowledge,
    or from multiple sources)
  • Evolution over time
Write a Comment
User Comments (0)
About PowerShow.com