Sunita Sarawagi IIT Bombay http://www.it.iitb.ernet.in/~sunita - PowerPoint PPT Presentation

About This Presentation
Title:

Sunita Sarawagi IIT Bombay http://www.it.iitb.ernet.in/~sunita

Description:

Fast, interactive answers to large aggregate queries. ... Navigational operators: Pivot, ... Unravel aggregate data. Total sales dropped 30% in N. America. Why? ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 50
Provided by: sunitas4
Category:
Tags: iit | bombay | ernet | http | iitb | sarawagi | sunita | unravel | www

less

Transcript and Presenter's Notes

Title: Sunita Sarawagi IIT Bombay http://www.it.iitb.ernet.in/~sunita


1
Sunita SarawagiIIT Bombayhttp//www.it.iitb
.ernet.in/sunita
I3 Intelligent, Interactive Investigation of
multidimensional data
2
Multidimensional OLAP databases
  • Fast, interactive answers to large aggregate
    queries.
  • Multidimensional model dimensions with
    hierarchies
  • Dim 1 Bank location
  • branch--gtcity--gtstate
  • Dim 2 Customer
  • sub profession --gt profession
  • Dim 3 Time
  • month --gt quarter --gt year
  • Measures loan amount, transactions, balance

3
OLAP
  • Navigational operators Pivot, drill-down,
    roll-up, select.
  • Hypothesis driven search E.g. factors affecting
    defaulters
  • view defaulting rate on age aggregated over other
    dimensions
  • for particular age segment detail along
    profession
  • Need interactive response to aggregate queries..

4
Motivation
  • OLAP products provide a minimal set of tools for
    analysis
  • simple aggregates
  • selects/drill-downs/roll-ups on the
    multidimensional structure
  • Heavy reliance on manual operations for analysis
  • tedious on large data with multiple dimensions
    and levels of hierarchy
  • GOAL automate through complex, mining-like
    operations integrated with Olap.

5
State of art in mining OLAP integration
  • Decision trees Information discovery, Cognos
  • find factors influencing high profits
  • Clustering Pilot software
  • segment customers to define hierarchy on that
    dimension
  • Time series analysis Seagates Holos
  • Query for various shapes along time spikes,
    outliers etc
  • Multi-level Associations Han et al.
  • find association between members of dimensions

6
New approach
  • Identify complex operations with specific
    OLAP needs in mind (what does an analyst need?)
    rather than looking at mining operations and
    choosing what fits
  • Three examples
  • Diff for specific why questions at aggregate
    level
  • most compactly represent the answer that user can
    quickly assimilate
  • Generalize from detailed data to more general
    cases
  • expand scope of problem case as far out as
    possible
  • Inform of interesting regions in data
  • Point to most informative regions in data, so
    user does not need to hunt for them in the blind.

7
The Diff operator
8
Unravel aggregate data
What is the most compact answer that user can
quickly assimilate?
9
Solution
  • A new DIFF-operator added to OLAP systems that
    provides the answer
  • in a single-step
  • is easy-to-assimilate
  • and compact --- configurable by user.
  • Obviates use of the lengthy and manual search for
    reasons in large multidimensional data.

10
Example query
11
Compact answer
12
Example explaining increases
13
Compact answer
14
Model for summarization
  • The two aggregated values correspond to two
    subcubes in detailed data.

Cube-A
Cube-B
15
Detailed answers
Explain only 15 of total difference as against
90 with compact
16
Summarizing similar changes
17
MDL model for summarization
  • Given N, find the best N rows of answer such
    that
  • if user knows cube-A and answer,
  • number of bits needed to send cube-B is
    minimized.

N row answer
Cube-A
Cube-B
18
Transmission cost MDL-based
  • Each answer entry has a ratio that is
  • sum of measure values in cube-B and cube-A not
    covered by a more detailed entry in answer.
  • For each cell of cube-B not in answer
  • r ratio of closest parent in answer
  • a (b) measure value of cube A (B).
  • Expected value of b a r
  • bits -log(prob(b, ar)) where prob(x,u) is
    probability at value x for a distribution with
    mean u.
  • We use a poisson distribution when x are counts,
    normal distribution otherwise

19
Algorithm
  • Challenges
  • Circular dependence on parents ratio
  • Bounded size of answer
  • Greedy methods do not work
  • Bottom up dynamic programming algorithm

20
N2
i
Tuples with same parent
Tuples in detailed data grouped by common parent..
21
Integration
  • Single pass on data --- all indexing/sorting in
    the DBMS interactive.
  • Low memory usage independent of number of
    tuples O(NL)
  • Easy to package as a stored procedure on the data
    server side.
  • When detailed subcube too large work off
    aggregated data.

22
Performance
  • 80 time spent in data access.
  • Quarter million records processed in 10 seconds

333 MHz Pentium 128 MB memory Data on DB2
UDB NT 4.0 Olap benchmark 1.36 million tuples 4
dimensions
23
The Relax operator
24
Example query generalizing drops
25
(No Transcript)
26
Ratio generalization
27
Problem formulation
  • Inputs
  • A specific tuple Ts
  • An upper bound N on the answer size
  • Error functions
  • R(Ts,T?) measures the error of including a tuple
    T? in a generalization around Ts
  • S(Ts,T?) measures the error of excluding T? from
    the generalization
  • Goal
  • To find all possible consistent and maximal
    generalizations around Ts

28
Algorithm
  • Considerations
  • Need to exploit the capabilities of the OLAP data
    source
  • Need to reduce the amount of data fetches to the
    application
  • 2-stage approach
  • Finding generalizations
  • Getting exceptions

29
Finding generalizations
  • n number of dimensions
  • Li levels of hierarchy of dimension Di
  • Dij jth level in the ith dimension hierarchy
  • candidate_set ? D11, D21Dn1 // all single
    dimension candidate gen.
  • k 1
  • while (candidate_set ? ?)
  • ?g ? candidate_set
  • if (ST?g S(Ts,T) gt ST?g R(Ts,T)) Gk ? Gk ? g
  • // generating candidates for pass (k1)
    from generalizations of pass k
  • candidate_set ? generateCandidates(Gk)
    //Apriori style
  • // if gen is possible at level j of dimension
    Di , add its parent level to the candidate set
  • candidate_set ? candidate_set ? Di(j1)Dij
    ? Gk jlt Li
  • k ? k 1
  • Return ?i Gi

30
Finding Summarized Exceptions
  • Goal
  • Find exceptions to each maximal
    generalization compacted to within N rows and
    yielding the minimum total error
  • Challenges
  • No absolute criteria for determining whether a
    tuple is an exception or not for all possible R
    functions
  • Worth of including a child tuple is circularly
    dependent on its parent tuple
  • Bounded size of answer
  • Solution
  • Bottom up dynamic programming algorithm

31
Single dimension with multiple levels of
hierarchies
  • Optimal solution for finite domain R functions
  • soln(l,n,v) the best solution for subtree l for
    all n between 0 and N and all possible values of
    the default rep.
  • soln(l,n,v,c) the intermediate value of
    soln(l,n,v) after the 1st to the cth child of l
    are scanned
  • Err(soln(l,n,v,c1))min0?k?n(Err(soln(l,n,v,c))E
    rr(soln(c1,n-k,v)))
  • Err(soln(l,n,v))min(Err(soln(l,n,v,)),
  • minv ? v Err(soln(1,n-1,v,)rep(v)))

32
soln(1,1,)
N3
N2
N1
N0
1
1.1 ()
1.2 (-)
1.3 ()
1.4 ()
- - - 1 2 3 4 5 6
7 8 9 10
- 1 2 3 4 5 6
  • - - -
  • 1 2 3 4 5 6 7 8 9

- - - - - 1 2 3 4 5 6 7
soln(1.1,3,)
soln(1.2,3,)
soln(1.3,3,)
soln(1.4,3,)
33
Generalize Operator
34
(No Transcript)
35
The Inform operator
36
User-cognizant data exploration overview
  • Monitor to find regions of data user has visited
  • Model users expectation of unseen values
  • Report most informative unseen values
  • How to
  • Model expected values?
  • Define information content?

37
Modeling expected values
38
The Maximum Entropy Principle
  • Choose the most uniform distribution while
    adhering to all the constraints
  • E.T.Jaynes..1990
  • it agrees with everything that is known but
    carefully avoids assuming anything that is not
    known. It is transcription into mathematics of an
    ancient principle of wisdom
  • Characterizing uniformity
  • maximum when all pi-s are equal
  • Solve the constrained optimization problem
  • maximize H(p) subject to k constraints

39
Modeling expected values
Visited views
Database
40
Change in entropy
41
Finding expected values
  • Solve the constrained optimization problem
  • maximize H(p) subject to k constraints
  • Each constraint is of the form sum of arbitrary
    sets of values
  • Expected values can be expressed as a product of
    k coefficients one from each of the k constraints

42
Iterative scaling algorithm
  • Initially all p values are the same
  • While convergence not reached
  • For each constraint Ci in turn
  • Scale p values included in Ci by
  • Converges to optimal solution when all
    constraints are consistent.

43
(No Transcript)
44
Information content of an unvisited cell
  • Defined as how much adding it as a constraint
    will reduce distance between actual and expected
    values
  • Distance between actual and expected
  • Information content of (k1)th constraint Ck1
  • Can be approximated as


45
Information content of unseen data
46
Adapting for OLAP data Optimization 1 Expand
expected cube on demand
  • Single entry for all cells with same expected
    value
  • Initially everything aggregated but touches lot
    of data
  • Later constraints touch limited amount of data.

Expected cube
Views
47
Optimization 2 Reduce overlap
  • Number of iterations depend on overlap between
    constraints
  • Remove subsumed constraints from their parents to
    reduce overlap

48
Finding N most informative cells
  • In general, most informative cells can be any of
    value from any level of aggregation.
  • Single-pass algorithm that finds the best
    difference between actual and expected values
    VLDB-99

49
Information gain with focussed exploration
50
Illustration from Student enrollment data
35 of information in data captured in 12 out of
4560 cells 0.25 of data
51
Top few suprising values
80 of information in data captured in 50 out of
4560 cells 1 of data
52
Summary
  • Our goal enhance OLAP with a suite of operations
    that are
  • richer than simple OLAP and SQL queries
  • more interactive than conventional mining
  • ...and thus reduce the need for manual analysis
  • Proposed three new operators Diff, Generalize,
    Surprise
  • Formulations with theoretical basis
  • Efficient algorithms for online answering
  • Integrates smoothly with existing systems.
  • Future work More operators.
Write a Comment
User Comments (0)
About PowerShow.com