Sunita Sarawagi IIT Bombay http:www'it'iitb'ac'insunita - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Sunita Sarawagi IIT Bombay http:www'it'iitb'ac'insunita

Description:

Data analysis understanding the effect of various factors on a target variable ... Unravel aggregate data. Total sales dropped 30% in Europe. Why? ... – PowerPoint PPT presentation

Number of Views:198
Avg rating:3.0/5.0
Slides: 40
Provided by: sunitas4
Category:

less

Transcript and Presenter's Notes

Title: Sunita Sarawagi IIT Bombay http:www'it'iitb'ac'insunita


1
Sunita SarawagiIIT Bombayhttp//www.it.iitb
.ac.in/sunita
Analyzing large multidimensional databases
2
Data Analysis in decision support systems
  • Data analysis ? understanding the effect of
    various factors on a target variable
  • Factors region, time, sales channel
  • Target profit, sales volume
  • Tools for data analysis
  • SQL queries/reports slow, manual, painful
  • Multidimensional tools OLAP, very popular
  • Statistical packages sophisticated
  • Data mining automated, lot of interest and hype
    but several hurdles to meaningful adoption

3
OLAP (On line Analytical Processing)
  • Data viewed as a multidimensional cube where
  • factors are dimensions with hierarchies
  • targets are values within cells
  • Analysis happens through fast interactive
    browsing of aggregates
  • Market share in 2002 US 3.5 billion
  • Vendors Microsoft, Hyperion, Cognos, Business
    Objects ? about 30 such vendors

4
Multidimensional Data analysis
  • Sales volume as a function of product, month, and
    region

Dimensions Product, Location, Time Hierarchical
summarization paths
Region
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
Month
5
Typical OLAP Operations
  • Roll up (drill-up) summarize data
  • by climbing up hierarchy or by dimension
    reduction
  • Drill down (roll down) reverse of roll-up
  • from higher level summary to lower level summary
    or detailed data, or introducing new dimensions
  • Slice and dice
  • project and select
  • Pivot (rotate)
  • reorient the cube, visualization, 3D to series of
    2D planes.

6
Limitation of OLAP-based analysis
  • OLAP products provide a minimal set of tools for
    analysis
  • simple aggregates
  • selects/drill-downs/roll-ups on the
    multidimensional structure
  • Heavy reliance on manual operations for analysis
  • tedious on large data with multiple dimensions
    and levels of hierarchy

7
I3 Intelligent, Interactive Investigation of
multidimensional data
  • lightweight automation of tedious multi-step
    tasks
  • Three examples
  • Diff for specific why questions at aggregate
    level
  • most compactly represent the answer that user can
    quickly assimilate
  • Generalize from detailed data to more general
    cases
  • expand scope of problem case as far out as
    possible
  • Inform of interesting regions in data

8
The Diff operator
9
Unravel aggregate data
What is the most compact answer that user can
quickly assimilate?
10
Solution
  • A new DIFF-operator added to OLAP systems that
    provides the answer
  • in a single-step
  • is easy-to-assimilate
  • and compact --- configurable by user.
  • Obviates use of the lengthy and manual search for
    reasons in large multidimensional data.

11
Example query
12
Compact answer
13
Example explaining increases
14
Compact answer
15
Model for summarization
  • The two aggregated values correspond to two
    subcubes in detailed data.

Cube-A
Cube-B
16
Detailed answers
Explain only 15 of total difference as against
90 with compact
17
Summarizing similar changes
18
MDL model for summarization
  • Given N, find the best N rows of answer such
    that
  • if user knows cube-A and answer,
  • number of bits needed to send cube-B is
    minimized.

N row answer
Cube-A
Cube-B
19
Transmission cost MDL-based
  • Each answer entry has a ratio that is
  • sum of measure values in cube-B and cube-A not
    covered by a more detailed entry in answer.
  • For each cell of cube-B not in answer
  • r ratio of closest parent in answer
  • a (b) measure value of cube A (B).
  • Expected value of b a r
  • bits -log(prob(b, ar)) where prob(x,u) is
    probability at value x for a distribution with
    mean u.
  • We use a poisson distribution when x are counts,
    normal distribution otherwise

20
Algorithm
  • Challenges
  • Circular dependence on parents ratio
  • Bounded size of answer
  • Greedy methods do not work
  • Bottom up dynamic programming algorithm

21
N2
i
Tuples with same parent
Tuples in detailed data grouped by common parent..
22
Integration
  • Single pass on data --- all indexing/sorting in
    the DBMS interactive.
  • Low memory usage independent of number of
    tuples O(NL)
  • Easy to package as a stored procedure on the data
    server side.
  • When detailed subcube too large work off
    aggregated data.

23
Performance
  • 80 time spent in data access.
  • Quarter million records processed in 10 seconds

333 MHz Pentium 128 MB memory Data on DB2
UDB NT 4.0 Olap benchmark 1.36 million tuples 4
dimensions
24
The Relax operator
25
Example query generalizing drops
26
(No Transcript)
27
Ratio generalization
28
Problem formulation
  • Inputs
  • A specific tuple Ts
  • An upper bound N on the answer size
  • Error functions
  • R(Ts,T?) measures the error of including a tuple
    T? in a generalization around Ts
  • S(Ts,T?) measures the error of excluding T? from
    the generalization
  • Goal
  • To find all possible consistent and maximal
    generalizations around Ts

29
Algorithm
  • Considerations
  • Need to exploit the capabilities of the OLAP data
    source
  • Need to reduce the amount of data fetches to the
    application
  • 2-stage approach
  • Finding generalizations
  • Getting exceptions

30
Finding generalizations
  • n number of dimensions
  • Li levels of hierarchy of dimension Di
  • Dij jth level in the ith dimension hierarchy
  • candidate_set ? D11, D21Dn1 // all single
    dimension candidate gen.
  • k 1
  • while (candidate_set ? ?)
  • ?g ? candidate_set
  • if (ST?g S(Ts,T) gt ST?g R(Ts,T)) Gk ? Gk ? g
  • // generating candidates for pass (k1)
    from generalizations of pass k
  • candidate_set ? generateCandidates(Gk)
    //Apriori style
  • // if gen is possible at level j of dimension
    Di , add its parent level to the candidate set
  • candidate_set ? candidate_set ? Di(j1)Dij
    ? Gk jlt Li
  • k ? k 1
  • Return ?i Gi

31
Finding Summarized Exceptions
  • Goal
  • Find exceptions to each maximal
    generalization compacted to within N rows and
    yielding the minimum total error
  • Challenges
  • No absolute criteria for determining whether a
    tuple is an exception or not for all possible R
    functions
  • Worth of including a child tuple is circularly
    dependent on its parent tuple
  • Bounded size of answer
  • Solution
  • Bottom up dynamic programming algorithm

32
Single dimension with multiple levels of
hierarchies
  • Optimal solution for finite domain R functions
  • soln(l,n,v) the best solution for subtree l for
    all n between 0 and N and all possible values of
    the default rep.
  • soln(l,n,v,c) the intermediate value of
    soln(l,n,v) after the 1st to the cth child of l
    are scanned
  • Err(soln(l,n,v,c1))min0?k?n(Err(soln(l,n,v,c))E
    rr(soln(c1,n-k,v)))
  • Err(soln(l,n,v))min(Err(soln(l,n,v,)),
  • minv ? v Err(soln(1,n-1,v,)rep(v)))

33
soln(1,1,)
N3
N2
N1
N0
1
1.1 ()
1.2 (-)
1.3 ()
1.4 ()
- - - 1 2 3 4 5 6
7 8 9 10
- 1 2 3 4 5 6
  • - - -
  • 1 2 3 4 5 6 7 8 9

- - - - - 1 2 3 4 5 6 7
soln(1.1,3,)
soln(1.2,3,)
soln(1.3,3,)
soln(1.4,3,)
34
Generalize Operator
35
(No Transcript)
36
The Inform operator
37
User-cognizant data exploration overview
  • Monitor to find regions of data user has visited
  • Model users expectation of unseen values
  • Report most informative unseen values
  • How to
  • Model expected values?
  • Define information content?

38
Modeling expected values
39
The Maximum Entropy Principle
  • Choose the most uniform distribution while
    adhering to all the constraints
  • E.T.Jaynes..1990
  • it agrees with everything that is known but
    carefully avoids assuming anything that is not
    known. It is transcription into mathematics of an
    ancient principle of wisdom
  • Characterizing uniformity
  • maximum when all pi-s are equal
  • Solve the constrained optimization problem
  • maximize H(p) subject to k constraints

40
Modeling expected values
Visited views
Database
41
Change in entropy
42
Finding expected values
  • Solve the constrained optimization problem
  • maximize H(p) subject to k constraints
  • Each constraint is of the form sum of arbitrary
    sets of values
  • Expected values can be expressed as a product of
    k coefficients one from each of the k constraints

43
Iterative scaling algorithm
  • Initially all p values are the same
  • While convergence not reached
  • For each constraint Ci in turn
  • Scale p values included in Ci by
  • Converges to optimal solution when all
    constraints are consistent.

44
(No Transcript)
45
Information content of an unvisited cell
  • Defined as how much adding it as a constraint
    will reduce distance between actual and expected
    values
  • Distance between actual and expected
  • Information content of (k1)th constraint Ck1
  • Can be approximated as


46
Information content of unseen data
47
Adapting for OLAP data Optimization 1 Expand
expected cube on demand
  • Single entry for all cells with same expected
    value
  • Initially everything aggregated but touches lot
    of data
  • Later constraints touch limited amount of data.

Expected cube
Views
48
Optimization 2 Reduce overlap
  • Number of iterations depend on overlap between
    constraints
  • Remove subsumed constraints from their parents to
    reduce overlap

49
Finding N most informative cells
  • In general, most informative cells can be any of
    value from any level of aggregation.
  • Single-pass algorithm that finds the best
    difference between actual and expected values
    Diff algorithm

50
Information gain with focussed exploration
51
Illustration from Student enrollment data
35 of information in data captured in 12 out of
4560 cells 0.25 of data
52
Top few suprising values
80 of information in data captured in 50 out of
4560 cells 1 of data
53
Summary
  • Our goal enhance OLAP with a suite of operations
    that are
  • richer than simple OLAP and SQL queries
  • more interactive than conventional mining
  • ...and thus reduce the need for manual analysis
  • Proposed three new operators Diff, Relax,Inform
  • Formulations with theoretical basis
  • Efficient algorithms for online answering
  • Integrates smoothly with existing systems.
Write a Comment
User Comments (0)
About PowerShow.com