Title: Sunita Sarawagi IIT Bombay http://www.it.iitb.ernet.in/~sunita
1Sunita SarawagiIIT Bombayhttp//www.it.iitb
.ernet.in/sunita
I3 Intelligent, Interactive Investigation of
multidimensional data
2Multidimensional OLAP databases
- Fast, interactive answers to large aggregate
queries. - Multidimensional model dimensions with
hierarchies - Dim 1 Bank location
- branch--gtcity--gtstate
- Dim 2 Customer
- sub profession --gt profession
- Dim 3 Time
- month --gt quarter --gt year
- Measures loan amount, transactions, balance
3OLAP
- Navigational operators Pivot, drill-down,
roll-up, select. - Hypothesis driven search E.g. factors affecting
defaulters - view defaulting rate on age aggregated over other
dimensions - for particular age segment detail along
profession - Need interactive response to aggregate queries..
4Motivation
- OLAP products provide a minimal set of tools for
analysis - simple aggregates
- selects/drill-downs/roll-ups on the
multidimensional structure - Heavy reliance on manual operations for analysis
- tedious on large data with multiple dimensions
and levels of hierarchy - GOAL automate through complex, mining-like
operations integrated with Olap.
5State of art in mining OLAP integration
- Decision trees Information discovery, Cognos
- find factors influencing high profits
- Clustering Pilot software
- segment customers to define hierarchy on that
dimension - Time series analysis Seagates Holos
- Query for various shapes along time spikes,
outliers etc - Multi-level Associations Han et al.
- find association between members of dimensions
6New approach
- Identify complex operations with specific
OLAP needs in mind (what does an analyst need?)
rather than looking at mining operations and
choosing what fits - Three examples
- Diff for specific why questions at aggregate
level - most compactly represent the answer that user can
quickly assimilate - Generalize from detailed data to more general
cases - expand scope of problem case as far out as
possible - Inform of interesting regions in data
- Point to most informative regions in data, so
user does not need to hunt for them in the blind.
7The Diff operator
8Unravel aggregate data
What is the most compact answer that user can
quickly assimilate?
9Solution
- A new DIFF-operator added to OLAP systems that
provides the answer - in a single-step
- is easy-to-assimilate
- and compact --- configurable by user.
- Obviates use of the lengthy and manual search for
reasons in large multidimensional data.
10Example query
11Compact answer
12Example explaining increases
13Compact answer
14Model for summarization
- The two aggregated values correspond to two
subcubes in detailed data.
Cube-A
Cube-B
15Detailed answers
Explain only 15 of total difference as against
90 with compact
16Summarizing similar changes
17MDL model for summarization
- Given N, find the best N rows of answer such
that - if user knows cube-A and answer,
- number of bits needed to send cube-B is
minimized.
N row answer
Cube-A
Cube-B
18Transmission cost MDL-based
- Each answer entry has a ratio that is
- sum of measure values in cube-B and cube-A not
covered by a more detailed entry in answer. - For each cell of cube-B not in answer
- r ratio of closest parent in answer
- a (b) measure value of cube A (B).
- Expected value of b a r
- bits -log(prob(b, ar)) where prob(x,u) is
probability at value x for a distribution with
mean u. - We use a poisson distribution when x are counts,
normal distribution otherwise
19Algorithm
- Challenges
- Circular dependence on parents ratio
- Bounded size of answer
- Greedy methods do not work
- Bottom up dynamic programming algorithm
20N2
i
Tuples with same parent
Tuples in detailed data grouped by common parent..
21Integration
- Single pass on data --- all indexing/sorting in
the DBMS interactive. - Low memory usage independent of number of
tuples O(NL) - Easy to package as a stored procedure on the data
server side. - When detailed subcube too large work off
aggregated data.
22Performance
- 80 time spent in data access.
- Quarter million records processed in 10 seconds
333 MHz Pentium 128 MB memory Data on DB2
UDB NT 4.0 Olap benchmark 1.36 million tuples 4
dimensions
23The Relax operator
24Example query generalizing drops
25(No Transcript)
26Ratio generalization
27Problem formulation
- Inputs
- A specific tuple Ts
- An upper bound N on the answer size
- Error functions
- R(Ts,T?) measures the error of including a tuple
T? in a generalization around Ts - S(Ts,T?) measures the error of excluding T? from
the generalization - Goal
- To find all possible consistent and maximal
generalizations around Ts
28Algorithm
- Considerations
- Need to exploit the capabilities of the OLAP data
source - Need to reduce the amount of data fetches to the
application - 2-stage approach
- Finding generalizations
- Getting exceptions
29Finding generalizations
- n number of dimensions
- Li levels of hierarchy of dimension Di
- Dij jth level in the ith dimension hierarchy
- candidate_set ? D11, D21Dn1 // all single
dimension candidate gen. - k 1
- while (candidate_set ? ?)
- ?g ? candidate_set
- if (ST?g S(Ts,T) gt ST?g R(Ts,T)) Gk ? Gk ? g
- // generating candidates for pass (k1)
from generalizations of pass k - candidate_set ? generateCandidates(Gk)
//Apriori style - // if gen is possible at level j of dimension
Di , add its parent level to the candidate set - candidate_set ? candidate_set ? Di(j1)Dij
? Gk jlt Li - k ? k 1
- Return ?i Gi
30Finding Summarized Exceptions
- Goal
- Find exceptions to each maximal
generalization compacted to within N rows and
yielding the minimum total error - Challenges
- No absolute criteria for determining whether a
tuple is an exception or not for all possible R
functions - Worth of including a child tuple is circularly
dependent on its parent tuple - Bounded size of answer
- Solution
- Bottom up dynamic programming algorithm
31Single dimension with multiple levels of
hierarchies
- Optimal solution for finite domain R functions
- soln(l,n,v) the best solution for subtree l for
all n between 0 and N and all possible values of
the default rep. - soln(l,n,v,c) the intermediate value of
soln(l,n,v) after the 1st to the cth child of l
are scanned - Err(soln(l,n,v,c1))min0?k?n(Err(soln(l,n,v,c))E
rr(soln(c1,n-k,v))) - Err(soln(l,n,v))min(Err(soln(l,n,v,)),
- minv ? v Err(soln(1,n-1,v,)rep(v)))
32soln(1,1,)
N3
N2
N1
N0
1
1.1 ()
1.2 (-)
1.3 ()
1.4 ()
- - - 1 2 3 4 5 6
7 8 9 10
- 1 2 3 4 5 6
- - - - - 1 2 3 4 5 6 7
soln(1.1,3,)
soln(1.2,3,)
soln(1.3,3,)
soln(1.4,3,)
33Generalize Operator
34(No Transcript)
35The Inform operator
36User-cognizant data exploration overview
- Monitor to find regions of data user has visited
- Model users expectation of unseen values
- Report most informative unseen values
- How to
- Model expected values?
- Define information content?
37Modeling expected values
38The Maximum Entropy Principle
- Choose the most uniform distribution while
adhering to all the constraints - E.T.Jaynes..1990
- it agrees with everything that is known but
carefully avoids assuming anything that is not
known. It is transcription into mathematics of an
ancient principle of wisdom - Characterizing uniformity
- maximum when all pi-s are equal
- Solve the constrained optimization problem
- maximize H(p) subject to k constraints
39Modeling expected values
Visited views
Database
40Change in entropy
41Finding expected values
- Solve the constrained optimization problem
- maximize H(p) subject to k constraints
- Each constraint is of the form sum of arbitrary
sets of values - Expected values can be expressed as a product of
k coefficients one from each of the k constraints -
-
42Iterative scaling algorithm
- Initially all p values are the same
- While convergence not reached
- For each constraint Ci in turn
- Scale p values included in Ci by
- Converges to optimal solution when all
constraints are consistent. -
43(No Transcript)
44Information content of an unvisited cell
- Defined as how much adding it as a constraint
will reduce distance between actual and expected
values - Distance between actual and expected
- Information content of (k1)th constraint Ck1
- Can be approximated as
45Information content of unseen data
46Adapting for OLAP data Optimization 1 Expand
expected cube on demand
- Single entry for all cells with same expected
value - Initially everything aggregated but touches lot
of data - Later constraints touch limited amount of data.
Expected cube
Views
47Optimization 2 Reduce overlap
- Number of iterations depend on overlap between
constraints - Remove subsumed constraints from their parents to
reduce overlap
48Finding N most informative cells
- In general, most informative cells can be any of
value from any level of aggregation. - Single-pass algorithm that finds the best
difference between actual and expected values
VLDB-99
49Information gain with focussed exploration
50Illustration from Student enrollment data
35 of information in data captured in 12 out of
4560 cells 0.25 of data
51Top few suprising values
80 of information in data captured in 50 out of
4560 cells 1 of data
52Summary
- Our goal enhance OLAP with a suite of operations
that are - richer than simple OLAP and SQL queries
- more interactive than conventional mining
- ...and thus reduce the need for manual analysis
- Proposed three new operators Diff, Generalize,
Surprise - Formulations with theoretical basis
- Efficient algorithms for online answering
- Integrates smoothly with existing systems.
- Future work More operators.