Sunita Sarawagi IIT Bombay http://www.it.iitb.ernet.in/~sunita

About This Presentation

Title:

Sunita Sarawagi IIT Bombay http://www.it.iitb.ernet.in/~sunita

Description:

Fast, interactive answers to large aggregate queries. ... Navigational operators: Pivot, ... Unravel aggregate data. Total sales dropped 30% in N. America. Why? ... – PowerPoint PPT presentation

Number of Views:112

Avg rating:3.0/5.0

Slides: 50

Provided by: sunitas4

Category:

more less

Transcript and Presenter's Notes

Title: Sunita Sarawagi IIT Bombay http://www.it.iitb.ernet.in/~sunita

1
Sunita SarawagiIIT Bombayhttp//www.it.iitb
.ernet.in/sunita
I3 Intelligent, Interactive Investigation of
multidimensional data
2
Multidimensional OLAP databases

Fast, interactive answers to large aggregate
queries.
Multidimensional model dimensions with
hierarchies
Dim 1 Bank location
branch--gtcity--gtstate
Dim 2 Customer
sub profession --gt profession
Dim 3 Time
month --gt quarter --gt year
Measures loan amount, transactions, balance

3
OLAP

Navigational operators Pivot, drill-down,
roll-up, select.
Hypothesis driven search E.g. factors affecting
defaulters
view defaulting rate on age aggregated over other
dimensions
for particular age segment detail along
profession
Need interactive response to aggregate queries..

4
Motivation

OLAP products provide a minimal set of tools for
analysis
simple aggregates
selects/drill-downs/roll-ups on the
multidimensional structure
Heavy reliance on manual operations for analysis
tedious on large data with multiple dimensions
and levels of hierarchy
GOAL automate through complex, mining-like
operations integrated with Olap.

5
State of art in mining OLAP integration

Decision trees Information discovery, Cognos
find factors influencing high profits
Clustering Pilot software
segment customers to define hierarchy on that
dimension
Time series analysis Seagates Holos
Query for various shapes along time spikes,
outliers etc
Multi-level Associations Han et al.
find association between members of dimensions

6
New approach

Identify complex operations with specific
OLAP needs in mind (what does an analyst need?)
rather than looking at mining operations and
choosing what fits
Three examples
Diff for specific why questions at aggregate
level
most compactly represent the answer that user can
quickly assimilate
Generalize from detailed data to more general
cases
expand scope of problem case as far out as
possible
Inform of interesting regions in data
Point to most informative regions in data, so
user does not need to hunt for them in the blind.

7
The Diff operator
8
Unravel aggregate data
What is the most compact answer that user can
quickly assimilate?
9
Solution

A new DIFF-operator added to OLAP systems that
provides the answer
in a single-step
is easy-to-assimilate
and compact --- configurable by user.
Obviates use of the lengthy and manual search for
reasons in large multidimensional data.

10
Example query
11
Compact answer
12
Example explaining increases
13
Compact answer
14
Model for summarization

The two aggregated values correspond to two
subcubes in detailed data.

Cube-A
Cube-B
15
Detailed answers
Explain only 15 of total difference as against
90 with compact
16
Summarizing similar changes
17
MDL model for summarization

Given N, find the best N rows of answer such
that
if user knows cube-A and answer,
number of bits needed to send cube-B is
minimized.

N row answer
Cube-A
Cube-B
18
Transmission cost MDL-based

Each answer entry has a ratio that is
sum of measure values in cube-B and cube-A not
covered by a more detailed entry in answer.
For each cell of cube-B not in answer
r ratio of closest parent in answer
a (b) measure value of cube A (B).
Expected value of b a r
bits -log(prob(b, ar)) where prob(x,u) is
probability at value x for a distribution with
mean u.
We use a poisson distribution when x are counts,
normal distribution otherwise

19
Algorithm

Challenges
Circular dependence on parents ratio
Bounded size of answer
Greedy methods do not work
Bottom up dynamic programming algorithm

20
N2
i
Tuples with same parent
Tuples in detailed data grouped by common parent..
21
Integration

Single pass on data --- all indexing/sorting in
the DBMS interactive.
Low memory usage independent of number of
tuples O(NL)
Easy to package as a stored procedure on the data
server side.
When detailed subcube too large work off
aggregated data.

22
Performance

80 time spent in data access.
Quarter million records processed in 10 seconds

333 MHz Pentium 128 MB memory Data on DB2
UDB NT 4.0 Olap benchmark 1.36 million tuples 4
dimensions
23
The Relax operator
24
Example query generalizing drops
25
(No Transcript)
26
Ratio generalization
27
Problem formulation

Inputs
A specific tuple Ts
An upper bound N on the answer size
Error functions
R(Ts,T?) measures the error of including a tuple
T? in a generalization around Ts
S(Ts,T?) measures the error of excluding T? from
the generalization
Goal
To find all possible consistent and maximal
generalizations around Ts

28
Algorithm

Considerations
Need to exploit the capabilities of the OLAP data
source
Need to reduce the amount of data fetches to the
application
2-stage approach
Finding generalizations
Getting exceptions

29
Finding generalizations

n number of dimensions
Li levels of hierarchy of dimension Di
Dij jth level in the ith dimension hierarchy
candidate_set ? D11, D21Dn1 // all single
dimension candidate gen.
k 1
while (candidate_set ? ?)
?g ? candidate_set
if (ST?g S(Ts,T) gt ST?g R(Ts,T)) Gk ? Gk ? g
// generating candidates for pass (k1)
from generalizations of pass k
candidate_set ? generateCandidates(Gk)
//Apriori style
// if gen is possible at level j of dimension
Di , add its parent level to the candidate set
candidate_set ? candidate_set ? Di(j1)Dij
? Gk jlt Li
k ? k 1
Return ?i Gi

30
Finding Summarized Exceptions

Goal
Find exceptions to each maximal
generalization compacted to within N rows and
yielding the minimum total error
Challenges
No absolute criteria for determining whether a
tuple is an exception or not for all possible R
functions
Worth of including a child tuple is circularly
dependent on its parent tuple
Bounded size of answer
Solution
Bottom up dynamic programming algorithm

31
Single dimension with multiple levels of
hierarchies

Optimal solution for finite domain R functions
soln(l,n,v) the best solution for subtree l for
all n between 0 and N and all possible values of
the default rep.
soln(l,n,v,c) the intermediate value of
soln(l,n,v) after the 1st to the cth child of l
are scanned
Err(soln(l,n,v,c1))min0?k?n(Err(soln(l,n,v,c))E
rr(soln(c1,n-k,v)))
Err(soln(l,n,v))min(Err(soln(l,n,v,)),
minv ? v Err(soln(1,n-1,v,)rep(v)))

32
soln(1,1,)
N3
N2
N1
N0
1
1.1 ()
1.2 (-)
1.3 ()
1.4 ()
- - - 1 2 3 4 5 6
7 8 9 10
- 1 2 3 4 5 6

- - -
1 2 3 4 5 6 7 8 9

- - - - - 1 2 3 4 5 6 7
soln(1.1,3,)
soln(1.2,3,)
soln(1.3,3,)
soln(1.4,3,)
33
Generalize Operator
34
(No Transcript)
35
The Inform operator
36
User-cognizant data exploration overview

Monitor to find regions of data user has visited
Model users expectation of unseen values
Report most informative unseen values

How to
Model expected values?
Define information content?

37
Modeling expected values
38
The Maximum Entropy Principle

Choose the most uniform distribution while
adhering to all the constraints
E.T.Jaynes..1990
it agrees with everything that is known but
carefully avoids assuming anything that is not
known. It is transcription into mathematics of an
ancient principle of wisdom
Characterizing uniformity
maximum when all pi-s are equal
Solve the constrained optimization problem
maximize H(p) subject to k constraints

39
Modeling expected values
Visited views
Database
40
Change in entropy
41
Finding expected values

Solve the constrained optimization problem
maximize H(p) subject to k constraints
Each constraint is of the form sum of arbitrary
sets of values
Expected values can be expressed as a product of
k coefficients one from each of the k constraints

42
Iterative scaling algorithm

Initially all p values are the same
While convergence not reached
For each constraint Ci in turn
Scale p values included in Ci by

Converges to optimal solution when all
constraints are consistent.

43
(No Transcript)
44
Information content of an unvisited cell

Defined as how much adding it as a constraint
will reduce distance between actual and expected
values
Distance between actual and expected
Information content of (k1)th constraint Ck1
Can be approximated as

45
Information content of unseen data
46
Adapting for OLAP data Optimization 1 Expand
expected cube on demand

Single entry for all cells with same expected
value
Initially everything aggregated but touches lot
of data
Later constraints touch limited amount of data.

Expected cube
Views
47
Optimization 2 Reduce overlap

Number of iterations depend on overlap between
constraints
Remove subsumed constraints from their parents to
reduce overlap

48
Finding N most informative cells

In general, most informative cells can be any of
value from any level of aggregation.
Single-pass algorithm that finds the best
difference between actual and expected values
VLDB-99

49
Information gain with focussed exploration
50
Illustration from Student enrollment data
35 of information in data captured in 12 out of
4560 cells 0.25 of data
51
Top few suprising values
80 of information in data captured in 50 out of
4560 cells 1 of data
52
Summary

Our goal enhance OLAP with a suite of operations
that are
richer than simple OLAP and SQL queries
more interactive than conventional mining
...and thus reduce the need for manual analysis
Proposed three new operators Diff, Generalize,
Surprise
Formulations with theoretical basis
Efficient algorithms for online answering
Integrates smoothly with existing systems.
Future work More operators.

Write a Comment

User Comments (0)

About PowerShow.com

Sunita Sarawagi IIT Bombay http://www.it.iitb.ernet.in/~sunita - PowerPoint PPT Presentation

Sunita Sarawagi IIT Bombay http://www.it.iitb.ernet.in/~sunita

Fast, interactive answers to large aggregate queries. ... Navigational operators: Pivot, ... Unravel aggregate data. Total sales dropped 30% in N. America. Why? ... – PowerPoint PPT presentation