The DGX Distribution for Mining Massive, Skewed Data - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

The DGX Distribution for Mining Massive, Skewed Data

Description:

relational db (80-20 law'; high-end' histograms; skew-aware join algo's) ... Off-the-shelf maximization algo (matlab), to find good m, s. Skip. HP Labs, 2001 ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 56

Provided by: zhiqiangbi

Category:

more less

Transcript and Presenter's Notes

Title: The DGX Distribution for Mining Massive, Skewed Data

1
The DGX Distribution for Mining Massive, Skewed
Data

Zhiqiang Bi (CMU)
Christos Faloutsos (CMU)
Flip Korn (ATT)

2
Outline

Problem definition / Motivation
Background
Proposed method
Experiments
Conclusions

3
Motivation

Many real distributions are skewed
but they occasionally tilt more than Zipfs law
expects (top concavity)
Thus, we need a distribution more general than
Zipfs
A quick intro to Zipf distribution first

4
Outline

Problem definition / Motivation
Background mini intro to Zipf
Proposed method
Experiments
Conclusions

5
Example
the
log(freq)
and
Bible RANK-FREQUENCY plot (in log-log scales)
log(rank)
Zipfs (first) Law
6
Equivalently
Zipfs (second) law frequency-count relation (
PDF)
and
FREQ.-COUNT (PDF)
RANK-FREQUENCY
log(count)
log(freq)
the and of
of and the
log(freq)
log(rank)
7
Equivalently
Zipfs (second) law frequency-count relation (
PDF)
FREQ.-COUNT (PDF)
log(count)
of and the
log(freq)
8
Why is Zipf important?

because MANY distr. follow it (words, last/first
names, income etc)
are there skewed distributions that are NOT Zipf?

9
Motivating example
Clickstream Data
Web Site Traffic
log(count)
Zipf
log(freq)

lturl, u-id, ....gt
10
Outline

Problem definition / Motivation
Background Zipf successes and failures
Proposed method
Experiments
Conclusions

11
Background

Skewed distributions appear VERY OFTEN in
practice
relational db (80-20 law high-end
histograms skew-aware join algos)
economics (Paretos law)
text / IR Zipf

12
Background contd

library science (Lotkas law of publication
count) and citation counts (citeseer.nj.nec.com
6/2001)

log(count)
J. Ullman
log(citations)
13
Background contd

areas (lakes, islands, habitat patches) Korcak

14
Korcaks law
log(count( gt area))
Scandinavian lakes area vs complementary
cumulative count (log-log axes)
log(area)
15
Olympic medals
log( medals)
Russia
China
USA
log rank
16
Earthquakes

Energy of earthquakes (Gutenberg-Richter law)
simscience.org

log(count)
amplitude
magnitude
day
17
Background contd

web of in- and out- links CLEVER Barabasi

18
Background contd

UNIX file systems web file transfers
Bestavros
population of cities Zipf
distribution of first and last names Mandelbrot
(-1 and 0.7, resp.)
...

19
But occasionally Zipf fails

We want a distribution that
can handle the top concavity

20
Application of Zipfs Law

Word frequency
City size
Surname distribution
Olympic medals
Web traffic

Q Do all skewed distributions obey Zipfs law?
21
Problem definition

We want a distribution that
is discrete
models well real datasets
includes Zipf as a special case
can handle the top concavity
needs few parameters
is fast to compute

22
Problem definition

which one to choose? (and why?)
Gaussian, Erlang, Weibul,
Chi-square, Cauchy
geometric, exponential
Pareto, ... ?

23
Problem definition - contd

Or should we fit curves in the frequency-count
plot? or the rank-frequency plot?
parabolas? hyperbolas? sinusoids? polynomials?

24
Outline

Problem definition / Motivation
Background
Proposed method
Experiments
Conclusions

25
Proposed Method DGX

Discretized Gaussian Exponentiated
Inspired by the LogNormal distribution, which is
continuous

26
Lognormal

DFN If X is Gaussian (m,s), then exp(X) is
Lognormal
It has only two parameters to estimate (m and s
)
It has deep theoretical background (contrary to
Zipfs) and it appears often
size of crystals growing
capitals that grow exponentially
etc see KotzJohnson
But
- is continuous

27
Lognormal
PDF
BUT NOT discrete
log(Prob(x))
Prob(x) (count))
0
0
1
x (eg., income)
log(x)
28
Hence DGX
PDF
Prob(x) (count))
...
0
0
1
x (eg., income)
log(1)
29
Recall our goals

We want a distribution that
is discrete
models well real datasets
includes Zipf as a special case
can handle the top concavity
needs few parameters
is fast to compute

V
V
V
30
Zipf as a special case
Skip

When log(k) ltlt m, then
becomes

Details in the paper. Intuitively
31
Zipf as a special case

Intuitively

m gtgt 0 -gt top-concavity
m ltlt 0 -gt Zipf-like
log(Prob(x))
log(Prob(x))
...
log(x)
log(x)
log(1)
32
Recall our goals

We want a distribution that
is discrete
models well real datasets
includes Zipf as a special case
can handle the top concavity
needs few parameters
is fast to compute

V
V
V
V
33
Estimation of m, s

single pass, to collect histogram ( PDF)
Max likelihood for m, s (using off the shelf max.
routine)

34
Estimation of m, s
Skip
Off-the-shelf maximization algo (matlab), to find
good m, s
35
Recall our goals

We want a distribution that
is discrete
models well real datasets
includes Zipf as a special case
can handle the top concavity
needs few parameters
is fast to compute

V
V
V
V
V
36
Outline

Problem definition / Motivation
Background
Proposed method
Experiments
datasets
goodness of DGX
data mining spotting outliers with DGX
Conclusions

37
Experiments

Data
TEXT, (Bible), N800,000 words and V12,500
vocabulary words
SALES data from a retail chain (O(100) branches,
ltp-id, b-idgt, 5Gb records per week)
TELCO data, monthly usage volume per customer,
from three region ltu-id, region-idgt
CLICKSTREAM data. A. Montgomery, GSIA/CMU

38
Experiments

Evaluation of goodness
visual, in the frequency-count plot
correlation coefficient in the q-q plot (
quantile-quantile plot) ideally, straight lines
with slope 1)

90-tile of actual
quantile of actual distr.
90-tile of synthetic
quantile of synthetic distribution
39
Results TEXT
Count (log scale)
synthetic
blue synthetic green real
0.96
real
Word frequency (log scale)
quantile-quantile plot
40
SALES data store96
blue synthetic green real
41
SALES data store82
blue synthetic green real
42
SALES data store101
blue synthetic green real
43
TELCO data region A
blue synthetic green real
44
TELCO data region B
blue synthetic green real
45
TELCO data region C
blue synthetic green real
46
CLICKSTREAM
web site access count
number of user accesses
47
Outline