The DGX Distribution for Mining Massive, Skewed Data - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

The DGX Distribution for Mining Massive, Skewed Data

Description:

relational db (80-20 law'; high-end' histograms; skew-aware join algo's) ... Off-the-shelf maximization algo (matlab), to find good m, s. Skip. HP Labs, 2001 ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 56
Provided by: zhiqiangbi
Category:

less

Transcript and Presenter's Notes

Title: The DGX Distribution for Mining Massive, Skewed Data


1
The DGX Distribution for Mining Massive, Skewed
Data
  • Zhiqiang Bi (CMU)
  • Christos Faloutsos (CMU)
  • Flip Korn (ATT)

2
Outline
  • Problem definition / Motivation
  • Background
  • Proposed method
  • Experiments
  • Conclusions

3
Motivation
  • Many real distributions are skewed
  • but they occasionally tilt more than Zipfs law
    expects (top concavity)
  • Thus, we need a distribution more general than
    Zipfs
  • A quick intro to Zipf distribution first

4
Outline
  • Problem definition / Motivation
  • Background mini intro to Zipf
  • Proposed method
  • Experiments
  • Conclusions

5
Example
the
log(freq)
and
Bible RANK-FREQUENCY plot (in log-log scales)
log(rank)
Zipfs (first) Law
6
Equivalently
Zipfs (second) law frequency-count relation (
PDF)
and
FREQ.-COUNT (PDF)
RANK-FREQUENCY
log(count)
log(freq)
the and of
of and the
log(freq)
log(rank)
7
Equivalently
Zipfs (second) law frequency-count relation (
PDF)
FREQ.-COUNT (PDF)
log(count)
of and the
log(freq)
8
Why is Zipf important?
  • because MANY distr. follow it (words, last/first
    names, income etc)
  • are there skewed distributions that are NOT Zipf?

9
Motivating example
Clickstream Data
Web Site Traffic
log(count)
Zipf
log(freq)

lturl, u-id, ....gt
10
Outline
  • Problem definition / Motivation
  • Background Zipf successes and failures
  • Proposed method
  • Experiments
  • Conclusions

11
Background
  • Skewed distributions appear VERY OFTEN in
    practice
  • relational db (80-20 law high-end
    histograms skew-aware join algos)
  • economics (Paretos law)
  • text / IR Zipf

12
Background contd
  • library science (Lotkas law of publication
    count) and citation counts (citeseer.nj.nec.com
    6/2001)

log(count)
J. Ullman
log(citations)
13
Background contd
  • areas (lakes, islands, habitat patches) Korcak

14
Korcaks law
log(count( gt area))
Scandinavian lakes area vs complementary
cumulative count (log-log axes)
log(area)
15
Olympic medals
log( medals)
Russia
China
USA
log rank
16
Earthquakes
  • Energy of earthquakes (Gutenberg-Richter law)
    simscience.org

log(count)
amplitude
magnitude
day
17
Background contd
  • web of in- and out- links CLEVER Barabasi

18
Background contd
  • UNIX file systems web file transfers
    Bestavros
  • population of cities Zipf
  • distribution of first and last names Mandelbrot
    (-1 and 0.7, resp.)
  • ...

19
But occasionally Zipf fails
  • We want a distribution that
  • can handle the top concavity

20
Application of Zipfs Law
  • Word frequency
  • City size
  • Surname distribution
  • Olympic medals
  • Web traffic

Q Do all skewed distributions obey Zipfs law?
21
Problem definition
  • We want a distribution that
  • is discrete
  • models well real datasets
  • includes Zipf as a special case
  • can handle the top concavity
  • needs few parameters
  • is fast to compute

22
Problem definition
  • which one to choose? (and why?)
  • Gaussian, Erlang, Weibul,
  • Chi-square, Cauchy
  • geometric, exponential
  • Pareto, ... ?

23
Problem definition - contd
  • Or should we fit curves in the frequency-count
    plot? or the rank-frequency plot?
  • parabolas? hyperbolas? sinusoids? polynomials?

24
Outline
  • Problem definition / Motivation
  • Background
  • Proposed method
  • Experiments
  • Conclusions

25
Proposed Method DGX
  • Discretized Gaussian Exponentiated
  • Inspired by the LogNormal distribution, which is
    continuous

26
Lognormal
  • DFN If X is Gaussian (m,s), then exp(X) is
    Lognormal
  • It has only two parameters to estimate (m and s
    )
  • It has deep theoretical background (contrary to
    Zipfs) and it appears often
  • size of crystals growing
  • capitals that grow exponentially
  • etc see KotzJohnson
  • But
  • - is continuous

27
Lognormal
PDF
BUT NOT discrete
log(Prob(x))
Prob(x) (count))
0
0
1
x (eg., income)
log(x)
28
Hence DGX
PDF
Prob(x) (count))
...
0
0
1
x (eg., income)
log(1)
29
Recall our goals
  • We want a distribution that
  • is discrete
  • models well real datasets
  • includes Zipf as a special case
  • can handle the top concavity
  • needs few parameters
  • is fast to compute

V
V
V
30
Zipf as a special case
Skip
  • When log(k) ltlt m, then
  • becomes

Details in the paper. Intuitively
31
Zipf as a special case
  • Intuitively

m gtgt 0 -gt top-concavity
m ltlt 0 -gt Zipf-like
log(Prob(x))
log(Prob(x))
...
log(x)
log(x)
log(1)
32
Recall our goals
  • We want a distribution that
  • is discrete
  • models well real datasets
  • includes Zipf as a special case
  • can handle the top concavity
  • needs few parameters
  • is fast to compute

V
V
V
V
33
Estimation of m, s
  • single pass, to collect histogram ( PDF)
  • Max likelihood for m, s (using off the shelf max.
    routine)

34
Estimation of m, s
Skip
Off-the-shelf maximization algo (matlab), to find
good m, s
35
Recall our goals
  • We want a distribution that
  • is discrete
  • models well real datasets
  • includes Zipf as a special case
  • can handle the top concavity
  • needs few parameters
  • is fast to compute

V
V
V
V
V
36
Outline
  • Problem definition / Motivation
  • Background
  • Proposed method
  • Experiments
  • datasets
  • goodness of DGX
  • data mining spotting outliers with DGX
  • Conclusions

37
Experiments
  • Data
  • TEXT, (Bible), N800,000 words and V12,500
    vocabulary words
  • SALES data from a retail chain (O(100) branches,
    ltp-id, b-idgt, 5Gb records per week)
  • TELCO data, monthly usage volume per customer,
    from three region ltu-id, region-idgt
  • CLICKSTREAM data. A. Montgomery, GSIA/CMU

38
Experiments
  • Evaluation of goodness
  • visual, in the frequency-count plot
  • correlation coefficient in the q-q plot (
    quantile-quantile plot) ideally, straight lines
    with slope 1)

90-tile of actual
quantile of actual distr.
90-tile of synthetic
quantile of synthetic distribution
39
Results TEXT
Count (log scale)
synthetic
blue synthetic green real
0.96
real
Word frequency (log scale)
quantile-quantile plot
40
SALES data store96
blue synthetic green real
41
SALES data store82
blue synthetic green real
42
SALES data store101
blue synthetic green real
43
TELCO data region A
blue synthetic green real
44
TELCO data region B
blue synthetic green real
45
TELCO data region C
blue synthetic green real
46
CLICKSTREAM
web site access count
number of user accesses
47
Outline
  • Problem definition / Motivation
  • Background
  • Proposed method
  • Experiments
  • datasets
  • goodness of DGX
  • data mining spotting outliers with DGX
  • Conclusions

48
How to spot outlier branches
s
m
49
How to spot outlier branches
s
m
50
How to spot outlier branches
51
Conclusions
  • DGX has all the desired properties
  • is discrete
  • models well real datasets
  • includes Zipf as a special case
  • can handle the top concavity
  • needs few parameters
  • is fast to compute

V
V
V
V
V
V
52
Philosophically, why is DGX so popular?
  • Gaussian fixed point for addition of R.V.
    lognormal/DGX for multiplication
  • FOR EXAMPLE
  • breaking a stick into pieces
  • rich get richer phenomena

53
Philosophically, why is DGX so popular?
  • Stick, breaking in half, n times
  • length of leftmost piece L0 p1 p2 ... pn

L0 p1
L0 (1-p1)
54
Philosophically, why is DGX so popular?
  • rich get richer leads to lognormals
  • C(t) C(t-1) (1 a noise(t))
  • ln(C(t)) ln(C(0)) St ln(1anoise(t))
  • ln( C(t) ) ln(C(0)) at St(noise(t))

ln(1x) x

log()
time
time
55
Usefulness for HP projects?
  • disk traffic (bytes per unit time could be
    lognormal/DGX 80-20)
  • ditto for web traffic (image-file sizes
    lognormal)
  • feature extraction ( (m,s) for printers of type
    A, (m,s) for type B compare)

56
Code resources
  • zb26,christos_at_cs.cmu.edu
  • full paper Bi, Faloutsos Korn, KDD 2001
    (runner up for best paper award)
  • Kotz, Johnson and Balakrishnan Continuous
    Univariate distributions
Write a Comment
User Comments (0)
About PowerShow.com