Data Discretization Unification - PowerPoint PPT Presentation

About This Presentation
Title:

Data Discretization Unification

Description:

Cost(T/Model) is the complexity of table encoding in. the given model. ... GFModel (T) = Cost(Model, T0) Cost(Model, T) Goodness Function Definition ... – PowerPoint PPT presentation

Number of Views:176
Avg rating:3.0/5.0
Slides: 37
Provided by: ksu7
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Discretization Unification


1
Data Discretization Unification
  • Ruoming Jin
  • Yuri Breitbart
  • Chibuike Muoh
  • Kent State University, Kent, USA

2
Outline
  • Motivation
  • Problem Statement
  • Prior Art Our Contributions
  • Goodness Function Definition
  • Unification
  • Parameterized Discretization
  • Discretization Algorithm Its Performance

3
Motivation
Patients Table
Age Success Failure Total
18 10 1 19
25 5 2 7
41 100 5 105
. . .
51 250 10 260
52 360 5 365
53 249 10 259
. .. .. ..
4
Motivation
  • Possible implication of the table
  • If a person is between 18 and 25, the probability
    of procedure success is much higher than if the
    person is between 45 and 55
  • Is that a good rule or this one is better If a
    person is between 18 and 30, the probability of
    procedure success is much higher than if the
    person is between 46 and 61
  • What is the best interval?

5
Motivation
  • Without data discretization some rules would be
    difficult to establish.
  • Several existing data mining systems cannot
    handle continuous variables without
    discretization.
  • Data discretization significantly improves the
    quality of the discovered knowledge.
  • New methods of discretization needed for tables
    with rare events.
  • Data discretization significantly improves the
    performance of data mining algorithms. Some
    studies reported ten fold increase in
    performance. However
  • Any discretization process generally
    leads to a loss
  • of information. Minimizing such a
    possible loss is the mark
  • of good discretization method.

6
Problem Statement
Given an input table
Intervals Class1 Class2 . Class J Row Sum
S1 r11 r12 . r1J N1
S2 r21 r22 . r2J N2
. . . . . . . . . . . . . . . . . .
SI rI1 rI2 rIJ NI
Column Sum M1 M2 MJ N(Total)
7
Problem Statement
Obtain an output table
Intervals Class1 Class2 . Class J Row Sum
S1 C11 C12 . C1J N1
S2 C21 C22 . C2J N2
. . . . . . . . . . . . . . . . . .
SI CI1 CI2 CIJ NJ
Column Sum M1 M2 MJ N(Total)
Where Si Union of consecutive k intervals. The
quality of discretization is measured by
cost(model) cost(data/model)
penalty(model)
8
Prior Art
  • Unsupervised Discretization no class
    information is provided
  • Equal-width
  • Equal-frequency
  • Supervised Discretization class information is
    provided with each attribute value
  • MDLP
  • Pearsons X2 or Wilks G2 statistic based
    methods
  • Dougherty, Kohavi (1995) compare unsupervised and
    supervised methods of Holte (1993) and entropy
    based methods by Fayyad and Irani (1993) and
    conclude that supervised methods give less
    classification errors than the unsupervised ones
    and supervised methods based on entropy are
    better than other supervised methods .

9
Prior Art
  • There are several recent (2003-2006) papers
    introduced new discretization algorithms Yang
    and Webb Kurgan and Cios (CAIM) Boulle
    (Khiops).
  • CAIM attempts to minimize the number of
    discretization intervals and at the same time to
    minimize the information loss.
  • Khiops uses Pearsons X2 statistic to select
    merging consecutive intervals that minimize the
    value of X2.
  • Yang and Webb studied discretization using naïve
    Bayesian classifiers. They report that their
    method generates a lower number of
    classification errors than the alternative
    discretization methods that appeared in literature

10
Our Results
  • There is a strong connection between
    discretization methods based on statistic and on
    entropy.
  • There is a parametric function so that any prior
    discretization method is derivable from this
    function by choosing at most two parameters.
  • There is an optimal dynamic programming method
    that derived from our discretization approach
    that mostly outperforms any prior discretization
    method in experiments that we conducted.

11
Goodness Function Definition(Preliminaries)
Intervals Class1 Class2 . Class J Row Sum
S1 C11 C12 . C1J N1
S2 C21 C22 . C2J N2
. . . . . . . . . . . . . . . . . .
SI CI1 CI2 CIJ NJ
Column Sum M1 M2 MJ N(Total)
12
Goodness Function Definition(Preliminaries)
Entropy of the i-th row of a contingency table
Total entropy of all intervals Entropy of a
contingency table
  • Binary encoding of a row requires H(Si) binary
    characters.
  • Binary encoding of a set of rows requires H( S1,
    S2, SI ) binary characters
  • Binary encoding of a table requires SL binary
    characters


NH( S1 ,S2 , .SI ) SL
13
Goodness Function Definition
  • Cost(Model, T) Cost(T/Model)Penalty(Model)
  • (Mannila, et. al.)
  • Cost(T/Model) is the complexity of table
    encoding in
  • the given model.
  • Penalty(Model) reflects a complexity of the
    resulting
  • table.
  • GFModel (T) Cost(Model, T0) Cost(Model, T)

14
Goodness Function DefinitionModels To Be
Considered
  • MDLP (Information Theory)
  • Statistical Model Selection
  • Confidence Level of Rows Independence
  • Gini Index

15
Goodness Function Examples
  • Entropy
  • Statistical Akaike (AIC)
  • Statistical Bayesian (BIC)


16
Goodness Function Examples (Cont)
  • Confidence level based Goodness Functions
  • Pearson X2 statistic
  • where
  • Wilks G2-statistic
  • Tables degree of freedom is df(I-1)(J-1)
  • Distribution functions for these statistics are
  • It is known in statistic that asymptotically both
    Pearsons
  • X2 and Wilks G2 statistics have chi-square
    distribution with df degrees of freedom.

17
Unification
  • The following relationship between G2 and
    goodness functions for MDLP, AIC, and BIC holds
  • G2/2 NH(S1U USI) SL
  • Thus, the goodness functions for MDLP, AIC and
    BIC can be rewritten as follows

18
Unification
  • Normal Deviate is the difference between the mean
    and the variable divided by standard deviation
  • Consider t a random variable chi-square
    distributed. U(t) be a normal deviate so that the
    following equation holds
  • Let u(t) be a normal deviate function. The
    following theorem holds (see Wallace 1959, 1960)
  • For all tgtdf, all dfgt.37, and with
    w(t)t-df-dflog(t/df)(1/2)
  • 0ltw(t)u(t)w(t).6
    df-(1/2)

19
Unification
  • From this theorem it follows that if df goes to
    infinity and w(t)gtgt0,
  • u(t)/w(t) 1.
  • Finally, w2(t) u2(t) t df dflog(t/df) and
  • GFG2(T)u2(G2)G2 - df(1 log(G2 /df))
  • and similarly goodness function for GFX2(T)
  • is asymptotically the same

20
Unification
  • G2 estimate
  • If G2 gtdf, then G2 lt2NlogJ. It follows from the
    upper bound logJ on H(S1 U .U SJ) and lower
    bound 0 on entropy of a specific row of the
    contingency table.
  • Recall that u2(G2)G2 - df(1 log(G2 /df))
  • Thus, penalty of u2(G2) is between O(df) and
    O(dflogN)
  • If G2 cdf and cgt1, then penalty is O(df)
  • If G2 cdf and N/df N/(IJ)c, then penalty is
    also O(df)
  • If G2 cdf and N-gtinf and N/(IJ) -gtinf, then
    penalty is
  • O(dflogN)

21
Unification
  • GFMDLP G2 (depending on the ratio of N/I)
    O(df)

  • O(dflogN)
  • GFAIC G2 - df
  • GFBIC G2 - df(logN)/2
  • GFG2 G2 - df(either constant or logN,
    depending on
  • the ratio
    between N and I and J
  • In general, GF G2 - dff(N,I,J)
  • To unify a Gini function as one of the cost
    functions
  • we resort to the parametric approach to
    goodness of discretization

22
Gini Based Goodness Function
  • Let Si be a row of a contingency table. Gini
    index on Si is defined as follows
  • Cost(Gini, T)
  • Gini Goodness Function

23
Parametrized Discretization
  • Parametrized Entropy
  • Entropy of the row
  • Gini of the row
  • Parametrized Data Cost

24
Parametrized Discretization
  • Parametrized Cost of T0
  • Parametrized Goodness Function

25
Parameters for Known Goodness Functions
26
Parametrized Dynamic Programming Algorithm
27
Dynamic Programming Algorithm
28
Experiments
29
(No Transcript)
30
(No Transcript)
31
Classification errors for Glass dataset (Naïve
Bayesian)
32
IrisC4.5
33
Experiments C4.5 Validation
34
Experiments Naïve Bayesian Validation
35
Conclusions
  • We considered several seemingly different
    approaches to discretization and demonstrated
    that they can be unified by considering a notion
    of generalized entropy.
  • Each of the methods that were discussed in
    literature can be derived from generalized
    entropy by selecting at most two parameters.
  • Dynamic Programming Algorithm for a given set of
    two parameters selects an optimal method of
    discretization (in terms of the discretization
    goodness function)

36
What Remains To be Done
  • How to find analytically a relationship between
    the goodness function in terms of the model and
    the number of classification errors?
  • What is the Algorithm for Selecting the Best
    Parameters for a Given Set of Data?
Write a Comment
User Comments (0)
About PowerShow.com