Minimum Description Length MDL principle and its applications to pattern classification - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Minimum Description Length MDL principle and its applications to pattern classification

Description:

http://www.cs.Helsinki.fi/research/cosco/ Pluralitas non est ponenda sine necessitas ... Present - 'An answer to a decade old question' Future - 'Rovers on Mars' ... – PowerPoint PPT presentation

Number of Views:941
Avg rating:3.0/5.0
Slides: 25
Provided by: Henry120
Category:

less

Transcript and Presenter's Notes

Title: Minimum Description Length MDL principle and its applications to pattern classification


1
Minimum Description Length (MDL) principle and
its applications to pattern classification
  • Henry Tirri
  • Complex Systems Computation Group
  • Department of Computer Science
  • University of Helsinki
  • http//www.cs.Helsinki.fi/research/cosco/

Pluralitas non est ponenda sine
necessitas William of Ockham (1285-1349)
2
Outline
  • Past - The Legend of MDL
  • Present - An answer to a decade old question
  • Future - Rovers on Mars

3
The Legend of MDL
4
On Modeling
Do you believe that the data generating
mechanism really is in your model class M?
5
non-M-closed predictive inference
  • Explicitly include prediction (and intervention)
    in modeling

Models are a means (a language) to describe
interesting properties of the phenomenon to be
studied, but they are not intrinsic to the
phenomenon itself.
6
Descriptive complexity
  • 1000 bit strings
  • 000100010001000100010001 .. 00010001
  • 011101001101000010101010 .. 10101110
  • 111001111110100110111111 .. 01111011
  • Solomonoff-Kolmogorov-Chaitin complexity
  • shortest possible encoding with the help of L
  • code based on a universal computer language L
  • too strong a description language -
    uncomputability

7
The idea
  • a good model M(q) captures regular features
    (constraints) of the observed data
  • any set of regularities we find reduces our
    uncertainty of the data D, and we can use it to
    encode the data in a shorter and less redundant
    way (Dltlt D)
  • thus the purpose of modeling is to find models
    that allow short encodings of the data D, i.e.
    compress the data

8
Stochastic complexity scaled-down Kolmogorov
complexity
The stochastic complexity of the data set D with
respect to the model class M is the shortest code
length of D obtainable when the encoding is done
with the help of class M (e.g., Rissanen 1989)
9
Codes are probability distributions
10
Maximum Likelihood Minimal Code Length
  • For classes of models regular enough, there
    exists a maximum likelihood estimator for any
    data set of any length n

11
Minimum Description Length principle
  • MDL says that we must code our data using some
    fixed code, which compresses all data sets that
    are well modeled by M
  • ML gives optimal compression only to some D
  • In general we cannot have a single code C s.t.
  • However, we can have a code C s.t. (here Kn
    equal to all D of size n)

12
Formal stochastic complexity
  • Let C be the code C that Kn is minimal
    (denoted by K). Now the stochastic complexity of
    D with respect to model class M is

13
How to compute SC(DM)?
  • For most reasonably regular model classes the
    first term is easy to compute, computing K is
    very difficult
  • For sufficiently regular classes we can
    approximate K(k), k is the number of parameters,
    well with
  • Rissanen 1989 calls the first term
    (confusingly-) the MDL Model Selection
    Criterion it works only well with large data sets

14
What we usually think as MDL is a two-part code
approximation
  • Coding process is two-phase first encode the
    model ?, then encode D using the code
    corresponding to ? therefore we have a code
    length of L(?)L(D ?)
  • Now we have to decide the encoding accuracy of
    each parameter ?, and search for ? and precision
    d that minimizes L(?)L(D ?)
  • Thus SC(DMk) ? L(D ?) L(?)
  • Can be computed for very general cases

15
SC as an average
  • Now, remembering that SC(DM)LC(D) we can map
    C to a probability distribution P s.t. for all
    D, -logP(D)SC(DM)
  • Assume M is finite. The we can define a new
    probability distribution
  • Here ?(?) (the prior) is a probability
    distribution introduced for normalization - if
    ?(?) is uniform, then Pav is a very good
    approximation of P
  • For the continuous case ?(?) is not uniform (but
    Jeffreys prior)

16
Two part SC vs. Bayesian MAP model
  • Bayes Theorem
  • Minimum encoding (MML two part MDL)

17
Average SC vs. Bayesian evidence
  • From the law of total probability P(DM) in
    Bayes theorem (usually called evidence or
    marginal likelihood) is
  • This coincides with Pav M with highest evidence
    is M with lowest SC

18
Predicting with SC and two part codes
  • prediction with stochastic complexity requires
    calculation of the predictive density
  • prediction with two part codes is done with the
    best parameter values (model ?) or using a
    weighted combination of good models (selected
    model averaging)

19
An answer to a decade old question
How do we calculate the error term in the two
part MDL for classification problems - do we
calculate the number of bits needed to describe
all the vectors classified incorrectly or only
the indices? (a poor grad student asking from
Professor, 1989)
20
Codes are probability distributions
21
What do we gain? (a lot )
BarronCoverYamanishi MDL converges to
the closest model in M
Grünwald for certain loss-functions MDL
produces a safe predictor - it will give a good
estimate on its prediction performance in
M-complete cases
22
Nice, but does it really work?
  • classification as discrimination (supervised)
  • medical chemical diagnosis from feature vectors
  • classification as clustering (unsupervised)
  • mineral clustering on Mars on the future Mars
    missions (2003-)

23
Pathfinder Rover 1997
24
Mars Surveyor Rover 2001 (2003)
Write a Comment
User Comments (0)
About PowerShow.com