Title: Minimum Description Length MDL principle and its applications to pattern classification
1Minimum Description Length (MDL) principle and
its applications to pattern classification
- Henry Tirri
- Complex Systems Computation Group
- Department of Computer Science
- University of Helsinki
- http//www.cs.Helsinki.fi/research/cosco/
Pluralitas non est ponenda sine
necessitas William of Ockham (1285-1349)
2Outline
- Past - The Legend of MDL
- Present - An answer to a decade old question
- Future - Rovers on Mars
3The Legend of MDL
4On Modeling
Do you believe that the data generating
mechanism really is in your model class M?
5non-M-closed predictive inference
- Explicitly include prediction (and intervention)
in modeling
Models are a means (a language) to describe
interesting properties of the phenomenon to be
studied, but they are not intrinsic to the
phenomenon itself.
6Descriptive complexity
- 1000 bit strings
- 000100010001000100010001 .. 00010001
- 011101001101000010101010 .. 10101110
- 111001111110100110111111 .. 01111011
- Solomonoff-Kolmogorov-Chaitin complexity
- shortest possible encoding with the help of L
- code based on a universal computer language L
- too strong a description language -
uncomputability
7The idea
- a good model M(q) captures regular features
(constraints) of the observed data - any set of regularities we find reduces our
uncertainty of the data D, and we can use it to
encode the data in a shorter and less redundant
way (Dltlt D) - thus the purpose of modeling is to find models
that allow short encodings of the data D, i.e.
compress the data
8Stochastic complexity scaled-down Kolmogorov
complexity
The stochastic complexity of the data set D with
respect to the model class M is the shortest code
length of D obtainable when the encoding is done
with the help of class M (e.g., Rissanen 1989)
9Codes are probability distributions
10Maximum Likelihood Minimal Code Length
- For classes of models regular enough, there
exists a maximum likelihood estimator for any
data set of any length n
11Minimum Description Length principle
- MDL says that we must code our data using some
fixed code, which compresses all data sets that
are well modeled by M - ML gives optimal compression only to some D
- In general we cannot have a single code C s.t.
- However, we can have a code C s.t. (here Kn
equal to all D of size n)
12Formal stochastic complexity
- Let C be the code C that Kn is minimal
(denoted by K). Now the stochastic complexity of
D with respect to model class M is
13How to compute SC(DM)?
- For most reasonably regular model classes the
first term is easy to compute, computing K is
very difficult - For sufficiently regular classes we can
approximate K(k), k is the number of parameters,
well with
- Rissanen 1989 calls the first term
(confusingly-) the MDL Model Selection
Criterion it works only well with large data sets
14What we usually think as MDL is a two-part code
approximation
- Coding process is two-phase first encode the
model ?, then encode D using the code
corresponding to ? therefore we have a code
length of L(?)L(D ?) - Now we have to decide the encoding accuracy of
each parameter ?, and search for ? and precision
d that minimizes L(?)L(D ?) - Thus SC(DMk) ? L(D ?) L(?)
- Can be computed for very general cases
15SC as an average
- Now, remembering that SC(DM)LC(D) we can map
C to a probability distribution P s.t. for all
D, -logP(D)SC(DM) - Assume M is finite. The we can define a new
probability distribution
- Here ?(?) (the prior) is a probability
distribution introduced for normalization - if
?(?) is uniform, then Pav is a very good
approximation of P - For the continuous case ?(?) is not uniform (but
Jeffreys prior)
16Two part SC vs. Bayesian MAP model
- Bayes Theorem
- Minimum encoding (MML two part MDL)
17Average SC vs. Bayesian evidence
- From the law of total probability P(DM) in
Bayes theorem (usually called evidence or
marginal likelihood) is
- This coincides with Pav M with highest evidence
is M with lowest SC
18Predicting with SC and two part codes
- prediction with stochastic complexity requires
calculation of the predictive density
- prediction with two part codes is done with the
best parameter values (model ?) or using a
weighted combination of good models (selected
model averaging)
19An answer to a decade old question
How do we calculate the error term in the two
part MDL for classification problems - do we
calculate the number of bits needed to describe
all the vectors classified incorrectly or only
the indices? (a poor grad student asking from
Professor, 1989)
20Codes are probability distributions
21What do we gain? (a lot )
BarronCoverYamanishi MDL converges to
the closest model in M
Grünwald for certain loss-functions MDL
produces a safe predictor - it will give a good
estimate on its prediction performance in
M-complete cases
22Nice, but does it really work?
- classification as discrimination (supervised)
- medical chemical diagnosis from feature vectors
- classification as clustering (unsupervised)
- mineral clustering on Mars on the future Mars
missions (2003-)
23Pathfinder Rover 1997
24Mars Surveyor Rover 2001 (2003)