Minimum Description Length MDL principle and its applications to pattern classification - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Minimum Description Length MDL principle and its applications to pattern classification

Description:

http://www.cs.Helsinki.fi/research/cosco/ Pluralitas non est ponenda sine necessitas ... Present - 'An answer to a decade old question' Future - 'Rovers on Mars' ... – PowerPoint PPT presentation

Number of Views:941

Avg rating:3.0/5.0

Slides: 25

Provided by: Henry120

Category:

more less

Transcript and Presenter's Notes

Title: Minimum Description Length MDL principle and its applications to pattern classification

1
Minimum Description Length (MDL) principle and
its applications to pattern classification

Henry Tirri
Complex Systems Computation Group
Department of Computer Science
University of Helsinki
http//www.cs.Helsinki.fi/research/cosco/

Pluralitas non est ponenda sine
necessitas William of Ockham (1285-1349)
2
Outline

Past - The Legend of MDL
Present - An answer to a decade old question
Future - Rovers on Mars

3
The Legend of MDL
4
On Modeling
Do you believe that the data generating
mechanism really is in your model class M?
5
non-M-closed predictive inference

Explicitly include prediction (and intervention)
in modeling

Models are a means (a language) to describe
interesting properties of the phenomenon to be
studied, but they are not intrinsic to the
phenomenon itself.
6
Descriptive complexity

1000 bit strings
000100010001000100010001 .. 00010001
011101001101000010101010 .. 10101110
111001111110100110111111 .. 01111011
Solomonoff-Kolmogorov-Chaitin complexity
shortest possible encoding with the help of L
code based on a universal computer language L
too strong a description language -
uncomputability

7
The idea

a good model M(q) captures regular features
(constraints) of the observed data
any set of regularities we find reduces our
uncertainty of the data D, and we can use it to
encode the data in a shorter and less redundant
way (Dltlt D)
thus the purpose of modeling is to find models
that allow short encodings of the data D, i.e.
compress the data

8
Stochastic complexity scaled-down Kolmogorov
complexity
The stochastic complexity of the data set D with
respect to the model class M is the shortest code
length of D obtainable when the encoding is done
with the help of class M (e.g., Rissanen 1989)
9
Codes are probability distributions
10
Maximum Likelihood Minimal Code Length

For classes of models regular enough, there
exists a maximum likelihood estimator for any
data set of any length n

11
Minimum Description Length principle

MDL says that we must code our data using some
fixed code, which compresses all data sets that
are well modeled by M
ML gives optimal compression only to some D
In general we cannot have a single code C s.t.

However, we can have a code C s.t. (here Kn
equal to all D of size n)

12
Formal stochastic complexity

Let C be the code C that Kn is minimal
(denoted by K). Now the stochastic complexity of
D with respect to model class M is

13
How to compute SC(DM)?

For most reasonably regular model classes the
first term is easy to compute, computing K is
very difficult
For sufficiently regular classes we can
approximate K(k), k is the number of parameters,
well with

Rissanen 1989 calls the first term
(confusingly-) the MDL Model Selection
Criterion it works only well with large data sets

14
What we usually think as MDL is a two-part code
approximation

Coding process is two-phase first encode the
model ?, then encode D using the code
corresponding to ? therefore we have a code
length of L(?)L(D ?)
Now we have to decide the encoding accuracy of
each parameter ?, and search for ? and precision
d that minimizes L(?)L(D ?)
Thus SC(DMk) ? L(D ?) L(?)
Can be computed for very general cases

15
SC as an average

Now, remembering that SC(DM)LC(D) we can map
C to a probability distribution P s.t. for all
D, -logP(D)SC(DM)
Assume M is finite. The we can define a new
probability distribution

Here ?(?) (the prior) is a probability
distribution introduced for normalization - if
?(?) is uniform, then Pav is a very good
approximation of P
For the continuous case ?(?) is not uniform (but
Jeffreys prior)

16
Two part SC vs. Bayesian MAP model

Bayes Theorem
Minimum encoding (MML two part MDL)

17
Average SC vs. Bayesian evidence

From the law of total probability P(DM) in
Bayes theorem (usually called evidence or
marginal likelihood) is

This coincides with Pav M with highest evidence
is M with lowest SC

18
Predicting with SC and two part codes

prediction with stochastic complexity requires
calculation of the predictive density

prediction with two part codes is done with the
best parameter values (model ?) or using a
weighted combination of good models (selected
model averaging)

19
An answer to a decade old question
How do we calculate the error term in the two
part MDL for classification problems - do we
calculate the number of bits needed to describe
all the vectors classified incorrectly or only
the indices? (a poor grad student asking from
Professor, 1989)
20
Codes are probability distributions
21
What do we gain? (a lot )
BarronCoverYamanishi MDL converges to
the closest model in M
Grünwald for certain loss-functions MDL
produces a safe predictor - it will give a good
estimate on its prediction performance in
M-complete cases
22
Nice, but does it really work?