Random Forests for Language Modeling presentation

About This Presentation

Transcript and Presenter's Notes

Title: Random Forests for Language Modeling

1
Random Forests for Language Modeling

Peng Xu and Frederick Jelinek
IPAM January 24, 2006

2
What Is a Language Model?

A probability distribution over word sequences
Based on conditional probability distributions
probability of a word given its history (past
words)

3
What Is a Language Model for?

Speech recognition

4
n-gram Language Models

A simple yet powerful solution to LM
(n-1) items in history n-gram model
Maximum Likelihood (ML) estimate
Sparseness Problem training and test mismatch,
most n-grams are never seen need for smoothing

5
Sparseness Problem

Example Upenn Treebank portion of WSJ, 1 million
words training data, 82 thousand words test data,
10-thousand-word open vocabulary

n-gram 3 4 5 6
unseen 54.5 75.4 83.1 86.0

Sparseness makes language modeling a difficult
regression problem an n-gram model needs at
least Vn words to cover all n-grams

6
More Data

More data ? solution to data sparseness
The web has everything web data is noisy.
The web does NOT have everything language models
using web data still have data sparseness
problem.
Zhu Rosenfeld, 2001 In 24 random web news
sentences, 46 out of 453 trigrams were not
covered by Altavista.
In domain training data is not always easy to
get.

7
Dealing With Sparseness in n-gram

Smoothing take out some probability mass from
seen n-grams and distribute among unseen n-grams
Interpolated Kneser-Ney consistently the best
performance Chen Goodman, 1998

8
Our Approach

Extend the appealing idea of history to
clustering via decision trees.
Overcome problems in decision tree construction
by using Random Forests!

9
Decision Tree Language Models

Decision trees equivalence classification of
histories
Each leaf is specified by the answers to a series
of questions (posed to history) which lead to
the leaf from the root.
Each leaf corresponds to a subset of the
histories. Thus histories are partitioned
(i.e.,classified).

10
Decision Tree Language Models An Example
Training data aba, aca, bcb, bbb, ada
New event cba in test Stuck!
Is the first word in a?
Is the first word in b?
New event bdb in test
New event adb in test
11
Decision Tree Language Models An Example

Example trigrams (w-2,w-1,w0)
Questions about positions Is w-i2S? and Is
w-i2Sc? There are two history positions for
trigram.
Each pair, S and Sc, defines a possible split of
a node, and therefore, training data.
S and Sc are complements with respect to training
data
A node gets less data than its ancestors.
(S, Sc) are obtained by an exchange algorithm.

12
Construction of Decision Trees

Data Driven decision trees are constructed on
the basis of training data
The construction requires
The set of possible questions
A criterion evaluating the desirability of
questions
A construction stopping rule or post-pruning rule

13
Construction of Decision Trees Our Approach

Grow a decision tree until maximum depth using
training data
Use training data likelihood to evaluate
questions
Perform no smoothing during growing
Prune fully grown decision tree to maximize
heldout data likelihood
Incorporate KN smoothing during pruning

14
Smoothing Decision Trees

Using similar ideas as interpolated Kneser-Ney
smoothing
Note
All histories in one node are not smoothed in the
same way.
Only leaves are used as equivalence classes.

15
Problems with Decision Trees

Training data fragmentation
As tree is developed, the questions are selected
on the basis of less and less data.
Lack of optimality
The exchange algorithm is a greedy algorithm.
So is the tree growing algorithm
Overtraining and undertraining
Deep trees fit the training data well, will not
generalize well to new test data.
Shallow trees not sufficiently refined.

16
Amelioration Random Forests

Breiman applied the idea of random forests to
relatively small problems. Breiman 2001
Using different random samples of data and
randomly chosen subsets of questions, construct K
decision trees.
Apply test datum x to all the different decision
trees.
Produce classes y1,y2,,yK.
Accept plurality decision

17
Example of a Random Forest
T1
T2
T3
a
a
a
a
a
?
?
?
?
a
?
An example x will be classified as a according to
this random forest.
18
Random Forests for Language Modeling

Two kinds of randomness
Selection of positions to ask about
Alternatives position 1 or 2 or the better of
the two.
Random initialization of the exchange algorithm
100 decision trees ith tree estimates
PDT(i)(w0w-2,w-1)
The final estimate is the average of all trees

19
Experiments

Perplexity (PPL)
UPenn Treebank part of WSJ about 1 million words
for training and heldout (90/10), 82 thousand
words for test

20
Experiments trigram

Baseline KN-trigram
No randomization DT-trigram
100 random DTs RF-trigram

Model heldout heldout Test Test
Model PPL Gain PPL Gain
KN-trigram 160.1 - 145.0 -
DT-trigram 158.6 0.9 163.3 -12.6
RF-trigram 126.8 20.8 129.7 10.5
21
Experiments Aggregating

Considerable improvement already with 10 trees!

22
Experiments Analysis

seen event
KN-trigram in training data
DT-trigram in training data

Analyze test data events by number of times seen
in 100 DTs
23
Experiments Stability
PPL results of different realizations varies, but
differences are small.
24
Experiments Aggregation v.s. Interpolation
25
Experiments Aggregation v.s. Interpolation
26
Experiments High Order n-grams Models

Baseline KN n-gram
100 random DTs RF n-gram

n-gram 3 4 5 6
KN 145.0 140.0 138.8 138.6
RF 129.7 126.4 126.0 126.3
27
Using Random Forests to Other Models SLM

Structured Language Model (SLM) Chelba
Jelinek, 2000
Approximation use tree triples

SLM
KN 137.9
RF 122.8
28
Speech Recognition Experiments (I)

Word Error Rate (WER) by N-best Rescoring
WSJ text 20 or 40 million words training
WSJ DARPA93 HUB1 test data 213 utterances, 3446
words
N-best rescoring baseline WER is 13.7
N-best lists were generated by a trigram baseline
using Katz backoff smoothing.
The baseline trigram used 40 million words for
training.
Oracle error rate is around 6.

29
Speech Recognition Experiments (I)

Baseline KN smoothing
100 random DTs for RF 3-gram
100 random DTs for the PREDICTOR in SLM
Approximation in SLM

3-gram (20M) 3-gram (40M) SLM (20M)
KN 14.0 13.0 12.8
RF 12.9 12.4 11.9
p-value lt0.001 lt0.05 lt0.001
30
Speech Recognition Experiments (II)

Word Error Rate by Lattice Rescoring
IBM 2004 Conversational Telephony System for Rich
Transcription 1st place in RT-04 evaluation
Fisher data 22 million words
WEB data 525 million words, using frequent
Fisher n-grams as queries
Other data Switchboard, Broadcast News, etc.
Lattice language model 4-gram with interpolated
Kneser-Ney smoothing, pruned to have 3.2 million
unique n-grams, WER is 14.4
Test set DEV04, 37,834 words

31
Speech Recognition Experiments (II)

Baseline KN 4-gram
110 random DTs for EB-RF 4-gram
Sampling data without replacement
Fisher and WEB models are interpolated

Fisher 4-gram WEB 4-gram FisherWEB 4-gram
KN 14.1 15.2 13.7
RF 13.5 15.0 13.1
p-value lt0.001 - lt0.001
32
Practical Limitations of the RF Approach

Memory
Decision tree construction uses much more memory.
Little performance gain when training data is
really large.
Because we have 100 trees, the final model
becomes too large to fit into memory.
Effective language model compression or pruning
remains an open question.

33
Conclusions Random Forests

New RF language modeling approach

More general LM RF ? DT ? n-gram

Randomized history clustering

Good generalization better n-gram coverage, less
biased to training data

Extension of Briemans random forests for data
sparseness problem

34
Conclusions Random Forests

Improvements in perplexity and/or word error rate
over interpolated Kneser-Ney smoothing for
different models
n-gram (up to n6)
Class-based trigram
Structured Language Model
Significant improvements in the best performing
large vocabulary conversational telephony speech
recognition system

Write a Comment

User Comments (0)

About PowerShow.com

Random Forests for Language Modeling PowerPoint PPT Presentation