Random Forests for Language Modeling - PowerPoint PPT Presentation

About This Presentation
Title:

Random Forests for Language Modeling

Description:

Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006 What Is a Language Model? A probability distribution over word sequences ... – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 35
Provided by: Peng55
Category:

less

Transcript and Presenter's Notes

Title: Random Forests for Language Modeling


1
Random Forests for Language Modeling
  • Peng Xu and Frederick Jelinek
  • IPAM January 24, 2006

2
What Is a Language Model?
  • A probability distribution over word sequences
  • Based on conditional probability distributions
    probability of a word given its history (past
    words)

3
What Is a Language Model for?
  • Speech recognition

4
n-gram Language Models
  • A simple yet powerful solution to LM
  • (n-1) items in history n-gram model
  • Maximum Likelihood (ML) estimate
  • Sparseness Problem training and test mismatch,
    most n-grams are never seen need for smoothing

5
Sparseness Problem
  • Example Upenn Treebank portion of WSJ, 1 million
    words training data, 82 thousand words test data,
    10-thousand-word open vocabulary

n-gram 3 4 5 6
unseen 54.5 75.4 83.1 86.0
  • Sparseness makes language modeling a difficult
    regression problem an n-gram model needs at
    least Vn words to cover all n-grams

6
More Data
  • More data ? solution to data sparseness
  • The web has everything web data is noisy.
  • The web does NOT have everything language models
    using web data still have data sparseness
    problem.
  • Zhu Rosenfeld, 2001 In 24 random web news
    sentences, 46 out of 453 trigrams were not
    covered by Altavista.
  • In domain training data is not always easy to
    get.

7
Dealing With Sparseness in n-gram
  • Smoothing take out some probability mass from
    seen n-grams and distribute among unseen n-grams
  • Interpolated Kneser-Ney consistently the best
    performance Chen Goodman, 1998

8
Our Approach
  • Extend the appealing idea of history to
    clustering via decision trees.
  • Overcome problems in decision tree construction
  • by using Random Forests!

9
Decision Tree Language Models
  • Decision trees equivalence classification of
    histories
  • Each leaf is specified by the answers to a series
    of questions (posed to history) which lead to
    the leaf from the root.
  • Each leaf corresponds to a subset of the
    histories. Thus histories are partitioned
    (i.e.,classified).

10
Decision Tree Language Models An Example
Training data aba, aca, bcb, bbb, ada
New event cba in test Stuck!
Is the first word in a?
Is the first word in b?
New event bdb in test
New event adb in test
11
Decision Tree Language Models An Example
  • Example trigrams (w-2,w-1,w0)
  • Questions about positions Is w-i2S? and Is
    w-i2Sc? There are two history positions for
    trigram.
  • Each pair, S and Sc, defines a possible split of
    a node, and therefore, training data.
  • S and Sc are complements with respect to training
    data
  • A node gets less data than its ancestors.
  • (S, Sc) are obtained by an exchange algorithm.

12
Construction of Decision Trees
  • Data Driven decision trees are constructed on
    the basis of training data
  • The construction requires
  • The set of possible questions
  • A criterion evaluating the desirability of
    questions
  • A construction stopping rule or post-pruning rule

13
Construction of Decision Trees Our Approach
  • Grow a decision tree until maximum depth using
    training data
  • Use training data likelihood to evaluate
    questions
  • Perform no smoothing during growing
  • Prune fully grown decision tree to maximize
    heldout data likelihood
  • Incorporate KN smoothing during pruning

14
Smoothing Decision Trees
  • Using similar ideas as interpolated Kneser-Ney
    smoothing
  • Note
  • All histories in one node are not smoothed in the
    same way.
  • Only leaves are used as equivalence classes.

15
Problems with Decision Trees
  • Training data fragmentation
  • As tree is developed, the questions are selected
    on the basis of less and less data.
  • Lack of optimality
  • The exchange algorithm is a greedy algorithm.
  • So is the tree growing algorithm
  • Overtraining and undertraining
  • Deep trees fit the training data well, will not
    generalize well to new test data.
  • Shallow trees not sufficiently refined.

16
Amelioration Random Forests
  • Breiman applied the idea of random forests to
    relatively small problems. Breiman 2001
  • Using different random samples of data and
    randomly chosen subsets of questions, construct K
    decision trees.
  • Apply test datum x to all the different decision
    trees.
  • Produce classes y1,y2,,yK.
  • Accept plurality decision

17
Example of a Random Forest
T1
T2
T3
a
a
a
a
a
?
?
?
?
a
?
An example x will be classified as a according to
this random forest.
18
Random Forests for Language Modeling
  • Two kinds of randomness
  • Selection of positions to ask about
  • Alternatives position 1 or 2 or the better of
    the two.
  • Random initialization of the exchange algorithm
  • 100 decision trees ith tree estimates
  • PDT(i)(w0w-2,w-1)
  • The final estimate is the average of all trees

19
Experiments
  • Perplexity (PPL)
  • UPenn Treebank part of WSJ about 1 million words
    for training and heldout (90/10), 82 thousand
    words for test

20
Experiments trigram
  • Baseline KN-trigram
  • No randomization DT-trigram
  • 100 random DTs RF-trigram

Model heldout heldout Test Test
Model PPL Gain PPL Gain
KN-trigram 160.1 - 145.0 -
DT-trigram 158.6 0.9 163.3 -12.6
RF-trigram 126.8 20.8 129.7 10.5
21
Experiments Aggregating
  • Considerable improvement already with 10 trees!

22
Experiments Analysis
  • seen event
  • KN-trigram in training data
  • DT-trigram in training data

Analyze test data events by number of times seen
in 100 DTs
23
Experiments Stability
PPL results of different realizations varies, but
differences are small.
24
Experiments Aggregation v.s. Interpolation
25
Experiments Aggregation v.s. Interpolation
26
Experiments High Order n-grams Models
  • Baseline KN n-gram
  • 100 random DTs RF n-gram

n-gram 3 4 5 6
KN 145.0 140.0 138.8 138.6
RF 129.7 126.4 126.0 126.3
27
Using Random Forests to Other Models SLM
  • Structured Language Model (SLM) Chelba
    Jelinek, 2000
  • Approximation use tree triples

SLM
KN 137.9
RF 122.8
28
Speech Recognition Experiments (I)
  • Word Error Rate (WER) by N-best Rescoring
  • WSJ text 20 or 40 million words training
  • WSJ DARPA93 HUB1 test data 213 utterances, 3446
    words
  • N-best rescoring baseline WER is 13.7
  • N-best lists were generated by a trigram baseline
    using Katz backoff smoothing.
  • The baseline trigram used 40 million words for
    training.
  • Oracle error rate is around 6.

29
Speech Recognition Experiments (I)
  • Baseline KN smoothing
  • 100 random DTs for RF 3-gram
  • 100 random DTs for the PREDICTOR in SLM
  • Approximation in SLM

3-gram (20M) 3-gram (40M) SLM (20M)
KN 14.0 13.0 12.8
RF 12.9 12.4 11.9
p-value lt0.001 lt0.05 lt0.001
30
Speech Recognition Experiments (II)
  • Word Error Rate by Lattice Rescoring
  • IBM 2004 Conversational Telephony System for Rich
    Transcription 1st place in RT-04 evaluation
  • Fisher data 22 million words
  • WEB data 525 million words, using frequent
    Fisher n-grams as queries
  • Other data Switchboard, Broadcast News, etc.
  • Lattice language model 4-gram with interpolated
    Kneser-Ney smoothing, pruned to have 3.2 million
    unique n-grams, WER is 14.4
  • Test set DEV04, 37,834 words

31
Speech Recognition Experiments (II)
  • Baseline KN 4-gram
  • 110 random DTs for EB-RF 4-gram
  • Sampling data without replacement
  • Fisher and WEB models are interpolated

Fisher 4-gram WEB 4-gram FisherWEB 4-gram
KN 14.1 15.2 13.7
RF 13.5 15.0 13.1
p-value lt0.001 - lt0.001
32
Practical Limitations of the RF Approach
  • Memory
  • Decision tree construction uses much more memory.
  • Little performance gain when training data is
    really large.
  • Because we have 100 trees, the final model
    becomes too large to fit into memory.
  • Effective language model compression or pruning
    remains an open question.

33
Conclusions Random Forests
  • New RF language modeling approach
  • More general LM RF ? DT ? n-gram
  • Randomized history clustering
  • Good generalization better n-gram coverage, less
    biased to training data
  • Extension of Briemans random forests for data
    sparseness problem

34
Conclusions Random Forests
  • Improvements in perplexity and/or word error rate
    over interpolated Kneser-Ney smoothing for
    different models
  • n-gram (up to n6)
  • Class-based trigram
  • Structured Language Model
  • Significant improvements in the best performing
    large vocabulary conversational telephony speech
    recognition system
Write a Comment
User Comments (0)
About PowerShow.com