Word clustering: Smaller models, Faster training - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Word clustering: Smaller models, Faster training

Description:

WEEKDAY = Sunday, Monday, Tuesday, ... EVENT=party, celebration, ... WEEKDAY = Sunday, Monday, Tuesday, ... MONTH = January, February, April, May, June, ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 55
Provided by: ResearchM53
Category:

less

Transcript and Presenter's Notes

Title: Word clustering: Smaller models, Faster training


1
Word clusteringSmaller models, Faster training
  • Joshua Goodman
  • Microsoft Speech.Net /
  • Microsoft Research

2
Quick Overview
  • What are language models
  • What are word clusters
  • How word clusters make language models
  • Smaller
  • Faster

3
A bad language model
4
A bad language model
5
A bad language model
6
A bad language model
7
Whats a Language Model
  • For our purposes today, a language model gives
    the probability of a word given its context
  • P(truth and nothing but the) ? 0.2
  • P(roof and nuts sing on the) ? 0.00000001
  • Useful for speech recognition, hand writing, OCR,
    etc.

8
The Trigram Approximation
  • Assume each word depends only on the previous two
    words
  • P(the whole truth and nothing but) ?
  • P(thenothing but)

9
Trigrams, continued
  • Find probabilities by counting in real text
    P(the nothing but) ?
  • C(nothing but the) /
    C(nothing but)
  • Smoothing need to combine trigram P(the
    nothing but) with bigram P(the nothing) with
    unigram P(the) otherwise, too many things
    youve never seen

10
Perplexity
  • Perplexity standard measure of language model
    accuracy lower is better
  • Corresponds to average branching factor of model

11
Trigram Problems
  • Models are potentially huge similar in size to
    training data
  • Largest part of commercial recognizers
  • Sophisticated variations can be slow to learn
  • Maximum entropy could take weeks, months, or
    years!

12
Overview Word clusters solve problems --
Smaller, faster
  • Background What are word clusters
  • Word clusters for smaller models
  • Use a clustering technique that leads to larger
    models, then prune
  • Up to 3 times smaller at same perplexity
  • Word clusters for faster training of maximum
    entropy models
  • Train two models, each of which predicts half as
    much. Up to 35 times faster training

13
What are word clusters?
  • CLUSTERING CLASSES (same thing)
  • What is P(Tuesday party on)
  • Similar to P(Monday party on)
  • Similar to P(Tuesday celebration on)
  • Put words in clusters
  • WEEKDAY Sunday, Monday, Tuesday,
  • EVENTparty, celebration, birthday,

14
Putting words into clusters
  • One cluster per word hard clustering
  • WEEKDAY Sunday, Monday, Tuesday,
  • MONTH January, February, April, May, June,
  • Soft clustering (each word belongs to more than
    one cluster) possible, but complicates things.
    You get fractional counts.

15
Clustering how to get them
  • Build them by hand
  • Works ok when almost no data
  • Part of Speech (POS) tags
  • Tends not to work as well as automatic
  • Automatic Clustering
  • Swap words between clusters to minimize perplexity

16
Clustering automatic
  • Minimize perplexity of P(zY)
  • Put words into clusters randomly
  • Swap words between clusters whenever overall
    perplexity of P(zY) goes down
  • Doing this naively is very slow, but mathematical
    tricks speed it up

17
Clustering fast
  • Use top-down splitting at each
    level
  • consider swapping each word
  • between two clusters.
  • not bottom up merging!
  • (considers all pairs of
  • Clusters!)

18
Clustering example
  • Imagine following counts
  • C(Tuesday party on) 0
  • C(Wednesday celebration before) 100
  • C(Tuesday WEEKDAY) 1000
  • Then
  • P(Tuesday party on) ? 0
  • P(WEEKDAY EVENT PREPOSITION) ? large
  • P(Tuesday WEEKDAY) ? large
  • P(WEEKDAY EVENT PREPOSITION) ? P(Tuesday
    WEEKDAY) ? large

19
Two actual WSJ clusters
  • MONDAYS
  • FRIDAYS
  • THURSDAY
  • MONDAY
  • EURODOLLARS
  • SATURDAY
  • WEDNESDAY
  • FRIDAY
  • TENTERHOOKS
  • TUESDAY
  • SUNDAY
  • CONDITION
  • PARTY
  • FESCO
  • CULT
  • NILSON
  • PETA
  • CAMPAIGN
  • WESTPAC
  • FORCE
  • CONRAN
  • DEPARTMENT
  • PENH
  • GUILD

20
How to use clusters
  • Let x, y, z be words, X, Y, Z be the clusters of
    those words.
  • P(zxy) ? P(ZXY) ? P(zZ)
  • P(Tuesday party on) ? P(WEEKDAY EVENT
    PREPOSITION) ? P(Tuesday WEEKDAY)
  • Much smoother, smaller model than normal P(zxy),
    but higher perplexity.

21
Predictive clustering
  • IMPORTANT FACT -- with no smoothing, etc.

We are using hard clusters, so if we know z then
we know the cluster, Z, so P(z, Z, history)
P(z, history)
22
Predictive clustering
  • Equality with no smoothing, etc.
  • P(Zhistory)?? P(zhistory, Z)
  • With smoothing, tends to be better
  • May have trouble figuring out probability of
    P(Tuesdayparty on) but can guess
  • P(WEEKDAYparty on)?P(Tuesdayparty on WEEKDAY)
    ?
  • P(WEEKDAYparty on)?P(Tuesday on WEEKDAY)

23
Compression - Introduction
  • We have billions of words of training data.
  • Most large-vocabulary models are limited by model
    size.
  • The most important question in language modeling
    is What is the best language model we can build
    that will fit in the available memory?
  • Relatively little research.
  • New results, up to a factor of 3 or more smaller
    than previous state of the art at the same
    perplexity.

24
Compression overview
  • Review previous techniques
  • Count cutoffs
  • Stolcke pruning
  • IBM clustering
  • Describe new techniques (Stolcke pruning
    predictive clustering)
  • Show experimental results
  • Up to factor of 3 or more size decrease (at same
    perplexity) versus Stolcke Pruning.

25
Count cutoffs
  • Simple, commonly used technique
  • Just remove n-grams with
  • small counts

26
Stolcke pruning
  • Consider P(City New York) vs. P(City York)
  • Probabilities are almost the same
  • Pruning P(City New York) has almost no cost,
    even though C(New York City) is big.
  • Consider pruning P(lightbulb change a) much
    more likely than P(lightbulb a)

27
IBM clustering
  • Use P(ZXY)?? P(zZ)
  • Dont interpolate P(zxy) of course
  • Model is much smaller, but higher perplexity.
  • How does it compare to count cutoffs, etc? No one
    ever tried comparison!

28
Predictive clustering
  • Predictive clustering P(Zxy) ? P(zxyZ)
  • Model actually larger than original P(zxy)
  • For each original P(zxy), we must store
    P(zxyZ). In addition, need P(Zxy)
  • Normal model stores
  • P(Sunday party on), P(Mondayparty on),
    P(Tuedayparty on),
  • Clustered, pruned model stores
  • P(WEEKDAYparty on)
  • P(SundayWEEKDAY), P(Monday WEEKDAY),
    P(TuedayWEEKDAY),

29
Experiments
30
Different Clusterings
  • Let xj, xk,xl be alternate clusterings for x
  • Example
  • Tuesdayl WEEKDAY
  • Tuesdayj DAYS-MONTHS-TIMES
  • Tuesdayk NOUNS
  • You can think of l, j, and k as being the number
    of clusters.
  • Example P(zl xy) ? P(zl xj yj )

31
Different Clusterings (continued)
  • Example P(zl xy) ? P(zl xj yj )
  • P(WEEKDAYparty on) ?
  • P(WEEKDAYpartyj onj )
  • P(WEEKDAY NOUN PREP)
  • Or
  • P(WEEKDAYparty on) ?
  • P(WEEKDAYpartyk onk )
  • P(WEEKDAY EVENT LOC-PREP)

32
Both Clustering
  • P(zl xy) ? P(zl xj yj )
  • P(z xyzl ) ? P(z xk yk zl )
  • Substitute into predictive clustering
  • P(zxy)
  • P(zl xy) ? P(zxyzl ) ?
  • P(zl xj yj ) ? P(z xk yk zl )

33
Example
  • P(zxy)
  • P(zl xy) ? P(z xyzl ) ?
  • P(zl xj yj ) ? P(z xk yk zl )
  • P(Tuesday party on)
  • P(WEEKDAYparty on) ? P(Tuesday party
    on WEEKDAY) ?
  • P(WEEKDAYNOUN PREP) ? P(Tuesday EVENT
    LOC-PREP WEEKDAY)

34
Size reduction
  • P(zxy) P(zl xy) ? P(zxyzl ) ?
  • P(zl xj yj) ? P(z xk
    yk zl )
  • Optimal setting for k is often very large, e.g.
    whole vocabulary.
  • Unpruned model is typically larger than
    unclustered, but smaller than predictive.
  • Pruned model is smaller than unclustered and
    smaller than predictive at same perplexity

35
Experiments
36
WSJ (English) results -- relative
37
Chinese Newswire Results(with Jianfeng Gao, MSR
Beijing)
38
Compression conclusion
  • We can achieve up to a factor of 3 or more
    reduction at the same perplexity by using Both
    Clustering combined with Stolcke pruning.
  • Model is surprising it actually increases the
    model size and then prunes it down smaller
  • Results are similar for Chinese and English.

39
Maximum Entropy Speedups
  • Many people think Maximum Entropy is the future
    of language modeling
  • (not me anymore)
  • Allows lots of different information to be
    combined
  • Very slow to train weeks
  • Predictive cluster models are up to 35 times
    faster to train

40
Maximum entropy overview
  • Describe what maximum entropy is
  • Explain how to train maxent models, and why it is
    slow
  • Show how predictive clustering can speed it up
  • Give experimental results showing factor of 35
    speedup.
  • Talk about application to other areas

41
Maximum Entropy Introduction
  • Im busy next weekend. We are having a big
    party on
  • How likely is Friday?
  • Reasonably likely to start 0.001
  • weekend occurs nearby 2 times as likely
  • Previous word is on 3 times as likely
  • Previous words are party on 5 times as likely
  • 0.001 ? 2 ? 3 ? 5 0.03
  • Need to normalize 0.03 / ?? P(all words)

42
Maximum Entropy what is it
  • Product of many indicator functions
  • fj is an indicator 1 if some condition holds,
    e.g. fj (w, wi-2 , wi-1 ) 1 if wFriday,wi-2
    party, wi-1 on
  • Can create bigrams, trigrams, skipping, caches,
    triggers with right indicator functions.
  • Z?? is a normalization constant

43
Maximum entropy training
  • How to get the ??s Iterative EM algorithm
  • Requires computing probability distribution in
    all training contexts. For each training
    context
  • Requires determining all indicators that might
    apply
  • Requires computing normalization constant
  • Note that number of indicators that can apply,
    and time to normalize are both bounded by a
    factor of vocabulary size.

44
Example party on Tuesday
  • Consider party on Friday
  • We need to compute P(Fridayparty on),
    P(Tuesdayparty on), P(fishparty on), etc.
  • Number of trigram indicators (fj s) that we need
    to consider bounded by vocabulary size
  • Number of words to normalize vocabulary size.

45
Solution Predictive Clustering
  • Create two separate maximum entropy models
    P(Zwxy) and P(zwxyZ).
  • Imagine 10,000 word vocabulary, 100 clusters, 100
    words per cluster.
  • Time to train first model will be proportional to
    number of clusters (100)
  • Time to train second model proportional to number
    of words per cluster (100)
  • 10,000 / 200 50 times speedup

46
Predictive clustering example
  • Consider party on Tuesday, P(Zwxy)
  • We need to know P(WEEKDAYparty on),
    P(MONTHparty on), P(ANIMALparty on), etc.
  • Number of trigram indicators (fj s) that we need
    to consider bounded by number clusters
  • Normalize only over number of clusters

47
Predictive clustering example(continued)
  • Consider party on Tuesday, P(zwxyZ)
  • We need to know P(Mondayparty on WEEKDAY),
    P(Tuesdayparty on WEEKDAY), etc. Note that
    P(fish party on WEEKDAY) 0
  • Number of trigram indicators (fj s) we need to
    consider bounded by number words in cluster.
  • Normalize only over number of words in cluster

48
Improvements testing
  • May also speed up testing.
  • If running decoder with all words, then we need
    to compute P(zwxy) for all z, and no speedup.
  • If using maximum entropy as a postprocessing
    step, on a lattice or n-best list, may still lead
    to speedups, since only need to compute a few zs
    for each context wxy.

49
Maximum entropy results
50
Maximum entropy conclusions
  • At 10,000 predictive hurts a little
  • At any larger size they help.
  • Amount they help increases as training data size
    increases.
  • Triple predictive gives a factor of 35 over fast
    unigram at 10,000,000 words training
  • Perplexity actually decreases slightly, even with
    faster training!

51
Overall conclusion Predictive clustering ??
Smaller, faster
  • Clustering is a well known technique
  • Smaller New ways of using clustering to reduce
    language model size up to 50 reduction in size
    at same perplexity.
  • Faster New ways of speeding up training for
    maximum entropy models.

52
Speedup applied to other areas
  • Can apply to any problem with many outputs, not
    just words
  • Example collaborative filtering tasks
  • This speedup can be used with most machine
    learning algorithms applied to problems with many
    outputs
  • Examples neural networks, decision trees

53
Neural Networks
  • Imagine a neural network with a large number of
    outputs (10,000)
  • Requires backpropagating one 1, and 9,999 0s

54
Maximum Entropy trainingInner loop
  • For each word w in vocabulary
  • Pw ? 1
  • next w
  • For each non-zero fj
  • Pw ? Pw ??
  • next j
  • z ?
  • For each word w in vocabulary
  • observedw?observedwPw/z
  • next w
Write a Comment
User Comments (0)
About PowerShow.com