LING 406 Intro to Computational Linguistics Word Frequency Distributions PowerPoint PPT Presentation

presentation player overlay
1 / 30
About This Presentation
Transcript and Presenter's Notes

Title: LING 406 Intro to Computational Linguistics Word Frequency Distributions


1
LING 406Intro to Computational LinguisticsWord
Frequency Distributions
  • Richard Sproat
  • URL http//catarina.ai.uiuc.edu/L406_08/

2
This lecture
  • Motivation all statistical models of language
    start with estimates of the likelihood of
    particular linguistic events.
  • Those estimates are in turn based on data derived
    from text.
  • But these data are sensitive to properties of how
    words are distributed in text.
  • It is therefore important to understand something
    about word frequency distributions.

3
An example
4
Some definitions
5
Word-frequency list
6
Growth in vocabulary
7
Mean word frequency
8
Mean word frequency
9
Mean word frequency
  • But this is also unjustified See Panel B of the
    previous figure.
  • Word frequency distributions are very different
    from distributions of more familiar things like
    die or coin tosses.
  • If you toss a coin 1000 times you expect there to
    be about 500 heads
  • 10000 times you expect about 5000 heads
  • The mean wont deviate much from 0.5 times the
    number of tosses
  • Indeed, as you increase the number of tosses your
    estimate of the underlying probability becomes
    better.
  • With word frequency distributions measures such
    as mean frequencies change with increasing sample
    size

Word frequency distributions belong to the class
of Large Number of Rare Event (LNRE)
distributions.
10
Mean word frequency
11
The randomness assumption
  • Many models of lexical frequencies assume that
    words occur randomly in text
  • This is obviously wrong, but how bad an
    assumption is it?
  • First 57 words of randomized Alice
  • More find likely a somebody a youre lost
    again was you invent waited a on to time passion
    so partner about and with panting back-somersault
    queen as was were the open obliged ask the Alice
    much a do your as on if face come crab best not
    rapped gryphon I affair I to it see unlocking
    low.

12
Sample relative frequency
13
Sample relative frequency of the, a
14
Frequency spectrum definitions
15
Frequency spectrum for Alice
16
Plot of frequency spectrum
17
Further note
18
Zipfs law
19
Zipfs law definitions
20
Zipfs law formulation
21
Zipfs law plot
22
Further issues
23
Zipfs law zeta distribution
24
Zipfs hope
25
Dependency of Zipfs law on sample size
26
More examples 1995 AP newswire
27
More examples ROCLING Chinese corpus (characters)
28
Miscellanea
  • For the 1995 Associated Press corpus 40 of the
    word types occur just once.
  • (In contrast among the 10 Million character
    ROCLING corpus, only 11 of the characters occur
    once.)
  • For a smaller corpus, such as the Brown corpus (1
    Million words), the amount will be closer to 50.
  • But note that the Brown corpus is special . . .

29
Summary
  • Statistics of various kinds change with sample
    size
  • This is due in part to the LNRE property of word
    frequency distributions
  • Zipfs law is a fairly robust property of word
    frequency distributions, but even that is
    sensitive to sample size
  • Question How many words did Shakespeare know?

30
One final example
Malayalam
Dravidian
Tamil
Kannada
Telugu
Indo-European
Assamese
Punjabi
Hindi
Bengali
Oriya
Write a Comment
User Comments (0)
About PowerShow.com