Title: LING 406 Intro to Computational Linguistics Word Frequency Distributions
1LING 406Intro to Computational LinguisticsWord
Frequency Distributions
- Richard Sproat
- URL http//catarina.ai.uiuc.edu/L406_08/
2This lecture
- Motivation all statistical models of language
start with estimates of the likelihood of
particular linguistic events. - Those estimates are in turn based on data derived
from text. - But these data are sensitive to properties of how
words are distributed in text. - It is therefore important to understand something
about word frequency distributions.
3An example
4Some definitions
5Word-frequency list
6Growth in vocabulary
7Mean word frequency
8Mean word frequency
9Mean word frequency
- But this is also unjustified See Panel B of the
previous figure. - Word frequency distributions are very different
from distributions of more familiar things like
die or coin tosses. - If you toss a coin 1000 times you expect there to
be about 500 heads - 10000 times you expect about 5000 heads
- The mean wont deviate much from 0.5 times the
number of tosses - Indeed, as you increase the number of tosses your
estimate of the underlying probability becomes
better. - With word frequency distributions measures such
as mean frequencies change with increasing sample
size
Word frequency distributions belong to the class
of Large Number of Rare Event (LNRE)
distributions.
10Mean word frequency
11The randomness assumption
- Many models of lexical frequencies assume that
words occur randomly in text - This is obviously wrong, but how bad an
assumption is it? - First 57 words of randomized Alice
- More find likely a somebody a youre lost
again was you invent waited a on to time passion
so partner about and with panting back-somersault
queen as was were the open obliged ask the Alice
much a do your as on if face come crab best not
rapped gryphon I affair I to it see unlocking
low.
12Sample relative frequency
13Sample relative frequency of the, a
14Frequency spectrum definitions
15Frequency spectrum for Alice
16Plot of frequency spectrum
17Further note
18Zipfs law
19Zipfs law definitions
20Zipfs law formulation
21Zipfs law plot
22Further issues
23Zipfs law zeta distribution
24Zipfs hope
25Dependency of Zipfs law on sample size
26More examples 1995 AP newswire
27More examples ROCLING Chinese corpus (characters)
28Miscellanea
- For the 1995 Associated Press corpus 40 of the
word types occur just once. - (In contrast among the 10 Million character
ROCLING corpus, only 11 of the characters occur
once.) - For a smaller corpus, such as the Brown corpus (1
Million words), the amount will be closer to 50. - But note that the Brown corpus is special . . .
29Summary
- Statistics of various kinds change with sample
size - This is due in part to the LNRE property of word
frequency distributions - Zipfs law is a fairly robust property of word
frequency distributions, but even that is
sensitive to sample size - Question How many words did Shakespeare know?
30One final example
Malayalam
Dravidian
Tamil
Kannada
Telugu
Indo-European
Assamese
Punjabi
Hindi
Bengali
Oriya