Title: Vocabulary size and term distribution: tokenization, text normalization and stemming
1Vocabulary size and term distribution
tokenization, text normalization and stemming
2Overview
- Getting started
- tokenization, stemming, compounds
- end of sentence
- Collection vocabulary
- Terms, tokens, types
- Vocabulary size
- Term distribution
- Stop words
- Vector representation of text and term weighting
3Tokenization
- Friends, Romans, Countrymen, lend me your ears
- Friends Romans Countrymen lend me your
ears - Token an instance of a sequence of characters
that are grouped together as a useful semantic
unit for processing - Type the class of all tokens containing the same
character sequence - Term type that is included in the system
dictionary (normalized)
4- The cat slept peacefully in the living room. Its
a very old cat.
5- Mr. ONeill thinks that the boys stories about
Chiles capital arent amusing. - How to handle special cases involving
apostrophes, hyphens etc? - C, C, URLs, emails, phone numbers, dates
- San Francisco, Los Angeles
6- Issues of tokenization are language specific
- Requires the language to be known
- Language identification based on classifiers that
use short character subsequences as features is
highly effective - Most languages have distinctive signature patterns
7Very important for information retrieval
- Splitting tokens on spaces can cause bad
retrieval results - Search for York University, returns pages
containing new york university - German compound nouns
- Retrieval systems for German greatly benefit fron
the use of compound-splitter module - Checks if a word can be subdivided into words
that appear in the vocabulary - East Asian Languages (Chinese, Japanese, Korean,
Thai) - Text is written without any spaces between words
8(No Transcript)
9Stop words
- Very common words that have no discriminatory
power
10Building a stop word list
- Sort terms by collection frequency and take the
most frequent - In a collection about insurance practices,
insurance would be a stop word - Why do we need stop lists
- Smaller indices for information retrieval
- Better approximation of importance for
summarization etc - Use problematic in phrasal searches
11- Trend in IR systems over time
- Large stop lists (200-300 terms)
- Very small stop lists (7-12 terms)
- No stop list whatsoever
- The 30 most common words account for 30 of the
tokens in written text - Good compression techniques for indices
- Term weighting leads to very common words having
little impact for document represenation
12Normalization
- Token normalization
- Canonicalizing tokens so that matches occur
despite superficial differences in the character
sequences of the tokens - U.S.A vs USA
- Anti-discriminatory vs antidiscriminatory
- Car vs automobile?
13Normalization sensitive to query
- Query term Terms that should match
- Windows Windows
- windows Windows, windows, window
- Window window, windows
14Capitalization/case folding
- Good for
- Allow instances of Automobile at the beginning of
a sentence to match with a query of automobile - Helps a search engine when most users type
ferrari when they are interested in a Ferrari car - Bad for
- Proper names vs common nouns
- General Motors, Associated Press, Black
- Heuristic solution lowercase only words at the
beginning of the sentence true casing via
machine learning - In IR, lowercasing is most practical because of
the way users issue their queries
15Other languages
- 60 of webpages are in english
- Less than one third of Internet users speak
English - Less than 10 of the worlds population primarily
speak English - Only about one third of blog posts are in English
16Stemming and lemmatization
- Organize, organizes, organizing
- Democracy, democratic, democratization
- Am, are, is ? be
- Car, cars, cars, cars ? car
17- Stemming
- Crude heuristic process that chops off the ends
of the words - Democratic ? democa
- Lemmatization
- Use of vocabulary and morphological analysis,
returns the base form of a word (lemma) - Democratic ? democracy
- Sang ? sing
18Porter stemmer
- Most common algorithm for stemming English
- 5 phases of word reduction
- SSES ? SS
- caresses ? caress
- IES ? I
- ponies ? poni
- SS ? SS
- S ?
- cats ? cat
- EMENT ?
- replacement ? replac
- cement ? cement
19(No Transcript)
20(No Transcript)
21Vocabulary size
- Dictionaries
- 600,000 words
- But they do not include names of people,
locations, products etc
22Heaps law estimating the number of terms
M vocabulary size (number of terms) T number of
tokens 30 lt k lt 100 b 0.5 Linear relation
between vocabulary size and number of tokens in
log-log space
23(No Transcript)
24Zipfs law modeling the distribution of terms
- The collection frequency of the ith most common
term is proportional to 1/i - If the most frequent term occurs cf1 then the
second most frequent term has half as many
occurrences, the third most frequent term has a
third as many, etc
25(No Transcript)
26Problems with the normalization
- A change in the stop word list can dramatically
alter term weightings - A document may contain an outlier term