Vocabulary size and term distribution: tokenization, text normalization and stemming - PowerPoint PPT Presentation

About This Presentation

Title:

Vocabulary size and term distribution: tokenization, text normalization and stemming

Description:

Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2 Overview Getting started: tokenization, stemming, compounds end of ... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 27

Provided by: nenk

Learn more at: https://www.cis.upenn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Vocabulary size and term distribution: tokenization, text normalization and stemming

1
Vocabulary size and term distribution
tokenization, text normalization and stemming

Lecture 2

2
Overview

Getting started
tokenization, stemming, compounds
end of sentence
Collection vocabulary
Terms, tokens, types
Vocabulary size
Term distribution
Stop words
Vector representation of text and term weighting

3
Tokenization

Friends, Romans, Countrymen, lend me your ears
Friends Romans Countrymen lend me your
ears
Token an instance of a sequence of characters
that are grouped together as a useful semantic
unit for processing
Type the class of all tokens containing the same
character sequence
Term type that is included in the system
dictionary (normalized)

The cat slept peacefully in the living room. Its
a very old cat.

Mr. ONeill thinks that the boys stories about
Chiles capital arent amusing.
How to handle special cases involving
apostrophes, hyphens etc?
C, C, URLs, emails, phone numbers, dates
San Francisco, Los Angeles

Issues of tokenization are language specific
Requires the language to be known
Language identification based on classifiers that
use short character subsequences as features is
highly effective
Most languages have distinctive signature patterns

7
Very important for information retrieval

Splitting tokens on spaces can cause bad
retrieval results
Search for York University, returns pages
containing new york university
German compound nouns
Retrieval systems for German greatly benefit fron
the use of compound-splitter module
Checks if a word can be subdivided into words
that appear in the vocabulary
East Asian Languages (Chinese, Japanese, Korean,
Thai)
Text is written without any spaces between words

8
(No Transcript)
9
Stop words

Very common words that have no discriminatory
power

10
Building a stop word list

Sort terms by collection frequency and take the
most frequent
In a collection about insurance practices,
insurance would be a stop word
Why do we need stop lists
Smaller indices for information retrieval
Better approximation of importance for
summarization etc
Use problematic in phrasal searches

Trend in IR systems over time
Large stop lists (200-300 terms)
Very small stop lists (7-12 terms)
No stop list whatsoever
The 30 most common words account for 30 of the
tokens in written text
Good compression techniques for indices
Term weighting leads to very common words having
little impact for document represenation

12
Normalization

Token normalization
Canonicalizing tokens so that matches occur
despite superficial differences in the character
sequences of the tokens
U.S.A vs USA
Anti-discriminatory vs antidiscriminatory
Car vs automobile?

13
Normalization sensitive to query

Query term Terms that should match
Windows Windows
windows Windows, windows, window
Window window, windows

14
Capitalization/case folding

Good for
Allow instances of Automobile at the beginning of
a sentence to match with a query of automobile
Helps a search engine when most users type
ferrari when they are interested in a Ferrari car
Bad for
Proper names vs common nouns
General Motors, Associated Press, Black
Heuristic solution lowercase only words at the
beginning of the sentence true casing via
machine learning
In IR, lowercasing is most practical because of
the way users issue their queries

15
Other languages

60 of webpages are in english
Less than one third of Internet users speak
English
Less than 10 of the worlds population primarily
speak English
Only about one third of blog posts are in English

16
Stemming and lemmatization

Organize, organizes, organizing
Democracy, democratic, democratization
Am, are, is ? be
Car, cars, cars, cars ? car

Stemming
Crude heuristic process that chops off the ends
of the words
Democratic ? democa
Lemmatization
Use of vocabulary and morphological analysis,
returns the base form of a word (lemma)
Democratic ? democracy
Sang ? sing

18
Porter stemmer

Most common algorithm for stemming English
5 phases of word reduction
SSES ? SS
caresses ? caress
IES ? I
ponies ? poni
SS ? SS
S ?
cats ? cat
EMENT ?
replacement ? replac
cement ? cement

19
(No Transcript)
20
(No Transcript)
21
Vocabulary size

Dictionaries
600,000 words
But they do not include names of people,
locations, products etc

22
Heaps law estimating the number of terms
M vocabulary size (number of terms) T number of
tokens 30 lt k lt 100 b 0.5 Linear relation
between vocabulary size and number of tokens in
log-log space
23
(No Transcript)
24
Zipfs law modeling the distribution of terms

The collection frequency of the ith most common
term is proportional to 1/i
If the most frequent term occurs cf1 then the
second most frequent term has half as many
occurrences, the third most frequent term has a
third as many, etc

25
(No Transcript)
26
Problems with the normalization