Vocabulary size and term distribution: tokenization, text normalization and stemming - PowerPoint PPT Presentation

About This Presentation
Title:

Vocabulary size and term distribution: tokenization, text normalization and stemming

Description:

Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2 Overview Getting started: tokenization, stemming, compounds end of ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 27
Provided by: nenk
Category:

less

Transcript and Presenter's Notes

Title: Vocabulary size and term distribution: tokenization, text normalization and stemming


1
Vocabulary size and term distribution
tokenization, text normalization and stemming
  • Lecture 2

2
Overview
  • Getting started
  • tokenization, stemming, compounds
  • end of sentence
  • Collection vocabulary
  • Terms, tokens, types
  • Vocabulary size
  • Term distribution
  • Stop words
  • Vector representation of text and term weighting

3
Tokenization
  • Friends, Romans, Countrymen, lend me your ears
  • Friends Romans Countrymen lend me your
    ears
  • Token an instance of a sequence of characters
    that are grouped together as a useful semantic
    unit for processing
  • Type the class of all tokens containing the same
    character sequence
  • Term type that is included in the system
    dictionary (normalized)

4
  • The cat slept peacefully in the living room. Its
    a very old cat.

5
  • Mr. ONeill thinks that the boys stories about
    Chiles capital arent amusing.
  • How to handle special cases involving
    apostrophes, hyphens etc?
  • C, C, URLs, emails, phone numbers, dates
  • San Francisco, Los Angeles

6
  • Issues of tokenization are language specific
  • Requires the language to be known
  • Language identification based on classifiers that
    use short character subsequences as features is
    highly effective
  • Most languages have distinctive signature patterns

7
Very important for information retrieval
  • Splitting tokens on spaces can cause bad
    retrieval results
  • Search for York University, returns pages
    containing new york university
  • German compound nouns
  • Retrieval systems for German greatly benefit fron
    the use of compound-splitter module
  • Checks if a word can be subdivided into words
    that appear in the vocabulary
  • East Asian Languages (Chinese, Japanese, Korean,
    Thai)
  • Text is written without any spaces between words

8
(No Transcript)
9
Stop words
  • Very common words that have no discriminatory
    power

10
Building a stop word list
  • Sort terms by collection frequency and take the
    most frequent
  • In a collection about insurance practices,
    insurance would be a stop word
  • Why do we need stop lists
  • Smaller indices for information retrieval
  • Better approximation of importance for
    summarization etc
  • Use problematic in phrasal searches

11
  • Trend in IR systems over time
  • Large stop lists (200-300 terms)
  • Very small stop lists (7-12 terms)
  • No stop list whatsoever
  • The 30 most common words account for 30 of the
    tokens in written text
  • Good compression techniques for indices
  • Term weighting leads to very common words having
    little impact for document represenation

12
Normalization
  • Token normalization
  • Canonicalizing tokens so that matches occur
    despite superficial differences in the character
    sequences of the tokens
  • U.S.A vs USA
  • Anti-discriminatory vs antidiscriminatory
  • Car vs automobile?

13
Normalization sensitive to query
  • Query term Terms that should match
  • Windows Windows
  • windows Windows, windows, window
  • Window window, windows

14
Capitalization/case folding
  • Good for
  • Allow instances of Automobile at the beginning of
    a sentence to match with a query of automobile
  • Helps a search engine when most users type
    ferrari when they are interested in a Ferrari car
  • Bad for
  • Proper names vs common nouns
  • General Motors, Associated Press, Black
  • Heuristic solution lowercase only words at the
    beginning of the sentence true casing via
    machine learning
  • In IR, lowercasing is most practical because of
    the way users issue their queries

15
Other languages
  • 60 of webpages are in english
  • Less than one third of Internet users speak
    English
  • Less than 10 of the worlds population primarily
    speak English
  • Only about one third of blog posts are in English

16
Stemming and lemmatization
  • Organize, organizes, organizing
  • Democracy, democratic, democratization
  • Am, are, is ? be
  • Car, cars, cars, cars ? car

17
  • Stemming
  • Crude heuristic process that chops off the ends
    of the words
  • Democratic ? democa
  • Lemmatization
  • Use of vocabulary and morphological analysis,
    returns the base form of a word (lemma)
  • Democratic ? democracy
  • Sang ? sing

18
Porter stemmer
  • Most common algorithm for stemming English
  • 5 phases of word reduction
  • SSES ? SS
  • caresses ? caress
  • IES ? I
  • ponies ? poni
  • SS ? SS
  • S ?
  • cats ? cat
  • EMENT ?
  • replacement ? replac
  • cement ? cement

19
(No Transcript)
20
(No Transcript)
21
Vocabulary size
  • Dictionaries
  • 600,000 words
  • But they do not include names of people,
    locations, products etc

22
Heaps law estimating the number of terms
M vocabulary size (number of terms) T number of
tokens 30 lt k lt 100 b 0.5 Linear relation
between vocabulary size and number of tokens in
log-log space
23
(No Transcript)
24
Zipfs law modeling the distribution of terms
  • The collection frequency of the ith most common
    term is proportional to 1/i
  • If the most frequent term occurs cf1 then the
    second most frequent term has half as many
    occurrences, the third most frequent term has a
    third as many, etc

25
(No Transcript)
26
Problems with the normalization
  • A change in the stop word list can dramatically
    alter term weightings
  • A document may contain an outlier term
Write a Comment
User Comments (0)
About PowerShow.com