Stemming - PowerPoint PPT Presentation

About This Presentation
Title:

Stemming

Description:

ape. accident. about. able. CS5286 Search Engine Technology and Algorithms/Xiaotie Deng ... D={able,ape,beatable, fixable, read, readable, reading, reads, red ... – PowerPoint PPT presentation

Number of Views:495
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Stemming


1
Lecture 6 Linguistic Methods for Searching
  • Stemming
  • Thesaurus
  • Online resources
  • Automatic construction of thesaurus

2
Outline of Stemming Methods
  • Goal of Stemming Process
  • Algorithm
  • Affix Removal (Porters Algorithm)
  • Dictionary Look-up Stemmers
  • Successor Variety
  • n-Gram Stemming
  • Applications

3
The advantage
  • Originally designed to improve performance by
    reducing the requirement on system resources.
  • With the continued significant increase in
    storage and computing power, use of stemming for
    performance reason is no longer as important.

4
Other Potentials
  • It may make improvement in recall.
  • There may be associated decline in precision.
  • System designer make their own choice of
    including stemming or not.
  • Google does not use the stemming
  • Hotbot includes the word stemming for user choice

5
Porter Stemming Algorithm
  • The Porter algorithm is the most commonly
    accepted algorithm.
  • Based upon a set of conditions of the stem,
    suffix and prefix and associated actions given
    the condition.
  • See, e.g,
  • http//www.tartarus.org/martin/PorterStemmer/

6
Porter Stemming (Condition)
  • m, the measure of a stem is a function of
    sequences of vowels (a,e,i,o,u,y) followed by a
    consonant.
  • C(VC)mV where the initial C and final V are
    optional and m is the number VC repeats

7
Porter Stemming (Condition)
  • ltXgt -stem ends with letter X
  • v -stem contains a vowel
  • d -stem ends in double consonant
  • o -stem ends with consonant-vowel-consonan
    t sequence where the final consonant is not w, x,
    or y

8
Rules
9
Rules (continued)
10
Example
  • duplicatable
  • duplicat rule 4
  • duplicate rule 1b1
  • duplic rule 3

11
Dictionary Look-Up Stemmer
  • A dictionary contains the pairing of a word and
    its stem for all the words.
  • The structure of the dictionary should be well
    designed for speeding up the search

TERM STEM computer comput compute compu
t computation comput
12
Successor Variety Stemming
  • Hafer and Weiss (1974) word segmentation by
    letter successor varieties, Information Storage
    and Retrieval 10, 371-385.
  • Main Idea Determine word and morpheme boundaries
    based on
  • the distribution of phonemes in a large body of
    utterances.

13
Note
  • Morpheme smallest meaningful part into which a
    word can be divided
  • Run-s contains two morphemes
  • un-like-ly contains three morphemes
  • Phoneme unit of the system of sounds in a
    language
  • English has 24 consonant phonemes

14
Overall approach
  • Hafer and Weiss use
  • letters in place of phonemes
  • texts in place of phonemically transcribed
    utterances

15
Formal Definition
  • Let w be a word of length n
  • wi is a length I prefix of w
  • Let D be a collection of words
  • D(wi) is the subset of D containing terms whose
    first I letters match wi exactly
  • S(wi) the successor variety of wi is the number
    of distinct letters that occupy the (i1)st
    position of words in D(wi).
  • A test word of length n has n successor varieties
    S(w1) S(w2) S(wn).

16
Informal Definition
  • The successor variety of a string in a collection
    D of words is the number of different characters
    that follows it in D.
  • That it, it depends on
  • the string
  • the collection D of words under consideration

17
An example
  • Dable, axle, accident, ape, about, be
  • The successor variety for
  • a 4 (b,x,c,p)
  • ap 1 (e)
  • app 0
  • ab 2 (l, o)
  • b 1 (e)
  • Using Trie, successor variety of a string is the
    number of children for the node the string
    reaches in the trie (terminal node is treated as
    having one child

18
Trie for the corpus of data D
1
b
a
2
b
x
be
3
c
p
axle
l
o
ape
accident
about
able
19
Segment in Words
  • From a large body of text, usually the successor
    variety of a substring decreases as a character
    is added, until a segment boundary is reached
  • Consider the following example
  • Dable,ape,beatable, fixable, read, readable,
    reading, reads, red rope, ripe
  • r 3 (e,I,o)
  • re 2 (a,d)
  • rea 1 (d)
  • read 3 (a,I,s)
  • read is a segment (or stem)

20
Selecting segments of words
  • Cut off method
  • a boundary is identified if some cutoff value is
    reached.
  • Peak and plateau method
  • a segment break is made after a character whose
    successor variety is larger than that of both the
    character immediate before and the character
    immediately after it.
  • Complete word method
  • a break is made after a segment if the segment is
    a complete word in the corpus
  • Entropy method
  • cutoff method applied to entropy defined for
    words.

21
Peak and Plateau Method
  • Dable,ape,beatable, fixable, read, readable,
    reading, reads, red rope, ripe
  • r 3 (e,I,o)
  • re 2 (a,d)
  • rea 1 (d)
  • read 3 (a,I,s)
  • reada 1 (b)
  • readab 1 (l)
  • readabl 1 (e)
  • readable 1 (blank)
  • the successor variety of read is 3 larger than
    that of both rea and reada

22
Peak and Plateau Method
  • Input A document of many terms.
  • Output each term is segmented.
  • E.G., the output of readable is read-able

23
Stem method of Hafer and Weiss
  • Determine successor variety of a word
  • Use this information to segment the word using
    one of the previous methods (say peakplateau)
  • Choose one of the segment as stem
  • if (first segment is in lt12 words in the corpus)
  • //comment maybe a prefix
  • first segment is stem
  • else
  • second segment is stem

24
Stem method of Hafer and Weiss
  • Input segmented word
  • Output the stem of the word
  • For example
  • read-able is input
  • read is the output
  • //may be able is the output dependent on what
    happens in the algorithms

25
Accessor Variety Method in Chinese
  • The notation is introduced by Feng, Chen, Zheng,
    Deng for chinese word extraction.
  • The idea is similar to successor variety
  • It is use to determine chinese text segmentation
    since it is difficult to separate words in
    Chinese text. In comparison, English words are
    separated by a space symbol in text.

26
Definition Accessor Variety
  • We treat each Chinese character as a letter
  • For each string (a potential word) consisting of
    several characters, we define successor variety
    as in English
  • Symmetrically, we also define a predecessor
    variety for each string.
  • A word is considered a word if it has a large
    successor variety and a large predecessor
    variety.

27
Testing Results
  • The accessor variety method turns out a very
    simple yet efficient way to recognize Chinese
    words when combined with some simple grammar
    rules.
  • For details, look at our paper
  • http//www.cs.cityu.edu.hk/deng/5286/feng.pdf

28
Word similarity
  • N-gram method
  • break a word of length n into (n-1) digrams,
    consisting of substring of two characters of the
    word.
  • Count the number of distinguished digrams
  • Let A (B) be the number of distinguished digrams
    in word 1 (2). Let C be the number of
    distinguished digrams shared by word 1 and word
    2.
  • The similarity of the two words is
  • S2C/(AB)

29
Example of Word similarity
  • Statistics st, ta, at, ti, is, st, ti, ic, cs
  • its distinguished digrams
  • at, cs, ic, is, st, ta, ti
  • statistical st, ta, at, ti, is, st, ti, ic, ca,
    al
  • its distinguished digrams
  • al, at, ca, ic, is, st, ta, ti
  • A7, B8, C6
  • Similarity 2x6/(78)12/154/580
  • One may build a similarity matrix of all words in
    a corpus, calculated as above, and complemented
    by cutoff value method (set to zero if less than
    a certain value, and to 1 else)

30
Thesaurus
  • Vocabulary control in an information retrieval
    system
  • Thesaurus construction
  • Manual construction
  • Automatic construction

31
Vocabulary control
  • Standard vocabulary for both indexing and
    searching (for the constructors of the system and
    the users of the system)

32
Objectives of vocabulary control
  • To promote the consistent representation of
    subject matter by indexers and searchers ,thereby
    avoiding the dispersion of related materials.
  • To facilitate the conduct of a comprehensive
    search on some topic by linking together terms
    whose meanings are related paradigmatically.

33
Thesaurus
  • Not like common dictionary
  • Words with their explanations
  • May contain words in a language
  • Or only contains words in a specific domain.
  • With a lot of other information especially the
    relationship between words
  • Classification of words in the language
  • Words relationship like synonyms, antonyms

34
On-Line Thesaurus
  • http//www.thesaurus.com
  • http//www.dictionary.com/
  • http//www.cogsci.princeton.edu/wn/

35
Dictionary vs. Thesaurus
Check Information use http//www.thesaurus.com
Dictionary Thesaurus
  • information ( n f r-m sh n)n.
  • Knowledge derived from study, experience, or
    instruction.
  • Knowledge of specific events or situations that
    has been gathered or received by communication
    intelligence or news. See Synonyms at knowledge.
  • ......

Nouns information, enlightenment, acquaintance
Verbs tell inform, inform of acquaint,
acquaint with impart, Adjectives informed
communique reported published
36
Use of Thesaurus
  • To control the term used in indexing ,for a
    specific domain only use the terms in the
    thesaurus as indexing terms
  • Assist the users to form proper queries by the
    help information contained in the thesaurus

37
Construction of Thesaurus
  • Stemming can be used for reduce the size of
    thesaurus
  • Can be constructed either manually or
    automatically

38
WordNet manually constructed
  • WordNet is an online lexical reference system
    whose design is inspired by current
    psycholinguistic theories of human lexical
    memory. English nouns, verbs, adjectives and
    adverbs are organized into synonym sets, each
    representing one underlying lexical concept.
    Different relations link the synonym sets.

39
Relations in WordNet
40
Automatic Thesaurus Construction
  • A variety of methods can be used in construction
    the thesaurus
  • Term similarity can be used for constructing the
    thesaurus

41
Complete Term Relation Method
  • Term Document Relationship can be calculated
    using a variety of methods
  • Like tf-idf
  • Term similarity can be calculated base on the
    term document relationship
  • for example

42
Complete Term Relation Method
Set threshold to 10
43
Complete Term Relation Method
  • Group
  • T1,T3,T4,T6
  • T1,T5
  • T2,T4,T6
  • T2,T6,T8
  • T7
Write a Comment
User Comments (0)
About PowerShow.com