Stemming Algorithms - PowerPoint PPT Presentation

About This Presentation
Title:

Stemming Algorithms

Description:

Title: Stemming Algorithm Author: hth Last modified by: hth Created Date: 11/18/2002 11:38:07 AM Document presentation format: Company – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 31
Provided by: hth1
Category:

less

Transcript and Presenter's Notes

Title: Stemming Algorithms


1
Stemming Algorithms
  • ?????????????
  • ??????? ??
  • ?? 9142608 ???
  • 9142609 ???

2
Outline
  • Introduction
  • Types of stemming algorithms
  • Experimental evaluations of stemming
  • Stemming to compress inverted files
  • Summary
  • Appendix

3
Introduction
  • Stemming is one technique to provide ways of
    finding morphological variants of search terms.
  • Used to improve retrieval effectiveness and to
    reduce the size of indexing files.
  • Taxonomy for stemming algorithms

4
Introduction (cont)
  • Criteria for judging stemmers
  • Correctness
  • Overstemming too much of a term is removed.
  • Understemming too little of a term is removed.
  • Retrieval effectiveness
  • measured with recall and precision, and on
    their speed, size, and so on
  • compression performance

5
Type of stemming algorithms
  • Table lookup approach
  • Successor Variety
  • n-gram stemmers
  • Affix Removal Stemmers

6
Table lookup approach
  • Store a table of all index terms and their stems,
    so terms from queries and indexes could be
    stemmed very fast.
  • Problems
  • There is no such data for English. Or some terms
    are domain dependent.
  • The storage overhead for such a table, though
    trading size for time is sometimes warranted.

7
Successor Variety approach
  • Determine word and morpheme boundaries based on
    the distribution of phonemes in a large body of
    utterances.
  • The successor variety of a string is the number
    of different characters that follow it in words
    in some body of text.
  • The successor variety of substrings of a term
    will decrease as more characters are added until
    a segment boundary is reached.

8
Successor Variety approach (cont)
Test Word READABLE Corpus ABLE, APE,
BEATABLE, FIXABLE, READ, READABLE,
READING, READS, RED, ROPE, RIPE
Prefix Successor Variety Letters
R RE REA READ READA READAB READABL READABLE 3 2 1 3 1 1 1 1 E,I,O A,D D A,I,S B L E (Blank)
9
Successor Variety approach (cont)
  • cutoff method
  • some cutoff value is selected and a boundary is
    identified whenever the cutoff value is reached
  • peak and plateau method
  • segment break is made after a character whose
    successor variety exceeds that of the characters
    immediately preceding and following it
  • complete method

10
Successor Variety approach (cont)
  • entropy method
  • the number of words in a text body
    beginning with the i length sequence of letters ?
  • the number of words in with the
    successor j
  • The probability that a member of number of words
    in
  • has the successor j is given by
  • The entropy of is

11
Successor Variety approach (cont)
  • Two criteria used to evaluate various
    segmentation methods
  • the number of correct segment cuts divided by the
    total number of cuts
  • the number of correct segment cuts divided by the
    total number of true boundaries
  • After segmenting, if the first segment occurs in
    more than 12 words in the corpus, it is probably
    a prefix.

12
Successor Variety approach (cont)
  • The successor variety stemming process has three
    parts
  • determine the successor varieties for a word
  • segment the word using one of the methods
  • select one of the segments as the stem

13
n-gram stemmers
  • Association measures are calculated between pairs
    of terms based on shared unique digrams.
  • statistics gt st ta at ti is st ti ic cs
  • unique digrams at cs ic is st ta ti
  • statistical gt st ta at ti is st ti ic ca al
  • unique digrams al at ca ic is st ta ti
  • Dices coefficient (similarity)
  • A and B are the numbers of unique digrams in
    the first and the second words. C is the number
    of unique digrams shared by A and B.

14
n-gram stemmers (cont)
  • Similarity measures are determined for all pairs
    of terms in the database, forming a similarity
    matrix
  • Once such a similarity matrix is available, terms
    are clustered using a single link clustering
    method (as described in Ch.16)

15
Affix Removal Stemmers
  • Affix removal algorithms remove suffixes and/or
    prefixes from terms leaving a stem
  • If a word ends in ies but not eies or aies
    (Harman 1991)
  • Then ies -gt y
  • If a word ends in es but not aes , or ees
    or oes
  • Then es -gt e
  • If a word ends in s but not us or ss
  • Then s -gt NULL

16
The Porter algorithm
  • The Porter algorithm consists of a set of
    condition/action rules.
  • The condition fall into three classes
  • Conditions on the stem
  • Conditions on the suffix
  • Conditions on rules

17
Conditions on the stem
  • 1.The measure , denoted m ,of a stem is based on
    its alternate vowel-consonant sequences.

Measure Example
M0 M1 M2 TR,EE,TREE,Y,BY TROUBLE,OATS,TREES,IVY TROUBLES,PRIVATE,OATEN
18
Conditions on the stem (cont)
  • 2.ltXgt ---the stem ends with a given letter X
  • 3.v---the stem contains a vowel
  • 4.d ---the stem ends in double consonant
  • 5.o ---the stem ends with a consonant-vowel-cons
    onant,sequence ,where the final consonant is not
    w, x or y
  • Suffix conditions take the form (current_suffix
    pattern)

19
Conditions on the rules
  • The rules are divided into steps. The rules in a
    step are examined in sequence , and only one rule
    from a step can apply

  • step1a(word)

  • step1b(stem)
  • if (the
    second or third rule of step 1b was used)
    step1b1(stem)

  • step1c(stem)
  • step2(stem)
  • step3(stem)
  • step4(stem)

  • step5a(stem)

  • step5b(stem)

20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Experimental Evaluations of stemming
27
(No Transcript)
28
Stemming Studies Conclusion
  • The majority of stemmings affection on retrieval
    performance have been positive
  • Stemming is as effective as manual conflation
  • The effect of stemming is dependent on the nature
    of vocabulary used
  • There appears to be little difference between the
    retrieval effectiveness of different full
    stemmers

29
Stemming to compress inverted files
Lennon et al. report the following compression
percentages for various stemmers and databases.
It is obvious that the savings in storage can be
substantial.
Compression rates also increase for affix removal
stemmers as the number of suffixes increases.
30
Summary
  • Stemmers are used to conflate terms to improve
    retrieval effectiveness and /or to reduce the
    size of indexing file.
  • Stemming will increase recall at the cost of
    decreased precision.
  • Stemming can have marked effect on the size of
    indexing files ,sometimes decreasing the size of
    file as much as 50 percent .
Write a Comment
User Comments (0)
About PowerShow.com