Stemming Algorithms presentation

About This Presentation

Transcript and Presenter's Notes

Title: Stemming Algorithms

1
Stemming Algorithms

?????????????
??????? ??
?? 9142608 ???
9142609 ???

2
Outline

Introduction
Types of stemming algorithms
Experimental evaluations of stemming
Stemming to compress inverted files
Summary
Appendix

3
Introduction

Stemming is one technique to provide ways of
finding morphological variants of search terms.
Used to improve retrieval effectiveness and to
reduce the size of indexing files.
Taxonomy for stemming algorithms

4
Introduction (cont)

Criteria for judging stemmers
Correctness
Overstemming too much of a term is removed.
Understemming too little of a term is removed.
Retrieval effectiveness
measured with recall and precision, and on
their speed, size, and so on
compression performance

5
Type of stemming algorithms

Table lookup approach
Successor Variety
n-gram stemmers
Affix Removal Stemmers

6
Table lookup approach

Store a table of all index terms and their stems,
so terms from queries and indexes could be
stemmed very fast.
Problems
There is no such data for English. Or some terms
are domain dependent.
The storage overhead for such a table, though
trading size for time is sometimes warranted.

7
Successor Variety approach

Determine word and morpheme boundaries based on
the distribution of phonemes in a large body of
utterances.
The successor variety of a string is the number
of different characters that follow it in words
in some body of text.
The successor variety of substrings of a term
will decrease as more characters are added until
a segment boundary is reached.

8
Successor Variety approach (cont)
Test Word READABLE Corpus ABLE, APE,
BEATABLE, FIXABLE, READ, READABLE,
READING, READS, RED, ROPE, RIPE
Prefix Successor Variety Letters
R RE REA READ READA READAB READABL READABLE 3 2 1 3 1 1 1 1 E,I,O A,D D A,I,S B L E (Blank)
9
Successor Variety approach (cont)

cutoff method
some cutoff value is selected and a boundary is
identified whenever the cutoff value is reached
peak and plateau method
segment break is made after a character whose
successor variety exceeds that of the characters
immediately preceding and following it
complete method

10
Successor Variety approach (cont)

entropy method
the number of words in a text body
beginning with the i length sequence of letters ?
the number of words in with the
successor j
The probability that a member of number of words
in
has the successor j is given by
The entropy of is

11
Successor Variety approach (cont)

Two criteria used to evaluate various
segmentation methods
the number of correct segment cuts divided by the
total number of cuts
the number of correct segment cuts divided by the
total number of true boundaries
After segmenting, if the first segment occurs in
more than 12 words in the corpus, it is probably
a prefix.

12
Successor Variety approach (cont)

The successor variety stemming process has three
parts
determine the successor varieties for a word
segment the word using one of the methods
select one of the segments as the stem

13
n-gram stemmers

Association measures are calculated between pairs
of terms based on shared unique digrams.
statistics gt st ta at ti is st ti ic cs
unique digrams at cs ic is st ta ti
statistical gt st ta at ti is st ti ic ca al
unique digrams al at ca ic is st ta ti
Dices coefficient (similarity)
A and B are the numbers of unique digrams in
the first and the second words. C is the number
of unique digrams shared by A and B.

14
n-gram stemmers (cont)

Similarity measures are determined for all pairs
of terms in the database, forming a similarity
matrix
Once such a similarity matrix is available, terms
are clustered using a single link clustering
method (as described in Ch.16)

15
Affix Removal Stemmers

Affix removal algorithms remove suffixes and/or
prefixes from terms leaving a stem
If a word ends in ies but not eies or aies
(Harman 1991)
Then ies -gt y
If a word ends in es but not aes , or ees
or oes
Then es -gt e
If a word ends in s but not us or ss
Then s -gt NULL

16
The Porter algorithm

The Porter algorithm consists of a set of
condition/action rules.
The condition fall into three classes
Conditions on the stem
Conditions on the suffix
Conditions on rules

17
Conditions on the stem

1.The measure , denoted m ,of a stem is based on
its alternate vowel-consonant sequences.

Measure Example
M0 M1 M2 TR,EE,TREE,Y,BY TROUBLE,OATS,TREES,IVY TROUBLES,PRIVATE,OATEN
18
Conditions on the stem (cont)

2.ltXgt ---the stem ends with a given letter X
3.v---the stem contains a vowel
4.d ---the stem ends in double consonant
5.o ---the stem ends with a consonant-vowel-cons
onant,sequence ,where the final consonant is not
w, x or y
Suffix conditions take the form (current_suffix
pattern)

19
Conditions on the rules

The rules are divided into steps. The rules in a
step are examined in sequence , and only one rule
from a step can apply
step1a(word)
step1b(stem)
if (the
second or third rule of step 1b was used)
step1b1(stem)
step1c(stem)
step2(stem)
step3(stem)
step4(stem)
step5a(stem)
step5b(stem)

20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Experimental Evaluations of stemming
27
(No Transcript)
28
Stemming Studies Conclusion

The majority of stemmings affection on retrieval
performance have been positive
Stemming is as effective as manual conflation
The effect of stemming is dependent on the nature
of vocabulary used
There appears to be little difference between the
retrieval effectiveness of different full
stemmers

29
Stemming to compress inverted files
Lennon et al. report the following compression
percentages for various stemmers and databases.
It is obvious that the savings in storage can be
substantial.
Compression rates also increase for affix removal
stemmers as the number of suffixes increases.
30
Summary

Stemmers are used to conflate terms to improve
retrieval effectiveness and /or to reduce the
size of indexing file.
Stemming will increase recall at the cost of
decreased precision.
Stemming can have marked effect on the size of
indexing files ,sometimes decreasing the size of
file as much as 50 percent .

Write a Comment

User Comments (0)

About PowerShow.com

Stemming Algorithms PowerPoint PPT Presentation