Learning to Extract Keyphrases from Text

About This Presentation

Title:

Learning to Extract Keyphrases from Text

Description:

The performance' of this algorithm was compared with C4.5 applied to keyword ... Performance' is measured by comparing the list of keyphrases extracted by the ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 9

Provided by: preraks

Learn more at: https://cedar.buffalo.edu

Category:

more less

Transcript and Presenter's Notes

Title: Learning to Extract Keyphrases from Text

1
Learning to Extract Keyphrases from Text

Paper by Peter Turney
National Research Council of Canada Technical
Report (1999)
Presented by Prerak Sanghvi
Computer Science and Engineering Department
State University of New York at Buffalo

2
GenEx Algorithm

GenEx is a special purpose algorithm devised for
extraction of keyphrases from text.
The performance of this algorithm was compared
with C4.5 applied to keyword extraction and found
to be superior.
Performance is measured by comparing the list
of keyphrases extracted by the algorithm to the
list of keyphrases suggested by the authors of
the document.
Consists of two parts the Extractor and the
Genitor algorithms

3
Extractor
1. Find Single Stems
2. Score Single Stems
4. Find Stem Phrases
3. Select Top Stems
5. Score Stem Phrases
6. Expand Single Stems
7. Drop Duplicates
8. Add Suffixes
9. Add Capitals
10. Final Output
4
Extractor Algorithm

Step 1 Find Single Stems Make a list of all
unique words. Drop stop words (and, or, if,
he, she) and words with less than three
characters. Stem the words by truncating them at
STEM_LENGTH.
Step 2 Score Single Stems For each unique stem,
count how often the stem appears in the text and
note when it first appears. The score of a stem
is the number of times it appears, multiplied by
a factor. This factor is based on how early it
appears in the document.

5
Extractor Algorithm

Step 3 Select Top Single Stems Rank the stems
in order of descending score and make a list of
the top NUM_WORKING single stems.
Step 4 Find Stem Phrases Make a list of all
phrases in input text (excluding stop words). A
phrase is a sequence of one, two or three words
that appear consecutively in the text. Stem each
phrase by truncating each word in the phrase at
STEM_LENGTH characters.

6
Extractor Algorithm

Step 5 Score Stem Phrases Score is based on how
often the phrase appears in the document. It is
also based on two other factors how early the
phrase appears in the document, and how many
words it contains.
Step 6 Expand Single Stems For each stem in the
list of the top NUM_WORKING single stems, find
the highest scoring stem phrase. The result is a
list of NUM_WORKING stem phrases.

7
Extractor Algorithm

Step 7 Drop Duplicates.
Step 8 Add Suffixes For each stem phrase, find
the most frequent corresponding whole phrase in
the input text.
Step 9 Add Capitalization Not important for our
purposes
Step 10 Final Output Each output phrase must
not be in the list of supplied stop phrases.

8
Genitor

Genitor is a steady state genetic algorithm, used
to tune the parameters of the Extractor.
The algorithm is tuned with a dataset, consisting
of documents paired with target lists of
keyphrases.
The learning process involves adjusting the
parameters to maximize the match between the
output of Extractor and the target keyphrase
lists.

Write a Comment

User Comments (0)