Learning to Extract Keyphrases from Text - PowerPoint PPT Presentation

About This Presentation
Title:

Learning to Extract Keyphrases from Text

Description:

The performance' of this algorithm was compared with C4.5 applied to keyword ... Performance' is measured by comparing the list of keyphrases extracted by the ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 9
Provided by: preraks
Category:

less

Transcript and Presenter's Notes

Title: Learning to Extract Keyphrases from Text


1
Learning to Extract Keyphrases from Text
  • Paper by Peter Turney
  • National Research Council of Canada Technical
    Report (1999)
  • Presented by Prerak Sanghvi
  • Computer Science and Engineering Department
  • State University of New York at Buffalo

2
GenEx Algorithm
  • GenEx is a special purpose algorithm devised for
    extraction of keyphrases from text.
  • The performance of this algorithm was compared
    with C4.5 applied to keyword extraction and found
    to be superior.
  • Performance is measured by comparing the list
    of keyphrases extracted by the algorithm to the
    list of keyphrases suggested by the authors of
    the document.
  • Consists of two parts the Extractor and the
    Genitor algorithms

3
Extractor
1. Find Single Stems
2. Score Single Stems
4. Find Stem Phrases
3. Select Top Stems
5. Score Stem Phrases
6. Expand Single Stems
7. Drop Duplicates
8. Add Suffixes
9. Add Capitals
10. Final Output
4
Extractor Algorithm
  • Step 1 Find Single Stems Make a list of all
    unique words. Drop stop words (and, or, if,
    he, she) and words with less than three
    characters. Stem the words by truncating them at
    STEM_LENGTH.
  • Step 2 Score Single Stems For each unique stem,
    count how often the stem appears in the text and
    note when it first appears. The score of a stem
    is the number of times it appears, multiplied by
    a factor. This factor is based on how early it
    appears in the document.

5
Extractor Algorithm
  • Step 3 Select Top Single Stems Rank the stems
    in order of descending score and make a list of
    the top NUM_WORKING single stems.
  • Step 4 Find Stem Phrases Make a list of all
    phrases in input text (excluding stop words). A
    phrase is a sequence of one, two or three words
    that appear consecutively in the text. Stem each
    phrase by truncating each word in the phrase at
    STEM_LENGTH characters.

6
Extractor Algorithm
  • Step 5 Score Stem Phrases Score is based on how
    often the phrase appears in the document. It is
    also based on two other factors how early the
    phrase appears in the document, and how many
    words it contains.
  • Step 6 Expand Single Stems For each stem in the
    list of the top NUM_WORKING single stems, find
    the highest scoring stem phrase. The result is a
    list of NUM_WORKING stem phrases.

7
Extractor Algorithm
  • Step 7 Drop Duplicates.
  • Step 8 Add Suffixes For each stem phrase, find
    the most frequent corresponding whole phrase in
    the input text.
  • Step 9 Add Capitalization Not important for our
    purposes
  • Step 10 Final Output Each output phrase must
    not be in the list of supplied stop phrases.

8
Genitor
  • Genitor is a steady state genetic algorithm, used
    to tune the parameters of the Extractor.
  • The algorithm is tuned with a dataset, consisting
    of documents paired with target lists of
    keyphrases.
  • The learning process involves adjusting the
    parameters to maximize the match between the
    output of Extractor and the target keyphrase
    lists.
Write a Comment
User Comments (0)
About PowerShow.com