Information Retrieval and Web Search - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Information Retrieval and Web Search

Description:

Set of keywords representing the topic of a document. Dense ... Extraction - On average 75% of human expert assigned keywords present ... Porter's algorithm) ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 25
Provided by: gheorghe
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval and Web Search


1
Information Retrieval and Web Search
  • Keyphrase Extraction
  • Instructor Rada Mihalcea
  • Invited Lecturer Andras Csomai
  • Class web page http//lit.csci.unt.edu/classes/C
    SCE5200/

2
Keyphrase Extraction
  • Set of keywords representing the topic of a
    document
  • Dense summary
  • Purpose
  • Topic/content identification
  • Indexing
  • Browsing
  • Back of book
  • Classification
  • Surfing
  • . . .
  • Stems from early roman times

3
(No Transcript)
4
(No Transcript)
5
Extraction vs. assignment
  • Assignment the keyphrases are not necessarily
    present in the document
  • controlled vocabulary
  • uncontrolled vocablary
  • Extraction - On average 75 of human expert
    assigned keywords present in the document
  • Conclusion extraction is a feasible approach

6
General framework of supervised keyword extraction
7
Candidate extraction
  • ngrams
  • Sequence of n consecutive words from text
  • 90 of the keywords present in the document
    contain up two 3 terms (3grams)
  • Depends on domain, style, etc.
  • Boundaries.?
  • Np-chunks
  • 50 extracted
  • Helps increase precision
  • Syntactic patterns
  • Using linguistic knowledge
  • Extract only patterns that are likely to be
    keyphrases (e.g. noun noun vs. verb adj is )

8
Machine Learning component
  • The choice of a machine learning method is of
    secondary importance
  • Decision trees
  • Rule induction
  • Naïve Bayes
  • GenEx
  • The most important part is the good feature set
  • Training data

9
Features
  • TFIDF
  • The ubiquitous IR feature
  • Distance
  • The distance of the candidate from the beginning
    of the document
  • Better performance in well structured text (e.g.
    scientific articles)
  • POS pattern
  • Linguistic information
  • The pos pattern of the candidate phrase
  • Length
  • E.g. 13.7 unigram, 51.8 bigram, 25.4 trigram

10
Features contd
  • Phraseness
  • Tells you if a sequence of words constitutes a
    phrase. (e.g. information retrieval vs. wood
    desk)
  • PMI (point-wise mutual information)
  • Probabilities via MLE
  • For ranking purposes ignore constants, use simple
    counts

11
Features contd.
  • Syntactic elements
  • E.g. adj. at the end of sequence, common verb
  • Binary feature
  • Special locations or emphasized style
  • Binary feature showing if the candidate appears
    in a title, metadata, etc.
  • If it is italic, bold, etc.
  • Domain specific features
  • Taxonomies, ontologies
  • Use the relations of the hierarchical structure
    as binary features. E.g. the presence of
    neighboring items in text
  • Or simple domain specific keyphrase collection
    (Turney)

12
More features
  • Coherence
  • the PMI of the top K phrases and the rest of the
    candidates (Turney)
  • WordNet based features
  • Are you familiar with WordNet?
  • PMI point-wise mutual information
  • Probabilities via MLE

13
Performance of the state of the art
  • Precision of .23 (average) (Turney, Gutwin)
  • Large collection, longer articles
  • F-measure of .33 (best system) with precision of
    .25
  • Only abstractseasier?
  • Acceptability 80 (good no answer), 62 good
    (Turney)

14
Unsupervised
  • When you have no training data
  • Language and domain independent
  • TextRank (Mihalcea)
  • Unsupervised, simple, powerful.
  • Efficient
  • Uses a variation of PageRank
  • Represent text as graph

15
Representing text as graph
16
Traditional Unsupervised Methods
  • TFIDF plus some improvement
  • Coherence based on WordNet
  • PMI is language independent
  • Heuristics
  • Ranking based on the TFIDF or some composite
    score
  • Select the top portion of the ranked list (score
    or keyphrase number cutoff)

17
Back of the Book indexing
  • Information likely to be sought by a user
  • Present at the end of almost every book.

18
Style
  • Index terms and references (location, cross
    reference, etc )
  • Alphabetical ordering (to facilitate the search)
  • Phrase heads brought up front (inverted index)
  • illustrations, indexing of
  • Cascaded style (organized information shows
    topical relatedness or specific instances of a
    more general concept)
  • illustrations, indexing of,
  • in newspaper indexes,
  • in periodical indexes,

19
Automated indexing
  • Observation indexes tend to contain a large
    number of Named Entities and keyphrases
  • Minimally supervised indexing system
  • Candidate extraction
  • Ranking
  • Postprocessing

20
Candidate Extraction
  • 1-4grams
  • no span over sentence boundaries
  • Named Entities
  • Using Snow
  • Heuristics
  • If candidate starts or ends with a stop word,
    drop it
  • If candidate starts or ends with a common word,
    drop
  • Identify candidates that are paraphrases

21
Ranking
  • TF from document and IDF from BNC
  • Named Entities receive an IDF of 1
  • Rank according to the TF/IDF scores
  • Heuristics
  • Index length (0.44 or 0.35 of text depending
    on target)
  • Distribution of entries by length

22
Post processing the index
  • Eliminate paraphrases
  • Look for paraphrases in the candidate list, and
    keep only the instance with the highest TFIDF
    score
  • Targeted paraphrase types
  • Lexical synonymy trace the river / follow the
    river
  • PP attachment a plant in Alabama / the
    Alabama plant
  • Morpho-syntactic variants inductive phenomena
    / phenomena of induction, plural forms, etc

23
Detecting paraphrases
  • Create an extended set for all non-common words
    of the entry including
  • the word itself
  • the stem (using Porters algorithm)
  • The synonyms of the first sense of the word, or
    its variations in other parts of speech
  • Two entries are paraphrases if
  • number of non-common words is equal
  • There is a one-to-one matching of the extended
    sets of the two entries

24
Postptrocessing contd
  • Generate inversions following human style
  • Find the word with the highest TFIDF component
  • Split the entry in three entry(A,B,C), where B
    is the most important word, A or B may be empty
  • Reorder into B,A,C
  • E.g.
  • centres of innervation / innervation, centres of
  • thrust of the sting / sting, thrust of the
  • concentrated nervous sytem /nervous,
    concentrated, system
Write a Comment
User Comments (0)
About PowerShow.com