Information Retrieval and Web Search - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Information Retrieval and Web Search

Description:

Chi squared independence of constituent words ... A machine learning component is trained to learn to extract keyphrases. Multiple machine learning algorithms: ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 34
Provided by: gheorghe
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval and Web Search


1
Information Retrieval and Web Search
  • Keyphrase Extraction
  • Instructor Rada Mihalcea
  • (Note most of the slides were adopted from a
    presentation by Andras Csomai)

2
Keyphrase extraction
  • Set of words and phrases that accurately and
    concisely describe a document.
  • Dense summary
  • Purpose
  • Topic/content identification
  • Indexing
  • Browsing
  • Back of book
  • Classification
  • Surfing
  • . . .
  • Stems from early roman times

3
(No Transcript)
4
(No Transcript)
5
Variations of keyword extraction
  • Constraints on what can be selected
  • Controlled vocabulary domain ontology, etc
  • Uncontrolled vocabulary
  • Constraints on where it has to be selected from
  • Assignment the keywords are not necessarily
    present in the document
  • Extraction - On average 75 of human expert
    assigned keywords are present in the document
    (depends on collection)
  • Conclusion extraction is a feasible approach

6
Uses of keyword extraction
  • Articles, journals topic/content identification
  • Back of the book indexes
  • Amazon SIPs
  • Google Books keywords
  • NLP and IR applications
  • Document classification
  • Document clustering
  • Keyword-based information retrieval
  • Building domain ontologies

7
Keyword extraction for back of the book indexing
  • Information likely to be sought by a user
  • A guide to concepts, names, places in a document

8
Evaluation data set
  • Use a set of documents with expert created
    indexes (Gold Standard)
  • Measure how well the automatically generated
    keyword set matches the index
  • Document collections
  • Gutenberg Project
  • 56 documents, reduced to 29
  • Balanced across multiple domains
  • University of California Press
  • 259 training and 30 test books
  • Only from humanities
  • Extract index entries

9
Index granularities
  • Coarse grained
  • A shorter index containing only the head
    expressions
  • A more concise version of the index
  • Fine grained
  • A longer, more detailed index
  • More biased towards the indexer style

10
Evaluation metrics
  • Compare the generated index to the Gold Standard
  • Only coarse grained evaluation
  • Metrics

11
Automated keyword extraction
  • Unsupervised
  • No training data required
  • Generally portable
  • Methods
  • Candidate extraction
  • Candidate ranking
  • TFIDF
  • Language model based
  • Pre- and post-processing

12
Automated keyword extraction
  • Supervised
  • Requires a large training corpus (documents and
    human expert extracted keywords)
  • Higher accuracy
  • Domain/Language dependent poor portability
  • Methods
  • Many features common with Unsupervised methods
  • Additional features (Linguistic, Semantic)

13
General workflow for keyword extraction
14
Candidate extraction
  • N-gram
  • Sequences of n consecutive words
  • Do not cross sentence boundaries
  • n4
  • The most comprehensive method
  • Anything is a possible candidate
  • Also the noisiest (generates candidate set more
    than two times the size of the document)

15
Noun phrase chunks
  • Observation (Hulth) most of the keywords are
    noun phrases in the document
  • Noun phrase chunks
  • Generates a lot less noise than n-grams
  • Increases precision, lowers recall

16
Named entities
  • Observation many of the keywords/index entries
    are named entities
  • Treat named entities separately
  • To weight them differently
  • To complement other candidate extraction methods
  • LingPipe
  • Heuristic named entity recognition
  • Capitalised phrases not appearing anywhere else
    lowercased
  • Not in the beginning of a sentence

17
Candidate extraction performance
18
Filtering methods
  • Eliminate stopwords, common words, paraphrases

19
Unsupervised features for candidate ranking
  • Informativeness features
  • How representative is to the document
  • TFIDF
  • Information retrieval metric
  • Term frequency adjusted by the frequency of words
    appearing in other documents
  • Document frequency values obtained from the
    British National Corpus

20
Chi squared independence
  • Chi squared independence
  • Measure the degree to which two events occur more
    frequently than they should by chance
  • Contingency table

21
Language model based features
  • Language Model based
  • Pointwise Kulback-Liebler divergence of two
    language models built on the book and a general
    corpus
  • Background corpus the BNC collection
  • Foreground corpus the book
  • Good-Turing smoothing

22
Features for candidate ranking
  • Phraseness
  • Degree to which a sequence of words can be
    considered as a phrase
  • Chi squared independence of constituent words
  • It measures if they appear together more often
    than by chance
  • Language model based
  • Information loss between a unigram and bigram
    language model

23
Ranking performance of features
  • N-gram
  • NP-chunks

24
Length of candidates
  • In supervised methods incorporated as a feature
  • Enforce an observed distribution on the candidate
    set
  • N-gram
  • NP-chunks

25
Combination methods
  • Combine different candidate extraction and
    ranking method
  • Each candidateranking pair provides a score
  • Combine scores using a weighted sum

26
Performance of combined models
27
Supervised keyphrase extraction
  • A machine learning component is trained to learn
    to extract keyphrases
  • Multiple machine learning algorithms
  • Neural Networks
  • Capable of learning non-linear decision surfaces
  • Decision Trees
  • Insight into learning mechanism
  • Support Vector Machines
  • Capable of handling large amounts of data
  • Naïve Bayes
  • Used in many previous keyphrase extraction
    systems
  • Computatinally efficient

28
Features for the supervised system
  • All features from the unsupervised system
  • Construction integration features
  • number of times a phrase is retained in the short
    term memory
  • Phrase length
  • Within document frequency
  • Term and document frequency separately

29
Linguistic feature
  • Probability that a phrase given a part of speech
    pattern is selected as a keyphrase
  • Estimated on training data

30
Encyclopedic feature
  • Previous work reports increased perfromance using
    domain specific thesaurus
  • Wikipedia
  • Can be used to extract a general thesaurus
  • Estimate the likelihood that a phrase is a good
    keyphrase independently from the context

31
Training Data
  • 259 books from the UCPress corpus
  • N-gram candidate extraction
  • Large, unbalanced training set
  • 48.5 million negative, 71.853 positive instances
  • Heuristic filtering
  • 11.5 million negative, 66.349 positive

32
Evaluation
  • 30 Books from UCPress corpus

33
Learning Curves
Write a Comment
User Comments (0)
About PowerShow.com