Information Retrieval and Web Search - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Information Retrieval and Web Search

Description:

Chi squared independence of constituent words ... A machine learning component is trained to learn to extract keyphrases. Multiple machine learning algorithms: ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 34

Provided by: gheorghe

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval and Web Search

1
Information Retrieval and Web Search

Keyphrase Extraction
Instructor Rada Mihalcea
(Note most of the slides were adopted from a
presentation by Andras Csomai)

2
Keyphrase extraction

Set of words and phrases that accurately and
concisely describe a document.
Dense summary
Purpose
Topic/content identification
Indexing
Browsing
Back of book
Classification
Surfing
. . .
Stems from early roman times

3
(No Transcript)
4
(No Transcript)
5
Variations of keyword extraction

Constraints on what can be selected
Controlled vocabulary domain ontology, etc
Uncontrolled vocabulary
Constraints on where it has to be selected from
Assignment the keywords are not necessarily
present in the document
Extraction - On average 75 of human expert
assigned keywords are present in the document
(depends on collection)
Conclusion extraction is a feasible approach

6
Uses of keyword extraction

Articles, journals topic/content identification
Back of the book indexes
Amazon SIPs
Google Books keywords
NLP and IR applications
Document classification
Document clustering
Keyword-based information retrieval
Building domain ontologies

7
Keyword extraction for back of the book indexing

Information likely to be sought by a user
A guide to concepts, names, places in a document

8
Evaluation data set

Use a set of documents with expert created
indexes (Gold Standard)
Measure how well the automatically generated
keyword set matches the index
Document collections
Gutenberg Project
56 documents, reduced to 29
Balanced across multiple domains
University of California Press
259 training and 30 test books
Only from humanities
Extract index entries

9
Index granularities

Coarse grained
A shorter index containing only the head
expressions
A more concise version of the index
Fine grained
A longer, more detailed index
More biased towards the indexer style

10
Evaluation metrics

Compare the generated index to the Gold Standard
Only coarse grained evaluation
Metrics

11
Automated keyword extraction

Unsupervised
No training data required
Generally portable
Methods
Candidate extraction
Candidate ranking
TFIDF
Language model based
Pre- and post-processing

12
Automated keyword extraction

Supervised
Requires a large training corpus (documents and
human expert extracted keywords)
Higher accuracy
Domain/Language dependent poor portability
Methods
Many features common with Unsupervised methods
Additional features (Linguistic, Semantic)

13
General workflow for keyword extraction
14
Candidate extraction

N-gram
Sequences of n consecutive words
Do not cross sentence boundaries
n4
The most comprehensive method
Anything is a possible candidate
Also the noisiest (generates candidate set more
than two times the size of the document)

15
Noun phrase chunks

Observation (Hulth) most of the keywords are
noun phrases in the document
Noun phrase chunks
Generates a lot less noise than n-grams
Increases precision, lowers recall

16
Named entities

Observation many of the keywords/index entries
are named entities
Treat named entities separately
To weight them differently
To complement other candidate extraction methods
LingPipe
Heuristic named entity recognition
Capitalised phrases not appearing anywhere else
lowercased
Not in the beginning of a sentence

17
Candidate extraction performance
18
Filtering methods

Eliminate stopwords, common words, paraphrases

19
Unsupervised features for candidate ranking

Informativeness features
How representative is to the document
TFIDF
Information retrieval metric
Term frequency adjusted by the frequency of words
appearing in other documents
Document frequency values obtained from the
British National Corpus

20
Chi squared independence

Chi squared independence
Measure the degree to which two events occur more
frequently than they should by chance
Contingency table

21
Language model based features

Language Model based
Pointwise Kulback-Liebler divergence of two
language models built on the book and a general
corpus
Background corpus the BNC collection
Foreground corpus the book
Good-Turing smoothing

22
Features for candidate ranking

Phraseness
Degree to which a sequence of words can be
considered as a phrase
Chi squared independence of constituent words
It measures if they appear together more often
than by chance
Language model based
Information loss between a unigram and bigram
language model

23
Ranking performance of features

N-gram
NP-chunks

24
Length of candidates

In supervised methods incorporated as a feature
Enforce an observed distribution on the candidate
set
N-gram
NP-chunks

25
Combination methods

Combine different candidate extraction and
ranking method
Each candidateranking pair provides a score
Combine scores using a weighted sum

26
Performance of combined models
27
Supervised keyphrase extraction

A machine learning component is trained to learn
to extract keyphrases
Multiple machine learning algorithms
Neural Networks
Capable of learning non-linear decision surfaces
Decision Trees
Insight into learning mechanism
Support Vector Machines
Capable of handling large amounts of data
Naïve Bayes
Used in many previous keyphrase extraction
systems
Computatinally efficient

28
Features for the supervised system

All features from the unsupervised system
Construction integration features
number of times a phrase is retained in the short
term memory
Phrase length
Within document frequency
Term and document frequency separately

29
Linguistic feature

Probability that a phrase given a part of speech
pattern is selected as a keyphrase
Estimated on training data

30
Encyclopedic feature

Previous work reports increased perfromance using
domain specific thesaurus
Wikipedia
Can be used to extract a general thesaurus
Estimate the likelihood that a phrase is a good
keyphrase independently from the context

31
Training Data