Information Retrieval and Web Search - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Information Retrieval and Web Search

Description:

Set of keywords representing the topic of a document. Dense ... Extraction - On average 75% of human expert assigned keywords present ... Porter's algorithm) ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 25

Provided by: gheorghe

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval and Web Search

1
Information Retrieval and Web Search

Keyphrase Extraction
Instructor Rada Mihalcea
Invited Lecturer Andras Csomai
Class web page http//lit.csci.unt.edu/classes/C
SCE5200/

2
Keyphrase Extraction

Set of keywords representing the topic of a
document
Dense summary
Purpose
Topic/content identification
Indexing
Browsing
Back of book
Classification
Surfing
. . .
Stems from early roman times

3
(No Transcript)
4
(No Transcript)
5
Extraction vs. assignment

Assignment the keyphrases are not necessarily
present in the document
controlled vocabulary
uncontrolled vocablary
Extraction - On average 75 of human expert
assigned keywords present in the document
Conclusion extraction is a feasible approach

6
General framework of supervised keyword extraction
7
Candidate extraction

ngrams
Sequence of n consecutive words from text
90 of the keywords present in the document
contain up two 3 terms (3grams)
Depends on domain, style, etc.
Boundaries.?
Np-chunks
50 extracted
Helps increase precision
Syntactic patterns
Using linguistic knowledge
Extract only patterns that are likely to be
keyphrases (e.g. noun noun vs. verb adj is )

8
Machine Learning component

The choice of a machine learning method is of
secondary importance
Decision trees
Rule induction
Naïve Bayes
GenEx
The most important part is the good feature set
Training data

9
Features

TFIDF
The ubiquitous IR feature
Distance
The distance of the candidate from the beginning
of the document
Better performance in well structured text (e.g.
scientific articles)
POS pattern
Linguistic information
The pos pattern of the candidate phrase
Length
E.g. 13.7 unigram, 51.8 bigram, 25.4 trigram

10
Features contd

Phraseness
Tells you if a sequence of words constitutes a
phrase. (e.g. information retrieval vs. wood
desk)
PMI (point-wise mutual information)
Probabilities via MLE
For ranking purposes ignore constants, use simple
counts

11
Features contd.

Syntactic elements
E.g. adj. at the end of sequence, common verb
Binary feature
Special locations or emphasized style
Binary feature showing if the candidate appears
in a title, metadata, etc.
If it is italic, bold, etc.
Domain specific features
Taxonomies, ontologies
Use the relations of the hierarchical structure
as binary features. E.g. the presence of
neighboring items in text
Or simple domain specific keyphrase collection
(Turney)

12
More features

Coherence
the PMI of the top K phrases and the rest of the
candidates (Turney)
WordNet based features
Are you familiar with WordNet?
PMI point-wise mutual information
Probabilities via MLE

13
Performance of the state of the art

Precision of .23 (average) (Turney, Gutwin)
Large collection, longer articles
F-measure of .33 (best system) with precision of
.25
Only abstractseasier?
Acceptability 80 (good no answer), 62 good
(Turney)

14
Unsupervised

When you have no training data
Language and domain independent
TextRank (Mihalcea)
Unsupervised, simple, powerful.
Efficient
Uses a variation of PageRank
Represent text as graph

15
Representing text as graph
16
Traditional Unsupervised Methods

TFIDF plus some improvement
Coherence based on WordNet
PMI is language independent
Heuristics
Ranking based on the TFIDF or some composite
score
Select the top portion of the ranked list (score
or keyphrase number cutoff)

17
Back of the Book indexing

Information likely to be sought by a user
Present at the end of almost every book.

18
Style

Index terms and references (location, cross
reference, etc )
Alphabetical ordering (to facilitate the search)
Phrase heads brought up front (inverted index)
illustrations, indexing of
Cascaded style (organized information shows
topical relatedness or specific instances of a
more general concept)
illustrations, indexing of,
in newspaper indexes,
in periodical indexes,

19
Automated indexing

Observation indexes tend to contain a large
number of Named Entities and keyphrases
Minimally supervised indexing system
Candidate extraction
Ranking
Postprocessing

20
Candidate Extraction

1-4grams
no span over sentence boundaries
Named Entities
Using Snow
Heuristics
If candidate starts or ends with a stop word,
drop it
If candidate starts or ends with a common word,
drop
Identify candidates that are paraphrases

21
Ranking

TF from document and IDF from BNC
Named Entities receive an IDF of 1
Rank according to the TF/IDF scores
Heuristics
Index length (0.44 or 0.35 of text depending
on target)
Distribution of entries by length

22
Post processing the index

Eliminate paraphrases
Look for paraphrases in the candidate list, and
keep only the instance with the highest TFIDF
score
Targeted paraphrase types
Lexical synonymy trace the river / follow the
river
PP attachment a plant in Alabama / the
Alabama plant
Morpho-syntactic variants inductive phenomena
/ phenomena of induction, plural forms, etc

23
Detecting paraphrases

Create an extended set for all non-common words
of the entry including
the word itself
the stem (using Porters algorithm)
The synonyms of the first sense of the word, or
its variations in other parts of speech
Two entries are paraphrases if
number of non-common words is equal
There is a one-to-one matching of the extended
sets of the two entries

24
Postptrocessing contd

Generate inversions following human style
Find the word with the highest TFIDF component
Split the entry in three entry(A,B,C), where B
is the most important word, A or B may be empty
Reorder into B,A,C
E.g.
centres of innervation / innervation, centres of
thrust of the sting / sting, thrust of the
concentrated nervous sytem /nervous,
concentrated, system