Natural Language Processing Applications

About This Presentation

Title:

Natural Language Processing Applications

Description:

Most stemmers don't use lexical look up. There are shortcomings: ... stemming is imperfect and the size and diversity of the web increase the chance of a mismatch ... – PowerPoint PPT presentation

Number of Views:365

Avg rating:3.0/5.0

Slides: 127

Provided by: vena4

Category:

more less

Transcript and Presenter's Notes

Title: Natural Language Processing Applications

1
Natural Language Processing Applications

Lecture 7
Fabienne Venant
Université Nancy2 / Loria

2
Information Retrieval
3
What is Information Retrieval?

Information retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers)
Applications
Many universities and public libraries use IR
systems to provide access to books journals and
other documents.
Web search
Large volumes of unstable, unstructured dat
Speed is important
Cross-language IR
Finding documents written in another language
Touches on Machine translation
....

4
Concerns

The set of texts can be very large hence hence
efficiency is a concern
Textual data is noisy, incomplete and
untrustworthy hence robustness is a concern
Information may be hidden
Need to derive information from raw data
Need to derive information from vaguely expressed
needs

5
IR Basic concepts

Information needs queries and relevance
Indexing helps speeding up retrieval
Retrieval models describe how to search and
recover relevant documents
Evaluation IR systems are large and convincing
evaluation is tricky

6
Information needs
7
Information needs

INFORMATION NEED the topic about which the
user desires to know more
QUERY what the user conveys to the computer in
an attempt to communicate the information need
RELEVANCE a document is relevant if it is one
that the user perceives as containing information
of value wrt their personal information need
Ex
topic pipeline leaks
relevant documents doesnt matter if they use
those words or express the concept with other
words such a pipeline rupture .

8
Capturing information needs

Information needs can be hard to capture
One possibility use natural language
Advantage expressive enough to allow all needs
to be described
Drawbacks
Semantic analysis of arbitrary NL is very hard
Users may not want to type full blown sentences
into a search engine

9
Queries
10
Queries

Information needs are typically expressed as a
query
Where shall I go on holiday? ? holiday
destinations
Two main types of possible queries
How much blood does the human heart pump in one
minute?
Boolean queries
? heart AND blood AND minutes
Web types queries
? human biology

11
Remarks

A query
is usually quite short and incomplete
may contain misspelled or poorly selected words
may contain too many or too few words
The information need
may be difficult to describe precisely,especially
when the user isn't familiar about the topic
Precise understanding of the document content is
difficult.

12
Persistent vs one-off Queries

Queries might or not evolve over times
Persistent queries
predefined and routinely performed
Top ten performing shares today
Continuous queries persistent queries that
allow users to receive new results when they
become available
typical of Information extraction and News
Routing systems
One-off (or ad-hoc) queries
created to obtain information as the need arises
typical of Web searching

13
Relevance

Relevance is subjective
python ambiguous but not for user
Topicality vs. Utility a document is relevant
wrt a specific Goal
? A document is relevant if it addresses the
stated information need, not because it just
happens to contain all the words in the query.
Relevance is a gradual concept (a document is not
just relevant or not it is more or less relevant
to a query)
IR systems usually rank retrieved documents by
relevance
But many algorithm use a binary decision of
relevance.

14
The big picture
15
Terminology

An IR system looks for data matching some
criteria defined by the users in their queries.
The langage used to ask a question is called the
query language.
These queries use keywords (atomic items
characterizing some data).
The basic unit of data is a document (can be a
file, an article, a paragraph, etc.).
A document corresponds to free text (may be
unstructured).
All the documents are gathered into a collection
(or corpus).

16
Searching for a given word in a document

One way to do that is to start at the beginning
and to read through all the text
Pattern matching (re) speed of modern computer?
grepping through tex can be a very effective
Enough for simple querying of modest collections
(millions of words)
But for many purposes, you do need more
To process large document collections (billions
ot trillions of words) quickly.
To allow more flexible matching operations. For
example, it is impractical to perform the query
Romans NEAR countrymen with grep, where NEAR
might be defined as within 5 words or within
the same sentence.
To allow ranked retrieval in many cases you
want the best answer to an information need among
many documents that contain certain words
-- gtYou need an Index

17
Index
18
Motivation for Indexing

Extremely large dataset
Only a tiny fraction of the dataset is relevant
to a given query
Speed is essential (0.25 second for web
searching)
Indexing helps speedup retrieval

19
Indexing documents

How to relate the users information need with
some documents content ?
Idea using an index to refer to documents
Usually an index is a list of terms that appear
in a document, it can be represented
mathematically as
index doci? Uj keywordj
Here, the kind of index we use maps keywords to
the list of documents they appear in
index' keywordj ? Ui doci
We call this an inverted index.

20
Indexing documents

The set of keywords is usually called the
dictionary (or vocabulary)
A document identifier appearing in the list
associated with a keyword is called a posting
The list of document identifiers associated with
a given keyword is called a posting list

21
Inverted files

The most common indexing technique
Source file collection organised by documents
Inverted file collection organised by terms

22
Inverted Index

Given a dictionary of terms (also called
vocabulary or vocabulary lexicon)
For each term, record in a list which documents
the term occurs in
Each item in the list
records that a term appeared in a document
and, later, often, the positions in the document
is conventionally called a posting
The list is then called a postings list (or
inverted list),

23
Inverted Index
From an introduction to information
retrieval , C.D. Manning,P. Raghavan and
H.Schütze
24
Exercise

Draw the inverted index that would be built for
the following document collection
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
For this document collection, what are the
returned results for these queries
schizophrenia AND drug
schizophrenia AND NOT(drug OR approach)

25
Indexing documents

Arising questions how to build an index
automatically ? What are the relevant keywords ?
Some additional desiderata
fast processing of large collections of
documents,
having flexible matching operations (robust
retrieval),
having the possibility to rank the retrieved
document in terms of relevance
To ensure these requirements (especially fast
processing) are fulfilled, the indexes are
computed in advance
Note that the format of the index has a huge
impact on the performances of the system

26
Indexing documents

NB an index is built in 4 steps
Gathering of the collection (each document is
given a unique identifier)
Segmentation of each document into a list of
atomic tokens ? tokenization
Linguistic processing of the tokens in order to
normalize them ?lemmatizing.
Indexing the documents by computing the
dictionary and lists of postings

27
Manual indexing

Advantages
Human judgement are most reliable
Retrieval is better
Drawbacks
Time consuming
Not always consistent
different people build different indexes for the
same document.

28
Automatic indexing

Using NLU?
Not fast enough in real world settings (e.g., web
search)
Not robust enough (low coverage)
Difficulty what to include and what to exclude.
Indexes should not contain headings for topics
for which there is no information in the document
Can a machine parse full sentences of ideas and
recognize the core ideas, the important terms,
and the relationships between related concepts
throughout the entire text?

29
Building the vocabulary
30
Stop list

The members of which are discarded during
indexing
some extremely common words which would appear to
be of little value in helping select documents
matching a user need are excluded from the
vocabulary entirely.
These words are called STOP WORDS
Collection strategy
Sort the terms by collection frequency (the total
number of times each term appears in the document
collection),
Take the most frequent terms
often hand-filtered for their semantic content
relative to the domain of the documents being
indexed
What counts as a stop word depends on the
collection
in a collection of legal article law can be
considered a stop word
Ex
a an and are as at be by for from has he in is it
its of on that the to was were will with

31
Why eliminate stop words?

Efficiency
Eliminating stop words reduces the size of the
index considerably
Eliminating stop words reduces retrieval time
considerably
Quality of results
Most of the time not indexing stop words does
little harm
keyword searches with terms like the and by dont
seem very useful
BUT, this is not true for phrase searches.
The phrase query President of the United States
is more precise than President AND United
States.
The meaning of flights to London is likely to
be lost if the word to is stopped out.
.....

32
Building the vocabulary

Processing a stream of characters to extract
keywords
1st task tokenization, main difficulties
token delimiters (ex Chinese)
apostrophes (ex Oneill, Finlands capital)
hyphens (ex Hewlett-Packard, state-of-the-art)
segmented compound nouns (ex Los Angeles)
unsegmented compound nouns (icecream, breadknife)
numerical data (dates, IP addresses)
word order (ex Arabic wrt nouns and numbers)

33
Solutions for tokenization issues

Using a pre-defined dictionary with largest
matches and heuristics for unknown words
Using learning algorithms trained over
hand-segmented words

34
Choosing keywords

Selecting the words that are most likely to
appear in a query
These words characterize the documents they
appear in
Which are they?

35
The bag of words approach

Extreme interpretation of the the principle of
compositional semnaics
The meaning of documents resides solely in the
words that are contained within them
The exact ordering of the terms in a document is
ignored but the number of occurrences of each
term is material

36
BoW

Not the same thing a bit! said the Hatter.
You might just as well say that I see what I
eat is the same thing as I eat what I see!
You might just as well say, added the March
Hare, that I like what I get is the same thing
as I get what I like!
You might just as well say, added the Dormouse,
who seemed to be talking in its sleep, that I
breathe when I sleep is the same thing as I
sleep when I breathe!

37
Bags of words

Nevertheless, it seems intuitive that two
documents with similar bag of words
representations are similar in content..

38
Whats in a bag of words?

Are all words in a document equally important?
stop words do not contribute in any way to
retrieval and scoring
BoW contain terms
What should count as a term?
Words
Phrases (e.g., president of the US)

39
Morphological normalization

Should index terms be word forms, lemmas or
stems?
Matching morphological variants increase recall
Example morphological variants
anticipate, anticipating, anticipated,
anticipation
Company/Companies, sell/sold
USA vs U.S.A.,
22/10/2007 vs 10/22/2007 vs 2007/10/22
university vs University
Idea using equivalence classes of terms,
ex Opel, OPEL, opel ? opel
Two techniques
stemming refers to a crude heuristic process
that chops off the ends of words in the hope of
achieving this goal correctly most of the time
Lemmatisation refers to doing things,properly
with the use of a vocabulary and morphological
analysis of words, normally aiming to remove
inflectional endings only and to return a
dictionary form of a word, which is known as the
lemma.
NB documents and queries have to be processed
using the same tokenization process !

40
Stemming and Lemmatization

Role reducing inflectional forms to common base
forms,
Example
car, cars, cars, cars ? car
am, are, is ? be
Stemming removes suffixes (surface markers) to
produce root forms
Lemmatization reduces a word to a canonical form
(using a dictionary and a morphological analyser)
Illustration of the difficulty
plurals (woman/women, crisis/crisis)
derivational morphology (automatize/automate)
English ? Porter stemming algorithm (University
of Cambridge, UK, 1980)

41
Porter stemmer

Algorithm based on a set of context-sensitive
rewriting rules
http//tartarus.org/martin/PorterStemmer/index
.html
http//tartarus.org/martin/PorterStemmer/def.t
xt
Rules are composed of a pattern (left-hand-side)
and a string (right-hand-side), example
(.)sses ? \1 ss sses ? ss caresses ? caress
(. aeiou.)ies ? \1i ies ? i ponies ? poni,
ties ?ti
(. aeiou.)ss ? \1 ss ss ? ss caress ?
caress
Rules may be constrained by conditions on the
words measure, example
(m gt 1) (.)ement ? \1 replacement ? replac
but not cement ? c
(mgt0) (.)eed -gt \1ee feed -gt feed but agreed
-gt agree
(v) ed -gt \1 plastered -gt plaster but bled
-gt bled
(v) ing -gt \1 motoring -gt motor but sing -gt
sing

42
Porter StemmerWord measure

Assumed that a list of consonants is denoted by
C, and a list of vowels by V
Any word, or part of a word has one of the four
forms
CVCV ... C
CVCV ... V
VCVC ... C
VCVC ... V
These may all be represented by the single form
CVCVC ... V where the square brackets denote
arbitrary presence of their contents.
Using (VC)m to denote VC repeated m times, this
may again be written as
C(VC)mV.
m will be called the measure of any word or word
part when represented in this form.
Here are some examples
m0 TR, EE, TREE, Y, BY
m1 TROUBLE, OATS, TREES, IVY
m2 TROUBLES, PRIVATE, OATEN, ORRERY.

43
Exercise

What is the Porter measure of the following words
(give your computation) ?
crepuscular
rigorous
placement
cr ep usc ul ar
C VC VC VC VC
m 4
r ig or ous
C VC VC VC
m 3
pl ac em ent
C VC VC VC
m 3

44
Stemming

Most stemmers also removes suffixes such as ed,
ing, ational, ation, able, ism...
Relational ? relate
Most stemmers dont use lexical look up
There are shortcomings
Stemming can result in non-words
Organization ? Organ
Doing ? doe
Unrelated words can be reduced to the same stem
police, policy ?polic

45
Stemming

Popular stemmers
Porters
Lovins
Iterated Lovins
Kstem

46
Lemmatization

Exceptions needs to be handled
sought ? seek, sheep ?sheep, feet ?foot
Computationally more expensive than stemming as
it lookups words in a dictionnary
Lemmatizer for French
http//bach.arts.kuleuven.be/pmertens/morlex/
FLEMM (F. Namer)
POS taggers with lemmatization TreeTagger, LT-POS

47
What is actually used?

Most retrieval systems use stemming/lemmatising
and stop word lists
Stemming increases recall while harming precision
Most web search engines do use stop word lists
but not stemming/lemmatising because
the text collection is extremely large so that
the change of matching morphogical variants is
higher
recall is not an issue
stemming is imperfect and the size and diversity
of the web increase the chance of a mismatch
stemming/tokenising tools are available for few
languages

48
Example Text Representations

Scientists have found compelling new evidence of
possible ancient
microscopic life on mars, derived from magnetic
crystals in a meteorit that fell to Earth from
the red planet, NASA anounced on Monday.
Web search scientists, found, compelling, new,
evidence,
possible, ancient, microscopic, life, mars,
derived, magnetic, crystals,
meteorite, fell, earth, red, planet, NASA,
anounced, Monday
Information service or library search scientist,
find, compelling,
new, evidence, possible, ancient, microscopic,
life, mars, derived,
magnetic, crystal, meteorite, fall, earth, red,
planet, NASA,
anounce, Monday

49
Granularity

Document unit
An index can map terms
... to documents
... to paragraphs in documents
... to sentences in document
... to positions in documents
An IR system should be designed to offer choices
of granularity.
For now, we will henceforth assume that a
suitable size document unit has been chosen,
together with an appropriate way of dividing or
aggregating files, if needed.

50
Index Content

The index usually stores some or all of the
following information
For each term
Document count. How many documents the term
occurs in.
Total Frequency count. How many times the term
occurs accross all documents ? popularity
measure
For each term and for each document
Frequency How often the term occurs in that
document.
Position. The offsets at which the term occurs in
that document.

51
Retrieval model
52
What is a retrieval model

A model is an abstraction of a process here
retrieval
Conclusions derived by the model are good if the
model provides a good approximation of the
retrieval process
IR Model variables queries, documents, terms,
relevance, users, information needs
Existing types of retrieval models
Boolean models
Vector space models
Probabilistic models
Models based on Belief nets
Models based on language models

53
Retrieval Models the general intuition

Documents and user information needs are
represented using index terms
Index terms serve as links to documents
Queries consists of index terms
Relevance can be measured in terms of a match
between queries and document index

54
Exact vs. Best Match

Exact Match
A query specifies precise retrieval criteria
Each document either matches or fails to match
the query
The result is a set of documents (no ranking)
Best match
A query describes good or best matching documents
The result is a ranked list of documents

55
Stastical retrieval
56
Statistical Models

A document is typically represented by a bag of
words (unordered words with frequencies)
User specifies a set of desired terms with
optional weights
Weighted query terms
Q lt database 0.5 text 0.8 information
0.2 gt
Unweighted query terms
Q lt database text information gt
No Boolean conditions specified in the query.

56
57
Statistical Retrieval

Retrieval based on similarity between query and
documents.
Output documents are ranked according to
similarity to query
Similarity based on occurrence frequencies of
keywords in query and document
Automatic relevance feedback can be supported
The user issues a (short, simple) query.
The system returns an initial set of retrieval
results.
The user marks some returned documents as
relevant or nonrelevant.
The systemcomputes a better representation of the
information need base on the user feedback.
The system displays a revised set of retrieval
results.

57
58
Boolean model
59
The boolean model

Most common exact-match model
Basic assumptions
An index term is either present or absent in a
document
All index terms provide equal evidence wrt
information needs
Queries are boolean combinations of index terms
x AND y docts that contains both x and y
(intersection of addresses)
x OR y docts that contains x, y or both (union
of addresses)
NOT x docts that do not contain x (complement
set of addresses)
Additionnally,
proximity operator
simple regular expressions
spelling variants

60
Boolean queriesExample

User information need
? interested in learning about vitamins that are
antioxidant
User boolean query
? antioxidant AND vitamin

61
The boolean model

Example of input collection (Shakespeares
plays)
Doc1
I did enact Julius Caesar
I was killed in the Capitol
Brutus killed me.
Doc2
So let it be with Caesar. The
noble Brutus hath told you Caesar
was ambitious

62
The boolean model index construction

First we build the list of pairs (keyword,
docID))

63
The boolean model index construction

Then the lists are sorted by keywords, frequency
information is added

64
The boolean model index construction

Multiple occurences of keywords are then merged
to create a dictionary file and a postings file

65
Processing Boolean queries

User boolean query Brutus AND Calpurnia
over the inverted index
Locate Brutus in the Dictionary
Retrieve its postings
Locate Calpurnia in the Dictionary
Retrieve its postings
Intersect the two postings lists
The intersection operation is the crucial one. It
has to be we efficient so as to be able to
quickly find documents that contain both terms.
sometimes referred to as merging postings lists
because it uses a merge algorithm
Merge algortihm general family of algorithms
that combine multiple sorted lists by interleaved
advancing of pointers through each list

66
Intersection
67
Extended boolean queries

Merging algorithm (from Manning et al., 07)

NB the posting lists HAVE to be sorted.
68
Extended boolean queries

Generalisation of the merging process
Imagine more than 2 keywords appear in the
query
(Brutus AND Caesar) AND NOT (Capitol)
Brutus AND Caesar AND Capitol
(Brutus OR Caesar) AND (Capitol
...
Ideas
consider keywords with shorter posting lists
first (to reduce the number of operations).
use the frequency information stored in the
dictionary
? See Manning et al., 07 for the algorithm

69
Extended boolean queries
retrieved docs D7, D5, D2
70
Exercise

How would you process the following queries (main
steps)
Brutus AND NOT Caesar
Try your algorithm on

71
Exercise

How would you process the following query (main
steps)
Brutus OR NOT Caesar

72
Remarks on the boolean model

The boolean model allows to express precise
queries (you know what you get, BUT you do not
have flexibility ? exact matches)
Boolean queries can be processed efficiently
(time complexity of the merge algorithm is linear
in the sum of the length of the lists to be
merged)
Has been a reference model in IR for a long time

73
Advantages of exact-match retrieval

Predictable, easy to explain
Structured queries
Works well when information need is clear and
precise

74
Drawbacks of exact-match retrieval

Unintuitive for non experts adequate query
formulation difficult for most users
no ranking of retrieved documents
exact matching may lead to too few or too many
retrieved documents
too few if not using synonyms
difficulty increases with collection size
large results sets need to be compensated by
interactive query refinement
No notion of partial relevance (useful if query
is overrestrictive)
All terms have equal importance (no term
weighing)
Ranking models consistently better

75
Boolean modelThe story so far

An inverted index associate keywords with posting
lists
The postings lists contain document identifiers
(and other useful information, such as total
frequences, number of documents, etc.)
Boolean queries are processed by merging posting
lists in order to find the documents satisfaying
the query
The cost of this list merging is time linear in
the total number of document Ids O(m n)
Question how to process phrase queries (i.e.
taking the words context into account) ?

76
Dealing with phrases queries

Many complex or technical concepts and many
organization and product names are multiword
compounds or phrases.
Stanford University
Graph Theory
Natural Language Processing
...
The user wants documents were the whole phrase
appears, and not only some parts of it (i.e. The
inventor Stanford Ovshinsky never went to
university is not a match )
About 10 of the web queries are phrase queries
(songs names, institutions...)
Such queries need either more complex dictionary
terms, or more complex index (critical parameter
size of the index)

77
Biword indexes

Use key-phrases of length 2, example
Text Natural Language Processing
Dictionary
Natural Language
Language Processing
The dictionary is made of biwords (notion of
context)
Query Information retrieval in Natural Langage
Processing
(Information retrieval) and (retrieval Natural)
and (Natural Language) and (Language Processing)
It might seem a better query to omit the middle
biword.
Better results can be obtained by using more
precise part-of-speech patterns that define which
extended biwords should be indexed

78
Positionnal indexes

Store positions in the inverted indexes, example
termID
doc1 position1, position2, ...
doc2 position1, position2, ..
....
Processing then corresponds to an extension of
the merging algorithm (additional checkings while
traversing the lists)
NB such indexes can be used to process proximity
queries (i.e. using constraints on proximity
between words)

Positional indexes need an entry per occurence
(NB classic inverted indexes need an entry per
document Id)
The size of such indexes grows exponentially with
the size of the document
The size of a positional index depends on the
language being indexed and the type of document
(books, articles, etc)
On average, a positional index is 2-4 times
bigger than a inverted index, it can reach 35 to
50 of the size of the original text (for
English)
Positional indexes can be used in combination
with classic indexes to save time and space (see
Williams et al, 2005).

80
Exercise

Which documents can contain the sentence to be
or not to be considering the following
(incomplete) indexes ?
be
1 7, 18, 33, 72, 86, 231
2 3, 149
4 17, 191, 291, 430, 434
5 363, 367
to
2 1, 17, 74, 222, 551
4 8, 16, 190, 429, 433
7 13, 23, 191

81
Exercise

Given the following positional indexes, give the
documents Ids corresponding to the query world
wide web
world
1 7, 18, 33, 70, 85, 131
2 3, 149
4 17, 190, 291, 430, 434
wide
1 12, 19, 40, 72, 86, 231
2 2, 17, 74, 150, 551
3 8, 16, 191, 429, 435
web
1 20, 22, 41, 75, 87, 200
2 18, 32, 45, 56, 77, 151
4 25, 192, 300, 332, 440

The postings lists to access are to, be, or,
not.
We will examine intersecting the postings lists
for to and be.
We first look for documents that contain both
terms.
Then, we look for places in the lists where there
is an occurrence of be with a token index one
higher than a position of to
and then we look for another occurrence of each
word with token index 4 higher than the first
occurrence.
In the above lists, the pattern of occurrences
that is a possible
match is
to lt...4lt...,429,433gt...gt
Be lt...4lt...,430,434gt...gt

83
Exercise

Consider the following index
Language ltd1,12gtltd2,23-32-43gtltd3,53gtltd5,36-42-48gt
Loria ltd1,25gt ltd2,34-40gt ltd5,38-51gt
Where dI refers to the document I, the other
numbers being positions.
The infix operator NEAR/x refers to the proximity
x between two term
Give the solutions to the query language NEAR/2
Loria
Give the pairs (x,docids) for each x such that
language NEAR/x Loria has at least one solution
Propose an algorithm for retrieving matching
document for this operator

84
Example WESTLAW

Large commercial system that serves legal and
professional market since 1974
legal materials (court opinions, statutes,
regulations, ...)
news (newspapers, magazines, journals, ...)
financial (stock quotes, financial analyses,
...)
Total collection size 5-7 Terabytes
700 000 users (they claim 56 of legal searchers
as of 2002)
Best match added in 1992

85
WESTLAW query language features

Boolean and proximity operators
Phrases West Publishing
Word Proximity West /5 Publishing
Same sentence Massachussets /s technology
Same paragraph - information retrieval /p
Restrictions DATE(AFTER 1992 BEFORE 1995)
Term expansion
wildcard (THOMSON) truncation (THOM!)
automatic expansion of plurals, possessive
Document structure (fields)

86
WESTLAW query example

Information need Information on the legal
theories involved in preventing the disclosure of
trade secrets by employees formerly employed by a
competingcompany.
? Query "trade secret" /s disclos! /s prevent /s
employe!
Information need Requirements for disabled
people to be able to access a workplace.
? Query disab! /p access! /s work-site
work-place (employment /3 place)
Information need Cases about a hosts
responsibility for drunk guests.
? Query host! /p (responsib! liab!) /p
(intoxicat! drunk!) /p guest

87
Boolean query languages are not dead

Exact match still prevalent in the commercial
market (but then includes some type of ranking)
Many users prefer Boolean
For some queries/collections, boolean may work
better
Boolean and free text queries find different
documents
? Need retrieval models that support both

88
The Vector Space Model
89
Best-Match retrieval

Boolean retrieval is the archetypal example of
exact-match retrieval
Best-match or ranking models are now more common
Advantages
easier to use
similar efficiency
provides ranking
best match generally has better retrieval
performance
most relevant documents appear at the top of the
ranking
But comparison best- and exact-match is difficult

Boolean model all documents matching the query
are retrieved
The matching is binary yes or no
Extreme cases the list of retrieved documents
can be empty, or huge
A ranking of the documents matching a query is
needed
A score is computed for each pair (query,
document)

91
Vector-space Retrieval

By far the most common retrieval systems
Key idea Everything (document, queries) is a
vector in a high dimensional space
Vector coefficients for an object (document,
query, term) represent the degree to which this
object embodies each of the basic dimensions
Relevance is measured using vector similarity a
document is relevant to a query if their
representing vectors are similar

92
Vector-space Representation

Documents are vectors of terms
Terms are vectors of documents
A query is a vector of terms

93
Graphic Representation

Example
D1 2T1 3T2 5T3
D2 3T1 7T2 T3
Q 0T1 0T2 2T3

93
94
Similarity in the Vector-space

Vector can contain binary terms or weighted terms
Binary term vector 1 ? term present, 0 ? term
absent
Weighted term vectorindicates relative
importance of terms in a document
Vector similarity can be measured in several
ways
Inner product (measure of overlap)
Cosine coefficient
Jacquard coefficient
Dice coefficient
Mikowski metric (dissimilarity)
Euclidian distance (dissimilarity)

95
Using the inner product similarity measure

Given a query vector q and a doct vector d, both
of length n,
similarity between q and d is defined by the
inner product q d of q and d

where qi (di ) is the value of the i -th position
of q(d)
With binary values this amounts to counting the
matching terms between q and d

96
Similarity an example in the Vector-space
97
The effect of varying document lengths

Problem
Longer documents will be represented with longer
vectors, but that does not mean they are more
important
If two documents have the same score, the shorter
one should be preferred
Solution the length of a document must be taken
into account when computing the similarity score

98
Document length normalization

The length of a document euclidian length
If d (x1, x2, ... Xn) then dw

To normalize a document, we divide it by its own
length d/dw
Similarity given by the cosine measure between
normalized vectors
q?(d/dw)
One problem is solved shorter more focused
documents receive a higher score than longer
documents with the same matching terms
But shorter documents are generally preferred
over longer one!
More sophisticated weighting schemes are
generally used

99
Term weights

qi is the weight of the term i in q
Up to now, we only considered binary term weight
0 term absent
1 term present
Two shortcomings
Does not reflect how often a term occurs
All terms are equally important (president vs.
the)
Remedy use non binary term weights
tf-score store the frequency of a term in the
vector (e.g., 4 if the term occurs 4 times in the
document)
idf-score to distinguish meaningful terms ie
terms that occur only in a few documents

100
Term frequency

A document is treated as a set of words
Each word characterizes that document to some
extent
When we have eliminated stop words, the most
frequent words tend to be what the document is
about
Therefore fkd (Nb of occurrences of word k in
document d) will be an important measure.
? Also called the term frequency (tf)

101
Document frequency

What makes this document distinct from others in
the corpus?
The terms which discriminate best are not those
which occur with high document frequency!
Therefore dk (nb of documents in which word k
occurs) will also be an important measure.
? Also called the document frequency (idf)

102
TF.IDF

This can all be summarized as
Words are best discriminators when
they occur often in this document (term
frequency)
do not occur in a lot of documents (document
frequency)
One very common measure of the importance of a
word to a document is
TF.IDF term frequency x inverse document
frequency
There are multiple formulas for actually
computing this. The underlying concept is the
same in all of them.

103
Term weights

tf-score tfi,j frequency of term i in
document j
idf-score idfi Inversed document frequency of
term i
idfi log(N/ni) with
N, the size of the document collection (nb of
documents)
ni , the number of documents in which the term i
occurs
idfi Proportion of the document collection in
which termi occurs
Term weight of term i in document j (TF-IDF)
tfi,j. idfi
the rarity of a term in the document collection

104
Boolean retrieval vs. Vector Space Retrieval

Boolean retrieval
Documents are not ranked
Boolean queries are not easy to manipulate
Vector space retrieval
Documents can be ranked
Issue 1 choice of comparison function. Usually
cosine comparison.
Issue 2 choice of weighing scheme. Usuall
variations on tfi,j. idfi

105
Evaluation
106
Evaluation

Issues
User-based evaluation
System-based evaluation
TREC
Precision and recall

107
Evaluation methods

Two types of evaluation methods
User-based measures the user satisfaction
System-based focuses on how well the system
ranks the documents

108
User based evaluation

More direct
Expensive
Difficult to do correctly
Need sufficiently large, representative sample of
users
The compared systems must be equally well
developed (complete with fully fonctional user
interface)
Each user must be trained to control learning
effects
Information, information needs, relevance are
intangible concepts

109
System based evaluation

Good system performance good document rankings
Allows for fair comparative testing
Less expensive can be reused
Test collection Topics, Documents, Relevance
judgments
System based evaluation goes back to Cranfields
experiments (1960)
Rate relevance of retrieved bibliographic
reference on a scale from 1 to 4

110
Recall and Precision

Three important performance metrics
Precision Proportion of retrieved documents
that are relevant
? No penalty for selecting too few item
Recall Proportion of relevant documents that
have been retrieved
? No penalty for selecting too many items (e.g.,
everything)7

111
F-Measure
112
Standard Text Collections

Relevant documents must be identified
Given a document collection D and a set of
queries Q, RELq is the set of document relevant
to q
Whether a document d is relevant to a query q is
decided by human judgement

113
Standard Text Collections

CACM (computer science) 3024 abstracts, 64
queries
CF (medicine) 1239 abstracts, 100 queries
CISI (library science) 1460 abstracts, 112
queries
CRANFIELD (aeronautics) 1400 abstracts, 225
queries
LISA (library science) 6004 abstracts, 35
queries
TIME (newspaper) 423 abstracts, 83 queries
Ohsumed (medicine) 348 566 abstracts, 106 queries

114
Building Test Collections

How to identify relevant documents?
How to assess relevance? (binary or
finer-grained)
One vs several judges

115
TREC

Text REtrieval Conference
Proceedings at http//trec.nist.gov/
Established in 1991 to evaluate large scale IR
Retrieving documents from a gigabyte collection
Organised by NIST and run continuously since 1991
Best known IR evaluation setting
25 participants in 92
109 participants from 4 continents in 2004
European (CLEF) and Asian counterparts (NTCIR)7

116
TREC Format

Several IR research tracks
ad-hoc retrieval
routing/filtering
cross languag
scanned document
spoken document
Video
Web
question answering
...

117
TREC notion of relevance

If you were writing a report on the subject of
the topic and would use the information contained
in the document in the report, then the document
is relevant
Pooling is used for identifying relevant
documents
A set of possibly relevant documents is created
automatically for each information need
The top 100 documents returned by each system are
kept and inspected by judges who determine which
documents are relevant
Inter-judge agreement is about 808

118
Improving Recall and Precision

The two big problems with short queries are
Synonymy Poor recall results from missing
documents that contain synonyms of search terms,
but not the terms themselves
Polysemy/Homonymy Poor precision results from
search terms that have multiple meanings leading
to the retrieval of non-relevant documents.

119
Query Expansion

Find a way to expand a users query to
automatically include relevant terms (that they
should have included themselves), in an effort to
improve recall
Use a dictionary/thesaurus
Use relevance feedback

120
Thesauri

A thesaurus contains information about words
(e.g., violin) such as
Synonyms similar words e.g., fiddle
Hyperonyms more general words e.g., instrument
Hyponyms more specific words e.g., Stradivari
Meronyms parts, e.g., strings
A very popular machine readable thesaurus is
Wordnet

121
Problems of Thesauri

Language dependent
Available only for a couple of languages

122
Cooccurence models

Semantically or syntactically related terms
Cooccurence vs. Thesauri
Easy to adapt to other languages/domains
Also covers relations not expressed in thesaur
Not as reliable as manually edited thesauri
Can introduce considerable noise
Selection criteria Mutual information, Expected
mutual, information

123
Relevance feedback

Ask user to identify a few documents which appear
to be related to their information need
Extract terms from those documents and add them
to the original query.
Run the new query and present those results to
the user.
Typically converges quickly

124
Blind feedback

Assume that first few documents returned are most
relevant rather than having users identify them
Proceed as for relevance feedback
Tends to improve recall at the expense of
precision

125
Post-Hoc Analysis

When a set of documents has been returned, they
can be analyzed to improve usefulness in
addressing information need
Grouped by meaning for polysemic queries (using
N-Gram-type approaches)
Grouped by extracted information (Named entities,
for instance)
Group into existing hierarchy if structured
fields available Filtering (e.g., eliminate spam)

126
References

Introduction to Information Retrieval, by C.
Manning, P. Raghavan, and H. Schütze. To appear
at Cambridge University Press (chapters available
at the book website).
Information Retrieval, Second Edition, by C.J.
van Rijsbergen, Butterworths, London, 1979.
Available here.

Write a Comment

User Comments (0)