The%20term%20vocabulary%20and%20postings%20lists

About This Presentation

Title:

The%20term%20vocabulary%20and%20postings%20lists

Description:

Lecture 2: The term vocabulary and postings lists Related to Chapter 2: http://nlp.stanford.edu/IR-book/pdf/02voc.pdf – PowerPoint PPT presentation

Number of Views:396

Avg rating:3.0/5.0

Slides: 61

Provided by: Christop582

Category:

more less

Transcript and Presenter's Notes

Title: The%20term%20vocabulary%20and%20postings%20lists

1

Lecture 2
The term vocabulary and postings lists
Related to Chapter 2
http//nlp.stanford.edu/IR-book/pdf/02voc.pdf

2
Recap of the previous lecture
Ch. 1

Basic inverted indexes
Structure Dictionary and Postings
Key step in construction Sorting
Boolean query processing
Intersection by linear time merging
Simple optimizations

3
Recall the basic indexing pipeline
Documents to be indexed.
Friends, Romans, countrymen.
First project
4
Plan for this lecture

Elaborate basic indexing
Preprocessing to form the term vocabulary
Documents
Tokenization
What terms do we put in the index?
Postings
Faster merges skip lists
Positional postings and phrase queries

5
Parsing a document
Sec. 2.1

What format is it in?
pdf/word/excel/html?
What language is it in?
What character set is in use?

Each of these is a classification problem, which
we will study later in the course.
But these tasks are often done heuristically
6
Complications Format/language
Sec. 2.1

Documents being indexed can include docs from
many different languages
A single index may have to contain terms of
several languages.
Sometimes a document or its components can
contain multiple languages/formats
French email with a German pdf attachment.
Document unit

7
Tokens and Terms
8
Tokenization

Given a character sequence and a defined document
unit, tokenization is the task of chopping it up
into pieces, called tokens, perhaps at the same
time throwing away certain characters, such as
punctuation.

9
Tokenization
Sec. 2.2.1

Input university of Qom, computer department
Output Tokens
university
of
Qom
computer
department
A token is a sequence of characters in a document
Each such token is now a candidate for an index
entry, after further processing
Described below
But what are valid tokens to emit?

10
Issues in tokenization
Sec. 2.2.1

Irans capital ? Iran? Irans? Irans?
Hyphen
Hewlett-Packard ? Hewlett and Packard as two
tokens?
the hold-him-back-and-drag-him-away maneuver
break up hyphenated sequence.
co-education
lowercase, lower-case, lower case ?
Space
San Francisco How do you decide it is one token?

11
Issues in tokenization
Sec. 2.2.1

Numbers
Older IR systems may not index numbers
But often very useful think about things like
looking up error codes/stack traces on the web
(One answer is using n-grams Lecture 3)
Will often index meta-data separately
Creation date, format, etc.
3/12/91 Mar. 12, 1991 12/3/91
55 B.C.
B-52
My PGP key is 324a3df234cb23e
(800) 234-2333

12
Language issues in tokenization
Sec. 2.2.1

French
L'ensemble ? one token or two?
L ? L ? Le ?
Want lensemble to match with un ensemble
Until at least 2003, it didnt on Google
German noun compounds are not segmented
Lebensversicherungsgesellschaftsangestellter
life insurance company employee
German retrieval systems benefit greatly from a
compound splitter module
Can give a 15 performance boost for German

13
Language issues in tokenization
Sec. 2.2.1

Chinese and Japanese have no spaces between
words
????????????????????
Not always guaranteed a unique tokenization
Further complicated in Japanese, with multiple
alphabets intermingled

??????500?????????????500K(?6,000??)
14
Language issues in tokenization
Sec. 2.2.1

Arabic (or Hebrew) is basically written right to
left, but with certain items like numbers written
left to right
With modern Unicode representation concepts, the
order of characters in files matches the
conceptual order, and the reversal of displayed
characters is handled by the rendering system,
but this may not be true for documents in older
encodings.
Other complexities that you know!

15
Stop words
Sec. 2.2.2

With a stop list, you exclude from the
dictionary entirely the commonest words.
Intuition They have little semantic content
the, a, and, to, be
Using a stop list significantly reduces the
number of postings that a system has to store,
because there are a lot of them.

16
Stop words

You need them for
Phrase queries President of Iran
Various song titles, etc. Let it be, To be or
not to be
Relational queries flights to London
The general trend in IR systems from large stop
lists (200300 terms) to very small stop lists
(712 terms) to no stop list.
Good compression techniques (lecture 5) means the
space for including stop words in a system is
very small
Good query optimization techniques (lecture 7)
mean you pay little at query time for including
stop words.

17
Normalization to terms
Sec. 2.2.3

We want to match I.R. and IR
Token normalization is the process of
canonicalizing tokens so that matches occur
despite superficial differences in the character
sequences of the tokens
Result is terms a term is a (normalized) word
type, which is an entry in our IR system
dictionary

18
Normalization to terms

One way is using Equivalence Classes
Searches for one term will retrieve documents
that contain each of these members.
We most commonly implicitly define equivalence
classes of terms rather than being fully
calculated in advance (hand constructed), e.g.,
deleting periods to form a term
U.S.A., USA
deleting hyphens to form a term
anti-discriminatory, antidiscriminatory

19
Normalization other languages
Sec. 2.2.3

Accents e.g., French résumé vs. resume.
Umlauts e.g., German Tuebingen vs. Tübingen
Normalization of things like date forms
7?30? vs. 7/30
Tokenization and normalization may depend on the
language and so is intertwined with language
detection
Crucial Need to normalize indexed text as well
as query terms into the same form

20
Case folding
Sec. 2.2.3

Reduce all letters to lower case
exception upper case in mid-sentence?
Often best to lower case everything, since users
will use lowercase regardless of correct
capitalization

21
Normalization to terms
Sec. 2.2.3

What is the disadvantage of equivalence classing?
An alternative to equivalence classing is to do
asymmetric expansion (hand constructed)
An example of where this may be useful
Enter window Search window, windows
Enter windows Search Windows, windows, window
Enter Windows Search Windows
Potentially more powerful, but less efficient

22
Thesauri and soundex

Do we handle synonyms and homonyms?
E.g., by hand-constructed equivalence classes
car automobile color colour
What about spelling mistakes?
One approach is soundex, which forms equivalence
classes of words based on phonetic heuristics
More in lectures 3 and 9

23
Review

IR systems
Indexing
Searching
Indexing
Parsing document
Tokenization -gt tokens
Normalization -gt terms
Indexing -gt index

24
Review

Normalization consider tokens rather than query.
Examples
Case
Hyphen
Period
Synonyms
Spelling mistakes

25
Review

Two methods for normalization
Equivalence classing often implicit.
Asymmetric expansion
Query time
a query expansion dictionary
more processing at query time
Indexing time
more space for storing postings.
Asymmetric expansion is considerably less
efficient than equivalence classing but more
flexible.

26
Stemming and lemmatization

Documents are going to use different forms of a
word, such as organize, organizes, and
organizing.
Additionally, there are families of
derivationally related words with similar
meanings, such as democracy, democratic, and
democratization.
Reduce terms to their roots before indexing.
E.g.,
am, are, is ? be
car, cars, car's, cars' ? car
the boy's cars are different colors ? the boy car
be different color

27
Stemming
Sec. 2.2.4

Stemming suggest crude affix chopping
language dependent
Example
Porters algorithm
http//www.tartarus.org/martin/PorterStemmer
Lovins stemmer
http//www.comp.lancs.ac.uk/computing/research/ste
mming/general/lovins.htm

28
Porters algorithm
Sec. 2.2.4

Commonest algorithm for stemming English
Results suggest its at least as good as other
stemming options
Conventions 5 phases of reductions
phases applied sequentially
each phase consists of a set of commands
sample convention Of the rules in a compound
command, select the one that applies to the
longest suffix.

29
Typical rules in Porter
Sec. 2.2.4

sses ? ss presses ?
press
ies ? I bodies ? bodi
ss ? ss press ? press
s ? cats ? cat
Many other rules are sensitive to the measure of
words
(mgt1) EMENT ?
replacement ? replac
cement ? cement

30
Lemmatization
Sec. 2.2.4

Reduce inflectional/variant forms to base form
(lemma) properly with the use of a vocabulary and
morphological analysis of words
Lemmatizer a tool from Natural Language
Processing which does full morphological analysis
to accurately identify the lemma for each word.

31
Stemming vs. Lemmatization

saw
stemming might return just s,
Lemmatization would attempt to return either
see the use of the token was as a verb
saw the use of the token was as a noun

32
Helpfulness of normalization

Do stemming and other normalizations help?
Definitely useful for Spanish, German, Finnish,
30 performance gains for Finnish!
What about English?

33
Helpfulness of normalization

English
Not so considerable help!
Helps a lot for some queries, hurts performance a
lot for others.
Stemming helps recall but harms precision
operative (dentistry) ? oper
operational (research) ? oper
operating (systems) ? oper
For a case like this, moving to using a
lemmatizer would not completely fix the problem

34
Project 2

Find rules for normalizing Farsi documents and
implement.

35
Exercise

Are the following statements true or false? Why?
a. In a Boolean retrieval system, stemming never
lowers precision.
b. In a Boolean retrieval system, stemming never
lowers recall.
c. Stemming increases the size of the vocabulary.
d. Stemming should be invoked at indexing time
but not while processing a query

36
Language-specificity
Sec. 2.2.4

Many of the above features embody transformations
that are
Language-specific and
Often, application-specific
These are plug-in addenda to the indexing
process
Both open source and commercial plug-ins are
available for handling these

37
Faster postings mergesSkip lists
38
Recall basic merge
Sec. 2.3

Walk through the two postings simultaneously, in
time linear in the total number of postings
entries

128
2
4
8
41
48
64
Brutus
2
8
31
1
2
3
8
11
17
21
Caesar
If the list lengths are m and n, the merge takes
O(mn) operations.
Can we do better? Yes
39
Augment postings with skip pointers (at indexing
time)
Sec. 2.3
128
41
31
11
31

Why?
To skip postings that will not figure in the
search results.
How?
Where do we place skip pointers?
The resulted list is skip list.

40
Query processing with skip pointers
Sec. 2.3
128
41
128
31
11
31
Suppose weve stepped through the lists until we
process 8 on each list. We match it and advance.
We then have 41 and 11 on the lower. 11 is
smaller.
41
Where do we place skips?
Sec. 2.3

Tradeoff
More skips ? shorter skip spans ? more likely to
skip. But lots of comparisons to skip pointers.
Fewer skips ? few pointer comparison, but then
long skip spans ? few successful skips.

42
Placing skips
Sec. 2.3

Simple heuristic for postings of length L, use
?L evenly-spaced skip pointers.
This ignores the distribution of query terms.
Easy if the index is relatively static harder if
L keeps changing because of updates.
This definitely used to help with modern
hardware it may not (Bahle et al. 2002)
The I/O cost of loading a bigger postings list
can outweigh the gains from quicker in memory
merging!

D. Bahle, H. Williams, and J. Zobel. Efficient
phrase querying with an auxiliary index. SIGIR
2002, pp. 215-221.
43
Exercise

Do exercises 2.5 and 2.6 of your book.

44
Phrase queries and positional indexes
45
Phrase queries
Sec. 2.4

Want to be able to answer queries such as
stanford university as a phrase
Thus the sentence I went to university at
Stanford is not a match.
Most recent search engines support a double
quotes syntax

46
Phrase queries

PHRASE QUERIES has proven to be very easily
understood and successfully used by users.
As many as 10 of web queries are phrase queries,
Many more queries are implicit phrase queries
For this, it no longer suffices to store only
ltterm docsgt entries
Solutions?

47
A first attempt Biword indexes
Sec. 2.4.1

Index every consecutive pair of terms in the text
as a phrase
For example the text Qom computer department
would generate the biwords
Qom computer
computer department
Each of these biwords is now a dictionary term
Two-word phrase query-processing is now immediate.

48
Longer phrase queries
Sec. 2.4.1

The query modern information retrieval course
can be broken into the Boolean query on biwords
modern information AND information retrieval AND
retrieval course
Work fairly well in practice,
But there can and will be occasional false
positives.

49
Extended biwords

Now consider phrases such as student of the
computer
Perform part-of-speech-tagging (POST).
POST classify words as nouns, verbs, etc.
Group the terms into (say) Nouns (N) and
articles/prepositions (X).

50
Extended biwords
Sec. 2.4.1

Call any string of terms of the form NXXN an
extended biword.
Each such extended biword is made a term in the
vocabulary
Segment query into enhanced biwords

51
Issues for biword indexes
Sec. 2.4.1

False positives, as noted before
Index blowup due to bigger dictionary
Infeasible for more than biwords, big even for
them
Biword indexes are not the standard solution (for
all biwords) but can be part of a compound
strategy

52
Solution 2 Positional indexes
Sec. 2.4.2

In the postings, store for each term the
position(s) in which tokens of it appear
ltterm, number of docs containing term
doc1 position1, position2
doc2 position1, position2
etc.gt

53
Positional index example
Sec. 2.4.2
ltbe 993427 1 7, 18, 33, 72, 86, 231 2 3,
149 4 17, 191, 291, 430, 434 5 363, 367, gt
Which of docs 1,2,4,5 could contain to be or not
to be?

For phrase queries, we need to deal with more
than just equality

54
Processing a phrase query
Sec. 2.4.2

Extract inverted index entries for each distinct
term to, be, or, not.
Merge their docposition lists
to
21,17,74,222,551 48,16,190,429,433
713,23,191 ...
be
117,19 417,191,291,430,434 514,19,101 ...
Same general method for proximity searches

55
Proximity queries
Sec. 2.4.2

LIMIT /3 STATUTE /3 FEDERAL /2 TORT
/k means within k words of (on either side).
Clearly, positional indexes can be used for such
queries biword indexes cannot.
Figure 2.12 The merge of postings to handle
proximity queries.
This is a little tricky to do correctly and
efficiently

56
Positional index size
Sec. 2.4.2

Need an entry for each occurrence, not just once
per document
Index size depends on average document size
Average web page has lt1000 terms
Books, even some epic poems easily 100,000
terms
Consider a term with frequency 0.1

Why?
57
Rules of thumb
Sec. 2.4.2

A positional index is 24 as large as a
non-positional index
Positional index size 3550 of volume of
original text
Caveat all of this holds for English-like
languages

58
Positional index size
Sec. 2.4.2

You can compress position values/offsets well
talk about that in lecture 5
Nevertheless, a positional index expands postings
storage substantially
Nevertheless, a positional index is now
standardly used because of the power and
usefulness of phrase and proximity queries
whether used explicitly or implicitly in a
ranking retrieval system.

59
Combination schemes
Sec. 2.4.3

These two approaches can be profitably combined
For particular phrases (Hossein Rezazadeh) it
is inefficient to keep on merging positional
postings lists

60
Combination schemes

Williams et al. (2004) evaluate a more
sophisticated mixed indexing scheme
A typical web query mixture was executed in ¼ of
the time of using just a positional index
It required 26 more space than having a
positional index alone
H.E. Williams, J. Zobel, and D. Bahle. 2004.
Fast Phrase Querying with Combined Indexes, ACM
Transactions on Information Systems.

Arbitrary Presentation

Write a Comment

User Comments (0)