Information Retrieval presentation

About This Presentation

Transcript and Presenter's Notes

Title: Information Retrieval

1
Information Retrieval

CSE 8337 (Part I)
Spring 2011
Some Material for these slides obtained from
Modern Information Retrieval by Ricardo
Baeza-Yates and Berthier Ribeiro-Neto
http//www.sims.berkeley.edu/hearst/irbook/
Data Mining Introductory and Advanced Topics by
Margaret H. Dunham
http//www.engr.smu.edu/mhd/book
Introduction to Information Retrieval by
Christopher D. Manning, Prabhakar Raghavan, and
Hinrich Schutze
http//informationretrieval.org

2
CSE 8337 Outline

Introduction
Text Processing
Indexes
Boolean Queries
Web Searching/Crawling
Vector Space Model
Matching
Evaluation
Feedback/Expansion

3
Information Retrieval

Information Retrieval (IR) retrieving desired
information from textual data.
Library Science
Digital Libraries
Web Search Engines
Traditionally keyword based
Sample query
Find all documents about data mining.

4
Motivation

IR representation, storage, organization of, and
access to information items
Focus is on the user information need
User information need (example)
Find all docs containing information on college
tennis teams which (1) are maintained by a USA
university and (2) participate in the NCAA
tournament.
Emphasis is on the retrieval of information (not
data)

5
DB vs IR

Records (tuples) vs. documents
Well defined results vs. fuzzy results
DB grew out of files and traditional business
systesm
IR grew out of library science and need to
categorize/group/access books/articles

6
Unstructured data

Typically refers to free text
Allows
Keyword queries including operators
More sophisticated concept queries e.g.,
find all web pages dealing with drug abuse
Classic model for searching text documents

7
Semi-structured data

In fact almost no data is unstructured
E.g., this slide has distinctly identified zones
such as the Title and Bullets
Facilitates semi-structured search such as
Title contains data AND Bullets contain search
to say nothing of linguistic structure

8
DB vs IR (contd)

Data retrieval
which docs contain a set of keywords?
Well defined semantics
a single erroneous object implies failure!
Information retrieval
information about a subject or topic
semantics is frequently loose
small errors are tolerated
IR system
interpret contents of information items
generate a ranking which reflects relevance
notion of relevance is most important

9
Motivation

IR software issues
classification and categorization
systems and languages
user interfaces and visualization
Still, area was seen as of narrow interest
Advent of the Web changed this perception once
and for all
universal repository of knowledge
free (low cost) universal access
no central editorial board
many problems though IR seen as key to finding
the solutions!

10
Basic Concepts

The User Task
Retrieval
information or data
purposeful
Browsing
glancing around
Feedback

Response
Retrieval
Database
Browsing
Feedback
11
Basic Concepts
Logical view of the documents
12
The Retrieval Process
13
Basic assumptions of Information Retrieval

Collection Fixed set of documents
Goal Retrieve documents with information that is
relevant to users information need and helps him
complete a task

14
Fuzzy Sets and Logic

Fuzzy Set Set membership function is a real
valued function with output in the range 0,1.
f(x) Probability x is in F.
1-f(x) Probability x is not in F.
EX
T x x is a person and x is tall
Let f(x) be the probability that x is tall
Here f is the membership function

15
Fuzzy Sets
16
IR is Fuzzy
Relevant
Relevant
Not Relevant
Not Relevant
Simple
Fuzzy
17
Information Retrieval Metrics

Similarity measure of how close a query is to a
document.
Documents which are close enough are retrieved.
Metrics
Precision Relevant and Retrieved
Retrieved
Recall Relevant and Retrieved
Relevant

18
IR Query Result Measures
IR
19
CSE 8337 Outline

Introduction
Text Processing (Background)
Indexes
Boolean Queries
Web Searching/Crawling
Vector Space Model
Matching
Evaluation
Feedback/Expansion

20
Text Processing TOC

Simple Text Storage
String Matching
Approximate (Fuzzy) Matching (Spell Checker)
Parsing
Tokenization
Stemming/ngrams
Stop words
Synonyms

21
Text storage

EBCDIC/ASCII
Array of character
Linked list of character
Trees- B Tree, Trie
Stuart E. Madnick, String Processing
Techniques, Communications of the ACM, Vol 10,
No 7, July 1967, pp 420-424.

22
Pattern Matching(Recognition)

Pattern Matching finds occurrences of a
predefined pattern in the data.
Applications include speech recognition,
information retrieval, time series analysis.

23
Similarity Measures

Determine similarity between two objects.
Similarity characteristics
Alternatively, distance measures measure how
unlike or dissimilar objects are.

24
String Matching Problem

Input
Pattern length m
Text string length n
Find one (next, all) occurrences of string in
pattern
Ex
String 00110011011110010100100111
Pattern 011010

25
String Matching Algorithms

Brute Force
Knuth-Morris Pratt
Boyer Moore

26
Brute Force String Matching

Brute Force
Handbook of Algorithms and Data Structures
http//www.dcc.uchile.cl/rbaeza/handbook/algs/7/7
11a.srch.c.html
Space O(mn)
Time O(mn)

00110011011110010100100111
27
FSR
28
Creating FSR

Create FSM
Construct the correct spine.
Add a default failure bus to state 0.
Add a default initial bus to state 1.
For each state, decide its attachments to failure
bus, initial bus, or other failure links.

29
Knuth-Morris-Pratt

Apply FSM to string by processing characters one
at a time.
Accepting state is reached when pattern is found.
Space O(mn)
Time O(mn)
Handbook of Algorithms and Data Structures
http//www.dcc.uchile.cl/rbaeza/handbook/algs/7/7
12.srch.c.html

30
Boyer-Moore

Scan pattern from right to left
Skip many positions on illegal character string.
O(mn)
Expected time better than KMP
Expected behavior better
Handbook of Algorithms and Data Structures
http//www.dcc.uchile.cl/rbaeza/handbook/algs/7/7
13.preproc.c.html

31
Approximate String Matching

Find patterns close to the string
Fuzzy matching
Applications
Spelling checkers
IR
Define similarity (distance) between string and
pattern

32
String-to-String Correction

Levenshtein Distance
http//www.mendeley.com/research/binary-codes-capa
ble-of-correcting-insertions-and-reversals/
Measure of similarity between strings
Can be used to determine how to convert from one
string to another
Cost to convert one to the other
Transformations
Match Current characters in both strings are
the same
Delete Delete current character in input string
Insert Insert current character in target
string into string

33
Distance Between Strings
34
Spell Checkers

Check or Replace or Expand or Suggest
Phonetic
Use phonetic spelling for word
Truespel www.foreignword.com/cgi-bin//transpel.cgi
Phoneme smallest sounds
Jaro Winkler
distance measure
http//en.wikipedia.org/wiki/JaroE28093Winkler_
distance
Autocomplete
www.amazon.com

35
Tokenization

Find individual words (tokens) in text string.
Look for spaces, commas, etc.
http//nlp.stanford.edu/IR-book/html/htmledition/t
okenization-1.html

36
Stemming/ngrams

Convert token/word into smallest word with
similar derivations
Remove suffixes (s, ed, ing, )
Remove prefixes (pre, re, un, )
ngram subsequences of length n

37
Stopwords

Common words
Bad words
Implementation
Text file

38
Synonyms

Exact/similar meaning
Hierarchy
One way
Bidirectional
Expand Query
Replace terms
Implementation
Synonym File or dictionary

39
CSE 8337 Outline

Introduction
Text Processing
Indexes
Boolean Queries
Web Searching/Crawling
Vector Space Model
Matching
Evaluation
Feedback/Expansion

40
Index

Common access is by keyword
Fast access by keyword
Index organizations?
Hash
B-tree
Linked List
Process document and query to identify keywords

41
Term-document incidence
1 if play contains word, 0 otherwise
Brutus AND Caesar but NOT Calpurnia
42
Incidence vectors

So we have a 0/1 vector for each term.
To answer query take the vectors for Brutus,
Caesar and Calpurnia (complemented) ? bitwise
AND.
110100 AND 110111 AND 101111 100100.
http//www.rhymezone.com/shakespeare/

43
Inverted index

For each term T, we must store a list of all
documents that contain T.
Do we use an array or a list for this?

Brutus
Calpurnia
Caesar
13
16
What happens if the word Caesar is added to
document 14?
44
Inverted index

Linked lists generally preferred to arrays
Dynamic space allocation
Insertion of terms into documents easy
Space overhead of pointers

Posting
2
4
8
16
32
64
128
2
3
5
8
13
21
34
1
13
16
Sorted by docID (more later on why).
45
Inverted index construction
Documents to be indexed.
Friends, Romans, countrymen.
46
Indexer steps

Sequence of (Modified token, Document ID) pairs.

Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
47

Sort by terms.

Core indexing step.
48

Multiple term entries in a single document are
merged.
Frequency information is added.

Why frequency? Will discuss later.
49

The result is split into a Dictionary file and a
Postings file.

Where do we pay in storage?

Terms
Pointers
51
Example Data

As an example for applying scalable index
construction algorithms, we will use the Reuters
RCV1 collection.
This is one year of Reuters newswire (part of
1995 and 1996)
http//about.reuters.com/researchandstandards/corp
us/available.asp
Hardware assumptions Table 4.1 p62 in textbook

52
A Reuters RCV1 document
53
Reuters RCV1 statistics
Symbol Statistic Value
N documents 800,000
L avg. tokens per doc 200
M terms ( word types) 400,000
avg. bytes per token(incl. spaces/punct.) 6
avg. bytes per token (without spaces/punct.) 4.5
avg. bytes per term 7.5
non-positional postings 100,000,000
4.5 bytes per word token vs. 7.5 bytes per word
type why?
54
Basic index construction

Documents are parsed to extract words and these
are saved with the Document ID.

Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
55
Key step

After all documents have been parsed, the
inverted file is sorted by terms.

We focus on this sort step. We have 100M items to
sort.
56
Scaling index construction

In-memory index construction does not scale.
How can we construct an index for very large
collections?
Taking into account the hardware constraints
Memory, disk, speed etc.

57
Sort-based Index construction

As we build the index, we parse docs one at a
time.
While building the index, we cannot easily
exploit compression tricks
The final postings for any term are incomplete
until the end.
At 12 bytes per postings entry, demands a lot of
space for large collections.
T 100,000,000 in the case of RCV1
So we can do this in memory in 2008, but
typical collections are much larger. E.g. New
York Times provides index of gt150 years of
newswire
Thus We need to store intermediate results on
disk.

58
Use the same algorithm for disk?

Can we use the same index construction algorithm
for larger collections, but by using disk instead
of memory?
No Sorting T 100,000,000 records on disk is
too slow too many disk seeks.
We need an external sorting algorithm.

59
Bottleneck

Parse and build postings entries one doc at a
time
Now sort postings entries by term (then by doc
within each term)
Doing this with random disk seeks would be too
slow must sort T100M records

If every comparison took 2 disk seeks, and N
items could be sorted with N log2N comparisons,
how long would this take?
60
BSBI Blocked sort-based Indexing

12-byte (444) records (term, doc, freq).
These are generated as we parse docs.
Must now sort 100M such 12-byte records by term.
Define a Block 10M such records
Can easily fit a couple into memory.
Will have 10 such blocks to start with.
Basic idea of algorithm
Accumulate postings for each block, sort, write
to disk.
Then merge the blocks into one long sorted order.

61
(No Transcript)
62
Sorting 10 blocks of 10M records

First, read each block and sort within
Quicksort takes 2N ln N expected steps
In our case 2 x (10M ln 10M) steps
Exercise estimate total time to read each block
from disk and and quicksort it.
10 times this estimate - gives us 10 sorted runs
of 10M records each.
Done straightforwardly, need 2 copies of data on
disk
But can optimize this

63
(No Transcript)
64
How to merge the sorted runs?

Can do binary merges, with a merge tree of log210
4 layers.
During each layer, read into memory runs in
blocks of 10M, merge, write back.

2
1
Merged run.
3
4
Runs being merged.
Disk
65
How to merge the sorted runs?

But it is more efficient to do a n-way merge,
where you are reading from all blocks
simultaneously
Providing you read decent-sized chunks of each
block into memory, youre not killed by disk seeks

66
Problem with sort-based algorithm

Our assumption was we can keep the dictionary in
memory.
We need the dictionary (which grows dynamically)
in order to implement a term to termID mapping.
Actually, we could work with term,docID postings
instead of termID,docID postings . . .
. . . but then intermediate files become very
large. (We would end up with a scalable, but very
slow index construction method.)

67
SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for
each block no need to maintain term-termID
mapping across blocks.
Key idea 2 Dont sort. Accumulate postings in
postings lists as they occur.
With these two ideas we can generate a complete
inverted index for each block.
These separate indexes can then be merged into
one big index.

68
SPIMI-Invert

Merging of blocks is analogous to BSBI.

69
SPIMI Compression

Compression makes SPIMI even more efficient.
Compression of terms
Compression of postings

70
Distributed indexing

For web-scale indexing (dont try this at home!)
must use a distributed computing cluster
Individual machines are fault-prone
Can unpredictably slow down or fail
How do we exploit such a pool of machines?

71
Google data centers

Google data centers mainly contain commodity
machines.
Data centers are distributed around the world.
Estimate a total of 1 million servers, 3 million
processors/cores (Gartner 2007)
Estimate Google installs 100,000 servers each
quarter.
Based on expenditures of 200250 million dollars
per year
This would be 10 of the computing capacity of
the world!?!

72
Google data centers

If in a non-fault-tolerant system with 1000
nodes, each node has 99.9 uptime, what is the
uptime of the system?
Answer 63
Calculate the number of servers failing per
minute for an installation of 1 million servers.

73
Distributed indexing

Maintain a master machine directing the indexing
job considered safe.
Break up indexing into sets of (parallel) tasks.
Master machine assigns each task to an idle
machine from a pool.

74
Parallel tasks

We will use two sets of parallel tasks
Parsers
Inverters
Break the input document corpus into splits
Each split is a subset of documents
(corresponding to blocks in BSBI/SPIMI)

75
Parsers

Master assigns a split to an idle parser machine
Parser reads a document at a time and emits
(term, doc) pairs
Parser writes pairs into j partitions
Each partition is for a range of terms first
letters
(e.g., a-f, g-p, q-z) here j3.
Now to complete the index inversion

76
Inverters

An inverter collects all (term,doc) pairs (
postings) for one term-partition.
Sorts and writes to postings lists

77
Data flow
Master
assign
assign
Postings
Parser
Inverter
a-f
g-p
q-z
a-f
Parser
a-f
g-p
q-z
Inverter
g-p
Inverter
splits
q-z
Parser
a-f
g-p
q-z
Map phase
Reduce phase
Segment files
78
MapReduce

The index construction algorithm we just
described is an instance of MapReduce.
MapReduce (Dean and Ghemawat 2004) is a robust
and conceptually simple framework for
distributed computing
without having to write code for the
distribution part.
They describe the Google indexing system (ca.
2002) as consisting of a number of phases, each
implemented in MapReduce.

79
MapReduce

Index construction was just one phase.
Another phase transforming a term-partitioned
index into document-partitioned index.
Term-partitioned one machine handles a subrange
of terms
Document-partitioned one machine handles a
subrange of documents
(As we discuss in the web part of the course)
most search engines use a document-partitioned
index better load balancing, etc.)

80
Dynamic indexing

Up to now, we have assumed that collections are
static.
They rarely are
Documents come in over time and need to be
inserted.
Documents are deleted and modified.
This means that the dictionary and postings lists
have to be modified
Postings updates for terms already in dictionary
New terms added to dictionary

81
Simplest approach

Maintain big main index
New docs go into small auxiliary index
Search across both, merge results
Deletions
Invalidation bit-vector for deleted docs
Filter docs output on a search result by this
invalidation bit-vector
Periodically, re-index into one main index

82
Issues with main and auxiliary indexes

Problem of frequent merges you touch stuff a
lot
Poor performance during merge
Actually
Merging of the auxiliary index into the main
index is efficient if we keep a separate file for
each postings list.
Merge is the same as a simple append.
But then we would need a lot of files
inefficient for O/S.
Assumption for the rest of the lecture The index
is one big file.
In reality Use a scheme somewhere in between
(e.g., split very large postings lists, collect
postings lists of length 1 in one file etc.)

83
Further issues with multiple indexes

Corpus-wide statistics are hard to maintain
E.g., when we spoke of spell-correction which of
several corrected alternatives do we present to
the user?
We said, pick the one with the most hits
How do we maintain the top ones with multiple
indexes and invalidation bit vectors?
One possibility ignore everything but the main
index for such ordering
Will see more such statistics used in results
ranking

84
Dynamic indexing at search engines

All the large search engines now do dynamic
indexing
Their indices have frequent incremental changes
News items, new topical web pages
Sarah Palin
But (sometimes/typically) they also periodically
reconstruct the index from scratch
Query processing is then switched to the new
index, and the old index is then deleted

85
Other sorts of indexes

Positional indexes
Same sort of sorting problem just larger
Building character n-gram indexes
As text is parsed, enumerate n-grams.
For each n-gram, need pointers to all dictionary
terms containing it the postings.
Note that the same postings entry will arise
repeatedly in parsing the docs need efficient
hashing to keep track of this.
E.g., that the trigram uou occurs in the term
deciduous will be discovered on each text
occurrence of deciduous
Only need to process each term once

86
CSE 8337 Outline

Introduction
Text Processing
Indexes
Boolean Queries
Web Searching/Crawling
Vector Space Model
Matching
Evaluation
Feedback/Expansion

87
The index we just built
Todays focus

How do we process a query?
Later - what kinds of queries can we process?

88
Keyword Based Queries

Basic Queries
Single word
Multiple words
Context Queries
Phrase
Proximity

89
Boolean Queries

Keywords combined with Boolean operators
OR (e1 OR e2)
AND (e1 AND e2)
BUT (e1 BUT e2) Satisfy e1 but not e2
Negation only allowed using BUT to allow
efficient use of inverted index by filtering
another efficiently retrievable set.
Naïve users have trouble with Boolean logic.

90
Boolean Retrieval with Inverted Indices

Primitive keyword Retrieve containing documents
using the inverted index.
OR Recursively retrieve e1 and e2 and take
union of results.
AND Recursively retrieve e1 and e2 and take
intersection of results.
BUT Recursively retrieve e1 and e2 and take set
difference of results.

91
Query processing AND

Consider processing the query
Brutus AND Caesar
Locate Brutus in the Dictionary
Retrieve its postings.
Locate Caesar in the Dictionary
Retrieve its postings.
Merge the two postings

128
Brutus
Caesar
34
92
The merge

Walk through the two postings simultaneously, in
time linear in the total number of postings
entries

128
2
34
If the list lengths are x and y, the merge takes
O(xy) operations. Crucial postings sorted by
docID.
93
Example WestLaw http//www.westlaw.com/

Largest commercial (paying subscribers) legal
search service (started 1975 ranking added 1992)
Tens of terabytes of data 700,000 users
Majority of users still use boolean queries
Example query
What is the statute of limitations in cases
involving the federal tort claims act?
LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3
CLAIM
/3 within 3 words, /S in same sentence

94
Boolean queries More general merges

Exercise Adapt the merge for the queries
Brutus AND NOT Caesar
Brutus OR NOT Caesar
Can we still run through the merge in time
O(xy)?

95
Merging

What about an arbitrary Boolean formula?
(Brutus OR Caesar) AND NOT
(Antony OR Cleopatra)
Can we always merge in linear time?
Linear in what?
Can we do better?

96
Query optimization

What is the best order for query processing?
Consider a query that is an AND of t terms.
For each of the t terms, get its postings, then
AND them together.

Brutus
Calpurnia
Caesar
13
16
Query Brutus AND Calpurnia AND Caesar
97
Query optimization example

Process in order of increasing freq
start with smallest set, then keep cutting
further.

This is why we kept freq in dictionary
Execute the query as (Caesar AND Brutus) AND
Calpurnia.
98
More general optimization

e.g., (madding OR crowd) AND (ignoble OR strife)
Get freqs for all terms.
Estimate the size of each OR by the sum of its
freqs (conservative).
Process in increasing order of OR sizes.

99
Exercise

Recommend a query processing order for

(tangerine OR trees) AND (marmalade OR skies)
AND (kaleidoscope OR eyes)
100
Phrasal Queries

Retrieve documents with a specific phrase
(ordered list of contiguous words)
information theory
May allow intervening stop words and/or stemming.
buy camera matches
buy a camera
buying the cameras
etc.

101
Phrasal Retrieval with Inverted Indices

Must have an inverted index that also stores
positions of each keyword in a document.
Retrieve documents and positions for each
individual word, intersect documents, and then
finally check for ordered contiguity of keyword
positions.
Best to start contiguity check with the least
common word in the phrase.

102
Phrasal Search Algorithm

Find set of documents D in which all keywords
(k1km) in phrase occur (using AND query
processing).
Intitialize empty set, R, of retrieved documents.
For each document, d, in D do
Get array, Pi , of positions of occurrences
for each ki in d
Find shortest array Ps of the Pis
For each position p of keyword ks in Ps do
For each keyword ki except ks do
Use binary search to find a
position (p s i ) in the
array Pi
If correct position for every keyword
found, add d to R
Return R

103
Proximity Queries

List of words with specific maximal distance
constraints between terms.
Example dogs and race within 4 words
match dogs will begin the race
May also perform stemming and/or not count stop
words.

104
Proximity Retrieval with Inverted Index

Use approach similar to phrasal search to find
documents in which all keywords are found in a
context that satisfies the proximity constraints.
During binary search for positions of remaining
keywords, find closest position of ki to p and
check that it is within maximum allowed distance.

105
Pattern Matching

Allow queries that match strings rather than word
tokens.
Requires more sophisticated data structures and
algorithms than inverted indices to retrieve
efficiently.

106
Simple Patterns

Prefixes Pattern that matches start of word.
anti matches antiquity, antibody, etc.
Suffixes Pattern that matches end of word
ix matches fix, matrix, etc.
Substrings Pattern that matches arbitrary
subsequence of characters.
rapt matches enrapture, velociraptor etc.
Ranges Pair of strings that matches any word
lexicographically (alphabetically) between them.
tin to tix matches tip, tire, title,
etc.

107
Allowing Errors

What if query or document contains typos or
misspellings?
Judge similarity of words (or arbitrary strings)
using
Edit distance (cost of insert/delete/match)
Longest Common Subsequence (LCS)
Allow proximity search with bound on string
similarity.

108
Longest Common Subsequence (LCS)

Length of the longest subsequence of characters
shared by two strings.
A subsequence of a string is obtained by deleting
zero or more characters.
Examples
misspell to mispell is 7
misspelled to misinterpretted is 7
mispeed

109
Structural Queries

Assumes documents have structure that can be
exploited in search.
Structure could be
Fixed set of fields, e.g. title, author,
abstract, etc.
Hierarchical (recursive) tree structure

book
chapter
chapter
title
section
title
section
title
subsection
110
Queries with Structure

Allow queries for text appearing in specific
fields
nuclear fusion appearing in a chapter title
SFQL Relational database query language SQL
enhanced with full text search.
Select abstract from journal.papers
where author contains Teller and
title contains nuclear fusion and
date lt 1/1/1950

111
Ranking search results

Boolean queries give inclusion or exclusion of
docs.
Often we want to rank/group results
Need to measure proximity from query to each doc.
Need to decide whether docs presented to user are
singletons, or a group of docs covering various
aspects of the query.

112
The web and its challenges

Unusual and diverse documents
Unusual and diverse users, queries, information
needs
Beyond terms, exploit ideas from social networks
link analysis, clickstreams ...
How do search engines work? And how can we make
them better?

Information Retrieval PowerPoint PPT Presentation