Information Retrieval - PowerPoint PPT Presentation

Loading...

PPT – Information Retrieval PowerPoint presentation | free to download - id: 7af29f-ZDBlZ



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Information Retrieval

Description:

Digital Libraries. Web Search Engines. ... systems and languages. ... Apply FSM to string by processing characters one at a time. – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 114
Provided by: smu142
Learn more at: http://lyle.smu.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Information Retrieval


1
Information Retrieval
  • CSE 8337 (Part I)
  • Spring 2011
  • Some Material for these slides obtained from
  • Modern Information Retrieval by Ricardo
    Baeza-Yates and Berthier Ribeiro-Neto
    http//www.sims.berkeley.edu/hearst/irbook/
  • Data Mining Introductory and Advanced Topics by
    Margaret H. Dunham
  • http//www.engr.smu.edu/mhd/book
  • Introduction to Information Retrieval by
    Christopher D. Manning, Prabhakar Raghavan, and
    Hinrich Schutze
  • http//informationretrieval.org

2
CSE 8337 Outline
  • Introduction
  • Text Processing
  • Indexes
  • Boolean Queries
  • Web Searching/Crawling
  • Vector Space Model
  • Matching
  • Evaluation
  • Feedback/Expansion

3
Information Retrieval
  • Information Retrieval (IR) retrieving desired
    information from textual data.
  • Library Science
  • Digital Libraries
  • Web Search Engines
  • Traditionally keyword based
  • Sample query
  • Find all documents about data mining.

4
Motivation
  • IR representation, storage, organization of, and
    access to information items
  • Focus is on the user information need
  • User information need (example)
  • Find all docs containing information on college
    tennis teams which (1) are maintained by a USA
    university and (2) participate in the NCAA
    tournament.
  • Emphasis is on the retrieval of information (not
    data)

5
DB vs IR
  • Records (tuples) vs. documents
  • Well defined results vs. fuzzy results
  • DB grew out of files and traditional business
    systesm
  • IR grew out of library science and need to
    categorize/group/access books/articles

6
Unstructured data
  • Typically refers to free text
  • Allows
  • Keyword queries including operators
  • More sophisticated concept queries e.g.,
  • find all web pages dealing with drug abuse
  • Classic model for searching text documents

7
Semi-structured data
  • In fact almost no data is unstructured
  • E.g., this slide has distinctly identified zones
    such as the Title and Bullets
  • Facilitates semi-structured search such as
  • Title contains data AND Bullets contain search
  • to say nothing of linguistic structure

8
DB vs IR (contd)
  • Data retrieval
  • which docs contain a set of keywords?
  • Well defined semantics
  • a single erroneous object implies failure!
  • Information retrieval
  • information about a subject or topic
  • semantics is frequently loose
  • small errors are tolerated
  • IR system
  • interpret contents of information items
  • generate a ranking which reflects relevance
  • notion of relevance is most important

9
Motivation
  • IR software issues
  • classification and categorization
  • systems and languages
  • user interfaces and visualization
  • Still, area was seen as of narrow interest
  • Advent of the Web changed this perception once
    and for all
  • universal repository of knowledge
  • free (low cost) universal access
  • no central editorial board
  • many problems though IR seen as key to finding
    the solutions!

10
Basic Concepts
  • The User Task
  • Retrieval
  • information or data
  • purposeful
  • Browsing
  • glancing around
  • Feedback

Response
Retrieval
Database
Browsing
Feedback
11
Basic Concepts
Logical view of the documents
12
The Retrieval Process
13
Basic assumptions of Information Retrieval
  • Collection Fixed set of documents
  • Goal Retrieve documents with information that is
    relevant to users information need and helps him
    complete a task

14
Fuzzy Sets and Logic
  • Fuzzy Set Set membership function is a real
    valued function with output in the range 0,1.
  • f(x) Probability x is in F.
  • 1-f(x) Probability x is not in F.
  • EX
  • T x x is a person and x is tall
  • Let f(x) be the probability that x is tall
  • Here f is the membership function

15
Fuzzy Sets
16
IR is Fuzzy
Relevant
Relevant
Not Relevant
Not Relevant
Simple
Fuzzy
17
Information Retrieval Metrics
  • Similarity measure of how close a query is to a
    document.
  • Documents which are close enough are retrieved.
  • Metrics
  • Precision Relevant and Retrieved
  • Retrieved
  • Recall Relevant and Retrieved
  • Relevant

18
IR Query Result Measures
IR
19
CSE 8337 Outline
  • Introduction
  • Text Processing (Background)
  • Indexes
  • Boolean Queries
  • Web Searching/Crawling
  • Vector Space Model
  • Matching
  • Evaluation
  • Feedback/Expansion

20
Text Processing TOC
  • Simple Text Storage
  • String Matching
  • Approximate (Fuzzy) Matching (Spell Checker)
  • Parsing
  • Tokenization
  • Stemming/ngrams
  • Stop words
  • Synonyms

21
Text storage
  • EBCDIC/ASCII
  • Array of character
  • Linked list of character
  • Trees- B Tree, Trie
  • Stuart E. Madnick, String Processing
    Techniques, Communications of the ACM, Vol 10,
    No 7, July 1967, pp 420-424.

22
Pattern Matching(Recognition)
  • Pattern Matching finds occurrences of a
    predefined pattern in the data.
  • Applications include speech recognition,
    information retrieval, time series analysis.

23
Similarity Measures
  • Determine similarity between two objects.
  • Similarity characteristics
  • Alternatively, distance measures measure how
    unlike or dissimilar objects are.

24
String Matching Problem
  • Input
  • Pattern length m
  • Text string length n
  • Find one (next, all) occurrences of string in
    pattern
  • Ex
  • String 00110011011110010100100111
  • Pattern 011010

25
String Matching Algorithms
  • Brute Force
  • Knuth-Morris Pratt
  • Boyer Moore

26
Brute Force String Matching
  • Brute Force
  • Handbook of Algorithms and Data Structures
  • http//www.dcc.uchile.cl/rbaeza/handbook/algs/7/7
    11a.srch.c.html
  • Space O(mn)
  • Time O(mn)

00110011011110010100100111
27
FSR
28
Creating FSR
  • Create FSM
  • Construct the correct spine.
  • Add a default failure bus to state 0.
  • Add a default initial bus to state 1.
  • For each state, decide its attachments to failure
    bus, initial bus, or other failure links.

29
Knuth-Morris-Pratt
  • Apply FSM to string by processing characters one
    at a time.
  • Accepting state is reached when pattern is found.
  • Space O(mn)
  • Time O(mn)
  • Handbook of Algorithms and Data Structures
  • http//www.dcc.uchile.cl/rbaeza/handbook/algs/7/7
    12.srch.c.html

30
Boyer-Moore
  • Scan pattern from right to left
  • Skip many positions on illegal character string.
  • O(mn)
  • Expected time better than KMP
  • Expected behavior better
  • Handbook of Algorithms and Data Structures
  • http//www.dcc.uchile.cl/rbaeza/handbook/algs/7/7
    13.preproc.c.html

31
Approximate String Matching
  • Find patterns close to the string
  • Fuzzy matching
  • Applications
  • Spelling checkers
  • IR
  • Define similarity (distance) between string and
    pattern

32
String-to-String Correction
  • Levenshtein Distance
  • http//www.mendeley.com/research/binary-codes-capa
    ble-of-correcting-insertions-and-reversals/
  • Measure of similarity between strings
  • Can be used to determine how to convert from one
    string to another
  • Cost to convert one to the other
  • Transformations
  • Match Current characters in both strings are
    the same
  • Delete Delete current character in input string
  • Insert Insert current character in target
    string into string

33
Distance Between Strings
34
Spell Checkers
  • Check or Replace or Expand or Suggest
  • Phonetic
  • Use phonetic spelling for word
  • Truespel www.foreignword.com/cgi-bin//transpel.cgi
  • Phoneme smallest sounds
  • Jaro Winkler
  • distance measure
  • http//en.wikipedia.org/wiki/JaroE28093Winkler_
    distance
  • Autocomplete
  • www.amazon.com

35
Tokenization
  • Find individual words (tokens) in text string.
  • Look for spaces, commas, etc.
  • http//nlp.stanford.edu/IR-book/html/htmledition/t
    okenization-1.html

36
Stemming/ngrams
  • Convert token/word into smallest word with
    similar derivations
  • Remove suffixes (s, ed, ing, )
  • Remove prefixes (pre, re, un, )
  • ngram subsequences of length n

37
Stopwords
  • Common words
  • Bad words
  • Implementation
  • Text file

38
Synonyms
  • Exact/similar meaning
  • Hierarchy
  • One way
  • Bidirectional
  • Expand Query
  • Replace terms
  • Implementation
  • Synonym File or dictionary

39
CSE 8337 Outline
  • Introduction
  • Text Processing
  • Indexes
  • Boolean Queries
  • Web Searching/Crawling
  • Vector Space Model
  • Matching
  • Evaluation
  • Feedback/Expansion

40
Index
  • Common access is by keyword
  • Fast access by keyword
  • Index organizations?
  • Hash
  • B-tree
  • Linked List
  • Process document and query to identify keywords

41
Term-document incidence
1 if play contains word, 0 otherwise
Brutus AND Caesar but NOT Calpurnia
42
Incidence vectors
  • So we have a 0/1 vector for each term.
  • To answer query take the vectors for Brutus,
    Caesar and Calpurnia (complemented) ? bitwise
    AND.
  • 110100 AND 110111 AND 101111 100100.
  • http//www.rhymezone.com/shakespeare/

43
Inverted index
  • For each term T, we must store a list of all
    documents that contain T.
  • Do we use an array or a list for this?

Brutus
Calpurnia
Caesar
13
16
What happens if the word Caesar is added to
document 14?
44
Inverted index
  • Linked lists generally preferred to arrays
  • Dynamic space allocation
  • Insertion of terms into documents easy
  • Space overhead of pointers

Posting
2
4
8
16
32
64
128
2
3
5
8
13
21
34
1
13
16
Sorted by docID (more later on why).
45
Inverted index construction
Documents to be indexed.
Friends, Romans, countrymen.
46
Indexer steps
  • Sequence of (Modified token, Document ID) pairs.

Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
47
  • Sort by terms.

Core indexing step.
48
  • Multiple term entries in a single document are
    merged.
  • Frequency information is added.

Why frequency? Will discuss later.
49
  • The result is split into a Dictionary file and a
    Postings file.

50
  • Where do we pay in storage?

Terms
Pointers
51
Example Data
  • As an example for applying scalable index
    construction algorithms, we will use the Reuters
    RCV1 collection.
  • This is one year of Reuters newswire (part of
    1995 and 1996)
  • http//about.reuters.com/researchandstandards/corp
    us/available.asp
  • Hardware assumptions Table 4.1 p62 in textbook

52
A Reuters RCV1 document
53
Reuters RCV1 statistics
Symbol Statistic Value
N documents 800,000
L avg. tokens per doc 200
M terms ( word types) 400,000
avg. bytes per token(incl. spaces/punct.) 6
avg. bytes per token (without spaces/punct.) 4.5
avg. bytes per term 7.5
non-positional postings 100,000,000
4.5 bytes per word token vs. 7.5 bytes per word
type why?
54
Basic index construction
  • Documents are parsed to extract words and these
    are saved with the Document ID.

Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
55
Key step
  • After all documents have been parsed, the
    inverted file is sorted by terms.

We focus on this sort step. We have 100M items to
sort.
56
Scaling index construction
  • In-memory index construction does not scale.
  • How can we construct an index for very large
    collections?
  • Taking into account the hardware constraints
  • Memory, disk, speed etc.

57
Sort-based Index construction
  • As we build the index, we parse docs one at a
    time.
  • While building the index, we cannot easily
    exploit compression tricks
  • The final postings for any term are incomplete
    until the end.
  • At 12 bytes per postings entry, demands a lot of
    space for large collections.
  • T 100,000,000 in the case of RCV1
  • So we can do this in memory in 2008, but
    typical collections are much larger. E.g. New
    York Times provides index of gt150 years of
    newswire
  • Thus We need to store intermediate results on
    disk.

58
Use the same algorithm for disk?
  • Can we use the same index construction algorithm
    for larger collections, but by using disk instead
    of memory?
  • No Sorting T 100,000,000 records on disk is
    too slow too many disk seeks.
  • We need an external sorting algorithm.

59
Bottleneck
  • Parse and build postings entries one doc at a
    time
  • Now sort postings entries by term (then by doc
    within each term)
  • Doing this with random disk seeks would be too
    slow must sort T100M records

If every comparison took 2 disk seeks, and N
items could be sorted with N log2N comparisons,
how long would this take?
60
BSBI Blocked sort-based Indexing
  • 12-byte (444) records (term, doc, freq).
  • These are generated as we parse docs.
  • Must now sort 100M such 12-byte records by term.
  • Define a Block 10M such records
  • Can easily fit a couple into memory.
  • Will have 10 such blocks to start with.
  • Basic idea of algorithm
  • Accumulate postings for each block, sort, write
    to disk.
  • Then merge the blocks into one long sorted order.

61
(No Transcript)
62
Sorting 10 blocks of 10M records
  • First, read each block and sort within
  • Quicksort takes 2N ln N expected steps
  • In our case 2 x (10M ln 10M) steps
  • Exercise estimate total time to read each block
    from disk and and quicksort it.
  • 10 times this estimate - gives us 10 sorted runs
    of 10M records each.
  • Done straightforwardly, need 2 copies of data on
    disk
  • But can optimize this

63
(No Transcript)
64
How to merge the sorted runs?
  • Can do binary merges, with a merge tree of log210
    4 layers.
  • During each layer, read into memory runs in
    blocks of 10M, merge, write back.

2
1
Merged run.
3
4
Runs being merged.
Disk
65
How to merge the sorted runs?
  • But it is more efficient to do a n-way merge,
    where you are reading from all blocks
    simultaneously
  • Providing you read decent-sized chunks of each
    block into memory, youre not killed by disk seeks

66
Problem with sort-based algorithm
  • Our assumption was we can keep the dictionary in
    memory.
  • We need the dictionary (which grows dynamically)
    in order to implement a term to termID mapping.
  • Actually, we could work with term,docID postings
    instead of termID,docID postings . . .
  • . . . but then intermediate files become very
    large. (We would end up with a scalable, but very
    slow index construction method.)

67
SPIMI Single-pass in-memory indexing
  • Key idea 1 Generate separate dictionaries for
    each block no need to maintain term-termID
    mapping across blocks.
  • Key idea 2 Dont sort. Accumulate postings in
    postings lists as they occur.
  • With these two ideas we can generate a complete
    inverted index for each block.
  • These separate indexes can then be merged into
    one big index.

68
SPIMI-Invert
  • Merging of blocks is analogous to BSBI.

69
SPIMI Compression
  • Compression makes SPIMI even more efficient.
  • Compression of terms
  • Compression of postings

70
Distributed indexing
  • For web-scale indexing (dont try this at home!)
  • must use a distributed computing cluster
  • Individual machines are fault-prone
  • Can unpredictably slow down or fail
  • How do we exploit such a pool of machines?

71
Google data centers
  • Google data centers mainly contain commodity
    machines.
  • Data centers are distributed around the world.
  • Estimate a total of 1 million servers, 3 million
    processors/cores (Gartner 2007)
  • Estimate Google installs 100,000 servers each
    quarter.
  • Based on expenditures of 200250 million dollars
    per year
  • This would be 10 of the computing capacity of
    the world!?!

72
Google data centers
  • If in a non-fault-tolerant system with 1000
    nodes, each node has 99.9 uptime, what is the
    uptime of the system?
  • Answer 63
  • Calculate the number of servers failing per
    minute for an installation of 1 million servers.

73
Distributed indexing
  • Maintain a master machine directing the indexing
    job considered safe.
  • Break up indexing into sets of (parallel) tasks.
  • Master machine assigns each task to an idle
    machine from a pool.

74
Parallel tasks
  • We will use two sets of parallel tasks
  • Parsers
  • Inverters
  • Break the input document corpus into splits
  • Each split is a subset of documents
    (corresponding to blocks in BSBI/SPIMI)

75
Parsers
  • Master assigns a split to an idle parser machine
  • Parser reads a document at a time and emits
    (term, doc) pairs
  • Parser writes pairs into j partitions
  • Each partition is for a range of terms first
    letters
  • (e.g., a-f, g-p, q-z) here j3.
  • Now to complete the index inversion

76
Inverters
  • An inverter collects all (term,doc) pairs (
    postings) for one term-partition.
  • Sorts and writes to postings lists

77
Data flow
Master
assign
assign
Postings
Parser
Inverter
a-f
g-p
q-z
a-f
Parser
a-f
g-p
q-z
Inverter
g-p
Inverter
splits
q-z
Parser
a-f
g-p
q-z
Map phase
Reduce phase
Segment files
78
MapReduce
  • The index construction algorithm we just
    described is an instance of MapReduce.
  • MapReduce (Dean and Ghemawat 2004) is a robust
    and conceptually simple framework for
  • distributed computing
  • without having to write code for the
    distribution part.
  • They describe the Google indexing system (ca.
    2002) as consisting of a number of phases, each
    implemented in MapReduce.

79
MapReduce
  • Index construction was just one phase.
  • Another phase transforming a term-partitioned
    index into document-partitioned index.
  • Term-partitioned one machine handles a subrange
    of terms
  • Document-partitioned one machine handles a
    subrange of documents
  • (As we discuss in the web part of the course)
    most search engines use a document-partitioned
    index better load balancing, etc.)

80
Dynamic indexing
  • Up to now, we have assumed that collections are
    static.
  • They rarely are
  • Documents come in over time and need to be
    inserted.
  • Documents are deleted and modified.
  • This means that the dictionary and postings lists
    have to be modified
  • Postings updates for terms already in dictionary
  • New terms added to dictionary

81
Simplest approach
  • Maintain big main index
  • New docs go into small auxiliary index
  • Search across both, merge results
  • Deletions
  • Invalidation bit-vector for deleted docs
  • Filter docs output on a search result by this
    invalidation bit-vector
  • Periodically, re-index into one main index

82
Issues with main and auxiliary indexes
  • Problem of frequent merges you touch stuff a
    lot
  • Poor performance during merge
  • Actually
  • Merging of the auxiliary index into the main
    index is efficient if we keep a separate file for
    each postings list.
  • Merge is the same as a simple append.
  • But then we would need a lot of files
    inefficient for O/S.
  • Assumption for the rest of the lecture The index
    is one big file.
  • In reality Use a scheme somewhere in between
    (e.g., split very large postings lists, collect
    postings lists of length 1 in one file etc.)

83
Further issues with multiple indexes
  • Corpus-wide statistics are hard to maintain
  • E.g., when we spoke of spell-correction which of
    several corrected alternatives do we present to
    the user?
  • We said, pick the one with the most hits
  • How do we maintain the top ones with multiple
    indexes and invalidation bit vectors?
  • One possibility ignore everything but the main
    index for such ordering
  • Will see more such statistics used in results
    ranking

84
Dynamic indexing at search engines
  • All the large search engines now do dynamic
    indexing
  • Their indices have frequent incremental changes
  • News items, new topical web pages
  • Sarah Palin
  • But (sometimes/typically) they also periodically
    reconstruct the index from scratch
  • Query processing is then switched to the new
    index, and the old index is then deleted

85
Other sorts of indexes
  • Positional indexes
  • Same sort of sorting problem just larger
  • Building character n-gram indexes
  • As text is parsed, enumerate n-grams.
  • For each n-gram, need pointers to all dictionary
    terms containing it the postings.
  • Note that the same postings entry will arise
    repeatedly in parsing the docs need efficient
    hashing to keep track of this.
  • E.g., that the trigram uou occurs in the term
    deciduous will be discovered on each text
    occurrence of deciduous
  • Only need to process each term once

86
CSE 8337 Outline
  • Introduction
  • Text Processing
  • Indexes
  • Boolean Queries
  • Web Searching/Crawling
  • Vector Space Model
  • Matching
  • Evaluation
  • Feedback/Expansion

87
The index we just built
Todays focus
  • How do we process a query?
  • Later - what kinds of queries can we process?

88
Keyword Based Queries
  • Basic Queries
  • Single word
  • Multiple words
  • Context Queries
  • Phrase
  • Proximity

89
Boolean Queries
  • Keywords combined with Boolean operators
  • OR (e1 OR e2)
  • AND (e1 AND e2)
  • BUT (e1 BUT e2) Satisfy e1 but not e2
  • Negation only allowed using BUT to allow
    efficient use of inverted index by filtering
    another efficiently retrievable set.
  • Naïve users have trouble with Boolean logic.

90
Boolean Retrieval with Inverted Indices
  • Primitive keyword Retrieve containing documents
    using the inverted index.
  • OR Recursively retrieve e1 and e2 and take
    union of results.
  • AND Recursively retrieve e1 and e2 and take
    intersection of results.
  • BUT Recursively retrieve e1 and e2 and take set
    difference of results.

91
Query processing AND
  • Consider processing the query
  • Brutus AND Caesar
  • Locate Brutus in the Dictionary
  • Retrieve its postings.
  • Locate Caesar in the Dictionary
  • Retrieve its postings.
  • Merge the two postings

128
Brutus
Caesar
34
92
The merge
  • Walk through the two postings simultaneously, in
    time linear in the total number of postings
    entries

128
2
34
If the list lengths are x and y, the merge takes
O(xy) operations. Crucial postings sorted by
docID.
93
Example WestLaw http//www.westlaw.com/
  • Largest commercial (paying subscribers) legal
    search service (started 1975 ranking added 1992)
  • Tens of terabytes of data 700,000 users
  • Majority of users still use boolean queries
  • Example query
  • What is the statute of limitations in cases
    involving the federal tort claims act?
  • LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3
    CLAIM
  • /3 within 3 words, /S in same sentence

94
Boolean queries More general merges
  • Exercise Adapt the merge for the queries
  • Brutus AND NOT Caesar
  • Brutus OR NOT Caesar
  • Can we still run through the merge in time
    O(xy)?

95
Merging
  • What about an arbitrary Boolean formula?
  • (Brutus OR Caesar) AND NOT
  • (Antony OR Cleopatra)
  • Can we always merge in linear time?
  • Linear in what?
  • Can we do better?

96
Query optimization
  • What is the best order for query processing?
  • Consider a query that is an AND of t terms.
  • For each of the t terms, get its postings, then
    AND them together.

Brutus
Calpurnia
Caesar
13
16
Query Brutus AND Calpurnia AND Caesar
97
Query optimization example
  • Process in order of increasing freq
  • start with smallest set, then keep cutting
    further.

This is why we kept freq in dictionary
Execute the query as (Caesar AND Brutus) AND
Calpurnia.
98
More general optimization
  • e.g., (madding OR crowd) AND (ignoble OR strife)
  • Get freqs for all terms.
  • Estimate the size of each OR by the sum of its
    freqs (conservative).
  • Process in increasing order of OR sizes.

99
Exercise
  • Recommend a query processing order for

(tangerine OR trees) AND (marmalade OR skies)
AND (kaleidoscope OR eyes)
100
Phrasal Queries
  • Retrieve documents with a specific phrase
    (ordered list of contiguous words)
  • information theory
  • May allow intervening stop words and/or stemming.
  • buy camera matches
    buy a camera
    buying the cameras
    etc.

101
Phrasal Retrieval with Inverted Indices
  • Must have an inverted index that also stores
    positions of each keyword in a document.
  • Retrieve documents and positions for each
    individual word, intersect documents, and then
    finally check for ordered contiguity of keyword
    positions.
  • Best to start contiguity check with the least
    common word in the phrase.

102
Phrasal Search Algorithm
  • Find set of documents D in which all keywords
    (k1km) in phrase occur (using AND query
    processing).
  • Intitialize empty set, R, of retrieved documents.
  • For each document, d, in D do
  • Get array, Pi , of positions of occurrences
    for each ki in d
  • Find shortest array Ps of the Pis
  • For each position p of keyword ks in Ps do
  • For each keyword ki except ks do
  • Use binary search to find a
    position (p s i ) in the
  • array Pi
  • If correct position for every keyword
    found, add d to R
  • Return R

103
Proximity Queries
  • List of words with specific maximal distance
    constraints between terms.
  • Example dogs and race within 4 words
    match dogs will begin the race
  • May also perform stemming and/or not count stop
    words.

104
Proximity Retrieval with Inverted Index
  • Use approach similar to phrasal search to find
    documents in which all keywords are found in a
    context that satisfies the proximity constraints.
  • During binary search for positions of remaining
    keywords, find closest position of ki to p and
    check that it is within maximum allowed distance.

105
Pattern Matching
  • Allow queries that match strings rather than word
    tokens.
  • Requires more sophisticated data structures and
    algorithms than inverted indices to retrieve
    efficiently.

106
Simple Patterns
  • Prefixes Pattern that matches start of word.
  • anti matches antiquity, antibody, etc.
  • Suffixes Pattern that matches end of word
  • ix matches fix, matrix, etc.
  • Substrings Pattern that matches arbitrary
    subsequence of characters.
  • rapt matches enrapture, velociraptor etc.
  • Ranges Pair of strings that matches any word
    lexicographically (alphabetically) between them.
  • tin to tix matches tip, tire, title,
    etc.

107
Allowing Errors
  • What if query or document contains typos or
    misspellings?
  • Judge similarity of words (or arbitrary strings)
    using
  • Edit distance (cost of insert/delete/match)
  • Longest Common Subsequence (LCS)
  • Allow proximity search with bound on string
    similarity.

108
Longest Common Subsequence (LCS)
  • Length of the longest subsequence of characters
    shared by two strings.
  • A subsequence of a string is obtained by deleting
    zero or more characters.
  • Examples
  • misspell to mispell is 7
  • misspelled to misinterpretted is 7
    mispeed

109
Structural Queries
  • Assumes documents have structure that can be
    exploited in search.
  • Structure could be
  • Fixed set of fields, e.g. title, author,
    abstract, etc.
  • Hierarchical (recursive) tree structure

book
chapter
chapter
title
section
title
section
title
subsection
110
Queries with Structure
  • Allow queries for text appearing in specific
    fields
  • nuclear fusion appearing in a chapter title
  • SFQL Relational database query language SQL
    enhanced with full text search.
  • Select abstract from journal.papers
  • where author contains Teller and
  • title contains nuclear fusion and
    date lt 1/1/1950

111
Ranking search results
  • Boolean queries give inclusion or exclusion of
    docs.
  • Often we want to rank/group results
  • Need to measure proximity from query to each doc.
  • Need to decide whether docs presented to user are
    singletons, or a group of docs covering various
    aspects of the query.

112
The web and its challenges
  • Unusual and diverse documents
  • Unusual and diverse users, queries, information
    needs
  • Beyond terms, exploit ideas from social networks
  • link analysis, clickstreams ...
  • How do search engines work? And how can we make
    them better?

113
More sophisticated information retrieval
  • Cross-language information retrieval
  • Question answering
  • Summarization
  • Text mining
About PowerShow.com