CSCI 7000 Modern Information Retrieval Jim Martin - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

CSCI 7000 Modern Information Retrieval Jim Martin

Description:

Given query, first enumerate all dictionary terms within a preset (weighted) edit distance ... query do we compute its edit distance to every dictionary term? ... – PowerPoint PPT presentation

Number of Views:237
Avg rating:3.0/5.0
Slides: 37
Provided by: csCol6
Category:

less

Transcript and Presenter's Notes

Title: CSCI 7000 Modern Information Retrieval Jim Martin


1
CSCI 7000Modern Information RetrievalJim Martin
  • Lecture 3
  • 9/3/2008

2
Today 9/5
  • Review
  • Dictionary contents
  • Advance query handling
  • Phrases
  • Wildcards
  • Spelling correction
  • First programming assignment

3
  • Index The Dictionary file and a Postings file

4
Review Dictionary
  • What goes into creating the dictionary?
  • Tokenization
  • Case folding
  • Stemming
  • Stop-listing
  • Dealing with numbers (and number-like entities)
  • Complex morphology

5
Phrasal queries
  • Want to handle queries such as
  • Colorado Buffaloes as a phrase
  • This concept is popular with users about 10 of
    web queries are phrasal queries
  • Postings that consist of document lists alone are
    not sufficient to handle phrasal queries
  • Two general approaches
  • Biword indexing
  • Positional indexing

6
Solution 1 Biword Indexing
  • Index every consecutive pair of terms in the text
    as a phrase
  • For example the text Friends, Romans,
    Countrymen would generate the biwords
  • friends romans
  • romans countrymen
  • Each of these biwords is now a dictionary term
  • Two-word phrase query-processing is now free
  • Not really.

7
Longer Phrasal Queries
  • Longer phrases can be broken into the Boolean AND
    queries on the component biwords
  • Colorado Buffaloes at Arizona
  • (Colorado Buffaloes) AND (Buffaloes at) AND (at
    Arizona)

Susceptible to Type 1 errors (false positives)
8
Solution 2 Positional Indexing
  • Change our posting content
  • Store, for each term, entries of the form
  • doc1 position1, position2
  • doc2 position1, position2
  • etc.

9
Positional index example
149 4 17, 191, 291, 430, 434 5 363, 367,
Which of docs 1,2,4,5 could contain to be or not
to be?
10
Processing a phrase query
  • Extract inverted index entries for each distinct
    term to, be, or, not.
  • Merge their docposition lists to enumerate all
    positions with to be or not to be.
  • to
  • 21,17,74,222,551 48,16,190,429,433
    713,23,191 ...
  • be
  • 117,19 417,191,291,430,434 514,19,101 ...
  • Same general method for proximity searches

11
Positional index size
  • As well see you can compress position
    values/offsets
  • But a positional index still expands the postings
    storage substantially
  • Nevertheless, it is now the standard approach
    because of the power and usefulness of phrase and
    proximity queries whether used explicitly or
    implicitly in a ranking retrieval system.

12
Rules of thumb
  • Positional index size 3550 of volume of
    original text
  • Caveat all of this holds for English-like
    languages

13
Combination Techniques
  • Biwords are faster.
  • And they cover a large percentage of very
    frequent (implied) phrasal queries
  • Britney Spears
  • So it can be effective to combine positional
    indexes with biword indexes for frequent bigrams

14
Web
  • Cuil
  • Yahoo! BOSS

15
Programming Assignment Part 1
  • Download and install Lucene
  • How does Lucene handle (by default)
  • Case, stemming, and phrasal queries
  • Download and index a collection that I will point
    you at
  • How big is the resulting index?
  • Terms and size of index
  • Return the Top N document IDs (hits) from a set
    of queries Ill provide.

16
Programming Assignment Part 2
  • Make it better

17
Wild Card Queries
  • Two flavors
  • Word-based
  • Caribb
  • Phrasal
  • Pirates Caribbean
  • General approach
  • Generate a set of new queries from the original
  • Operation on the dictionary
  • Run those queries in a not stupid way

18
Simple Single Wild-card queries
  • Single instance of a
  • means an string of length 0 or more
  • This is not Kleene .
  • mon find all docs containing any word beginning
    mon.
  • Index your lexicon on prefixes
  • mon find words ending in mon
  • Maintain a backwards index

Exercise from this, how can w enumerate all
terms meeting the wild-card query procent ?
19
Arbitrary Wildcards
  • How can we handle multiple s in the middle of
    query term?
  • The solution transform every wild-card query so
    that the s occur at the end
  • This gives rise to the Permuterm Index.

20
Permuterm Index
  • For term hello index under
  • hello, elloh, llohe, lohel, ohell
  • where is a special symbol.
  • Example

Query helo Rotate Lookup ohel
21
Permuterm query processing
  • Rotate query wild-card to the right
  • Now use indexed lookup as before.
  • Permuterm problem quadruples lexicon size

Empirical observation for English.
22
Spelling Correction
  • Two primary uses
  • Correcting document(s) being indexed
  • Retrieve matching documents when query contains a
    spelling error
  • Two main flavors
  • Isolated word
  • Check each word on its own for misspelling
  • Will not catch typos resulting in correctly
    spelled words e.g., from ? form
  • Context-sensitive
  • Look at surrounding words, e.g., I flew form
    Heathrow to Narita.

23
Document correction
  • Primarily for OCRed documents
  • Correction algorithms tuned for this
  • Goal the index (dictionary) contains fewer
    OCR-induced misspellings
  • Can use domain-specific knowledge
  • E.g., OCR can confuse O and D more often than it
    would confuse O and I (adjacent on the QWERTY
    keyboard, so more likely interchanged in typing).

24
Query correction
  • Our principal focus here
  • E.g., the query Alanis Morisett
  • We can either
  • Retrieve using that spelling
  • Retrieve documents indexed by the correct
    spelling, OR
  • Return several suggested alternative queries with
    the correct spelling
  • Did you mean ?

25
Isolated word correction
  • Fundamental premise there is a lexicon from
    which the correct spellings come
  • Two basic choices for this
  • A standard lexicon such as
  • Websters English Dictionary
  • An industry-specific lexicon hand-maintained
  • The lexicon of the indexed corpus
  • E.g., all words on the web
  • All names, acronyms etc.
  • (Including the mis-spellings)

26
Isolated word correction
  • Given a lexicon and a character sequence Q,
    return the words in the lexicon closest to Q
  • Whats closest?
  • Well study several alternatives
  • Edit distance
  • Weighted edit distance
  • Character n-gram overlap

27
Edit distance
  • Given two strings S1 and S2, the minimum number
    of basic operations to covert one to the other
  • Basic operations are typically character-level
  • Insert
  • Delete
  • Replace
  • E.g., the edit distance from cat to dog is 3.
  • Generally found by dynamic programming.

28
Weighted edit distance
  • As above, but the weight of an operation depends
    on the character(s) involved
  • Meant to capture keyboard errors, e.g. m more
    likely to be mis-typed as n than as q
  • Therefore, replacing m by n is a smaller edit
    distance than by q
  • (Same ideas usable for OCR, but with different
    weights)
  • Require weight matrix as input
  • Modify dynamic programming to handle weights
    (Viterbi)

29
Using edit distances
  • Given query, first enumerate all dictionary terms
    within a preset (weighted) edit distance
  • Then look up enumerated dictionary terms in the
    term-document inverted index

30
Edit distance to all dictionary terms?
  • Given a (misspelled) query do we compute its
    edit distance to every dictionary term?
  • Expensive and slow
  • How do we cut the set of candidate dictionary
    terms?
  • Here we can use n-gram overlap for this

31
Context-sensitive spell correction
  • Text I flew from Heathrow to Narita.
  • Consider the phrase query flew form Heathrow
  • Wed like to respond
  • Did you mean flew from Heathrow?
  • because no docs matched the query phrase.

32
Context-sensitive correction
  • Need surrounding context to catch this.
  • NLP too heavyweight for this.
  • First idea retrieve dictionary terms close (in
    weighted edit distance) to each query term
  • Now try all possible resulting phrases with one
    word fixed at a time
  • flew from heathrow
  • fled form heathrow
  • flea form heathrow
  • etc.
  • Suggest the alternative that has lots of hits?

33
Exercise
  • Suppose that for flew form Heathrow we have 7
    alternatives for flew, 19 for form and 3 for
    heathrow.
  • How many corrected phrases will we enumerate in
    this scheme?

34
General issue in spell correction
  • Will enumerate multiple alternatives for Did you
    mean
  • Need to figure out which one (or small number) to
    present to the user
  • Use heuristics
  • The alternative hitting most docs
  • Query log analysis tweaking
  • For especially popular, topical queries
  • Language modeling

35
Computational cost
  • Spell-correction is computationally expensive
  • Avoid running routinely on every query?
  • Run only on queries that matched few docs

36
Next Time
  • On to Chapter 4
  • Real indexing
Write a Comment
User Comments (0)
About PowerShow.com