Chapter 3: Indexing and Query Processing mostly based on SC Chapter 3

1 / 38
About This Presentation
Title:

Chapter 3: Indexing and Query Processing mostly based on SC Chapter 3

Description:

with finite set of result ratings: 0 (irrelevant), 1 (ok), 2(good) ... 'rail car' car. tube. underground 'Mont Blanc' catastrophe. accident. fire. flood. earthquake ' ... – PowerPoint PPT presentation

Number of Views:119
Avg rating:3.0/5.0
Slides: 39
Provided by: escome

less

Transcript and Presenter's Notes

Title: Chapter 3: Indexing and Query Processing mostly based on SC Chapter 3


1
Chapter 3 Indexing and Query Processingmostly
based on SC Chapter 3
3.1 Inverted Lists 3.2 Fast Top-k Search 3.3
Ranking and Advanced Query Types 3.4 Similarity
Search
2
3.3 Ranking and Advanced Query Types
3.2.1 Ranking and Quality Measures 3.2.2 Stems,
Phrases, Proximity 3.2.3 Relevance Feedback and
Query Expansion 3.2.4 Fuzzy Search
3
Vector-Space tfidf Scores
tf (di, tj) term frequency of term tj in doc
di df (tj) document frequency of tj docs
with tj idf (tj) N / df(tj) with corpus size
(total docs) N dl (di) doc length of di
(avgdl avg. doc length over all N docs)
tfidf score for single-term query (index
weight)
for tf(di,tj)gt0, 0 else
plus optional lenght normalization
cosine similarity for ranking (cosine of angle
between q and d vectors when vectors are
L2-normalized)
where j?q?di if qj?0?dij?0
4
Pivoted tfidf Scores
tf (di, tj) term frequency of term tj in doc
di df (tj) document frequency of tj docs
with tj idf (tj) N / df(tj) with corpus size
(total docs) N dl (di) doc length of di
(avgdl avg. doc length over all N docs)
tfidf score for single-term query (index
weight)
for tf(di,tj)gt0, 0 else
pivoted tfidf score
avoids undue favoring of long docs
also uses scalar product for score aggregation
5
Search Result Quality Basic Measures
ideal measure is user satisfaction heuristically
approximated by benchmarking measures (on test
corpora with query suite and relevance assessment
by experts)
Capability to return only relevant documents
Precision (Präzision)
typically for r 10, 100, 1000
Capability to return all relevant documents
typically for r corpus size
Recall (Ausbeute)
6
Search Result Quality Aggregated Measures
Combining precision and recall into F measure
(e.g. with ?0.5 harmonic mean F1)
Precision-recall breakeven point of query
q point on precision-recall curve p f(r) with
p r
for a set of n queries q1, ..., qn (e.g. TREC
benchmark)
Macro evaluation (user-oriented) of
precision
analogous for recall and F1
Micro evaluation (system-oriented) of precision
7
Search Result Quality Integrated Measures
  • Interpolated average precision of query q
  • with precision p(x) at recall x
  • and step width ? (e.g. 0.1)

area under precision- recall curve
  • Uninterpolated average precision of query q
  • with top-m search result rank list d1, ...,
    dm,
  • relevant results di1, ..., dik (k ? m, ij ? i
    j1 ? m)
  • Mean average precision (MAP) of query benchmark
    suite
  • macro-average of per-query interpolated
    average precision
  • for top-m results (usually with recall width
    0.01)

8
Search Result Quality Weighted Measures
Mean reciprocal rank (MRR) over query set Q
Variation summand 0 if FirstRelevantRank gt k
Discounted Cumulative Gain (DCG) for query q
with finite set of result ratings 0
(irrelevant), 1 (ok), 2(good),
Normalized Discounted Cumulative Gain (NDCG) for
query q
9
Search Result Quality Ranking Measures
Consider top-k of two rankings ?1 and ?2 or full
permutations of 1..n
  • overlap similarity OSim (?1,?2) top(k,?1) ?
    top(k,?2) / k
  • Kendall's ? measure KDist (?1,?2)

with U top(k,?1) ? top(k,?2) (with missing
items set to rank k1)
  • with ties in one ranking and order in the other,
    count p with 0?p?1
  • p0 weak KDist, ? p1 strict KDist
  • footrule distance Fdist(?1,?2)

(normalized) Fdist is upper bound for KDist and
Fdist/2 is lower bound
10
Search with Morphological Reduction
(Lemmatization)
  • Reduction onto grammatical ground form
  • nouns onto nominative, verbs onto infinitive,
  • plural onto singular, passive onto active, etc.
  • Examples
  • English keys onto key, corpora onto
    corpus,
  • finding and found onto find
  • German finden and gefundenes onto finden,
  • Gefundenes onto Fund
  • Reduction of morphological variations onto word
    stem
  • flexions (e.g. declination), composition,
    verb-to-noun, etc.
  • Examples (in German)
  • English beloved onto love, betraying onto
    betrayal,
  • downloading onto load,
    overloaded onto load
  • German Flüssen, einflößen onto Fluss,
  • finden and Gefundenes onto
    finden,
  • Du brachtest ... mit onto
    mitbringen,

11
Stemming and Normalization
  • Approaches to Word Stemming
  • Lookup in comprehensive lexicon/dictionary (e.g.
    for German)
  • Heuristic affix removal (e.g. Porter stemmer for
    English)
  • remove prefixes and/or suffixes based on
    (heuristic) rules
  • Example
  • stresses ? stress, stressing ? stress, symbols
    ? symbol
  • based on rules sses ? ss, ing ? ?, s ? ?, etc.

The benefit of stemming for IR is debated. Ex.
Bill is operating a company. On his
computer he runs the Linux operating system.
  • Other Normalizations
  • lower-case vs. upper-case spellings (but watch
    out for language)
  • spellings with or without whitespace chars
    (U.S.A. vs. USA)
  • Umlaute (ü vs. ue) and other special letters or
    accents
  • transcriptions from foreign languages (Chebyshev
    vs. Tchebycheff)

12
Stopword Elimination
  • Lookup in stopword list
  • (possibly considering domain-specific vocabulary,
  • e.g. definition or theorem in math corpus

Typical English stopwords (articles,
prepositions, conjunctions, pronouns, overloaded
verbs, etc.) a, also, an, and, as, at, be,
but, by, can, could, do, for, from, go, have,
he, her, here, his, how, I, if, in, into, it,
its, my, of, on, or, our, say, she, that, the,
their, there, therefore, they, this, these,
those, through, to, until, we, what, when,
where, which, while, who, with, would, you, your
13
HTML Pages
  • Consider anchor-text terms as terms of the
    link-target page
  • Boost scores (index weights) of terms with
    highlighting tags
  • (e.g. use term-tag statistics for idf)
  • Add tag-term pairs to the dictionary / index
  • and allow search for faceted/tagged terms
  • Example authorSpire, editorSpire,
    conferenceSpire

14
Phrase Queries and Proximity Queries
phrase queries such as George W. Bush,
President Bush, The Who, Evil Empire, PhD
admission, FC Schalke 04, native American
music, to be or not to be, The Lord of the
Rings, etc. etc.
difficult to anticipate and index all
(meaningful) phrases sources could be thesauri
(e.g. WordNet) or query logs
  • standard approach
  • combine single-term index with separate
    position index

term doc offset ... empire 39
191 empire 77 375 ... evil 12
45 evil 39 190 evil 39
194 evil 49 190 ... evil 77
190 ...
term doc score ... empire 77
0.85 empire 39 0.82 ... evil 49
0.81 evil 39 0.78 evil 12
0.75 ... evil 77 0.12 ...
B tree on term
B tree on term, doc
15
Biword and Phrase Indexing
build index over all word pairs index lists
(term1, term2, doc, score) or for each term1
nested list (term2, doc, score)
  • variations
  • treat nearest nouns as pairs,
  • or discount articles, prepositions,
    conjunctions
  • index phrases from query logs, compute
    correlation statistics
  • query processing
  • decompose even-numbered phrases into biwords
  • decompose odd-numbered phrases into biwords
  • with low selectivity (as estimated by
    df(term1))
  • may additionally use standard single-term index
    if necessary

Examples to be or not to be ? (to be) (or not)
(to be) The Lord of the Rings ? (The Lord) (Lord
of) (the Rings)
16
N-Gram Indexing and Wildcard Queries
Queries with wildcards (simple regular
expressions), to capture mis-spellings, name
variations, etc. Examples Britney, Smth,
Gozilla, Marko, realiation, raklion
  • Approach
  • decompose words into N-grams of N successive
    letters
  • and index all N-grams as terms
  • query processing computes AND of N-gram matches
  • Example (N3)
  • Britney ? Bri AND rit AND ney

Generalization decompose words into frequent
fragments (e.g., syllables, or fragments derived
from mis-spelling statistics)
17
Proximity Search
Example queries root polynom three,
high cholesterol measure, doctor
degree defense Idea identify positions (pos) of
all query-term occurrences and reward
short distances
keyword proximity score Büttcher/Clarke
SIGIR06 aggregation of per-term scores
per-term-pair scores attributed
to each term
count only pairs of query terms with no other
query term in between
cannot be precomputed ? expensive at query-time
18
Example Proximity Score Computation
  • It1 took2 the3 sea4 a5 thousand6 years,7
  • A8 thousand9 years10 to11 trace12
  • The13 granite14 features15 of16 this17 cliff,18
  • In19 crag20 and21 scarp22 and23 base.24
  • Query sea, years, cliff

19
Efficient Proximity Search
Define aggregation function to be distributive
Broschart et al. 2007 rather than holistic
Büttcher/Clarke 2006 precompute
term-pair distances and sum up at query-time
count all pairs of query terms
result quality comparable to holistic scores
index all pairs within max. window size (or
nested list of nearby terms for each term), with
precomputed pair-score mass
20
Ex. Efficiently Computable Proximity Score
  • It1 took2 the3 sea4 a5 thousand6 years,7
  • A8 thousand9 years10 to11 trace12
  • The13 granite14 features15 of16 this17 cliff,18
  • In19 crag20 and21 scarp22 and23 base.24
  • Query sea, years, cliff

21
Relevance Feedback
Given a query q, a result set (or ranked list)
D, a users assessment u D ? , ?
yielding positive docs D?D and
negative docs D? ?D
Goal derive query q that better captures the
users intention, by adapting term
weights in the query or by query expansion
Classical approach Rocchio method (for term
vectors)
with ?, ?, ? ? 0,1 and typically ? gt ? gt ?
Modern approach replace explicit feedback by
implicit feedback derived from queryclick logs
(pos. if clicked, neg. if skipped)
or rely on pseudo-relevance feedback
assume that all top-k results are positive
22
Query Expansion
  • q transportation tunnel disasters (from TREC
    2004 Robust Track)

transportation tunnel
disasters
1.0
1.0
1.0
transit highway train truck metro rail car car
tube underground Mont Blanc
catastrophe accident fire flood earthquake land
slide
0.9 0.8 0.7 0.6 0.6 0.5 0.1
1.0 0.9 0.7 0.6 0.6 0.5
0.9 0.8 0.7
d2
d1
  • Expansion terms from (pseudo-) relevance
    feedback,
  • thesauri/gazetteers/ontologies, Google
    top-10 snippets,
  • query click logs, users desktop data,
    etc.
  • Term similarities precomputed from corpus-wide
  • correlation measures, analysis of co-occurrence
    matrix, etc.

23
Towards Robust Query Expansion
Threshold-based query expansion substitute
w by exp(w)c1 ... ck with all ci with sim(w,
ci) ? ?
Naive scoring s(q,d) ?w?q ?c?exp(w)
sim(w,c) sc(d)
disputed for danger of topic dilution
  • Approach to careful expansion and scoring
  • determine phrases from query or best initial
    query results
  • (e.g., forming 3-grams and looking up
    ontology/thesaurus entries)
  • if uniquely mapped to one concept
  • then expand with synonyms and weighted hyponyms
  • avoid undue score-mass accumulation by expansion
    terms

s(q,d) ?w?q max c?exp(w) sim(w,c) sc(d)
24
Exploiting Query Logs for Query Expansion
Given user sessions of the form (q, D),
and let d?D denote the event that d is
clicked on We are interested in the correlation
between words w in a query and w in a clicked-on
document
Estimate from query log
relative frequency of w in d
relative frequency of d being clicked on when w
appears in query
Expand query by adding top m words w in desc.
order of
25
Fuzzy Search with Edit Distance
Idea tolerate mis-spellings and other variations
of search terms and score matches based on
editing distance
  • Examples
  • 1) query Microsoft
  • fuzzy match Migrosaft
  • score edit distance 3
  • query Microsoft
  • fuzzy match Microsiphon
  • score edit distance 5
  • 3) query Microsoft Corporation, Redmond, WA
  • fuzzy match at token level MS Corp., Readmond,
    USA

26
Similarity Measures on Strings (1)
Hamming distance of strings s1, s2 ?? with
s1s2 number of different characters
(cardinality of i s1i ? s2i)
Levenshtein distance (edit distance) of strings
s1, s2 ?? minimal number of editing
operations on s1 (replacement, deletion,
insertion of a character) to change s1 into s2
For edit (i, j) Levenshtein distance of
s11..i and s21..j it holds edit (0, 0) 0,
edit (i, 0) i, edit (0, j) j edit (i, j)
min edit (i-1, j) 1, edit (i, j-1)
1, edit
(i-1, j-1) diff (i, j) with diff
(i, j) 1 if s1i ? s2j, 0 otherwise ? efficient
computation by dynamic programming
27
Similarity Measures on Strings (2)
Damerau-Levenshtein distance of strings s1, s2
?? minimal number of replacement, insertion,
deletion, or transposition operations
(exchanging two adjacent characters) for
changing s1 into s2
For edit (i, j) Damerau-Levenshtein distance of
s11..i and s21..j edit (0, 0) 0, edit
(i, 0) i, edit (0, j) j edit (i, j) min
edit (i-1, j) 1, edit (i, j-1) 1,
edit (i-1,
j-1) diff (i, j),
edit (i-2, j-2) diff(i-1, j)
diff(i, j-1) 1 with diff (i, j)
1 if s1i ? s2j, 0 otherwise
28
Similarity based on N-Grams
Determine for string s the set of its N-Grams
G(s) substrings of s with length N (often
trigrams are used, i.e. N3) Distance of strings
s1 and s2 G(s1) G(s2) - 2G(s1)?G(s2)
Example G(rodney) rod, odn, dne,
ney G(rhodnee) rho, hod, odn, dne,
nee distance (rodney, rhodnee) 4 5 22 5
Alternative similarity measures Jaccard
coefficient G(s1)?G(s2) / G(s1)?G(s2)
Dice coefficient 2 G(s1)?G(s2) / (G(s1)
G(s2))
29
N-Gram Indexing for Fuzzy Search
Theorem (Jokinen and Ukkonen 1991) for query
string s and a target string t, the Levenshtein
edit distance is bounded by the N-Gram
bag-overlap
  • for fuzzy-match queries with edit-distance
    tolerance d,
  • perform top-k query over Ngrams,
  • using count for score aggregation

30
Phonetic Similarity (1)
  • Soundex code
  • Mapping of words (especially last names) onto
    4-letter codes
  • such that words that are similarly pronounced
    have the same code
  • first position of code first letter of word
  • code positions 2, 3, 4 (a, e, i, o, u, y, h, w
    are generally ignored)
  • b, p, f, v ? 1 c, s, g, j, k, q, x, z ? 2
  • d, t ? 3 l ? 4
  • m, n ? 5 r ? 6
  • Successive identical code letters are combined
    into one letter
  • (unless separated by the letter h)

Examples Powers ? P620 , Perez ? P620 Penny ?
P500, Penee ? P500 Tymczak ? T522, Tanshik ? T522
31
Phonetic Similarity (2)
Editex similarity edit distance with
consideration of phonetic codes
For editex (i, j) Editex distance of s11..i
and s21..j it holds editex (0, 0) 0,
editex (i, 0) editex (i-1, 0)
d(s1i-1, s1i), editex (0, j)
editex (0, j-1) d(s2j-1, s2j), editex (i,
j) min editex (i-1, j) d(s1i-1, s1i),
editex (i, j-1) d(s2j-1, s2j),
edit
(i-1, j-1) diffcode (i, j) with
diffcode (i, j) 0 if s1i s2j,
1 if group(s1i)
group(s2j), 2 otherwise und d(X, Y)
1 if X ? Y and X is h or w,
diffcode (X, Y) otherwise
with group a e i o u y, b p, c k q, d
t, l r, m n, g j, f p v, s x z, c s z
32
3.4 Similarity Search
  • Given a full document d find similar documents
    (related pages)
  • Construct representation of d
  • set/bag of terms, set of links,
  • set of query terms that led to clicking d,
    etc.
  • Define similarity measure
  • overlap, Dice coeff., Jaccard coeff.,
    cosine, etc.
  • Efficiently estimate similarity and design
    index
  • use approximations based on (overlapping)
    N-grams (shingles)
  • and statistical estimators
  • ? min-wise independent permutations /
    min-hash method
  • compute min(?(D)), min(?(D) for random
    permutations ?
  • of N-gram sets D and D of docs d and
    d
  • and test min(?(D)) min(?(D))

33
Min-Wise Independent Permutations (MIPs)aka.
Min-Hash Method
set of ids
17 21 3 12 24 8
h1(x) 7x 3 mod 51
20 48 24 36 18 8
h2(x) 5x 6 mod 51
40 9 21 15 24 46

hN(x) 3x 9 mod 51
9 21 18 45 30 33
compute N random permutations with
Pmin?(x)x?S?(x) 1/S
MIPs are unbiased estimator of resemblance
P min h(x) x?A min h(y) y?B
A?B / A?B
MIPs can be viewed as repeated sampling of x, y
from A, B
34
Duplicate Elimination Broder et al. 1997
duplicates on the Web may be slightly
perturbed crawler indexing interested in
identifying near-duplicates
  • Approach
  • represent each document d as set (or sequence)
    of
  • shingles (N-grams over tokens)
  • encode shingles by hash fingerprints (e.g.,
    using SHA-1),
  • yielding set of numbers S(d) ? 1..n with,
    e.g., n264
  • compare two docs d, d that are suspected to be
    duplicates by
  • resemblance
  • containment
  • drop d if resemblance or containment is above
    threshold

Jaccard coefficient
35
Efficient Duplicate Detection in Large Corpora
Broder et al. 1997
avoid comparing all pairs of docs
  • Solution
  • for each doc compute shingle-set and MIPs
  • produce (shingleID, docID) sorted list
  • produce (docID1, docID2, shingleCount) table
  • with counters for common shingles
  • Identify (docID1, docID2) pairs
  • with shingleCount above threshold
  • and add (docID1, docID2) edge to graph
  • Compute connected components of graph
    (union-find)
  • ? these are the near-duplicate clusters
  • Trick for additional speedup of steps 2 and 3
  • compute super-shingles (meta sketches) for
    shingles of each doc
  • docs with many common shingles have common
    super-shingle w.h.p.

36
Similarity Search by Random Hyperplanes
Charikar 2002
similarity measure cosine
  • generate random hyperplanes
  • with normal vector h
  • test if d and d are on
  • the same side of the hyperplane

P sign(hTd) sign(hTd) 1 ? angle(d,d) /
(?/2)
37
Summary of Chapter 3
  • indexing by inverted lists
  • doc-order with compression or score-order
  • partitioned across server farm for scalability
  • query processing by list merging
  • or treshold algorithms (TA, NRA, CA, etc.)
  • optionally with approximation, smart scheduling,
    etc.
  • phrase proximity queries with additional
    indexes
  • relevance feedback query expansion for better
    ranking recall
  • fuzzy search based on edit distances and N-gram
    overlaps
  • efficient similarity search by min-hash
    signatures (MIPs)

38
Additional Literature for Chapters 3.3 and 3.4
  • Ranking and advanced query types
  • G. Navarro A guided tour to approximate string
    matching, Comp. Surveys 2001
  • H.E. Williams, J. Zobel, D. Bahle Fast Phrase
    Querying with Combined Indexes,
  • TOIS 2004
  • H. Bast, I. Weber Type Less, Find More Fast
    Autocompletion Search with a
  • Succinct Index, SIGIR 2006
  • R. Schenkel, A. Broschart, S. Hwang, M. Theobald,
    G. Weikum
  • Efficient Text Proximity Search, SPIRE 2007
  • T. Joachims et al. Evaluating the accuracy of
    implicit feedback from clicks and
  • query reformulations in Web search, TOIS 2007
  • S. Liu, F. Liu, C.T. Yu, W. Meng An effective
    approach to document retrieval via
  • utilizing WordNet and recognizing phrases, SIGIR
    2004
  • P.Chirita, C.Firan, W.Nejdl Personalized Query
    Expansion for the Web. WWW07
  • Similarity search
  • A. Broder, S. Glassman, M. Manasse, G. Zweig
    Syntactic Clustering of the Web,
  • WWW 1997
  • M. Henzinger, Finding Near-Duplicate Web Pages a
    Large-Scale Evaluation of
Write a Comment
User Comments (0)