Chapter 3: Indexing and Query Processing mostly based on SC Chapter 3

About This Presentation

Title:

Chapter 3: Indexing and Query Processing mostly based on SC Chapter 3

Description:

with finite set of result ratings: 0 (irrelevant), 1 (ok), 2(good) ... 'rail car' car. tube. underground 'Mont Blanc' catastrophe. accident. fire. flood. earthquake ' ... – PowerPoint PPT presentation

Number of Views:119

Avg rating:3.0/5.0

Slides: 39

Provided by: escome

more less

Transcript and Presenter's Notes

Title: Chapter 3: Indexing and Query Processing mostly based on SC Chapter 3

1
Chapter 3 Indexing and Query Processingmostly
based on SC Chapter 3
3.1 Inverted Lists 3.2 Fast Top-k Search 3.3
Ranking and Advanced Query Types 3.4 Similarity
Search
2
3.3 Ranking and Advanced Query Types
3.2.1 Ranking and Quality Measures 3.2.2 Stems,
Phrases, Proximity 3.2.3 Relevance Feedback and
Query Expansion 3.2.4 Fuzzy Search
3
Vector-Space tfidf Scores
tf (di, tj) term frequency of term tj in doc
di df (tj) document frequency of tj docs
with tj idf (tj) N / df(tj) with corpus size
(total docs) N dl (di) doc length of di
(avgdl avg. doc length over all N docs)
tfidf score for single-term query (index
weight)
for tf(di,tj)gt0, 0 else
plus optional lenght normalization
cosine similarity for ranking (cosine of angle
between q and d vectors when vectors are
L2-normalized)
where j?q?di if qj?0?dij?0
4
Pivoted tfidf Scores
tf (di, tj) term frequency of term tj in doc
di df (tj) document frequency of tj docs
with tj idf (tj) N / df(tj) with corpus size
(total docs) N dl (di) doc length of di
(avgdl avg. doc length over all N docs)
tfidf score for single-term query (index
weight)
for tf(di,tj)gt0, 0 else
pivoted tfidf score
avoids undue favoring of long docs
also uses scalar product for score aggregation
5
Search Result Quality Basic Measures
ideal measure is user satisfaction heuristically
approximated by benchmarking measures (on test
corpora with query suite and relevance assessment
by experts)
Capability to return only relevant documents
Precision (Präzision)
typically for r 10, 100, 1000
Capability to return all relevant documents
typically for r corpus size
Recall (Ausbeute)
6
Search Result Quality Aggregated Measures
Combining precision and recall into F measure
(e.g. with ?0.5 harmonic mean F1)
Precision-recall breakeven point of query
q point on precision-recall curve p f(r) with
p r
for a set of n queries q1, ..., qn (e.g. TREC
benchmark)
Macro evaluation (user-oriented) of
precision
analogous for recall and F1
Micro evaluation (system-oriented) of precision
7
Search Result Quality Integrated Measures

Interpolated average precision of query q
with precision p(x) at recall x
and step width ? (e.g. 0.1)

area under precision- recall curve

Uninterpolated average precision of query q
with top-m search result rank list d1, ...,
dm,
relevant results di1, ..., dik (k ? m, ij ? i
j1 ? m)

Mean average precision (MAP) of query benchmark
suite
macro-average of per-query interpolated
average precision
for top-m results (usually with recall width
0.01)

8
Search Result Quality Weighted Measures
Mean reciprocal rank (MRR) over query set Q
Variation summand 0 if FirstRelevantRank gt k
Discounted Cumulative Gain (DCG) for query q
with finite set of result ratings 0
(irrelevant), 1 (ok), 2(good),
Normalized Discounted Cumulative Gain (NDCG) for
query q
9
Search Result Quality Ranking Measures
Consider top-k of two rankings ?1 and ?2 or full
permutations of 1..n

overlap similarity OSim (?1,?2) top(k,?1) ?
top(k,?2) / k

Kendall's ? measure KDist (?1,?2)

with U top(k,?1) ? top(k,?2) (with missing
items set to rank k1)

with ties in one ranking and order in the other,
count p with 0?p?1
p0 weak KDist, ? p1 strict KDist

footrule distance Fdist(?1,?2)

(normalized) Fdist is upper bound for KDist and
Fdist/2 is lower bound
10
Search with Morphological Reduction
(Lemmatization)

Reduction onto grammatical ground form
nouns onto nominative, verbs onto infinitive,
plural onto singular, passive onto active, etc.
Examples
English keys onto key, corpora onto
corpus,
finding and found onto find
German finden and gefundenes onto finden,
Gefundenes onto Fund

Reduction of morphological variations onto word
stem
flexions (e.g. declination), composition,
verb-to-noun, etc.
Examples (in German)
English beloved onto love, betraying onto
betrayal,
downloading onto load,
overloaded onto load
German Flüssen, einflößen onto Fluss,
finden and Gefundenes onto
finden,
Du brachtest ... mit onto
mitbringen,

11
Stemming and Normalization

Approaches to Word Stemming
Lookup in comprehensive lexicon/dictionary (e.g.
for German)
Heuristic affix removal (e.g. Porter stemmer for
English)
remove prefixes and/or suffixes based on
(heuristic) rules
Example
stresses ? stress, stressing ? stress, symbols
? symbol
based on rules sses ? ss, ing ? ?, s ? ?, etc.

The benefit of stemming for IR is debated. Ex.
Bill is operating a company. On his
computer he runs the Linux operating system.

Other Normalizations
lower-case vs. upper-case spellings (but watch
out for language)
spellings with or without whitespace chars
(U.S.A. vs. USA)
Umlaute (ü vs. ue) and other special letters or
accents
transcriptions from foreign languages (Chebyshev
vs. Tchebycheff)

12
Stopword Elimination

Lookup in stopword list
(possibly considering domain-specific vocabulary,
e.g. definition or theorem in math corpus

Typical English stopwords (articles,
prepositions, conjunctions, pronouns, overloaded
verbs, etc.) a, also, an, and, as, at, be,
but, by, can, could, do, for, from, go, have,
he, her, here, his, how, I, if, in, into, it,
its, my, of, on, or, our, say, she, that, the,
their, there, therefore, they, this, these,
those, through, to, until, we, what, when,
where, which, while, who, with, would, you, your
13
HTML Pages

Consider anchor-text terms as terms of the
link-target page
Boost scores (index weights) of terms with
highlighting tags
(e.g. use term-tag statistics for idf)
Add tag-term pairs to the dictionary / index
and allow search for faceted/tagged terms
Example authorSpire, editorSpire,
conferenceSpire

14
Phrase Queries and Proximity Queries
phrase queries such as George W. Bush,
President Bush, The Who, Evil Empire, PhD
admission, FC Schalke 04, native American
music, to be or not to be, The Lord of the
Rings, etc. etc.
difficult to anticipate and index all
(meaningful) phrases sources could be thesauri
(e.g. WordNet) or query logs

standard approach
combine single-term index with separate
position index

term doc offset ... empire 39
191 empire 77 375 ... evil 12
45 evil 39 190 evil 39
194 evil 49 190 ... evil 77
190 ...
term doc score ... empire 77
0.85 empire 39 0.82 ... evil 49
0.81 evil 39 0.78 evil 12
0.75 ... evil 77 0.12 ...
B tree on term
B tree on term, doc
15
Biword and Phrase Indexing
build index over all word pairs index lists
(term1, term2, doc, score) or for each term1
nested list (term2, doc, score)

variations
treat nearest nouns as pairs,
or discount articles, prepositions,
conjunctions
index phrases from query logs, compute
correlation statistics

query processing
decompose even-numbered phrases into biwords
decompose odd-numbered phrases into biwords
with low selectivity (as estimated by
df(term1))
may additionally use standard single-term index
if necessary

Examples to be or not to be ? (to be) (or not)
(to be) The Lord of the Rings ? (The Lord) (Lord
of) (the Rings)
16
N-Gram Indexing and Wildcard Queries
Queries with wildcards (simple regular
expressions), to capture mis-spellings, name
variations, etc. Examples Britney, Smth,
Gozilla, Marko, realiation, raklion

Approach
decompose words into N-grams of N successive
letters
and index all N-grams as terms
query processing computes AND of N-gram matches
Example (N3)
Britney ? Bri AND rit AND ney

Generalization decompose words into frequent
fragments (e.g., syllables, or fragments derived
from mis-spelling statistics)
17
Proximity Search
Example queries root polynom three,
high cholesterol measure, doctor
degree defense Idea identify positions (pos) of
all query-term occurrences and reward
short distances
keyword proximity score Büttcher/Clarke
SIGIR06 aggregation of per-term scores
per-term-pair scores attributed
to each term
count only pairs of query terms with no other
query term in between
cannot be precomputed ? expensive at query-time
18
Example Proximity Score Computation

It1 took2 the3 sea4 a5 thousand6 years,7
A8 thousand9 years10 to11 trace12
The13 granite14 features15 of16 this17 cliff,18
In19 crag20 and21 scarp22 and23 base.24
Query sea, years, cliff

19
Efficient Proximity Search
Define aggregation function to be distributive
Broschart et al. 2007 rather than holistic
Büttcher/Clarke 2006 precompute
term-pair distances and sum up at query-time
count all pairs of query terms
result quality comparable to holistic scores
index all pairs within max. window size (or
nested list of nearby terms for each term), with
precomputed pair-score mass
20
Ex. Efficiently Computable Proximity Score

It1 took2 the3 sea4 a5 thousand6 years,7
A8 thousand9 years10 to11 trace12
The13 granite14 features15 of16 this17 cliff,18
In19 crag20 and21 scarp22 and23 base.24
Query sea, years, cliff

21
Relevance Feedback
Given a query q, a result set (or ranked list)
D, a users assessment u D ? , ?
yielding positive docs D?D and
negative docs D? ?D
Goal derive query q that better captures the
users intention, by adapting term
weights in the query or by query expansion
Classical approach Rocchio method (for term
vectors)
with ?, ?, ? ? 0,1 and typically ? gt ? gt ?
Modern approach replace explicit feedback by
implicit feedback derived from queryclick logs
(pos. if clicked, neg. if skipped)
or rely on pseudo-relevance feedback
assume that all top-k results are positive
22
Query Expansion

q transportation tunnel disasters (from TREC
2004 Robust Track)

transportation tunnel
disasters
1.0
1.0
1.0
transit highway train truck metro rail car car
tube underground Mont Blanc
catastrophe accident fire flood earthquake land
slide
0.9 0.8 0.7 0.6 0.6 0.5 0.1
1.0 0.9 0.7 0.6 0.6 0.5
0.9 0.8 0.7
d2
d1

Expansion terms from (pseudo-) relevance
feedback,
thesauri/gazetteers/ontologies, Google
top-10 snippets,
query click logs, users desktop data,
etc.
Term similarities precomputed from corpus-wide
correlation measures, analysis of co-occurrence
matrix, etc.

23
Towards Robust Query Expansion
Threshold-based query expansion substitute
w by exp(w)c1 ... ck with all ci with sim(w,
ci) ? ?
Naive scoring s(q,d) ?w?q ?c?exp(w)
sim(w,c) sc(d)
disputed for danger of topic dilution

Approach to careful expansion and scoring
determine phrases from query or best initial
query results
(e.g., forming 3-grams and looking up
ontology/thesaurus entries)
if uniquely mapped to one concept
then expand with synonyms and weighted hyponyms
avoid undue score-mass accumulation by expansion
terms

s(q,d) ?w?q max c?exp(w) sim(w,c) sc(d)
24
Exploiting Query Logs for Query Expansion
Given user sessions of the form (q, D),
and let d?D denote the event that d is
clicked on We are interested in the correlation
between words w in a query and w in a clicked-on
document
Estimate from query log
relative frequency of w in d
relative frequency of d being clicked on when w
appears in query
Expand query by adding top m words w in desc.
order of
25
Fuzzy Search with Edit Distance
Idea tolerate mis-spellings and other variations
of search terms and score matches based on
editing distance

Examples
1) query Microsoft
fuzzy match Migrosaft
score edit distance 3
query Microsoft
fuzzy match Microsiphon
score edit distance 5
3) query Microsoft Corporation, Redmond, WA
fuzzy match at token level MS Corp., Readmond,
USA

26
Similarity Measures on Strings (1)
Hamming distance of strings s1, s2 ?? with
s1s2 number of different characters
(cardinality of i s1i ? s2i)
Levenshtein distance (edit distance) of strings
s1, s2 ?? minimal number of editing
operations on s1 (replacement, deletion,
insertion of a character) to change s1 into s2
For edit (i, j) Levenshtein distance of
s11..i and s21..j it holds edit (0, 0) 0,
edit (i, 0) i, edit (0, j) j edit (i, j)
min edit (i-1, j) 1, edit (i, j-1)
1, edit
(i-1, j-1) diff (i, j) with diff
(i, j) 1 if s1i ? s2j, 0 otherwise ? efficient
computation by dynamic programming
27
Similarity Measures on Strings (2)
Damerau-Levenshtein distance of strings s1, s2
?? minimal number of replacement, insertion,
deletion, or transposition operations
(exchanging two adjacent characters) for
changing s1 into s2
For edit (i, j) Damerau-Levenshtein distance of
s11..i and s21..j edit (0, 0) 0, edit
(i, 0) i, edit (0, j) j edit (i, j) min
edit (i-1, j) 1, edit (i, j-1) 1,
edit (i-1,
j-1) diff (i, j),
edit (i-2, j-2) diff(i-1, j)
diff(i, j-1) 1 with diff (i, j)
1 if s1i ? s2j, 0 otherwise
28
Similarity based on N-Grams
Determine for string s the set of its N-Grams
G(s) substrings of s with length N (often
trigrams are used, i.e. N3) Distance of strings
s1 and s2 G(s1) G(s2) - 2G(s1)?G(s2)
Example G(rodney) rod, odn, dne,
ney G(rhodnee) rho, hod, odn, dne,
nee distance (rodney, rhodnee) 4 5 22 5
Alternative similarity measures Jaccard
coefficient G(s1)?G(s2) / G(s1)?G(s2)
Dice coefficient 2 G(s1)?G(s2) / (G(s1)
G(s2))
29
N-Gram Indexing for Fuzzy Search
Theorem (Jokinen and Ukkonen 1991) for query
string s and a target string t, the Levenshtein
edit distance is bounded by the N-Gram
bag-overlap

for fuzzy-match queries with edit-distance
tolerance d,
perform top-k query over Ngrams,
using count for score aggregation

30
Phonetic Similarity (1)

Soundex code
Mapping of words (especially last names) onto
4-letter codes
such that words that are similarly pronounced
have the same code
first position of code first letter of word
code positions 2, 3, 4 (a, e, i, o, u, y, h, w
are generally ignored)
b, p, f, v ? 1 c, s, g, j, k, q, x, z ? 2
d, t ? 3 l ? 4
m, n ? 5 r ? 6
Successive identical code letters are combined
into one letter
(unless separated by the letter h)

Examples Powers ? P620 , Perez ? P620 Penny ?
P500, Penee ? P500 Tymczak ? T522, Tanshik ? T522
31
Phonetic Similarity (2)
Editex similarity edit distance with
consideration of phonetic codes
For editex (i, j) Editex distance of s11..i
and s21..j it holds editex (0, 0) 0,
editex (i, 0) editex (i-1, 0)
d(s1i-1, s1i), editex (0, j)
editex (0, j-1) d(s2j-1, s2j), editex (i,
j) min editex (i-1, j) d(s1i-1, s1i),
editex (i, j-1) d(s2j-1, s2j),
edit
(i-1, j-1) diffcode (i, j) with
diffcode (i, j) 0 if s1i s2j,
1 if group(s1i)
group(s2j), 2 otherwise und d(X, Y)
1 if X ? Y and X is h or w,
diffcode (X, Y) otherwise
with group a e i o u y, b p, c k q, d
t, l r, m n, g j, f p v, s x z, c s z
32
3.4 Similarity Search

Given a full document d find similar documents
(related pages)
Construct representation of d
set/bag of terms, set of links,
set of query terms that led to clicking d,
etc.
Define similarity measure
overlap, Dice coeff., Jaccard coeff.,
cosine, etc.
Efficiently estimate similarity and design
index
use approximations based on (overlapping)
N-grams (shingles)
and statistical estimators
? min-wise independent permutations /
min-hash method
compute min(?(D)), min(?(D) for random
permutations ?
of N-gram sets D and D of docs d and
d
and test min(?(D)) min(?(D))

33
Min-Wise Independent Permutations (MIPs)aka.
Min-Hash Method
set of ids
17 21 3 12 24 8
h1(x) 7x 3 mod 51
20 48 24 36 18 8
h2(x) 5x 6 mod 51
40 9 21 15 24 46

hN(x) 3x 9 mod 51
9 21 18 45 30 33
compute N random permutations with
Pmin?(x)x?S?(x) 1/S
MIPs are unbiased estimator of resemblance
P min h(x) x?A min h(y) y?B
A?B / A?B
MIPs can be viewed as repeated sampling of x, y
from A, B
34
Duplicate Elimination Broder et al. 1997
duplicates on the Web may be slightly
perturbed crawler indexing interested in
identifying near-duplicates

Approach
represent each document d as set (or sequence)
of
shingles (N-grams over tokens)
encode shingles by hash fingerprints (e.g.,
using SHA-1),
yielding set of numbers S(d) ? 1..n with,
e.g., n264
compare two docs d, d that are suspected to be
duplicates by
resemblance
containment
drop d if resemblance or containment is above
threshold

Jaccard coefficient
35
Efficient Duplicate Detection in Large Corpora
Broder et al. 1997
avoid comparing all pairs of docs

Solution
for each doc compute shingle-set and MIPs
produce (shingleID, docID) sorted list
produce (docID1, docID2, shingleCount) table
with counters for common shingles
Identify (docID1, docID2) pairs
with shingleCount above threshold
and add (docID1, docID2) edge to graph
Compute connected components of graph
(union-find)
? these are the near-duplicate clusters

Trick for additional speedup of steps 2 and 3
compute super-shingles (meta sketches) for
shingles of each doc
docs with many common shingles have common
super-shingle w.h.p.

36
Similarity Search by Random Hyperplanes
Charikar 2002
similarity measure cosine

generate random hyperplanes
with normal vector h
test if d and d are on
the same side of the hyperplane

P sign(hTd) sign(hTd) 1 ? angle(d,d) /
(?/2)
37
Summary of Chapter 3

indexing by inverted lists
doc-order with compression or score-order
partitioned across server farm for scalability
query processing by list merging
or treshold algorithms (TA, NRA, CA, etc.)
optionally with approximation, smart scheduling,
etc.
phrase proximity queries with additional
indexes
relevance feedback query expansion for better
ranking recall
fuzzy search based on edit distances and N-gram
overlaps
efficient similarity search by min-hash
signatures (MIPs)

38
Additional Literature for Chapters 3.3 and 3.4

Ranking and advanced query types
G. Navarro A guided tour to approximate string
matching, Comp. Surveys 2001
H.E. Williams, J. Zobel, D. Bahle Fast Phrase
Querying with Combined Indexes,
TOIS 2004
H. Bast, I. Weber Type Less, Find More Fast
Autocompletion Search with a
Succinct Index, SIGIR 2006
R. Schenkel, A. Broschart, S. Hwang, M. Theobald,
G. Weikum
Efficient Text Proximity Search, SPIRE 2007
T. Joachims et al. Evaluating the accuracy of
implicit feedback from clicks and
query reformulations in Web search, TOIS 2007
S. Liu, F. Liu, C.T. Yu, W. Meng An effective
approach to document retrieval via
utilizing WordNet and recognizing phrases, SIGIR
2004
P.Chirita, C.Firan, W.Nejdl Personalized Query
Expansion for the Web. WWW07
Similarity search
A. Broder, S. Glassman, M. Manasse, G. Zweig
Syntactic Clustering of the Web,
WWW 1997
M. Henzinger, Finding Near-Duplicate Web Pages a
Large-Scale Evaluation of

Write a Comment

User Comments (0)