Placing Skips Optimally in Expectation - PowerPoint PPT Presentation

1 / 79
About This Presentation
Title:

Placing Skips Optimally in Expectation

Description:

Latte. Macchiato. query: latte macchiato. Answering conjunctive ... Latte. Macchiato. Conventional WSDM. t. Skips are placed every N many positions. Question ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 80
Provided by: alessandro99
Category:

less

Transcript and Presenter's Notes

Title: Placing Skips Optimally in Expectation


1
Placing Skips Optimally in Expectation
  • Flavio Chierichetti,
  • Silvio Lattanzi,
  • Federico Mari
  • Alessandro Panconesi
  • Supported by

2
Problem Statement
3
Answering conjunctive queries
query latte macchiato
Latte
3
7
9
14
19
23
41
47
Macchiato
2
4
7
11
19
41
57
62
4
Answering conjunctive queries
query latte macchiato
Latte
3
7
9
14
19
23
41
47
Macchiato
2
4
7
11
19
41
57
62
Compute the intersection of 2 sorted lists
5
Merging
Latte
3
7
9
14
19
23
41
47
Macchiato
2
4
7
9
19
41
57
62
6
Merging
Latte
3
7
9
14
19
23
41
47
Macchiato
2
4
7
9
19
41
57
62
7
Merging
Latte
3
7
9
14
19
23
41
47
Macchiato
2
4
7
9
19
41
57
62
8
Merging
Latte
3
7
9
14
19
23
41
47
Macchiato
2
4
7
9
19
41
57
62
9
Merging
Latte
3
7
9
14
19
23
41
47
Macchiato
2
4
7
9
19
41
57
62
10
Skips
Latte
1
2
3
4
5
6
7
8
Macchiato
7
8
17
18
19
41
57
62
11
Skips
Latte
1
2
3
4
5
6
7
8
Macchiato
7
8
17
18
19
41
57
62
12
Skips
Latte
1
2
3
4
5
6
7
8
Macchiato
7
8
17
18
19
41
57
62
13
Conventional WSDM
t
Skips are placed every ?N many positions
14
Question
  • If we know the query distribution, can we place
    skips better?

15
Problem statement
  • If we know the query distribution, can we place
    skips in order to minimize the expected time of a
    merge?

16
Problem statement
  • If we know the query distribution, can we place
    skips in order to minimize the expected time of a
    merge?

Is the assumption realistic?
17
The Power of the Law
18
The query distribution contains a lot of
information. Can we provably take advantage of it?
19
Algorithms to follow work with any
query distribution whatsoever
20
Algorithms to follow work with any
query distribution whatsoever
..and can be extended to deal with soft
conjunctions
21
Outline
  • Skip placement policies
  • A matter of definitions
  • Algorithms
  • Experiments

22
Skip Placement Policies
23
Spaghetti Skips
24
Spaghetti Skips
t
25
Simple Skips
t
26
Simple Skips
t
This is the most interesting case
27
A Matter of Definitions
28
Useful Documents 1
q world cup
world
3
7
9
14
19
23
41
47
cup
2
4
7
9
19
41
57
62
Relevant docs are useful
29
But usefulness does not coincide with relevance
30
Useful Documents 2
q world cup
world
14
15
?
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
Is the skip useful for q?
31
Useful Documents 2
q world cup
world
14
15
?
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
Is the skip useful for q?
32
Useful Documents 2
q world cup
world
14
15
?
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
Is the skip useful for q?
33
Useful Documents 2
q world cup
world
14
15
?
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
The skip is useful
34
Useful Documents 2
q world cup
world
14
15
?
?
47
?
?
?
cup
13
19
41
43
62
?
?
?
35
Useful Documents 2
q world cup
world
14
15
?
?
47
?
?
?
cup
13
19
41
43
62
?
?
?
36
Useful Documents 2
q world cup
world
14
15
16
?
47
?
?
?
cup
13
19
41
43
62
?
?
?
37
Useful Documents 2
q world cup
world
14
15
16
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
38
Useful Documents 2
q world cup
world
14
15
16
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
39
Useful Documents 2
q world cup
world
14
15
16
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
40
Useful Documents 2
q world cup
world
14
15
16
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
The skip is useless
41
Useful Documents 2
18 cannot be skipped
q world cup
world
14
15
16
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
The skip is useless
42
Useful Documents 2
Useful documents are those that cannot be
avoided during a merge
43
Induced Distributions
  • The query distribution induces another
    distribution on the postings

platypus
1
2
3
i
j
k
n
p1
p2
pi
pn
44
Induced Distributions
  • The query distribution induces another
    distribution on the postings

platypus
1
2
3
i
j
k
n
p1
p2
pi
pn
pi Pr(i useful for q platypus ? q)
45
Induced Distributions
  • The query distribution induces another
    distribution on the postings

platypus
1
2
3
i
j
k
n
p1
p2
pi
pn
We will assume this distribution to be known
46
Induced Distributions
  • In practice these probabilities can be
    approximated using a small sample of the query
    universe

47
Making Life Simple
  • Events like a is useful and b is useful are
    not independent
  • ..but from now on we will assume that they are

48
Making Life Simple
  • Events like a is useful and b is useful are
    not independent
  • ..but from now on we will assume that they are

This simplifying assumption will be vindicated by
our experiments
49
Algorithms
50
Algorithms
  • Input a list with, for each doc, the probability
    that it is useful
  • Output skip placement that minimizes the
    expected time to merge

51
Algorithms
  • Input a list with, for each doc, the probability
    that it is useful
  • Output skip placement that minimizes the
    expected time to merge

cost of a merge elements read in posting list
52
Algorithms
  • O(nt) algorithm for spaghetti skips, where t is
    the average length of a skip
  • O(n log n) for simple skips

53
Algorithms
  • O(nt) algorithm for spaghetti skips, where t is
    the average length of a skip
  • O(n log n) for simple skips

O(n log n) algorithm for simple skips is by
far the most interesting
54
Simple Skips
  • t d1d2..di..dn p1p2..pn
  • Build the solution from left to right
  • M(i) is best placement for prefix d1..di

55
Computing M(i)
In computing M(i) we have two choices. We either
place a skip landing at position i or we do not
56
Computing M(i)
M(i-1)
i
If we place no skip to i then M(i) M(i-1)
57
Computing M(i)
M(j)
G(j,i)
j
i
58
Computing M(i)
M(j)
G(j,i)
j
i
Want this in O(log n)
maxj M(j) G(j, i)
M(i) max M(i-1), maxj M(j) G(j, i)
59
Computing M(i)
M(T(i))
G(T(i),i)
T(i)
i
maxj M(j) G(j, i) M(T(i)) G(T(i),j)
60
Monotonicity of T(i)
T(i)
i
i1
61
Monotonicity of T(i)
T(i)
i
T(i1)
i1
T(i) T(i1)
62
Monotonicity of T(i,k)
k
i
T(i,k) is best jump to i under the
additional constraint that it must start no later
than k
63
Monotonicity of T(i,k)
k
i
Key lemma T(i,k) T(i1,k)
64
Monotonicity of T(i,k)
  • Let î be the smallest index i such that T(i,k)k.
    Then,

k j î
T(j,k)
T(j,k-1) j lt î
65
Updating T(i,k)
T(i,k-1)
1
1
1
1
1
i1
i1
i1
i1
i2
i2
i2
i2
i3
i3
i3
i4
i4
i4
i4
66
Updating T(i,k)
T(i,k-1)
1
1
1
1
1
i1
i1
i1
i1
i2
i2
i2
i2
i3
i3
i3
i4
i4
i4
i4
1 lt i1 lt i2 lt i3 lt i4 lt k-1
67
Updating T(i,k)
j
T(i,k-1)
1
1
1
1
1
i1
i1
i1
i1
i2
i2
i2
i2
i3
i3
i3
i4
i4
i4
i4
The best skip to reach j starts at position i1
68
Updating T(i,k)
T(i,k-1)
1
1
1
1
1
i1
i1
i1
i1
i2
i2
i2
i2
i3
i3
i3
i4
i4
i4
i4
T(i,n) gives the optimal placement
69
Updating T(i,k)
T(i,k-1)
1
1
1
1
1
i1
i1
i1
i1
i2
i2
i2
i2
i3
i3
i3
i4
i4
i4
i4
T(i,n) gives the optimal placement
T(,1) ? T(,2) ? ? T(,k) ? ? T(,n)
70
Updating T(i,k)
min i T(i,k)k
T(i,k-1)
1
1
1
1
1
i1
i1
i1
i1
i2
i2
i2
i2
i3
i3
i3
i4
i4
i4
i4
71
Updating T(i,k)
min i T(i,k)k
T(i,k-1)
1
1
1
1
1
i1
i1
i1
i1
i2
i2
i2
i2
i3
i3
i3
i4
i4
i4
i4
T(i,k)
1
1
1
1
1
i1
i1
i1
i1
i2
i2
k
k
k
k
k
k
k
k
k
72
The resulting algorithm takes O(N logN) where N
is the length of the list
73
Experiments
74
Space
75
Time to merge
76
Build up time
77
Size of query sample for spaghetti skips
78
Size of query sample for simple skips
1/256
79
The Bottomline
  • Simple skips are the solution of choice (for
    power law distributions)
  • They merge as fast as spaghetti skips (the
    general case)
  • They occupy less space
  • Build time is much faster
  • They need a smaller sample to collect statistics
    on document usefulness

80
Summing up
  • First attempt to exploit in a rigorous way
    knowledge of the distribution
  • Much work remains to be done but results are
    encouraging

81
Extensions
  • Taking the cache into account
  • Taking dependencies into account
  • Compare against skip list and other data
    structures

Thanks for your attention
Write a Comment
User Comments (0)
About PowerShow.com