Hashed Samples

About This Presentation

Title:

Hashed Samples

Description:

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 28

Provided by: Marios79

Learn more at: https://users.cs.utah.edu

Category:

more less

Transcript and Presenter's Notes

Title: Hashed Samples

1
Hashed Samples

Selectivity Estimators for
Set Similarity Selection Queries

2
Set Similarity An Application

Find similar strings
Decompose strings into 3-grams.
Represent strings as sets of 3-grams.
Compare strings by comparing their respective
3-gram sets.
Nick Koudas Nic, ick, , das
Nick Arkoudas Nic, , das
We can use TF/IDF similarity (or other metrics)
to evaluate set similarity.

3
Indexes For Set similarity Evaluation

Current approaches use inverted lists
Compute IDF of set elements (e.g., 3-grams).
Create one inverted list per set element
consisting of one entry per database set
containing the respective elements (e.g., one
entry per string containing the respective
3-gram).
Use various algorithms and sorting/compression
schemes for fast merging of inverted lists.

4
Motivation

Set similarity queries are very important
String matching.
Data cleaning.
Set-valued attributes in ORDBMS.
A variety of set similarity operators have been
proposed (for join, selection queries)
Selectivity estimation is important for query
optimization

5
The Problem

Let I be a predefined set similarity measure.
Let D a collection of sets.
Given query set q and threshold t, a set
similarity selection query returns the answer set
A s ? D, s.t. I(q, s) gt t .
A set similarity selectivity estimation query
estimates the size of A.

6
Naïve Solutions
7
Random Sampling

Maintain a sample S, of sets s ? D.
Size of answer
A As??D / S,
where As s ? S I(q, s) gt t.
Drawbacks
Query independent.
Large variance.
Needs to store complete sets in the sample.
Cannot handle updates.

8
One Sample Per List

Use the existing inverted index
Compute one sample per inverted list.
Compute independent estimates per list
corresponding to the query set elements only
(query specific)
Report median, max, average
Drawbacks
Ignores correlations between lists.
Needs to store complete sets in inverted lists.
Will not be better than simple random sampling.
Cannot handle updates.

9
Sample Union

Compute the sample union of the samples
corresponding to the query set elements.
Drawbacks
Results in a biased sample.
There are duplicate elements in those lists.
Even if we eliminate duplicates we still need to
compute the distinct set size of the sample union
(for scaling up).
This is more expensive than answering the set
similarity query exactly to begin with.
Needs to store complete sets in inverted lists.
Cannot handle updates.

10
Dynamically Computed Samples

Given the query
Use reservoir sampling to compute a sample union
from the inverted lists on the fly.
Drawbacks
Produces a biased sample
Skips part of the input.
Duplicates.
Need to store complete sets in the inverted lists.

11
Hashed Samples
12
Hashed Sample

An a priori computed sample that
Builds uniform samples from arbitrary
combinations of inverted lists.
Does not need to store complete sets in the
sample (only set ids).
Leverages partial weight information contained in
the lists.
Eliminates the need to store distinct value
estimation synopses.
Provides unbiased estimates.
Handles updates gracefully.

13
Construction

We cannot draw independent samples per list
Draw samples deterministically
In order to leverage partial weights contained in
lists for computing I(q, s) efficiently.
Guarantee that if a set id is sampled from one
list, it will be always sampled in all other
lists.
Guarantee that union of list samples is a uniform
random sample.
We impose a random permutation on the domain of
set ids
Use hashing and sample a consistent subset.

14
Construction 2

Randomly choose a hash function h from a family
of universal hash functions.
Assume that h hashes in 1, 100
Values h(s1), h(s2), appear as i.i.d.
(empirically)
Choose a value x and sample from every list sets
s h(s) lt x.

15
Hashed Sample Properties

We get an x sample per list on average.
We get an overall x sample.
The union of samples of any set of inverted lists
is an x sample of the respective lists.
Let q q1, q2, , qn
A As ? q1 ? ? qnd / qs1 ?? ? qsnd.
Computing As is simple
Run any exact evaluation algorithm on the sampled
lists!
Performance improvement with respect to exact
evaluation is directly proportional to the size
of the sample.
We still need the distinct number of set ids in q
to scale up the results.

16
The K-Minimum Values Synopsis

Estimating the distinct size of arbitrary list
unions
The sampled lists themselves can be used as a KMV
synopsis, by contsruction.
The r-th smallest hash value hr of a set of
elements gives an unbiased estimator of the
distinct number of elements in the set
Sd ?? P (r 1) / hr
Given that sample lists contain all elements s.t.
h(s) ?? x, we can deduce the rank of hm x.

17
Experimental Evaluation
18
Setup

IMDB, DBLP, YellowPages
Decompose strings into 3-grams and build inverted
index for TF/IDF similarity.
Build list samples 1, 5, 10.
Draw queries from the data
100 queries per workload.
Each set contains queries of preset selectivity.
Evaluate estimation accuracy and runtime.

19
Storing Sets VS. Storing Set Ids
20
Reservoir Sampling Accuracy
21
Reservoir Sampling Cost
22
Hashed Sampling Accuracy
23
Hashed Sampling Cost
24
Hashed Sampling Threshold
25
Hashed Sampling Answer Size
26
Hashed Sampling KMV Accuracy
27
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Hashed Samples - PowerPoint PPT Presentation

Hashed Samples

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries – PowerPoint PPT presentation