CS 361A (Advanced Data Structures and Algorithms) presentation

About This Presentation

Transcript and Presenter's Notes

Title: CS 361A (Advanced Data Structures and Algorithms)

1
CS 361A (Advanced Data Structures and Algorithms)

Lectures 16 17 (Nov
16 and 28, 2005)
Synopses, Samples, and Sketches
Rajeev Motwani

2
Game Plan for Week

Last Class
Models for Streaming/Massive Data Sets
Negative results for Exact Distinct Values
Hashing for Approximate Distinct Values
Today
Synopsis Data Structures
Sampling Techniques
Frequency Moments Problem
Sketching Techniques
Finding High-Frequency Items

3
Synopsis Data Structures

Synopses
Webster a condensed statement or outline (as of
a narrative or treatise)
CS 361A succinct data structure that lets us
answers queries efficiently
Synopsis Data Structures
Lossy Summary (of a data stream)
Advantages fits in memory easy to communicate
Disadvantage lossiness implies approximation
error
Negative Results ? best we can do
Key Techniques randomization and hashing

4
Numerical Examples

Approximate Query Processing AQUA/Bell Labs
Database Size 420 MB
Synopsis Size 420 KB (0.1)
Approximation Error within 10
Running Time 0.3 of time for exact query
Histograms/Quantiles Chaudhuri-Motwani-Narasayya,
Manku-Rajagopalan-Lindsay, Khanna-Greenwald
Data Size 109 items
Synopsis Size 1249 items
Approximation Error within 1

5
Synopses

Desidarata
Small Memory Footprint
Quick Update and Query
Provable, low-error guarantees
Composable for distributed scenario
Applicability?
General-purpose e.g. random samples
Specific-purpose e.g. distinct values estimator
Granularity?
Per database e.g. sample of entire table
Per distinct value e.g. customer profiles
Structural e.g. GROUP-BY or JOIN result samples

6
Examples of Synopses

Synopses need not be fancy!
Simple Aggregates e.g. mean/median/max/min
Variance?
Random Samples
Aggregates on small samples represent entire data
Leverage extensive work on confidence intervals
Random Sketches
structured samples
Tracking High-Frequency Items

7
Random Samples
8
Types of Samples

Oblivious sampling at item level
Limitations Bar-YossefKumarSivakumar STOC 01
Value-based sampling e.g. distinct-value
samples
Structured samples e.g. join sampling
Naïve approach keep samples of each relation
Problem sample-of-join join-of-samples
Foreign-Key Join Chaudhuri-Motwani-Narasayya
SIGMOD 99

A A B B
A B
what if A sampled from L and B from R?
L
R
9
Basic Scenario

Goal maintain uniform sample of item-stream
Sampling Semantics?
Coin flip
select each item with probability p
easy to maintain
undesirable sample size is unbounded
Fixed-size sample without replacement
Our focus today
Fixed-size sample with replacement
Show can generate from previous sample
Non-Uniform Samples Chaudhuri-Motwani-Narasayya

10
Reservoir Sampling Vitter

Input stream of items X1 , X2, X3,
Goal maintain uniform random sample S of size n
(without replacement) of stream so far
Reservoir Sampling
Initialize include first n elements in S
Upon seeing item Xt
Add Xt to S with probability n/t
If added, evict random previous item

11
Analysis

Correctness?
Fact At each instant, S n
Theorem At time t, any XieS with probability n/t
Exercise prove via induction on t
Efficiency?
Let N be stream size
Remark Verify this is optimal.
Naïve implementation ? N coin flips ? time O(N)

12
Improving Efficiency
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
J94
J32
items inserted into sample S (where n3)

Random variable Jt number jumped over after
time t
Idea generate Jt and skip that many items
Cumulative Distribution Function F(s) PJt
s, for tgtn s0

13
Analysis

Number of calls to RANDOM()?
one per insertion into sample
this is optimal!
Generating Jt?
Pick random number U e 0,1
Find smallest j such that U F(j)
How?
Linear scan ? O(N) time
Binary search with Newtons interpolation ?
O(n2(1 polylog N/n)) time
Remark see paper for optimal algorithm

14
Sampling over Sliding Windows Babcock-Datar-Motwa
ni

Sliding Window W last w items in stream
Model item Xt expires at time tw
Why?
Applications may require ignoring stale data
Type of approximation
Only way to define JOIN over streams
Goal Maintain uniform sample of size n of
sliding window

15
Reservoir Sampling?

Observe
any item in sample S will expire eventually
must replace with random item of current window
Problem
no access to items in W-S
storing entire window requires O(w) memory
Oversampling
Backing sample B select each item with
probability
sample S select n items from B at random
upon expiry in S ? replenish from B
Claim n lt B lt n log w with high probability

16
Index-Set Approach

Pick random index set I i1, , in , X?0,1,
, w-1
Sample S items Xi with i e i1, , in (mod w)
in current window
Example
Suppose w2, n1, and I1
Then sample is always Xi with odd i
Memory only O(k)
Observe
S is uniform random sample of each window
But sample is periodic (union of arithmetic
progressions)
Correlation across successive windows
Problems
Correlation may hurt in some applications
Some data (e.g. time-series) may be periodic

17
Chain-Sample Algorithm

Idea
Fix expiry problem in Reservoir Sampling
Advance planning for expiry of sampled items
Focus on sample size 1 keep n independent such
samples
Chain-Sampling
Add Xt to S with probability 1/mint,w evict
earlier sample
Initially standard Reservoir Sampling up to
time w
Pre-select Xts replacement Xr e Wtw Xt1, ,
Xtw
Xt expires ? must replace from Wtw
At time r, save Xr and pre-select its own
replacement ? building chain of potential
replacements
Note if evicting earlier sample, discard its
chain as well

18
Example
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
19
Expectation for Chain-Sample

T(x) Echain length for Xt at time tx
Echain length T(w) ? e ? 2.718
Ememory required for sample size n O(n)

20
Tail Bound for Chain-Sample

Chain hops of total length at most w
Chain of h hops ? ordered (h1)-partition of w
h hops of total length less than w
plus, remainder
Each partition has probability w-h
Number of partitions
h O(log w) ? probability of a partition is
O(w-c)
Thus memory O(n log w) with high probability

21
Comparison of Algorithms
Algorithm Expected High-Probability
Periodic O(n) O(n)
Oversample O(n log w) O(n log w)
Chain-Sample O(n) O(n log w)

Chain-Sample beats Oversample
Expected memory O(n) vs O(n log w)
High-probability memory bound both O(n log w)
Oversample may have sample size shrink below n!

22
SketchesandFrequency Moments
23
Generalized Stream Model

Input Element (i,a)
a copies of domain-value i
increment to ith dimension of m by a
a need not be an integer
Negative value captures deletions

24
Example
On seeing element (i,a) (1,-1)
4
1
1
1
1
m0 m1 m2 m3 m4
25
Frequency Moments

Input Stream
values from U 0,1,,N-1
frequency vector m (m0,m1,,mN-1)
Kth Frequency Moment Fk(m) Si mik
F0 number of distinct values (Lecture 15)
F1 stream size
F2 Gini index, self-join size, Euclidean norm
Fk for kgt2, measures skew, sometimes useful
F8 maximum frequency
Problem estimation in small space
Sketches randomized estimators

26
Naive Approaches

Space N counter mi for each distinct value i
Space O(1)
if input sorted by i
single counter recycled when new i value appears
Goal
Allow arbitrary input
Use small (logarithmic) space
Settle for randomization/approximation

27
Sketching F2

Random Hash h(i) 0,1,,N-1 ? -1,1
Define Zi h(i)
Maintain X Si miZi
Easy for update streams (i,a) just add aZi to X
Claim X2 is unbiased estimator for F2
Proof EX2 E(Si miZi)2
ESi mi2Zi2
ESi,jmimjZiZj
Si mi2EZi2
Si,jmimjEZiEZj
Si mi2 0 F2
Last Line? Zi2 1 and EZi 0 as
uniform-1,1

from independence
28
Estimation Error?

Chebyshev bound
Define Y X2 ? EY EX2 Si mi2 F2
Observe EX4 E(SmiZi)4
ESmi4Zi44ESmimj3ZiZj36ESmi2mj2Zi2Zj2
12ESmimjmk2ZiZjZk224ESmimjmkmlZiZjZkZl
Smi4 6Smi2mj2
By definition VarY EY2 EY2 EX4
EX22
Smi46Smi2mj2 Smi42Smi2mj2
4Smi2mj2
2EX22 2F22

Why?
29
Estimation Error?

Chebyshev bound
P relative estimation error gt?
Problem What if we want ? really small?
Solution
Compute s 8/?2 independent copies of X
Estimator Y mean(Xi2)
Variance reduces by factor s
P relative estimation error gt?

30
Boosting Technique

Algorithm A Randomized ?-approximate estimator f
P(1- ?)f f (1 ?)f
3/4
Heavy Tail Problem Pfz, f, fz 1/16,
3/4, 3/16
Boosting Idea
O(log1/e) independent estimates from A(X)
Return median of estimates
Claim Pmedian is ?-approximate gt1- e
Proof
Pspecific estimate is ?-approximate ¾
Bad event only if gt50 estimates not
?-approximate
Binomial tail probability less than e

31
Overall Space Requirement

Observe
Let m Smi
Each hash needs O(log m)-bit counter
s 8/?2 hash functions for each estimator
O(log 1/e) such estimators
Total O(?-2 log 1/e log m) bits
Question Space for storing hash function?

32
Sketching Paradigm

Random Sketch inner product
frequency vector m (m0,m1,,mN-1)
random vector Z (currently, uniform -1,1)
Observe
Linearity ? Sketch(m1) Sketch(m2) Sketch
(m1 m2)
Ideal for distributed computing
Observe
Suppose Given i, can efficiently generate Zi
Then can maintain sketch for update streams
Problem
Must generate Zih(i) on first appearance of i
Need O(N) memory to store h explicitly
Need O(N) random bits

33
Two birds, One stone

Pairwise Independent Z1,Z2, , Zn
for all Zi and Zk, PZix, Zky
PZix.PZky
property EZiZk EZi.EZk
Example linear hash function
Seed Slta,bgt from 0..p-1, where p is prime
Zi h(i) aib (mod p)
Claim Z1,Z2, , Zn are pairwise independent
Zix and Zky ?? xaib (mod p) and yakb (mod
p)
fixing i, k, x, y ? unique solution for a, b
PZix, Zky 1/ p2 PZix.PZky
Memory/Randomness n log p ? 2 log p

34
Wait a minute!

Doesnt pairwise independence screw up proofs?
No EX2 calculation only has degree-2 terms
But what about VarX2?
Need 4-wise independence

35
Application Join-Size Estimation

Given
Join attribute frequencies f1 and f2
Join size f1.f2
Define X1 f1.Z and X2 f2.Z
Choose Z as 4-wise independent uniform -1,1
Exercise Show, as before,
EX1 X2 f1.f2
VarX1 X2 2 (f1.f2)2
Hint a.b a.b

36
Bounding Error Probability

Using s copies of Xs taking their mean Y
Pr Y- f1.f2 ? f1.f2 Var(Y) /
?2(f1.f2)2
2f12f22 / s?2(f1.f2)2
2 /
s?2cos2 ?
Bounding error probability?
Need s gt 2/?2cos2?
Memory? O( log 1/e cos-2? ?-2 (log N log
m))
Problem
To choose s need a-priori lower bound on cos ?
f1.f2
What if cos ? really small?

37
Sketch Partitioning
Idea for dealing with f12f22/(f1.f2)2 issue --
partition domain into regions where self-join
size is smaller to compensate small join-size
(cos ?)
self-join(R1.A)self-join(R2.B) 205205 42K
self-join(R1.A)self-join(R2.B)
self-join(R1.A)self-join(R2.B) 2005 2005
2K
38
Sketch Partitioning

Idea
intelligently partition join-attribute space
need coarse statistics on stream
build independent sketches for each partition
Estimate S partition sketches
Variance S partition variances

39
Sketch Partitioning

Partition Space Allocation?
Can solve optimally, given domain partition
Optimal Partition Find K-partition to minimize
Results
Dynamic Programming optimal solution for single
join
NP-hard for queries with multiple joins

40
Fk for k gt 2

Assume stream length m is known (Exercise
Show can fix with log m space overhead by
repeated-doubling estimate of m.)
Choose random stream item ap ? p
uniform from 1,2,,m
Suppose ap v e 0,1,,N-1
Count subsequent frequency of v
r q qp, aqv
Define X m(rk (r-1)k)

41
Example

Stream
7,8,5,1,7,5,2,1,5,4,5,10,6,5,4,1,4,7,3,8
m 20
p 9
ap 5
r 3

42
Fk for k gt 2

Var(X) kN1 1/k Fk2
Bounded Error Probability ? s O(kN1 1/k / ?2)
Boosting ? memory bound
O(kn1 1/k ?-2 (log 1/e)(log N
log m))

Summing over m choices of stream elements
43
Frequency Moments

F0 distinct values problem (Lecture 15)
F1 sequence length
for case with deletions, use Cauchy distribution
F2 self-join size/Gini index (Today)
Fk for k gt2
omitting grungy details
can achieve space bound
O(kN1 1/k ?-2 (log 1/e)(log n log m))
F8 maximum frequency

44
Communication Complexity

Cooperatively compute function f(A,B)
Minimize bits communicated
Unbounded computational power
Communication Complexity C(f) bits exchanged by
optimal protocol ?
Protocols?
1-way versus 2-way
deterministic versus randomized
Cd(f) randomized complexity for error
probability d

ALICE input A
BOB input B
45
Streaming Communication Complexity

Stream Algorithm ?1-way communication protocol
Simulation Argument
Given algorithm S computing f over streams
Alice initiates S, providing A as input stream
prefix
Communicates to Bob Ss state after seeing A
Bob resumes S, providing B as input stream
suffix
Theorem Stream algorithms space requirement is
at least the communication complexity C(f)

46
Example Set Disjointness

Set Disjointness (DIS)
A, B subsets of 1,2,,N
Output
Theorem Cd(DIS) O(N), for any dlt1/2

47
Lower Bound for F8

Theorem Fix elt1/3, dlt1/2. Any stream algorithm S
with
P (1-e)F8 lt S lt (1e)F8 gt 1-d
needs O(N) space
Proof
Claim S ? 1-way protocol for DIS (on any sets A
and B)
Alice streams set A to S
Communicates Ss state to Bob
Bob streams set B to S
Observe
Relative Error elt1/3 ? DIS solved exactly!
Perror lt½ lt d ? O(N) space

48
Extensions

Observe
Used only 1-way communication in proof
Cd(DIS) bound was for arbitrary communication
Exercise extend lower bound to multi-pass
algorithms
Lower Bound for Fk, kgt2
Need to increase gap beyond 2
Multiparty Set Disjointness t players
Theorem Fix e,dlt½ and k gt 5. Any stream
algorithm S with
P (1-e)Fk lt S lt (1e)Fk gt 1-d
needs O(N1-(2 d)/k) space
Implies O(N1/2) even for multi-pass algorithms

49
Tracking High-Frequency Items
50
Problem 1 Top-K ListCharikar-Chen-Farach-Colto
n

The Google Problem
Return list of k most frequent items in stream
Motivation
search engine queries, network traffic,
Remember
Saw lower bound recently!
Solution
Data structure Count-Sketch ? maintaining
count-estimates of high-frequency elements

51
Definitions

Notation
Assume 1, 2, , N in order of frequency
mi is frequency of ith most frequent element
m Smi is number of elements in stream
FindCandidateTop
Input stream S, int k, int p
Output list of p elements containing top k
Naive sampling gives solution with p ?(m log k
/ mk)
FindApproxTop
Input stream S, int k, real ?
Output list of k elements, each of frequency mi
gt (1-?) mk
Naive sampling gives no solution

52
Main Idea

Consider
single counter X
hash function h(i) 1, 2,,N ? -1,1
Input element i ? update counter X Zi h(i)
For each r, use XZr as estimator of mr
Theorem EXZr mr

Proof
X Si miZi
EXZr ESi miZiZr Si miEZi Zr mrEZr2
mr
Cross-terms cancel

53
Finding Max Frequency Element

Problem varX F2 Si mi2
Idea t counters, independent 4-wise hashes
h1,,ht
Use t O(log m ? mi2 / (?m1)2)
Claim New Variance lt ? mi2 / t (?m1)2 / log m
Overall Estimator
repeat median of averages
with high probability, approximate m1

h1 i? 1, 1
ht i? 1, 1
54
Problem with Array of Counters

Variance dominated by highest frequency
Estimates for less-frequent elements like k
corrupted by higher frequencies
variance gtgt mk
Avoiding Collisions?
spread out high frequency elements
replace each counter with hashtable of b counters

55
Count Sketch

Hash Functions
4-wise independent hashes h1,...,ht and s1,,st
hashes independent of each other
Data structure hashtables of counters X(r,c)

1 2 b
56
Overall Algorithm

sr(i) one of b counters in rth hashtable
Input i ? for each r, update X(r,sr(i)) hr(i)
Estimator(mi) medianr X(r,sr(i)) hr(i)
Maintain heap of k top elements seen so far
Observe
Not completely eliminated collision with high
frequency items
Few of estimates X(r,sr(i)) hr(i) could have
high variance
Median not sensitive to these poor estimates

57
Avoiding Large Items

b gt O(k) ? with probability O(1), no collision
with top-k elements
t hashtables represent independent trials
Need log m/? trials to estimate with probability
1-?
Also need small variance for colliding small
elements
Claim
Pvariance due to small items in each estimate lt
(?igtk mi2)/b O(1)
Final bound b O(k ?igtk mi2 / (?mk)2)

58
Final Results

Zipfian Distribution mi ? 1/i? Power Law
FindApproxTop
k (?igtkmi2) / (?mk)2 log m/?
Roughly sampling bound with frequencies squared
Zipfian gives improved results
FindCandidateTop
Zipf parameter 0.5
O(k log N log m)
Compare sampling bound O((kN)0.5 log k)

59
Problem 2 Elephants-and-AntsManku-Motwani
Stream

Identify items whose current frequency exceeds
support threshold s 0.1.
Jacobson 2000, Estan-Verghese 2001

60
Algorithm 1 Lossy Counting
Step 1 Divide the stream into windows
Window-size w is function of support s specify
later
61
Lossy Counting in Action ...
Empty
62
Lossy Counting (continued)
63
Error Analysis
How much do we undercount?
If current size of stream N and
window-size w
1/e then
windows eN
frequency error ?
Rule of thumb Set e 10 of support
s Example Given support frequency s
1, set error frequency e 0.1
64
Putting it all together
Output Elements with counter values exceeding
(s-e)N
Approximation guarantees Frequencies
underestimated by at most eN No false
negatives False positives have true
frequency at least (se)N

How many counters do we need?
Worst case bound 1/e log eN counters
Implementation details

65
Number of Counters?

Window size w 1/?
Number of windows m ?N
ni counters alive over last i windows
Fact
Claim
Counter must average 1 increment/window to
survive
active counters

66
Enhancements
Frequency Errors For counter (X, c),
true frequency in c, ceN
Trick Track number of windows t counter
has been active For counter (X,
c, t), true frequency in c, ct-1
If (t 1), no error!
Batch Processing Decrements after k
windows
67
Algorithm 2 Sticky Sampling
? Create counters by sampling ? Maintain exact
counts thereafter
What is sampling rate?
68
Sticky Sampling (continued)
For finite stream of length N Sampling rate
2/eN log 1/?s
? probability of failure
Output Elements with counter values exceeding
(s-e)N
Same Rule of thumb Set e 10 of support
s Example Given support threshold s 1,
set error threshold e 0.1 set
failure probability ? 0.01
69
Number of counters?
Finite stream of length N Sampling rate 2/eN
log 1/?s
Infinite stream with unknown N Gradually adjust
sampling rate
In either case, Expected number of counters
2/? log 1/?s
70
References Synopses

Synopsis data structures for massive data sets.
Gibbons and Matias, DIMACS 1999.
Tracking Join and Self-Join Sizes in Limited
Storage, Alon, Gibbons, Matias, and Szegedy. PODS
1999.
Join Synopses for Approximate Query Answering,
Acharya, Gibbons, Poosala, and Ramaswamy. SIGMOD
1999.
Random Sampling for Histogram Construction How
much is enough? Chaudhuri, Motwani, and
Narasayya. SIGMOD 1998.
Random Sampling Techniques for Space Efficient
Online Computation of Order Statistics of Large
Datasets, Manku, Rajagopalan, and Lindsay. SIGMOD
1999.
Space-efficient online computation of quantile
summaries, Greenwald and Khanna. SIGMOD 2001.

71
References Sampling

Random Sampling with a Reservoir, Vitter.
Transactions on Mathematical Software 11(1)37-57
(1985).
On Sampling and Relational Operators. Chaudhuri
and Motwani. Bulletin of the Technical Committee
on Data Engineering (1999).
On Random Sampling over Joins. Chaudhuri,
Motwani, and Narasayya. SIGMOD 1999.
Congressional Samples for Approximate Answering
of Group-By Queries, Acharya, Gibbons, and
Poosala. SIGMOD 2000.
Overcoming Limitations of Sampling for
Aggregation Queries, Chaudhuri, Das, Datar,
Motwani and Narasayya. ICDE 2001.
A Robust Optimization-Based Approach for
Approximate Answering of Aggregate Queries,
Chaudhuri, Das and Narasayya. SIGMOD 01.
Sampling From a Moving Window Over Streaming
Data. Babcock, Datar, and Motwani. SODA 2002.
Sampling algorithms lower bounds and
applications. Bar-YossefKumarSivakumar. STOC
2001.

72
References Sketches

Probabilistic counting algorithms for data base
applications. Flajolet and Martin. JCSS (1985).
The space complexity of approximating the
frequency moments. Alon, Matias, and Szegedy.
STOC 1996.
Approximate Frequency Counts over Streaming Data.
Manku and Motwani. VLDB 2002.
Finding Frequent Items in Data Streams. Charikar,
Chen, and Farach-Colton. ICALP 2002.
An Approximate L1-Difference Algorithm for
Massive Data Streams. Feigenbaum, Kannan,
Strauss, and Viswanathan. FOCS 1999.
Stable Distributions, Pseudorandom Generators,
Embeddings and Data Stream Computation. Indyk.
FOCS 2000.

Write a Comment

User Comments (0)

About PowerShow.com

CS 361A (Advanced Data Structures and Algorithms) PowerPoint PPT Presentation