Title: CS 361A Advanced Data Structures and Algorithms
1CS 361A (Advanced Data Structures and Algorithms)
 Lectures 16 17 (Nov
16 and 28, 2005)  Synopses, Samples, and Sketches
 Rajeev Motwani
2Game Plan for Week
 Last Class
 Models for Streaming/Massive Data Sets
 Negative results for Exact Distinct Values
 Hashing for Approximate Distinct Values
 Today
 Synopsis Data Structures
 Sampling Techniques
 Frequency Moments Problem
 Sketching Techniques
 Finding HighFrequency Items
3Synopsis Data Structures
 Synopses
 Webster a condensed statement or outline (as of
a narrative or treatise)  CS 361A succinct data structure that lets us
answers queries efficiently  Synopsis Data Structures
 Lossy Summary (of a data stream)
 Advantages fits in memory easy to communicate
 Disadvantage lossiness implies approximation
error  Negative Results ? best we can do
 Key Techniques randomization and hashing
4Numerical Examples
 Approximate Query Processing AQUA/Bell Labs
 Database Size 420 MB
 Synopsis Size 420 KB (0.1)
 Approximation Error within 10
 Running Time 0.3 of time for exact query
 Histograms/Quantiles ChaudhuriMotwaniNarasayya,
MankuRajagopalanLindsay, KhannaGreenwald  Data Size 109 items
 Synopsis Size 1249 items
 Approximation Error within 1
5Synopses
 Desidarata
 Small Memory Footprint
 Quick Update and Query
 Provable, lowerror guarantees
 Composable for distributed scenario
 Applicability?
 Generalpurpose e.g. random samples
 Specificpurpose e.g. distinct values estimator
 Granularity?
 Per database e.g. sample of entire table
 Per distinct value e.g. customer profiles
 Structural e.g. GROUPBY or JOIN result samples
6Examples of Synopses
 Synopses need not be fancy!
 Simple Aggregates e.g. mean/median/max/min
 Variance?
 Random Samples
 Aggregates on small samples represent entire data
 Leverage extensive work on confidence intervals
 Random Sketches
 structured samples
 Tracking HighFrequency Items
7Random Samples
8Types of Samples
 Oblivious sampling at item level
 Limitations BarYossefKumarSivakumar STOC 01
 Valuebased sampling e.g. distinctvalue
samples  Structured samples e.g. join sampling
 Naïve approach keep samples of each relation
 Problem sampleofjoin joinofsamples
 ForeignKey Join ChaudhuriMotwaniNarasayya
SIGMOD 99
A A B B
A B
what if A sampled from L and B from R?
L
R
9Basic Scenario
 Goal maintain uniform sample of itemstream
 Sampling Semantics?
 Coin flip
 select each item with probability p
 easy to maintain
 undesirable sample size is unbounded
 Fixedsize sample without replacement
 Our focus today
 Fixedsize sample with replacement
 Show can generate from previous sample
 NonUniform Samples ChaudhuriMotwaniNarasayya
10Reservoir Sampling Vitter
 Input stream of items X1 , X2, X3,
 Goal maintain uniform random sample S of size n
(without replacement) of stream so far  Reservoir Sampling
 Initialize include first n elements in S
 Upon seeing item Xt
 Add Xt to S with probability n/t
 If added, evict random previous item
11Analysis
 Correctness?
 Fact At each instant, S n
 Theorem At time t, any XieS with probability n/t
 Exercise prove via induction on t
 Efficiency?
 Let N be stream size
 Remark Verify this is optimal.
 Naïve implementation ? N coin flips ? time O(N)
12Improving Efficiency
J94
J32
items inserted into sample S (where n3)
 Random variable Jt number jumped over after
time t  Idea generate Jt and skip that many items
 Cumulative Distribution Function F(s) PJt
s, for tgtn s0
13Analysis
 Number of calls to RANDOM()?
 one per insertion into sample
 this is optimal!
 Generating Jt?
 Pick random number U e 0,1
 Find smallest j such that U F(j)
 How?
 Linear scan ? O(N) time
 Binary search with Newtons interpolation ?
O(n2(1 polylog N/n)) time  Remark see paper for optimal algorithm
14Sampling over Sliding Windows BabcockDatarMotwa
ni
 Sliding Window W last w items in stream
 Model item Xt expires at time tw
 Why?
 Applications may require ignoring stale data
 Type of approximation
 Only way to define JOIN over streams
 Goal Maintain uniform sample of size n of
sliding window
15Reservoir Sampling?
 Observe
 any item in sample S will expire eventually
 must replace with random item of current window
 Problem
 no access to items in WS
 storing entire window requires O(w) memory
 Oversampling
 Backing sample B select each item with
probability  sample S select n items from B at random
 upon expiry in S ? replenish from B
 Claim n lt B lt n log w with high probability
16IndexSet Approach
 Pick random index set I i1,
, in , X?0,1,
, w1  Sample S items Xi with i e i1,
, in (mod w)
in current window  Example
 Suppose w2, n1, and I1
 Then sample is always Xi with odd i
 Memory only O(k)
 Observe
 S is uniform random sample of each window
 But sample is periodic (union of arithmetic
progressions)  Correlation across successive windows
 Problems
 Correlation may hurt in some applications
 Some data (e.g. timeseries) may be periodic
17ChainSample Algorithm
 Idea
 Fix expiry problem in Reservoir Sampling
 Advance planning for expiry of sampled items
 Focus on sample size 1 keep n independent such
samples  ChainSampling
 Add Xt to S with probability 1/mint,w evict
earlier sample  Initially standard Reservoir Sampling up to
time w  Preselect Xts replacement Xr e Wtw Xt1,
,
Xtw  Xt expires ? must replace from Wtw
 At time r, save Xr and preselect its own
replacement ? building chain of potential
replacements  Note if evicting earlier sample, discard its
chain as well
18Example
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
19Expectation for ChainSample
 T(x) Echain length for Xt at time tx
 Echain length T(w) ? e ? 2.718
 Ememory required for sample size n O(n)
20Tail Bound for ChainSample
 Chain hops of total length at most w
 Chain of h hops ? ordered (h1)partition of w
 h hops of total length less than w
 plus, remainder
 Each partition has probability wh
 Number of partitions
 h O(log w) ? probability of a partition is
O(wc)  Thus memory O(n log w) with high probability
21Comparison of Algorithms
 ChainSample beats Oversample
 Expected memory O(n) vs O(n log w)
 Highprobability memory bound both O(n log w)
 Oversample may have sample size shrink below n!
22Sketches and Frequency Moments
23Generalized Stream Model
 Input Element (i,a)
 a copies of domainvalue i
 increment to ith dimension of m by a
 a need not be an integer
 Negative value captures deletions
24Example
On seeing element (i,a) (1,1)
4
1
1
1
1
m0 m1 m2 m3 m4
25Frequency Moments
 Input Stream
 values from U 0,1,
,N1
 frequency vector m (m0,m1,
,mN1)
 Kth Frequency Moment Fk(m) Si mik
 F0 number of distinct values (Lecture 15)
 F1 stream size
 F2 Gini index, selfjoin size, Euclidean norm
 Fk for kgt2, measures skew, sometimes useful
 F8 maximum frequency
 Problem estimation in small space
 Sketches randomized estimators
26Naive Approaches
 Space N counter mi for each distinct value i
 Space O(1)
 if input sorted by i
 single counter recycled when new i value appears
 Goal
 Allow arbitrary input
 Use small (logarithmic) space
 Settle for randomization/approximation
27Sketching F2
 Random Hash h(i) 0,1,
,N1 ? 1,1
 Define Zi h(i)
 Maintain X Si miZi
 Easy for update streams (i,a) just add aZi to X
 Claim X2 is unbiased estimator for F2
 Proof EX2 E(Si miZi)2
 ESi mi2Zi2
ESi,jmimjZiZj  Si mi2EZi2
Si,jmimjEZiEZj  Si mi2 0 F2
 Last Line? Zi2 1 and EZi 0 as
uniform1,1
from independence
28Estimation Error?
 Chebyshev bound
 Define Y X2 ? EY EX2 Si mi2 F2
 Observe EX4 E(SmiZi)4

ESmi4Zi44ESmimj3ZiZj36ESmi2mj2Zi2Zj2 
12ESmimjmk2ZiZjZk224ESmimjmkmlZiZjZkZl  Smi4 6Smi2mj2
 By definition VarY EY2 EY2 EX4
EX22 
Smi46Smi2mj2 Smi42Smi2mj2  4Smi2mj2
2EX22 2F22
Why?
29Estimation Error?
 Chebyshev bound
 P relative estimation error gt?
 Problem What if we want ? really small?
 Solution
 Compute s 8/?2 independent copies of X
 Estimator Y mean(Xi2)
 Variance reduces by factor s
 P relative estimation error gt?

30Boosting Technique
 Algorithm A Randomized ?approximate estimator f
 P(1 ?)f f (1 ?)f
3/4  Heavy Tail Problem Pfz, f, fz 1/16,
3/4, 3/16  Boosting Idea
 O(log1/e) independent estimates from A(X)
 Return median of estimates
 Claim Pmedian is ?approximate gt1 e
Proof  Pspecific estimate is ?approximate ¾
 Bad event only if gt50 estimates not
?approximate  Binomial tail probability less than e
31Overall Space Requirement
 Observe
 Let m Smi
 Each hash needs O(log m)bit counter
 s 8/?2 hash functions for each estimator
 O(log 1/e) such estimators
 Total O(?2 log 1/e log m) bits
 Question Space for storing hash function?
32Sketching Paradigm
 Random Sketch inner product
 frequency vector m (m0,m1,
,mN1)
 random vector Z (currently, uniform 1,1)
 Observe
 Linearity ? Sketch(m1) Sketch(m2) Sketch
(m1 m2)  Ideal for distributed computing
 Observe
 Suppose Given i, can efficiently generate Zi
 Then can maintain sketch for update streams
 Problem
 Must generate Zih(i) on first appearance of i
 Need O(N) memory to store h explicitly
 Need O(N) random bits
33Two birds, One stone
 Pairwise Independent Z1,Z2,
, Zn
 for all Zi and Zk, PZix, Zky
PZix.PZky  property EZiZk EZi.EZk
 Example linear hash function
 Seed Slta,bgt from 0..p1, where p is prime
 Zi h(i) aib (mod p)
 Claim Z1,Z2,
, Zn are pairwise independent
 Zix and Zky ?? xaib (mod p) and yakb (mod
p)  fixing i, k, x, y ? unique solution for a, b
 PZix, Zky 1/ p2 PZix.PZky
 Memory/Randomness n log p ? 2 log p
34Wait a minute!
 Doesnt pairwise independence screw up proofs?
 No EX2 calculation only has degree2 terms
 But what about VarX2?
 Need 4wise independence
35Application JoinSize Estimation
 Given
 Join attribute frequencies f1 and f2
 Join size f1.f2
 Define X1 f1.Z and X2 f2.Z
 Choose Z as 4wise independent uniform 1,1
 Exercise Show, as before,
 EX1 X2 f1.f2
 VarX1 X2 2 (f1.f2)2
 Hint a.b a.b
36Bounding Error Probability
 Using s copies of Xs taking their mean Y
 Pr Y f1.f2 ? f1.f2 Var(Y) /
?2(f1.f2)2

2f12f22 / s?2(f1.f2)2  2 /
s?2cos2 ?  Bounding error probability?
 Need s gt 2/?2cos2?
 Memory? O( log 1/e cos2? ?2 (log N log
m))  Problem
 To choose s need apriori lower bound on cos ?
f1.f2  What if cos ? really small?
37Sketch Partitioning
Idea for dealing with f12f22/(f1.f2)2 issue 
partition domain into regions where selfjoin
size is smaller to compensate small joinsize
(cos ?)
selfjoin(R1.A)selfjoin(R2.B) 205205 42K
selfjoin(R1.A)selfjoin(R2.B)
selfjoin(R1.A)selfjoin(R2.B) 2005 2005
2K
38Sketch Partitioning
 Idea
 intelligently partition joinattribute space
 need coarse statistics on stream
 build independent sketches for each partition
 Estimate S partition sketches
 Variance S partition variances
39Sketch Partitioning
 Partition Space Allocation?
 Can solve optimally, given domain partition
 Optimal Partition Find Kpartition to minimize
 Results
 Dynamic Programming optimal solution for single
join  NPhard for queries with multiple joins
40Fk for k gt 2
 Assume stream length m is known (Exercise
Show can fix with log m space overhead by
repeateddoubling estimate of m.)  Choose random stream item ap ? p
uniform from 1,2,
,m  Suppose ap v e 0,1,
,N1
 Count subsequent frequency of v
 r q qp, aqv
 Define X m(rk (r1)k)
41Example
 Stream
 7,8,5,1,7,5,2,1,5,4,5,10,6,5,4,1,4,7,3,8
 m 20
 p 9
 ap 5
 r 3
42Fk for k gt 2
 Var(X) kN1 1/k Fk2
 Bounded Error Probability ? s O(kN1 1/k / ?2)
 Boosting ? memory bound
 O(kn1 1/k ?2 (log 1/e)(log N
log m))
Summing over m choices of stream elements
43Frequency Moments
 F0 distinct values problem (Lecture 15)
 F1 sequence length
 for case with deletions, use Cauchy distribution
 F2 selfjoin size/Gini index (Today)
 Fk for k gt2
 omitting grungy details
 can achieve space bound
 O(kN1 1/k ?2 (log 1/e)(log n log m))
 F8 maximum frequency
44Communication Complexity
 Cooperatively compute function f(A,B)
 Minimize bits communicated
 Unbounded computational power
 Communication Complexity C(f) bits exchanged by
optimal protocol ?  Protocols?
 1way versus 2way
 deterministic versus randomized
 Cd(f) randomized complexity for error
probability d
ALICE input A
BOB input B
45Streaming Communication Complexity
 Stream Algorithm ?1way communication protocol
 Simulation Argument
 Given algorithm S computing f over streams
 Alice initiates S, providing A as input stream
prefix  Communicates to Bob Ss state after seeing A
 Bob resumes S, providing B as input stream
suffix  Theorem Stream algorithms space requirement is
at least the communication complexity C(f)
46Example Set Disjointness
 Set Disjointness (DIS)
 A, B subsets of 1,2,
,N
 Output
 Theorem Cd(DIS) O(N), for any dlt1/2
47Lower Bound for F8
 Theorem Fix elt1/3, dlt1/2. Any stream algorithm S
with  P (1e)F8 lt S lt (1e)F8 gt 1d
 needs O(N) space
 Proof
 Claim S ? 1way protocol for DIS (on any sets A
and B)  Alice streams set A to S
 Communicates Ss state to Bob
 Bob streams set B to S
 Observe
 Relative Error elt1/3 ? DIS solved exactly!
 Perror lt½ lt d ? O(N) space
48Extensions
 Observe
 Used only 1way communication in proof
 Cd(DIS) bound was for arbitrary communication
 Exercise extend lower bound to multipass
algorithms  Lower Bound for Fk, kgt2
 Need to increase gap beyond 2
 Multiparty Set Disjointness t players
 Theorem Fix e,dlt½ and k gt 5. Any stream
algorithm S with  P (1e)Fk lt S lt (1e)Fk gt 1d
 needs O(N1(2 d)/k) space
 Implies O(N1/2) even for multipass algorithms
49Tracking HighFrequency Items
50Problem 1 TopK List CharikarChenFarachColto
n
 The Google Problem
 Return list of k most frequent items in stream
 Motivation
 search engine queries, network traffic,
 Remember
 Saw lower bound recently!
 Solution
 Data structure CountSketch ? maintaining
countestimates of highfrequency elements
51Definitions
 Notation
 Assume 1, 2,
, N in order of frequency
 mi is frequency of ith most frequent element
 m Smi is number of elements in stream
 FindCandidateTop
 Input stream S, int k, int p
 Output list of p elements containing top k
 Naive sampling gives solution with p ?(m log k
/ mk)  FindApproxTop
 Input stream S, int k, real ?
 Output list of k elements, each of frequency mi
gt (1?) mk  Naive sampling gives no solution
52Main Idea
 Consider
 single counter X
 hash function h(i) 1, 2,
,N ? 1,1
 Input element i ? update counter X Zi h(i)
 For each r, use XZr as estimator of mr
 Theorem EXZr mr
Proof  X Si miZi
 EXZr ESi miZiZr Si miEZi Zr mrEZr2
mr  Crossterms cancel
53Finding Max Frequency Element
 Problem varX F2 Si mi2
 Idea t counters, independent 4wise hashes
h1,
,ht  Use t O(log m ? mi2 / (?m1)2)
 Claim New Variance lt ? mi2 / t (?m1)2 / log m
 Overall Estimator
 repeat median of averages
 with high probability, approximate m1
h1 i? 1, 1
ht i? 1, 1
54Problem with Array of Counters
 Variance dominated by highest frequency
 Estimates for lessfrequent elements like k
 corrupted by higher frequencies
 variance gtgt mk
 Avoiding Collisions?
 spread out high frequency elements
 replace each counter with hashtable of b counters
55Count Sketch
 Hash Functions
 4wise independent hashes h1,...,ht and s1,
,st
 hashes independent of each other
 Data structure hashtables of counters X(r,c)
1 2
b
56Overall Algorithm
 sr(i) one of b counters in rth hashtable
 Input i ? for each r, update X(r,sr(i)) hr(i)
 Estimator(mi) medianr X(r,sr(i)) hr(i)
 Maintain heap of k top elements seen so far
 Observe
 Not completely eliminated collision with high
frequency items  Few of estimates X(r,sr(i)) hr(i) could have
high variance  Median not sensitive to these poor estimates
57Avoiding Large Items
 b gt O(k) ? with probability O(1), no collision
with topk elements  t hashtables represent independent trials
 Need log m/? trials to estimate with probability
1?  Also need small variance for colliding small
elements  Claim
 Pvariance due to small items in each estimate lt
(?igtk mi2)/b O(1)  Final bound b O(k ?igtk mi2 / (?mk)2)
58Final Results
 Zipfian Distribution mi ? 1/i? Power Law
 FindApproxTop
 k (?igtkmi2) / (?mk)2 log m/?
 Roughly sampling bound with frequencies squared
 Zipfian gives improved results
 FindCandidateTop
 Zipf parameter 0.5
 O(k log N log m)
 Compare sampling bound O((kN)0.5 log k)
59Problem 2 ElephantsandAnts MankuMotwani
Stream
 Identify items whose current frequency exceeds
support threshold s 0.1.  Jacobson 2000, EstanVerghese 2001
60Algorithm 1 Lossy Counting
Step 1 Divide the stream into windows
Windowsize w is function of support s specify
later
61Lossy Counting in Action ...
Empty
62Lossy Counting (continued)
63Error Analysis
How much do we undercount?
If current size of stream N and
windowsize w
1/e then
windows eN
frequency error ?
Rule of thumb Set e 10 of support
s Example Given support frequency s
1, set error frequency e 0.1
64Putting it all together
Output Elements with counter values exceeding
(se)N
Approximation guarantees Frequencies
underestimated by at most eN No false
negatives False positives have true
frequency at least (se)N
 How many counters do we need?
 Worst case bound 1/e log eN counters
 Implementation details
65Number of Counters?
 Window size w 1/?
 Number of windows m ?N
 ni counters alive over last i windows
 Fact
 Claim
 Counter must average 1 increment/window to
survive  active counters
66Enhancements
Frequency Errors For counter (X, c),
true frequency in c, ceN
Trick Track number of windows t counter
has been active For counter (X,
c, t), true frequency in c, ct1
If (t 1), no error!
Batch Processing Decrements after k
windows
67Algorithm 2 Sticky Sampling
? Create counters by sampling ? Maintain exact
counts thereafter
What is sampling rate?
68Sticky Sampling (continued)
For finite stream of length N Sampling rate
2/eN log 1/?s
? probability of failure
Output Elements with counter values exceeding
(se)N
Same Rule of thumb Set e 10 of support
s Example Given support threshold s 1,
set error threshold e 0.1 set
failure probability ? 0.01
69Number of counters?
Finite stream of length N Sampling rate 2/eN
log 1/?s
Infinite stream with unknown N Gradually adjust
sampling rate
In either case, Expected number of counters
2/? log 1/?s
70References Synopses
 Synopsis data structures for massive data sets.
Gibbons and Matias, DIMACS 1999.  Tracking Join and SelfJoin Sizes in Limited
Storage, Alon, Gibbons, Matias, and Szegedy. PODS
1999.  Join Synopses for Approximate Query Answering,
Acharya, Gibbons, Poosala, and Ramaswamy. SIGMOD
1999.  Random Sampling for Histogram Construction How
much is enough? Chaudhuri, Motwani, and
Narasayya. SIGMOD 1998.  Random Sampling Techniques for Space Efficient
Online Computation of Order Statistics of Large
Datasets, Manku, Rajagopalan, and Lindsay. SIGMOD
1999.  Spaceefficient online computation of quantile
summaries, Greenwald and Khanna. SIGMOD 2001.
71References Sampling
 Random Sampling with a Reservoir, Vitter.
Transactions on Mathematical Software 11(1)3757
(1985).  On Sampling and Relational Operators. Chaudhuri
and Motwani. Bulletin of the Technical Committee
on Data Engineering (1999).  On Random Sampling over Joins. Chaudhuri,
Motwani, and Narasayya. SIGMOD 1999.  Congressional Samples for Approximate Answering
of GroupBy Queries, Acharya, Gibbons, and
Poosala. SIGMOD 2000.  Overcoming Limitations of Sampling for
Aggregation Queries, Chaudhuri, Das, Datar,
Motwani and Narasayya. ICDE 2001.  A Robust OptimizationBased Approach for
Approximate Answering of Aggregate Queries,
Chaudhuri, Das and Narasayya. SIGMOD 01.  Sampling From a Moving Window Over Streaming
Data. Babcock, Datar, and Motwani. SODA 2002.  Sampling algorithms lower bounds and
applications. BarYossefKumarSivakumar. STOC
2001.
72References Sketches
 Probabilistic counting algorithms for data base
applications. Flajolet and Martin. JCSS (1985).  The space complexity of approximating the
frequency moments. Alon, Matias, and Szegedy.
STOC 1996.  Approximate Frequency Counts over Streaming Data.
Manku and Motwani. VLDB 2002.  Finding Frequent Items in Data Streams. Charikar,
Chen, and FarachColton. ICALP 2002.  An Approximate L1Difference Algorithm for
Massive Data Streams. Feigenbaum, Kannan,
Strauss, and Viswanathan. FOCS 1999.  Stable Distributions, Pseudorandom Generators,
Embeddings and Data Stream Computation. Indyk.
FOCS 2000.