CS 361A Advanced Data Structures and Algorithms - PowerPoint PPT Presentation


PPT – CS 361A Advanced Data Structures and Algorithms PowerPoint presentation | free to download - id: a9a73-ODU4Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

CS 361A Advanced Data Structures and Algorithms


Synopsis Data Structures. Sampling Techniques. Frequency Moments Problem ... Synopsis Data Structures 'Lossy' ... Synopsis Size 420 KB (0.1 ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 73
Provided by: RajeevM2


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: CS 361A Advanced Data Structures and Algorithms

CS 361A (Advanced Data Structures and Algorithms)
  • Lectures 16 17 (Nov
    16 and 28, 2005)
  • Synopses, Samples, and Sketches
  • Rajeev Motwani

Game Plan for Week
  • Last Class
  • Models for Streaming/Massive Data Sets
  • Negative results for Exact Distinct Values
  • Hashing for Approximate Distinct Values
  • Today
  • Synopsis Data Structures
  • Sampling Techniques
  • Frequency Moments Problem
  • Sketching Techniques
  • Finding High-Frequency Items

Synopsis Data Structures
  • Synopses
  • Webster a condensed statement or outline (as of
    a narrative or treatise)
  • CS 361A succinct data structure that lets us
    answers queries efficiently
  • Synopsis Data Structures
  • Lossy Summary (of a data stream)
  • Advantages fits in memory easy to communicate
  • Disadvantage lossiness implies approximation
  • Negative Results ? best we can do
  • Key Techniques randomization and hashing

Numerical Examples
  • Approximate Query Processing AQUA/Bell Labs
  • Database Size 420 MB
  • Synopsis Size 420 KB (0.1)
  • Approximation Error within 10
  • Running Time 0.3 of time for exact query
  • Histograms/Quantiles Chaudhuri-Motwani-Narasayya,
    Manku-Rajagopalan-Lindsay, Khanna-Greenwald
  • Data Size 109 items
  • Synopsis Size 1249 items
  • Approximation Error within 1

  • Desidarata
  • Small Memory Footprint
  • Quick Update and Query
  • Provable, low-error guarantees
  • Composable for distributed scenario
  • Applicability?
  • General-purpose e.g. random samples
  • Specific-purpose e.g. distinct values estimator
  • Granularity?
  • Per database e.g. sample of entire table
  • Per distinct value e.g. customer profiles
  • Structural e.g. GROUP-BY or JOIN result samples

Examples of Synopses
  • Synopses need not be fancy!
  • Simple Aggregates e.g. mean/median/max/min
  • Variance?
  • Random Samples
  • Aggregates on small samples represent entire data
  • Leverage extensive work on confidence intervals
  • Random Sketches
  • structured samples
  • Tracking High-Frequency Items

Random Samples
Types of Samples
  • Oblivious sampling at item level
  • Limitations Bar-YossefKumarSivakumar STOC 01
  • Value-based sampling e.g. distinct-value
  • Structured samples e.g. join sampling
  • Naïve approach keep samples of each relation
  • Problem sample-of-join join-of-samples
  • Foreign-Key Join Chaudhuri-Motwani-Narasayya
    SIGMOD 99

what if A sampled from L and B from R?
Basic Scenario
  • Goal maintain uniform sample of item-stream
  • Sampling Semantics?
  • Coin flip
  • select each item with probability p
  • easy to maintain
  • undesirable sample size is unbounded
  • Fixed-size sample without replacement
  • Our focus today
  • Fixed-size sample with replacement
  • Show can generate from previous sample
  • Non-Uniform Samples Chaudhuri-Motwani-Narasayya

Reservoir Sampling Vitter
  • Input stream of items X1 , X2, X3, …
  • Goal maintain uniform random sample S of size n
    (without replacement) of stream so far
  • Reservoir Sampling
  • Initialize include first n elements in S
  • Upon seeing item Xt
  • Add Xt to S with probability n/t
  • If added, evict random previous item

  • Correctness?
  • Fact At each instant, S n
  • Theorem At time t, any XieS with probability n/t
  • Exercise prove via induction on t
  • Efficiency?
  • Let N be stream size
  • Remark Verify this is optimal.
  • Naïve implementation ? N coin flips ? time O(N)

Improving Efficiency
items inserted into sample S (where n3)
  • Random variable Jt number jumped over after
    time t
  • Idea generate Jt and skip that many items
  • Cumulative Distribution Function F(s) PJt
    s, for tgtn s0

  • Number of calls to RANDOM()?
  • one per insertion into sample
  • this is optimal!
  • Generating Jt?
  • Pick random number U e 0,1
  • Find smallest j such that U F(j)
  • How?
  • Linear scan ? O(N) time
  • Binary search with Newtons interpolation ?
    O(n2(1 polylog N/n)) time
  • Remark see paper for optimal algorithm

Sampling over Sliding Windows Babcock-Datar-Motwa
  • Sliding Window W last w items in stream
  • Model item Xt expires at time tw
  • Why?
  • Applications may require ignoring stale data
  • Type of approximation
  • Only way to define JOIN over streams
  • Goal Maintain uniform sample of size n of
    sliding window

Reservoir Sampling?
  • Observe
  • any item in sample S will expire eventually
  • must replace with random item of current window
  • Problem
  • no access to items in W-S
  • storing entire window requires O(w) memory
  • Oversampling
  • Backing sample B select each item with
  • sample S select n items from B at random
  • upon expiry in S ? replenish from B
  • Claim n lt B lt n log w with high probability

Index-Set Approach
  • Pick random index set I i1, … , in , X?0,1,
    … , w-1
  • Sample S items Xi with i e i1, … , in (mod w)
    in current window
  • Example
  • Suppose w2, n1, and I1
  • Then sample is always Xi with odd i
  • Memory only O(k)
  • Observe
  • S is uniform random sample of each window
  • But sample is periodic (union of arithmetic
  • Correlation across successive windows
  • Problems
  • Correlation may hurt in some applications
  • Some data (e.g. time-series) may be periodic

Chain-Sample Algorithm
  • Idea
  • Fix expiry problem in Reservoir Sampling
  • Advance planning for expiry of sampled items
  • Focus on sample size 1 keep n independent such
  • Chain-Sampling
  • Add Xt to S with probability 1/mint,w evict
    earlier sample
  • Initially standard Reservoir Sampling up to
    time w
  • Pre-select Xts replacement Xr e Wtw Xt1, …,
  • Xt expires ? must replace from Wtw
  • At time r, save Xr and pre-select its own
    replacement ? building chain of potential
  • Note if evicting earlier sample, discard its
    chain as well

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
Expectation for Chain-Sample
  • T(x) Echain length for Xt at time tx
  • Echain length T(w) ? e ? 2.718
  • Ememory required for sample size n O(n)

Tail Bound for Chain-Sample
  • Chain hops of total length at most w
  • Chain of h hops ? ordered (h1)-partition of w
  • h hops of total length less than w
  • plus, remainder
  • Each partition has probability w-h
  • Number of partitions
  • h O(log w) ? probability of a partition is
  • Thus memory O(n log w) with high probability

Comparison of Algorithms
  • Chain-Sample beats Oversample
  • Expected memory O(n) vs O(n log w)
  • High-probability memory bound both O(n log w)
  • Oversample may have sample size shrink below n!

Sketches and Frequency Moments
Generalized Stream Model
  • Input Element (i,a)
  • a copies of domain-value i
  • increment to ith dimension of m by a
  • a need not be an integer
  • Negative value captures deletions

On seeing element (i,a) (1,-1)
m0 m1 m2 m3 m4
Frequency Moments
  • Input Stream
  • values from U 0,1,…,N-1
  • frequency vector m (m0,m1,…,mN-1)
  • Kth Frequency Moment Fk(m) Si mik
  • F0 number of distinct values (Lecture 15)
  • F1 stream size
  • F2 Gini index, self-join size, Euclidean norm
  • Fk for kgt2, measures skew, sometimes useful
  • F8 maximum frequency
  • Problem estimation in small space
  • Sketches randomized estimators

Naive Approaches
  • Space N counter mi for each distinct value i
  • Space O(1)
  • if input sorted by i
  • single counter recycled when new i value appears
  • Goal
  • Allow arbitrary input
  • Use small (logarithmic) space
  • Settle for randomization/approximation

Sketching F2
  • Random Hash h(i) 0,1,…,N-1 ? -1,1
  • Define Zi h(i)
  • Maintain X Si miZi
  • Easy for update streams (i,a) just add aZi to X
  • Claim X2 is unbiased estimator for F2
  • Proof EX2 E(Si miZi)2
  • ESi mi2Zi2
  • Si mi2EZi2
  • Si mi2 0 F2
  • Last Line? Zi2 1 and EZi 0 as

from independence
Estimation Error?
  • Chebyshev bound
  • Define Y X2 ? EY EX2 Si mi2 F2
  • Observe EX4 E(SmiZi)4

  • ESmi4Zi44ESmimj3ZiZj36ESmi2mj2Zi2Zj2

  • 12ESmimjmk2ZiZjZk224ESmimjmkmlZiZjZkZl
  • Smi4 6Smi2mj2
  • By definition VarY EY2 EY2 EX4

  • Smi46Smi2mj2 Smi42Smi2mj2
  • 4Smi2mj2
    2EX22 2F22

Estimation Error?
  • Chebyshev bound
  • P relative estimation error gt?
  • Problem What if we want ? really small?
  • Solution
  • Compute s 8/?2 independent copies of X
  • Estimator Y mean(Xi2)
  • Variance reduces by factor s
  • P relative estimation error gt?

Boosting Technique
  • Algorithm A Randomized ?-approximate estimator f
  • P(1- ?)f f (1 ?)f
  • Heavy Tail Problem Pfz, f, fz 1/16,
    3/4, 3/16
  • Boosting Idea
  • O(log1/e) independent estimates from A(X)
  • Return median of estimates
  • Claim Pmedian is ?-approximate gt1- e
  • Pspecific estimate is ?-approximate ¾
  • Bad event only if gt50 estimates not
  • Binomial tail probability less than e

Overall Space Requirement
  • Observe
  • Let m Smi
  • Each hash needs O(log m)-bit counter
  • s 8/?2 hash functions for each estimator
  • O(log 1/e) such estimators
  • Total O(?-2 log 1/e log m) bits
  • Question Space for storing hash function?

Sketching Paradigm
  • Random Sketch inner product
  • frequency vector m (m0,m1,…,mN-1)
  • random vector Z (currently, uniform -1,1)
  • Observe
  • Linearity ? Sketch(m1) Sketch(m2) Sketch
    (m1 m2)
  • Ideal for distributed computing
  • Observe
  • Suppose Given i, can efficiently generate Zi
  • Then can maintain sketch for update streams
  • Problem
  • Must generate Zih(i) on first appearance of i
  • Need O(N) memory to store h explicitly
  • Need O(N) random bits

Two birds, One stone
  • Pairwise Independent Z1,Z2, …, Zn
  • for all Zi and Zk, PZix, Zky
  • property EZiZk EZi.EZk
  • Example linear hash function
  • Seed Slta,bgt from 0..p-1, where p is prime
  • Zi h(i) aib (mod p)
  • Claim Z1,Z2, …, Zn are pairwise independent
  • Zix and Zky ?? xaib (mod p) and yakb (mod
  • fixing i, k, x, y ? unique solution for a, b
  • PZix, Zky 1/ p2 PZix.PZky
  • Memory/Randomness n log p ? 2 log p

Wait a minute!
  • Doesnt pairwise independence screw up proofs?
  • No EX2 calculation only has degree-2 terms
  • But what about VarX2?
  • Need 4-wise independence

Application Join-Size Estimation
  • Given
  • Join attribute frequencies f1 and f2
  • Join size f1.f2
  • Define X1 f1.Z and X2 f2.Z
  • Choose Z as 4-wise independent uniform -1,1
  • Exercise Show, as before,
  • EX1 X2 f1.f2
  • VarX1 X2 2 (f1.f2)2
  • Hint a.b a.b

Bounding Error Probability
  • Using s copies of Xs taking their mean Y
  • Pr Y- f1.f2 ? f1.f2 Var(Y) /

  • 2f12f22 / s?2(f1.f2)2
  • 2 /
    s?2cos2 ?
  • Bounding error probability?
  • Need s gt 2/?2cos2?
  • Memory? O( log 1/e cos-2? ?-2 (log N log
  • Problem
  • To choose s need a-priori lower bound on cos ?
  • What if cos ? really small?

Sketch Partitioning
Idea for dealing with f12f22/(f1.f2)2 issue --
partition domain into regions where self-join
size is smaller to compensate small join-size
(cos ?)
self-join(R1.A)self-join(R2.B) 205205 42K
self-join(R1.A)self-join(R2.B) 2005 2005
Sketch Partitioning
  • Idea
  • intelligently partition join-attribute space
  • need coarse statistics on stream
  • build independent sketches for each partition
  • Estimate S partition sketches
  • Variance S partition variances

Sketch Partitioning
  • Partition Space Allocation?
  • Can solve optimally, given domain partition
  • Optimal Partition Find K-partition to minimize
  • Results
  • Dynamic Programming optimal solution for single
  • NP-hard for queries with multiple joins

Fk for k gt 2
  • Assume stream length m is known (Exercise
    Show can fix with log m space overhead by
    repeated-doubling estimate of m.)
  • Choose random stream item ap ? p
    uniform from 1,2,…,m
  • Suppose ap v e 0,1,…,N-1
  • Count subsequent frequency of v
  • r q qp, aqv
  • Define X m(rk (r-1)k)

  • Stream
  • 7,8,5,1,7,5,2,1,5,4,5,10,6,5,4,1,4,7,3,8
  • m 20
  • p 9
  • ap 5
  • r 3

Fk for k gt 2
  • Var(X) kN1 1/k Fk2
  • Bounded Error Probability ? s O(kN1 1/k / ?2)
  • Boosting ? memory bound
  • O(kn1 1/k ?-2 (log 1/e)(log N
    log m))

Summing over m choices of stream elements
Frequency Moments
  • F0 distinct values problem (Lecture 15)
  • F1 sequence length
  • for case with deletions, use Cauchy distribution
  • F2 self-join size/Gini index (Today)
  • Fk for k gt2
  • omitting grungy details
  • can achieve space bound
  • O(kN1 1/k ?-2 (log 1/e)(log n log m))
  • F8 maximum frequency

Communication Complexity
  • Cooperatively compute function f(A,B)
  • Minimize bits communicated
  • Unbounded computational power
  • Communication Complexity C(f) bits exchanged by
    optimal protocol ?
  • Protocols?
  • 1-way versus 2-way
  • deterministic versus randomized
  • Cd(f) randomized complexity for error
    probability d

ALICE input A
BOB input B
Streaming Communication Complexity
  • Stream Algorithm ?1-way communication protocol
  • Simulation Argument
  • Given algorithm S computing f over streams
  • Alice initiates S, providing A as input stream
  • Communicates to Bob Ss state after seeing A
  • Bob resumes S, providing B as input stream
  • Theorem Stream algorithms space requirement is
    at least the communication complexity C(f)

Example Set Disjointness
  • Set Disjointness (DIS)
  • A, B subsets of 1,2,…,N
  • Output
  • Theorem Cd(DIS) O(N), for any dlt1/2

Lower Bound for F8
  • Theorem Fix elt1/3, dlt1/2. Any stream algorithm S
  • P (1-e)F8 lt S lt (1e)F8 gt 1-d
  • needs O(N) space
  • Proof
  • Claim S ? 1-way protocol for DIS (on any sets A
    and B)
  • Alice streams set A to S
  • Communicates Ss state to Bob
  • Bob streams set B to S
  • Observe
  • Relative Error elt1/3 ? DIS solved exactly!
  • Perror lt½ lt d ? O(N) space

  • Observe
  • Used only 1-way communication in proof
  • Cd(DIS) bound was for arbitrary communication
  • Exercise extend lower bound to multi-pass
  • Lower Bound for Fk, kgt2
  • Need to increase gap beyond 2
  • Multiparty Set Disjointness t players
  • Theorem Fix e,dlt½ and k gt 5. Any stream
    algorithm S with
  • P (1-e)Fk lt S lt (1e)Fk gt 1-d
  • needs O(N1-(2 d)/k) space
  • Implies O(N1/2) even for multi-pass algorithms

Tracking High-Frequency Items
Problem 1 Top-K List Charikar-Chen-Farach-Colto
  • The Google Problem
  • Return list of k most frequent items in stream
  • Motivation
  • search engine queries, network traffic, …
  • Remember
  • Saw lower bound recently!
  • Solution
  • Data structure Count-Sketch ? maintaining
    count-estimates of high-frequency elements

  • Notation
  • Assume 1, 2, …, N in order of frequency
  • mi is frequency of ith most frequent element
  • m Smi is number of elements in stream
  • FindCandidateTop
  • Input stream S, int k, int p
  • Output list of p elements containing top k
  • Naive sampling gives solution with p ?(m log k
    / mk)
  • FindApproxTop
  • Input stream S, int k, real ?
  • Output list of k elements, each of frequency mi
    gt (1-?) mk
  • Naive sampling gives no solution

Main Idea
  • Consider
  • single counter X
  • hash function h(i) 1, 2,…,N ? -1,1
  • Input element i ? update counter X Zi h(i)
  • For each r, use XZr as estimator of mr
  • Theorem EXZr mr

  • X Si miZi
  • EXZr ESi miZiZr Si miEZi Zr mrEZr2
  • Cross-terms cancel

Finding Max Frequency Element
  • Problem varX F2 Si mi2
  • Idea t counters, independent 4-wise hashes
  • Use t O(log m ? mi2 / (?m1)2)
  • Claim New Variance lt ? mi2 / t (?m1)2 / log m
  • Overall Estimator
  • repeat median of averages
  • with high probability, approximate m1

h1 i? 1, 1
ht i? 1, 1
Problem with Array of Counters
  • Variance dominated by highest frequency
  • Estimates for less-frequent elements like k
  • corrupted by higher frequencies
  • variance gtgt mk
  • Avoiding Collisions?
  • spread out high frequency elements
  • replace each counter with hashtable of b counters

Count Sketch
  • Hash Functions
  • 4-wise independent hashes h1,...,ht and s1,…,st
  • hashes independent of each other
  • Data structure hashtables of counters X(r,c)

1 2 … b
Overall Algorithm
  • sr(i) one of b counters in rth hashtable
  • Input i ? for each r, update X(r,sr(i)) hr(i)
  • Estimator(mi) medianr X(r,sr(i)) hr(i)
  • Maintain heap of k top elements seen so far
  • Observe
  • Not completely eliminated collision with high
    frequency items
  • Few of estimates X(r,sr(i)) hr(i) could have
    high variance
  • Median not sensitive to these poor estimates

Avoiding Large Items
  • b gt O(k) ? with probability O(1), no collision
    with top-k elements
  • t hashtables represent independent trials
  • Need log m/? trials to estimate with probability
  • Also need small variance for colliding small
  • Claim
  • Pvariance due to small items in each estimate lt
    (?igtk mi2)/b O(1)
  • Final bound b O(k ?igtk mi2 / (?mk)2)

Final Results
  • Zipfian Distribution mi ? 1/i? Power Law
  • FindApproxTop
  • k (?igtkmi2) / (?mk)2 log m/?
  • Roughly sampling bound with frequencies squared
  • Zipfian gives improved results
  • FindCandidateTop
  • Zipf parameter 0.5
  • O(k log N log m)
  • Compare sampling bound O((kN)0.5 log k)

Problem 2 Elephants-and-Ants Manku-Motwani
  • Identify items whose current frequency exceeds
    support threshold s 0.1.
  • Jacobson 2000, Estan-Verghese 2001

Algorithm 1 Lossy Counting
Step 1 Divide the stream into windows
Window-size w is function of support s specify
Lossy Counting in Action ...
Lossy Counting (continued)
Error Analysis
How much do we undercount?
If current size of stream N and
window-size w
1/e then
windows eN
frequency error ?
Rule of thumb Set e 10 of support
s Example Given support frequency s
1, set error frequency e 0.1
Putting it all together…
Output Elements with counter values exceeding
Approximation guarantees Frequencies
underestimated by at most eN No false
negatives False positives have true
frequency at least (se)N
  • How many counters do we need?
  • Worst case bound 1/e log eN counters
  • Implementation details…

Number of Counters?
  • Window size w 1/?
  • Number of windows m ?N
  • ni counters alive over last i windows
  • Fact
  • Claim
  • Counter must average 1 increment/window to
  • active counters

Frequency Errors For counter (X, c),
true frequency in c, ceN
Trick Track number of windows t counter
has been active For counter (X,
c, t), true frequency in c, ct-1
If (t 1), no error!
Batch Processing Decrements after k
Algorithm 2 Sticky Sampling
? Create counters by sampling ? Maintain exact
counts thereafter
What is sampling rate?
Sticky Sampling (continued)
For finite stream of length N Sampling rate
2/eN log 1/?s
? probability of failure
Output Elements with counter values exceeding
Same Rule of thumb Set e 10 of support
s Example Given support threshold s 1,
set error threshold e 0.1 set
failure probability ? 0.01
Number of counters?
Finite stream of length N Sampling rate 2/eN
log 1/?s
Infinite stream with unknown N Gradually adjust
sampling rate
In either case, Expected number of counters
2/? log 1/?s
References Synopses
  • Synopsis data structures for massive data sets.
    Gibbons and Matias, DIMACS 1999.
  • Tracking Join and Self-Join Sizes in Limited
    Storage, Alon, Gibbons, Matias, and Szegedy. PODS
  • Join Synopses for Approximate Query Answering,
    Acharya, Gibbons, Poosala, and Ramaswamy.  SIGMOD
  • Random Sampling for Histogram Construction How
    much is enough? Chaudhuri, Motwani, and
    Narasayya. SIGMOD 1998.
  • Random Sampling Techniques for Space Efficient
    Online Computation of Order Statistics of Large
    Datasets, Manku, Rajagopalan, and Lindsay. SIGMOD
  • Space-efficient online computation of quantile
    summaries, Greenwald and Khanna. SIGMOD 2001.

References Sampling
  • Random Sampling with a Reservoir, Vitter.
    Transactions on Mathematical Software 11(1)37-57
  • On Sampling and Relational Operators. Chaudhuri
    and Motwani. Bulletin of the Technical Committee
    on Data Engineering (1999).
  • On Random Sampling over Joins. Chaudhuri,
    Motwani, and Narasayya. SIGMOD 1999.
  • Congressional Samples for Approximate Answering
    of Group-By Queries, Acharya, Gibbons, and
    Poosala. SIGMOD 2000.
  • Overcoming Limitations of Sampling for
    Aggregation Queries, Chaudhuri, Das, Datar,
    Motwani and Narasayya. ICDE 2001.
  • A Robust Optimization-Based Approach for
    Approximate Answering of Aggregate Queries,
    Chaudhuri, Das and Narasayya. SIGMOD 01.
  • Sampling From a Moving Window Over Streaming
    Data. Babcock, Datar, and Motwani. SODA 2002.
  • Sampling algorithms lower bounds and
    applications. Bar-YossefKumarSivakumar. STOC

References Sketches
  • Probabilistic counting algorithms for data base
    applications. Flajolet and Martin. JCSS (1985).
  • The space complexity of approximating the
    frequency moments. Alon, Matias, and Szegedy.
    STOC 1996.
  • Approximate Frequency Counts over Streaming Data.
    Manku and Motwani. VLDB 2002.
  • Finding Frequent Items in Data Streams. Charikar,
    Chen, and Farach-Colton. ICALP 2002.
  • An Approximate L1-Difference Algorithm for
    Massive Data Streams. Feigenbaum, Kannan,
    Strauss, and Viswanathan. FOCS 1999.
  • Stable Distributions, Pseudorandom Generators,
    Embeddings and Data Stream Computation. Indyk.
    FOCS  2000.
About PowerShow.com