Loading...

PPT – Data Stream Algorithms Intro, Sampling, Entropy PowerPoint presentation | free to download - id: 7e5066-YmU0Y

The Adobe Flash plugin is needed to view this content

Data Stream Algorithms Intro, Sampling, Entropy

Graham Cormode graham_at_research.att.com

Outline

- Introduction to Data Streams
- Motivating examples and applications
- Data Streaming models
- Basic tail bounds
- Sampling from data streams
- Sampling to estimate entropy

Data is Massive

- Data is growing faster than our ability to store

or index it - There are 3 Billion Telephone Calls in US each

day, 30 Billion emails daily, 1 Billion SMS,

IMs. - Scientific data NASA's observation satellites

generate billions of readings each per day. - IP Network Traffic up to 1 Billion packets per

hour per router. Each ISP has many (hundreds)

routers! - Whole genome sequences for many species now

available each megabytes to gigabytes in size

Massive Data Analysis

- Must analyze this massive data
- Scientific research (monitor environment,

species) - System management (spot faults, drops, failures)
- Customer research (association rules, new offers)

- For revenue protection (phone fraud, service

abuse) - Else, why even measure this data?

Example Network Data

- Networks are sources of massive data the

metadata per hour per router is gigabytes - Fundamental problem of data stream analysis Too

much information to store or transmit - So process data as it arrives one pass, small

space the data stream approach. - Approximate answers to many questions are OK, if

there are guarantees of result quality

IP Network Monitoring Application

Example NetFlow IP Session Data

- 24x7 IP packet/flow data-streams at network

elements - Truly massive streams arriving at rapid rates
- ATT/Sprint collect 1 Terabyte of NetFlow data

each day - Often shipped off-site to data warehouse for

off-line analysis

Packet-Level Data Streams

- Single 2Gb/sec link say avg packet size is

50bytes - Number of packets/sec 5 million
- Time per packet 0.2 microsec
- If we only capture header information per

packet src/dest IP, time, no. of bytes, etc.

at least 10bytes. - Space per second is 50Mb
- Space per day is 4.5Tb per link
- ISPs typically have hundreds of links!
- Analyzing packet content streams order(s) of

magnitude harder

Network Monitoring Queries

Off-line analysis slow, expensive

Network Operations Center (NOC)

Peer

R3

R1

R2

Enterprise Networks

PSTN

DSL/Cable Networks

- Extra complexity comes from limited space and

time - Will introduce solutions for these and other

problems

Streaming Data Questions

- Network managers ask questions requiring us to

analyze the data - How many distinct addresses seen on the network?
- Which destinations or groups use most bandwidth?
- Find hosts with similar usage patterns?
- Extra complexity comes from limited space and

time - Will introduce solutions for these and other

problems

Other Streaming Applications

- Sensor networks
- Monitor habitat and environmental parameters
- Track many objects, intrusions, trend analysis
- Utility Companies
- Monitor power grid, customer usage patterns etc.
- Alerts and rapid response in case of problems

Streams Defining Frequency Dbns.

- We will consider streams that define frequency

distributions - E.g. frequency of packets from source A to source

B - This simple setting captures many of the core

algorithmic problems in data streaming - How many distinct (non-zero) values seen?
- What is the entropy of the frequency

distribution? - What (and where) are the highest frequencies?
- More generally, can consider streams that define

multi-dimensional distributions, graphs,

geometric data etc. - But even for frequency distributions, several

models are relevant

Data Stream Models

- We model data streams as sequences of simple

tuples - Complexity arises from massive length of streams
- Arrivals only streams
- Example (x, 3), (y, 2), (x, 2) encodes the

arrival of 3 copies of item x, 2 copies of y,

then 2 copies of x. - Could represent eg. packets on a network power

usage - Arrivals and departures
- Example (x, 3), (y,2), (x, -2) encodes final

state of (x, 1), (y, 2). - Can represent fluctuating quantities, or measure

differences between two distributions

x y

x y

Approximation and Randomization

- Many things are hard to compute exactly over a

stream - Is the count of all items the same in two

different streams? - Requires linear space to compute exactly
- Approximation find an answer correct within some

factor - Find an answer that is within 10 of correct

result - More generally, a (1? ?) factor approximation
- Randomization allow a small probability of

failure - Answer is correct, except with probability 1 in

10,000 - More generally, success probability (1-?)
- Approximation and Randomization (?,

?)-approximations

Basic Tools Tail Inequalities

- General bounds on tail probability of a random

variable (probability that a random variable

deviates far from its expectation) - Basic Inequalities Let X be a random variable

with expectation ? and variance VarX. Then, for

any ?gt0

Tail Bounds

- Markov Inequality
- For a random variable Y which takes only

non-negative values. - PrY ? k ? E(Y)/k
- (This will be lt 1 only for k gt E(Y))
- Chebyshevs Inequality
- For a random variable Y
- PrY-E(Y) ? k ? Var(Y)/k2
- Proof Set X (Y E(Y))2
- E(X) E(Y2E(Y)22YE(Y)) E(Y2)E(Y)2-2E(Y)2

Var(Y) - So PrY-E(Y) ? k Pr(Y E(Y))2 ?

k2. - Using Markov ? E(Y E(Y))2/k2 Var(Y)/k2

Outline

- Introduction to Data Streams
- Motivating examples and applications
- Data Streaming models
- Basic tail bounds
- Sampling from data streams
- Sampling to estimate entropy

Sampling From a Data Stream

- Fundamental prob sample m items uniformly from

stream - Useful approximate costly computation on small

sample - Challenge dont know how long stream is
- So when/how often to sample?
- Two solutions, apply to different situations
- Reservoir sampling (dates from 1980s?)
- Min-wise sampling (dates from 1990s?)

Reservoir Sampling

- Sample first m items
- Choose to sample the ith item (igtm) with

probability m/i - If sampled, randomly replace a previously sampled

item - Optimization when i gets large, compute which

item will be sampled next, skip over intervening

items. Vitter 85

Reservoir Sampling - Analysis

- Analyze simple case sample size m 1
- Probability ith item is the sample from stream

length n - Prob. i is sampled on arrival ? prob. i survives

to end

1/n

- Case for m gt 1 is similar, easy to show uniform

probability - Drawbacks of reservoir sampling hard to

parallelize

Min-wise Sampling

- For each item, pick a random fraction between 0

and 1 - Store item(s) with the smallest random tag Nath

et al.04

0.391

0.908

0.291

0.555

0.619

0.273

- Each item has same chance of least tag, so

uniform - Can run on multiple streams separately, then

merge

Sampling Exercises

- What happens when each item in the stream also

has a weight attached, and we want to sample

based on these weights? - Generalize the reservoir sampling algorithm to

draw a single sample in the weighted case. - Generalize reservoir sampling to sample multiple

weighted items, and show an example where it

fails to give a meaningful answer. - Research problem design new streaming algorithms

for sampling in the weighted case, and analyze

their properties.

Outline

- Introduction to Data Streams
- Motivating examples and applications
- Data Streaming models
- Basic tail bounds
- Sampling from data streams
- Sampling to estimate entropy

Application of Sampling Entropy

- Given a long sequence of characters
- S lta1, a2, a3 amgt each aj ? 1 n
- Let fi frequency of i in the sequence
- Compute the empirical entropy
- H(S) - ?i fi/m log fi/m - ?i pi log pi
- Example S lt a, b, a, b, c, a, d, agt
- pa 1/2, pb 1/4, pc 1/8, pd 1/8
- H(S) ½ ¼ ? 2 1/8 ? 3 1/8 ? 3 7/4
- Entropy promoted for anomaly detection in networks

Challenge

- Goal approximate H(S) in space sublinear

(poly-log) in m (stream length), n (alphabet

size) - (?,?) approx answer is (1?)H(S) w/prob 1-?
- Easy if we have O(n) space compute each fi

exactly - More challenging if n is huge, m is huge, and we

have only one pass over the input in order - (The data stream model)

Sampling Based Algorithm

- Simple estimator
- Randomly sample a position j in the stream
- Count how many times aj appears subsequently r
- Output X -(r log (r/m) (r-1) log((r-1)/m))
- Claim Estimator is unbiased EX H(S)
- Proof prob of picking j 1/m, sum telescopes

correctly - Variance of estimate is not too large VarX

O(log2 m) - Observe that X log m
- VarX E(X EX)2 lt (max(X) min(X))2

O(log2 m)

Analysis of Basic Estimator

- A general technique in data streams
- Repeat in parallel an unbiased estimator with

bounded variance, take average of estimates to

improve variance - Var 1/k (Y1 Y2 ... Yk) 1/k VarY
- By Chebyshev, need k repetitions to be

VarX/?2E2X - For entropy, this means space k

O(log2m/?2H2(S)) - Problem for entropy when H(S) is very small?
- Space needed for an accurate approx goes as 1/H2!

Low Entropy

- But... what does a low entropy stream look like?
- aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabaaaaa
- Very boring most of the time, we are only rarely

surprised - Can there be two frequent items?
- aabababababababaababababbababababababa
- No! Thats high entropy (¼ 1 bit / character)
- Only way to get H(S) o(1) is to have only one

character with pi close to 1

Removing the frequent character

- Write entropy as
- -pa log pa (1-pa) H(S)
- Where S stream S with all as removed
- Can show
- Doesnt matter if H(S) is small as pa is large,

additive error on H(S) ensures relative error on

(1-pa)H(S) - Relative error (1-pa) on pa gives relative error

on pa log pa - Summing both (positive) terms gives relative

error overall

Finding the frequency character

- Ejecting a is easy if we know in advance what it

is - Can then compute pa exactly
- Can find online deterministically
- Assume pa gt 2/3 (if not, H(S) gt 0.9, and original

alg works) - Run a heavy hitters algorithm on the stream

(see later) - Modify analysis, find a and pa ? (1-pa)
- But... how to also compute H(S) simultaneously

if we dont know a from the start... do we need

two passes?

Always have a back up plan...

- Idea keep two samples to build our estimator
- If at the end one of our samples is a, use the

other - How to do this and ensure uniform sampling?
- Pick first sample with min-wise sampling
- At end of the stream, if the sampled character

a, we want to sample from the stream ignoring

all as - This is just the character achieving the

smallest label distinct from the one that

achieves the smallest label - Can track information to do this in a single

pass, constant space

Sampling Two Tokens

B

C

D

B

B

B

A

A

A

A

A

C

Stream

0.627

0.549

0.228

0.366

0.770

0.191

0.408

Tags

0.202

0.173

0.082

0.217

0.815

Repeats

A

A

A

- Assign tags, choose first token as before
- Delete all occurrences of first token
- Choose token with min remaining tag count

repeats - Implementation keep track of two triples
- (min tag, corresponding token, number of repeats)

Putting it all together

- Can combine all these pieces
- Build an estimator based on tracking this

information, deciding whether there is a frequent

character or not - A more involved Chernoff bounds argument improves

number of repetitions of estimator from

O(?-2VarX/E2X) to O(?-2RangeX/EX) O(?-2

log m) - In O(?-2 log m log 1/?) space (words) we can

compute an (?,?) approximation to H(S) in a

single pass

Entropy Exercises

- As a subroutine, we need to find an element that

occurs more than 2/3 of the time and estimate its

weight - How can we find a frequently occurring item?
- How can we estimate its weight p with ?(1-p)

error? - Our algorithm uses O(?-2 log m log 1/?) space,

could this be improved or is it optimal (lower

bounds)? - Our algorithm updates each sampled pair for every

update, how quickly can we implement it? - (Research problem) What if there are multiple

distributed streams and we want to compute the

entropy of their union?

Outline

- Introduction to Data Streams
- Motivating examples and applications
- Data Streaming models
- Basic tail bounds
- Sampling from data streams
- Sampling to estimate entropy

Data Stream Algorithms Frequency Moments

Graham Cormode graham_at_research.att.com

Frequency Moments

- Introduction to Frequency Moments and Sketches
- Count-Min sketch for F? and frequent items
- AMS Sketch for F2
- Estimating F0
- Extensions
- Higher frequency moments
- Combined frequency moments

Last Time

- Introduced data streams and data stream models
- Focus on a stream defining a frequency

distribution - Sampling to draw a uniform sample from the stream
- Entropy estimation based on sampling

This Time Frequency Moments

- Given a stream of updates, let fi be the number

of times that item i is seen in the stream - Define Fk of the stream as ?i (fi)k the kth

Frequency Moment - Space Complexity of the Frequency Moments by

Alon, Matias, Szegedy in STOC 1996 studied this

problem - Awarded Godel prize in 2005
- Set the pattern for much streaming algorithms to

follow - Frequency moments are at the core of many

streaming problems

Frequency Moments

- F0 count 1 if fi ? 0 number of distinct items
- F1 length of stream, easy
- F2 sum the squares of the frequencies self

join size - Fk related to statistical moments of the

distribution - F? (really lim k? ? Fk1/k) dominated by the

largest fk, finds the largest frequency - Different techniques needed for each one.
- Mostly sketch techniques, which compute a certain

kind of random linear projection of the stream

Sketches

- Not every problem can be solved with sampling
- Example counting how many distinct items in the

stream - If a large fraction of items arent sampled,

dont know if they are all same or all different - Other techniques take advantage that the

algorithm can see all the data even if it cant

remember it all - (To me) a sketch is a linear transform of the

input - Model stream as defining a vector, sketch is

result of multiplying stream vector by an

(implicit) matrix

linear projection

Trivial Example of a Sketch

1 0 1 1 1 0 1 0 1

1 0 1 1 0 0 1 0 1

- Test if two (asynchronous) binary streams are

equal d (x,y) 0 iff xy, 1 otherwise - To test in small space pick a random hash

function h - Test h(x)h(y) small chance of false positive,

no chance of false negative. - Compute h(x), h(y) incrementally as new bits

arrive (e.g. h(x) xiti mod p for random prime

p, and t lt p) - Exercise extend to real valued vectors in update

model

Frequency Moments

- Introduction to Frequency Moments and Sketches
- Count-Min sketch for F? and frequent items
- AMS Sketch for F2
- Estimating F0
- Extensions
- Higher frequency moments
- Combined frequency moments

Count-Min Sketch

- Simple sketch idea, can be used for as the basis

of many different stream mining tasks. - Model input stream as a vector x of dimension U
- Creates a small summary as an array of w ? d in

size - Use d hash function to map vector entries to

1..w - Works on arrivals only and arrivals departures

streams

W

Array CMi,j

d

Count-Min Sketch Structure

j,c

dlog 1/?

w 2/?

- Each entry in vector x is mapped to one bucket

per row. - Merge two sketches by entry-wise summation
- Estimate xj by taking mink CMk,hk(j)
- Guarantees error less than eF1 in size O(1/e log

1/d) - Probability of more error is less than 1-d

C, Muthukrishnan 04

Approximation of Point Queries

- Approximate point query xj mink CMk,hk(j)
- Analysis In k'th row, CMk,hk(j) xj Xk,j
- Xk,j S xi hk(i) hk(j)
- E(Xk,j) Si? j xiPrhk(i)hk(j) ?

Prhk(i)hk(k) Si xi e F1/2 by pairwise

independence of h - PrXk,j ? eF1 PrXk,j ? 2E(Xk,j) ? 1/2 by

Markov inequality - So, Prxj ? xj eF1 Pr? k. Xk,j gt eF1

?1/2log 1/d d - Final result with certainty xj ? xj and

with probability at least 1-d, xj lt xj e

F1

Applications of Count-Min to F?

- Count-Min sketch lets us estimate fi for any i

(up to ?F1) - F? asks to find maxi fi
- Slow way test every i after creating sketch
- Faster way test every i after it is seen in the

stream, and remember largest estimated value - Alternate way
- keep a binary tree over the domain of input

items, where each node corresponds to a subset - keep sketches of all nodes at same level
- descend tree to find large frequencies,

discarding branches with low frequency

Count-Min Exercises

- The median of a distribution is the item so that

the sum of the frequencies of lexicographically

smaller items is ½ F1. Use CM sketch to find the

(approximate) median. - Assume the input frequencies follow the Zipf

distribution so that the ith largest frequency

is ?(i-z) for zgt1. Show that CM sketch only

needs to be size ?-1/z to give same guarantee - Suppose we have arrival and departure streams

where the frequencies of items are allowed to be

negative. Extend CM sketch analysis to estimate

these frequencies (note, Markov argument no

longer works) - How to find the large absolute frequencies when

some are negative? Or in the difference of two

streams?

Frequency Moments

- Introduction to Frequency Moments and Sketches
- Count-Min sketch for F? and frequent items
- AMS Sketch for F2
- Estimating F0
- Extensions
- Higher frequency moments
- Combined frequency moments

F2 estimation

- AMS sketch (for Alon-Matias-Szegedy) proposed in

1996 - Allows estimation of F2 (second frequency moment)
- Used at the heart of many streaming and

non-streaming mining applications achieves

dimensionality reduction - Here, describe AMS sketch by generalizing CM

sketch. - Uses extra hash functions g1...glog 1/d 1...U?

1,-1 - Now, given update (j,c), set CMk,hk(i)

cgk(j)

linear projection

AMS sketch

F2 analysis

j,c

d8log 1/?

w 4/?2

- Estimate F2 mediank åi CMk,i2
- Each rows result is åi g(i)2xi2 åh(i)h(j)

2 g(i) g(j) xi xj - But g(i)2 -12 12 1, and åi xi2 F2
- g(i)g(j) has 1/2 chance of 1 or 1

expectation is 0

F2 Variance

- Expectation of row estimate Rk åi CMk,i2 is

exactly F2 - Variance of row k, VarRk, is an expectation
- VarRk E (?buckets b (CMk,b)2 F2)2
- Good exercise in algebra expand this sum and

simplify - Many terms are zero in expectation because of

terms like g(a)g(b)g(c)g(d) (degree at most 4) - Requires that hash function g is four-wise

independent it behaves uniformly over subsets of

size four or smaller - Such hash functions are easy to construct

F2 Variance

- Terms with odd powers of g(a) are zero in

expectation - g(a)g(b)g2(c), g(a)g(b)g(c)g(d), g(a)g3(b)
- Leaves VarRk ? ?i g4(i) xi4 2 ?j? i

g2(i) g2(j) xi2 xj2 4 ?h(i)h(j) g2(i)

g2(j) xi2 xj2 - (xi4 ?j? i 2xi2

xj2) ? F22/w - Row variance can finally be bounded by F22/w
- Chebyshev for w4/?2 gives probability ¼ of

failure - How to amplify this to small ? probability of

failure?

Tail Inequalities for Sums

- We derive stronger bounds on tail probabilities

for the sum of independent Bernoulli trials via

the Chernoff Bound - Let X1, ..., Xm be independent Bernoulli trials

s.t. PrXi1 p (PrXi0 1-p). - Let X ?i1m Xi ,and ? mp be the expectation

of X. - Then, for any ?gt0,

Applying Chernoff Bound

- Each row gives an estimate that is within ?

relative error with probability p gt ¾ - Take d repetitions and find the median. Why the

median? - Because bad estimates are either too small or too

large - Good estimates form a contiguous group in the

middle - At least d/2 estimates must be bad for median to

be bad - Apply Chernoff bound to d independent estimates,

p3/4 - Pr More than d/2 bad estimates lt 2exp(d/8)
- So we set d ?(ln ?) to give ? probability of

failure - Same outline used many times in data streams

Aside on Independence

- Full independence is expensive in a streaming

setting - If hash functions are fully independent over n

items, then we need ?(n) space to store their

description - Pairwise and four-wise independent hash functions

can be described in a constant number of words - The F2 algorithm uses a careful mix of limited

and full independence - Each hash function is four-wise independent over

all n items - Each repetition is fully independent of all

others but there are only O(log 1/?)

repetitions.

AMS Sketch Exercises

- Let x and y be binary streams of length n. The

Hamming distance H(x,y) i xi? yi Show

how to use AMS sketches to approximate H(x,y) - Extend for strings drawn from an arbitrary

alphabet - The inner product of two strings x, y is x ? y

?i1n xiyi Use AMS sketches to estimate x ?

y - Hint try computing the inner product of the

sketches. Show the estimator is unbiased (correct

in expectation) - What form does the error in the approximation

take? - Use Count-Min Sketches for the same problem and

compare the errors. - Is it possible to build a (1??) approximation of

x ? y?

Frequency Moments

- Introduction to Frequency Moments and Sketches
- Count-Min sketch for F? and frequent items
- AMS Sketch for F2
- Estimating F0
- Extensions
- Higher frequency moments
- Combined frequency moments

F0 Estimation

- F0 is the number of distinct items in the stream
- a fundamental quantity with many applications
- Early algorithms by Flajolet and Martin 1983

gave nice hashing-based solution - analysis assumed fully independent hash functions
- Will describe a generalized version of the FM

algorithm due to Bar-Yossef et. al with only

pairwise indendence

F0 Algorithm

- Let m be the domain of stream elements
- Each item in stream is from 1m
- Pick a random hash function h m ? m3
- With probability at least 1-1/m, no collisions

under h - For each stream item i, compute h(i), and track

the t distinct items achieving the smallest

values of h(i) - Note if same i is seen many times, h(i) is same
- Let vt tth smallest value of h(i) seen.
- If F0 lt t, give exact answer, else estimate F0

tm3/vt - vt/m3 ? fraction of hash domain occupied by t

smallest

0m3

m3

Analysis of F0 algorithm

- Suppose F0 tm3/vt gt (1?) F0 estimate is

too high - So for stream set S ? 2m, we have
- s ? S h(s) lt tm3/(1?)F0 gt t
- Because ? lt 1, we have tm3/(1?)F0 ?

(1-?/2)tm3/F0 - Pr h(s) lt (1-?/2)tm3/F0 ? 1/m3 (1-?/2)tm3/F0

(1-?/2)t/F0 - (this analysis outline hides some rounding

issues)

0m3

tm3/(1?)F0

vt

m3

Chebyshev Analysis

- Let Y be number of items hashing to under

tm3/(1?)F0 - EY F0 Pr h(s) lt tm3/(1?)F0 (1-?/2)t
- For each item i, variance of the event p(1-p) lt

p - VarY ?s ? S Var h(s) lt tm3/(1?)F0 lt

(1-?/2)t - We sum variances because of pairwise independence
- Now apply Chebyshev
- Pr Y gt t ? PrY EY gt ?t/2 ?

4VarY/?2t2 lt 4t/(?2t2) - Set t20/?2 to make this Prob ? 1/5

Completing the analysis

- We have shown Pr F0 gt (1?) F0 lt 1/5
- Can show Pr F0 lt (1-?) F0 lt 1/5 similarly
- too few items hash below a certain value
- So Pr (1-?) F0 ? F0 ? (1?)F0 gt 3/5 Good

estimate - Amplify this probability repeat O(log 1/?) times

in parallel with different choices of hash

function h - Take the median of the estimates, analysis as

before

F0 Issues

- Space cost
- Store t hash values, so O(1/?2 log m) bits
- Can improve to O(1/?2 log m) with additional

tricks - Time cost
- Find if hash value h(i) lt vt
- Update vt and list of t smallest if h(i) not

already present - Total time O(log 1/? log m) worst case

Range Efficiency

- Sometimes input is specified as a stream of

ranges a,b - a,b means insert all items (a, a1, a2 b)
- Trivial solution just insert each item in the

range - Range efficient F0 Pavan, Tirthapura 05
- Start with an alg for F0 based on pairwise hash

functions - Key problem track which items hash into a

certain range - Dives into hash fns to divide and conquer for

ranges - Range efficient F2 Calderbank et al. 05,

Rusu,Dobra 06 - Start with sketches for F2 which sum hash values
- Design new hash functions so that range sums are

fast

F0 Exercises

- Suppose the stream consists of a sequence of

insertions and deletions. Design an algorithm to

approximate F0 of the current set. - What happens when some frequencies are negative?
- Give an algorithm to find F0 of the most recent W

arrivals - Use F0 algorithms to approximate Max-dominance

given a stream of pairs (i,x(i)), approximate ?i

max(i, x(i)) x(i)

Frequency Moments

- Introduction to Frequency Moments and Sketches
- Count-Min sketch for F? and frequent items
- AMS Sketch for F2
- Estimating F0
- Extensions
- Higher frequency moments
- Combined frequency moments

Higher Frequency Moments

- Fk for kgt2. Use sampling trick as with Entropy

Alon et al 96 - Uniformly pick an item from the stream length 1n
- Set r how many times that item appears

subsequently - Set estimate Fk n(rk (r-1)k)
- EFk1/nn f1k - (f1-1)k (f1-1)k - (f1-2)k

1k-0k f1k f2k Fk - VarFk?1/nn2(f1k-(f1-1)k)2
- Use various bounds to bound the variance by k

m1-1/k Fk2 - Repeat k m1-1/k times in parallel to reduce

variance - Total space needed is O(k m1-1/k) machine words

Improvements

- Coppersmith and Kumar 04 Generalize the F2

approach - E.g. For F3, set p1/?m, and hash items onto

1-1/p, -1/p with probability 1/p, 1-1/p

respectively. - Compute cube of sum of the hash values of the

stream - Correct in expectation, bound variance ? O(?mF32)
- Indyk, Woodruff 05, Bhuvangiri et al. 06

Optimal solutions by extracting different

frequencies - Use hashing to sample subsets of items and fis
- Combine these to build the correct estimator
- Cost is O(m1-2/k poly-log(m,n,1/?)) space

Combined Frequency Moments

Consider network traffic data defines a

communication graph eg edge (source,

destination) or edge (sourceport,

destport) Defines a (directed) multigraph We are

interested in the underlying (support) graph on n

nodes

- Want to focus on number of distinct communication

pairs, not size of communication - So want to compute moments of F0 values...

Multigraph Problems

- Let Gi,j 1 if (i,j) appears in stream edge

from i to j. Total of m distinct edges - Let di Sj1n Gi,j degree of node i
- Find aggregates of dis
- Estimate heavy dis (people who talk to many)
- Estimate frequency moments number of distinct di

values, sum of squares - Range sums of dis (subnet traffic)

F? (F0) using CM-FM

- Find is such that di gt f åi di Finds the people

that talk to many others - Count-Min sketch only uses additions, so can

apply

Accuracy for F?(F0)

- Focus on point query accuracy estimate di.
- Can prove estimate has only small bias in

expectation - Analysis is similar to original CM sketch

analysis, but now have to take account of F0

estimation of counts - Gives an bound of O(1/?3 poly-log(n)) space
- The product of the size of the sketches
- Remains to fully understand other combinations of

frequency moments, eg. F2(F0), F2(F2) etc.

Exercises / Problems

- (Research problem) What can be computed for other

combinations of frequency moments, e.g. F2 of F2

values, etc.? - The F2 algorithm uses the fact that 1/-1 values

square to preserve F2 but are 0 in expectation.

Why wont it work to estimate F4 with h ? -1,

1, -i, i? - (Research problem) Read, understand and simplify

analysis for optimal Fk estimation algorithms - Take the sampling Fk algorithm and combine it

with F0 estimators to approximate Fk of node

degrees - Why cant we use the sketch approach for F2 of

node degrees? Show there the analysis breaks

down

Frequency Moments

- Introduction to Frequency Moments and Sketches
- Count-Min sketch for F? and frequent items
- AMS Sketch for F2
- Estimating F0
- Extensions
- Higher frequency moments
- Combined frequency moments

Data Stream Algorithms Lower Bounds

Graham Cormode graham_at_research.att.com

Streaming Lower Bounds

- Lower bounds for data streams
- Communication complexity bounds
- Simple reductions
- Hardness of Gap-Hamming problem
- Reductions to Gap-Hamming

1 0 1 1 1 0 1 0 1

This Time Lower Bounds

- So far, have seen many examples of things we can

do with a streaming algorithm - What about things we cant do?
- Whats the best we could achieve for things we

can do? - Will show some simple lower bounds for data

streams based on communication complexity

Streaming As Communication

1 0 1 1 1 0 1 0 1

- Imagine Alice processing a stream
- Then take the whole working memory, and send to

Bob - Bob continues processing the remainder of the

stream

Streaming As Communication

- Suppose Alices part of the stream corresponds to

string x, and Bobs part corresponds to string

y... - ...and that computing the function on the stream

corresponds to computing f(x,y)... - ...then if f(x,y) has communication complexity

?(g(n)), then the streaming computation has a

space lower bound of ?(g(n)) - Proof by contradiction If there was an

algorithm with better space usage, we could run

it on x, then send the memory contents as a

message, and hence solve the communication problem

Deterministic Equality Testing

1 0 1 1 1 0 1 0 1

1 0 1 1 0 0 1 0 1

- Alice has string x, Bob has string y, want to

test if xy - Consider a deterministic (one-round, one-way)

protocol that sends a message of length m lt n - There are 2m possible messages, so some strings

must generate the same message this would cause

error - So a deterministic message (sketch) must be ?(n)

bits - In contrast, we saw a randomized sketch of size

O(log n)

Hard Communication Problems

- INDEX x is a binary string of length n y is an

index in n Goal output xy Result (one-way)

(randomized) communication complexity of INDEX is

?(n) bits - DISJOINTNESS x and y are both length n binary

strings Goal Output 1 if ?i xiyi1, else

0 Result (multi-round) (randomized)

communication complexity of DISJOINTNESS is ?(n)

bits

Simple Reduction to Disjointness

x 1 0 1 1 0 1

1, 3, 4, 6

y 0 0 0 1 1 0

4, 5

- F? output the highest frequency in a stream
- Input the two strings x and y from disjointness
- Stream if xi1, then put i in stream then

same for y - Analysis if F?2, then intersection if F??1,

then disjoint. - Conclusion Giving exact answer to F? requires

?(N) bits - Even approximating up to 50 error is hard
- Even with randomization DISJ bound allows

randomness

Simple Reduction to Index

x 1 0 1 1 0 1

1, 3, 4, 6

y 5

5

- F0 output the number of items in the stream
- Input the strings x and index y from INDEX
- Stream if xi1, put i in stream then put y in

stream - Analysis if (1-?)F0(x?y)gt(1?)F0(x) then

xy1, else it is 0 - Conclusion Approximating F0 for ?lt1/N requires

?(N) bits - Implies that space to approximate must be ?(1/?)
- Bound allows randomization

Hardness Reduction Exercises

- Use reductions to DISJ or INDEX to show the

hardness of - Frequent items find all items in the stream

whose frequency gt ?N, for some ?. - Sliding window given a stream of binary (0/1)

values, compute the sum of the last N values - Can this be approximated instead?
- Min-dominance given a stream of pairs (i,x(i)),

approximate ?i min(i, x(i)) x(i) - Rank sum Given a stream of (x,y) pairs and query

(p,q) specified after stream, approximate

(x,y) xltp, yltq

Streaming Lower Bounds

- Lower bounds for data streams
- Communication complexity bounds
- Simple reductions
- Hardness of Gap-Hamming problem
- Reductions to Gap-Hamming

1 0 1 1 1 0 1 0 1

Gap Hamming

- GAP-HAMM communication problem
- Alice holds x ? 0,1N, Bob holds y ? 0,1N
- Promise H(x,y) is either ? N/2 - pN or ? N/2

pN - Which is the case?
- Model one message from Alice to Bob
- Requires ?(N) bits of one-way randomized

communication - Indyk, Woodruff03, Woodruff04, Jayram, Kumar,

Sivakumar 07

Hardness of Gap Hamming

- Reduction to an instance of INDEX
- Map string x to u by 1? 1, 0 ? -1 (i.e. ui

2xi -1 ) - Assume both Alice and Bob have access to public

random strings rj, where each bit of rj is iid

-1, 1 - Assume w.l.o.g. that length of string n is odd

(important!) - Alice computes aj sign(rj ? u)
- Bob computes bj sign(rjy)
- Repeat N times with different random strings, and

consider the Hamming distance of a1... aN with b1

... bN

Probability of a Hamming Error

- Consider the pair aj sign(rj ? u), bj

sign(rjy) - Let w ?i ? y ui rji
- w is a sum of (n-1) values distributed iid

uniform -1,1 - Case 1 w ? 0. So w? 2, since (n-1) is even
- so sign(aj) sign(w), independent of xy
- Then Praj ? bj Prsign(w) ? sign(rjy) ½
- Case 2 w 0. So aj sign(rj?u) sign(w

uyrjy) sign(uyrjy) - Then Praj ? bj Prsign(uyrjy)

sign(rjy) - This probability is 1 is uy1, 0 if uy-1
- Completely biased by the answer to INDEX

Finishing the Reduction

- So what is Prw0?
- w is sum of (n-1) iid uniform -1,1 values
- Textbook Prw0 c/?n, for some constant c
- Do some probability manipulation
- Praj bj ½ c/2?n if xy1
- Praj bj ½ - c/2?n if xy0
- Amplify this bias by making strings of length

N4n/c2 - Apply Chernoff bound on N instances
- With probgt2/3, either H(a,b)gtN/2 ?N or

H(a,b)ltN/2 - ?N - If we could solve GAP-HAMMING, could solve INDEX
- Therefore, need ?(N) ?(n) bits for GAP-HAMMING

Streaming Lower Bounds

- Lower bounds for data streams
- Communication complexity bounds
- Simple reductions
- Hardness of Gap-Hamming problem
- Reductions to Gap-Hamming

1 0 1 1 1 0 1 0 1

Lower Bound for Entropy

- Alice x ? 0,1N, Bob y ? 0,1N
- Entropy estimation algorithm A
- Alice runs A on enc(x) ?(1,x1), (2,x2), ,

(N,xN)? - Alice sends over memory contents to Bob
- Bob continues A on enc(y) ?(1,y1), (2,y2), ,

(N,yN)?

1

1

0

0

1

0

Alice

Bob

0

1

0

0

1

1

Lower Bound for Entropy

- Observe there are
- 2H(x,y) tokens with frequency 1 each
- N-H(x,y) tokens with frequency 2 each
- So, H(S) log N H(x,y)/N
- Thus size of Alices memory contents ?(N).

Set ? 1/(p(N) log N) to show bound of ?(?/log

1/?)-2)

1

1

0

0

1

0

Alice

Bob

0

1

0

0

1

1

Lower Bound for F0

- Same encoding works for F0 (Distinct Elements)
- 2H(x,y) tokens with frequency 1 each
- N-H(x,y) tokens with frequency 2 each
- F0(S) N H(x,y)
- Either H(x,y)gtN/2 ?N or H(x,y)ltN/2 - ?N
- If we could approximate F0 with ? lt 1/?N, could

separate - But space bound ?(N) ?(?-2) bits
- Dependence on ? for F0 is tight
- Similar arguments show ?(?-2) bounds for Fk,
- Proof assumes k (and hence 2k) are constants

Lower Bounds Exercises

- Formally argue the space lower bound for F2 via

Gap-Hamming - Argue space lower bounds for Fk via Gap-Hamming
- (Research problem) Extend lower bounds for the

case when the order of the stream is random or

near-random - (Research problem) Kumar conjectures the

multi-round communication complexity of

Gap-Hamming is ?(n) this would give lower

bounds for multi-pass streaming

Streaming Lower Bounds

- Lower bounds for data streams
- Communication complexity bounds
- Simple reductions
- Hardness of Gap-Hamming problem
- Reductions to Gap-Hamming

1 0 1 1 1 0 1 0 1

Data Stream Algorithms Extensions and Open

Problems

Graham Cormode graham_at_research.att.com

This Time Extensions

- Have given the basics of streaming streams of

items, frequency moments, upper and lower bounds - Many variations with many open problems
- Streams representing different combinatorial

objects - Streams that are distributed, correlated,

uncertain - Systems for processing streams
- Different models of streams
- See also Open problems in Data Streams

McGregor 07 - Result of a workshop held at IIT Kanpur in Dec

2006

Deterministic Streaming Algorithms

- Focus so far has been on randomized algorithms
- Many important problems can be solved

deterministically! - Finding frequent items/ heavy hitters
- Finding quantiles of a distribution
- For many problems, lower bounds show

randomization is necessary for sublinear space - Anything involving equality testing as a special

case - Frequency moments
- When they are possible, deterministic algorithms

are often faster and use less space more

practical to implement

Clustering On Data Streams

- Goal output k cluster centers at end - any

point can be classified using these centers. - Use divide and conquer approach Guha et al.

00 - Buffer as many points as possible, then cluster

them - Cluster the clusters
- Cluster the cluster clusters, etc...
- Each level of clustering gives up extra factors

in quality

Input

Output

Geometric Streaming

- Stream specifies a sequence of d-dimensional

points - Answer various geometric problems such as
- Convex hull
- Minimum spanning tree weight
- Facility location
- Minimum enclosing ball
- Gridding approach reduces to Fk or related

problems Indyk 03 - Core-set keep a carefully chosen small subset of

points and evaluate on them Har-Peled 02,

Chan06 - Simple example For minimum enclosing ball, keep

extremal points in evenly-space directions

Sliding Window Computations

- In a sliding window, we only consider the last W

items - W still very large, so want poly-log(W) solutions
- Exponential Histograms Datar et al.02 and

Waves Gibbons Tirthapura02 - Deterministic structure tracks counts in a window
- Based on doubling bucket sizes to give relative

error - Same structure sketches solves for aggregates
- Asynchronous streams items not in timestamp

order - Relative error counts possible Busch, Tirthapura

07 - Extend concept to other aggregates C. et al. 08

Time Decay

- Assign a weight to each item as a function of its

age - E.g. Exponential decay or polynomial decay
- Implies weighted versions of problems
- Cohen and Strauss 2003
- Can reduce sum and counts to multiple instances

of sliding window queries - C., Korn and Tirthapura 2008
- Same observations applies to other computations

(quantiles, frequent items)

age ?

Multi-Pass Algorithms

- Some situations allow multiple passes of the

stream - E.g. scanning over slow storage (tape) random

access not possible, but can scan multiple times - Earliest work in streaming Munro, Paterson 78

studied the pass/space tradeoff for finding

medians - Lower bounds can follow from multi-round

communication complexity bounds

1 0 1 1 1 0 1 0 1

Other Massive Data Models

- Massive Unordered Data (MUD) model Feldman et

al. 08 - Abstracts computations in MapReduce/Hadoop

settings - Can provably simulate deterministic streaming

algs - What about randomized computations, multiple

passes?

Skewed Streams

- In practice, not all frequency distributions are

worst case - Few items are frequent, then a long tail of

infrequent items - Such skew is prevalent in network data, word

frequency, paper citations, city sizes, etc. - Zipfian distribution with skew z gt 0 (z

1..2 typical) - Analyze algorithms under assumption of skewed

data - Improved F2 space cost O(e-2/(1z) log 1/d),

provided zgt1

Graph Streaming

(4,5) (2,3) (1,3) (3,5) (1,2) (2,4) (1,5) (3,4)

- Stream specifies a massive graph edge by edge
- Most natural problems have ?(V) space lower

bounds - Semi-streaming model allow ?(V) but o(E)

space Therefore also o(V2) space also - Allow one (or few) passes to approximate
- Minimum Spanning Tree Weight
- Graph Distances (based on spanners)
- Maximum weight matching
- Counting Triangles

Matrix Streaming

- Stream specifies a massive n ? n matrix
- Either by giving entries in some order, or

updates to entries - In one (or few) passes, find
- CUR Decomposition
- Page Rank Vector
- Approximate Matrix product
- Singular Vector Decomposition
- Current methods take small constant number of

passes, sample constant number of rows and

columns by weight - Sketching methods dont seem so useful here

O(1) Columns

O(1) Rows

Carefully chosen U

Permutation Streaming

- Stream presents a permutation of items
- Abstraction of several settings, more of

theoretic interest - Approximate number of inversions in the stream
- Locations where i gt j but i appears before j in

stream - Can be reduced to a variation of quantiles

Gupta, Zane03 - Find length of longest increasing subsequence
- Reduce (up to factor 2) to simpler function

Ergun, Jowhari 08 - Approximate this using a different variation of

quantiles - Deterministic lower bound ?(N1/2), randomized

bound open

Random Order Streaming

- Lower bounds are sometimes based on carefully

creating adversarial orders of streams - Random order streams order is uniformly permuted
- Can sometimes give much better upper bounds

prefix of stream gives a good sample of dbn. to

come - Lower bounds in random order give stronger

evidence of robust hardness, e.g. Chakrabarti

et al. 08 - Hardness via communication complexity of random

partitions - GAP-HAMMING still has linear lower bound
- t2-party DISJOINTNESS has ?(n/t) lower bound

Probabilistic Streams

Example S (?x, ½?, ?y, 1/3?, ?y, ¼?) Encodes

6 possible worlds

G ? x y x,y y,y x,y,y

PrG ¼ ¼ 5/24 5/24 1/24 1/24

- Instead of exact values, stream of discrete

distributions - Specify exponentially many possible worlds
- Adds complexity to previously studied problems
- Sum and Count are easy (by linearity of

expectation) - AvgSum/Count is hard! because of ratio

McGregor et al. 07 - Linearity of expectation, summation of variance
- Allows estimation of Fk over streams C,

Garofalakis 07

Distributed Streams

- Motivated by Sensor Networks large wireless

nets - Communication drains battery compute more, send

less - Key problem design stream summary data

structures that can be combined to summarize the

union of streams - Most sketches (AMS, Count-Min, F0) naturally

distribute - Similar results needed for other problems

http//www.intel.com/research/exploratory/motes.ht

m

base station (root, coordinator)

Continuous Distributed Model

- Goal Continuously track (global) query over

streams at the coordinator while bounding the

communication - Large-scale network-event monitoring, real-time

anomaly/ DDoS attack detection, power grid

monitoring, - Results known for quantiles, Fk, clustering...
- Cost not much higher than one time computation C

et al. 08

Extensions for P2P Networks

- Much work focused on specifics of sensor and

wired nets - P2P and Grid computing present alternate models
- Structure of multi-hop overlay networks
- Controlled failure model nodes explicitly

leave and join - Allows us to think beyond model of highly

resource constrained sensors. - Implementations such as OpenDHT over PlanetLab

Rhea et al.05

Authenticated Stream Aggregation

- Wide-area query processing
- Possible malicious aggregators
- Can suppress or add spurious information
- Authenticate query results at the querier?
- Perhaps, to within some approximation error
- Initial steps in Garofalakis et al.06,
- Sliding window Hadjieleftheriou et al. 07

Data Stream Algorithms

- Slides are on the web on my website
- Long list of references also on the web
- http//dimacs.rutgers.edu/graham