Data Stream Algorithms Intro, Sampling, Entropy - PowerPoint PPT Presentation

Loading...

PPT – Data Stream Algorithms Intro, Sampling, Entropy PowerPoint presentation | free to download - id: 7e5066-YmU0Y



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Data Stream Algorithms Intro, Sampling, Entropy

Description:

Data Stream Algorithms Intro, Sampling, Entropy Graham Cormode graham_at_research.att.com – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 114
Provided by: Grah2151
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Data Stream Algorithms Intro, Sampling, Entropy


1
Data Stream Algorithms Intro, Sampling, Entropy
Graham Cormode graham_at_research.att.com
2
Outline
  • Introduction to Data Streams
  • Motivating examples and applications
  • Data Streaming models
  • Basic tail bounds
  • Sampling from data streams
  • Sampling to estimate entropy

3
Data is Massive
  • Data is growing faster than our ability to store
    or index it
  • There are 3 Billion Telephone Calls in US each
    day, 30 Billion emails daily, 1 Billion SMS,
    IMs.
  • Scientific data NASA's observation satellites
    generate billions of readings each per day.
  • IP Network Traffic up to 1 Billion packets per
    hour per router. Each ISP has many (hundreds)
    routers!
  • Whole genome sequences for many species now
    available each megabytes to gigabytes in size

4
Massive Data Analysis
  • Must analyze this massive data
  • Scientific research (monitor environment,
    species)
  • System management (spot faults, drops, failures)
  • Customer research (association rules, new offers)
  • For revenue protection (phone fraud, service
    abuse)
  • Else, why even measure this data?

5
Example Network Data
  • Networks are sources of massive data the
    metadata per hour per router is gigabytes
  • Fundamental problem of data stream analysis Too
    much information to store or transmit
  • So process data as it arrives one pass, small
    space the data stream approach.
  • Approximate answers to many questions are OK, if
    there are guarantees of result quality

6
IP Network Monitoring Application
Example NetFlow IP Session Data
  • 24x7 IP packet/flow data-streams at network
    elements
  • Truly massive streams arriving at rapid rates
  • ATT/Sprint collect 1 Terabyte of NetFlow data
    each day
  • Often shipped off-site to data warehouse for
    off-line analysis

7
Packet-Level Data Streams
  • Single 2Gb/sec link say avg packet size is
    50bytes
  • Number of packets/sec 5 million
  • Time per packet 0.2 microsec
  • If we only capture header information per
    packet src/dest IP, time, no. of bytes, etc.
    at least 10bytes.
  • Space per second is 50Mb
  • Space per day is 4.5Tb per link
  • ISPs typically have hundreds of links!
  • Analyzing packet content streams order(s) of
    magnitude harder

8
Network Monitoring Queries
Off-line analysis slow, expensive
Network Operations Center (NOC)
Peer
R3
R1
R2


EnterpriseNetworks
PSTN

DSL/Cable Networks
  • Extra complexity comes from limited space and
    time
  • Will introduce solutions for these and other
    problems

9
Streaming Data Questions
  • Network managers ask questions requiring us to
    analyze the data
  • How many distinct addresses seen on the network?
  • Which destinations or groups use most bandwidth?
  • Find hosts with similar usage patterns?
  • Extra complexity comes from limited space and
    time
  • Will introduce solutions for these and other
    problems

10
Other Streaming Applications
  • Sensor networks
  • Monitor habitat and environmental parameters
  • Track many objects, intrusions, trend analysis
  • Utility Companies
  • Monitor power grid, customer usage patterns etc.
  • Alerts and rapid response in case of problems

11
Streams Defining Frequency Dbns.
  • We will consider streams that define frequency
    distributions
  • E.g. frequency of packets from source A to source
    B
  • This simple setting captures many of the core
    algorithmic problems in data streaming
  • How many distinct (non-zero) values seen?
  • What is the entropy of the frequency
    distribution?
  • What (and where) are the highest frequencies?
  • More generally, can consider streams that define
    multi-dimensional distributions, graphs,
    geometric data etc.
  • But even for frequency distributions, several
    models are relevant

12
Data Stream Models
  • We model data streams as sequences of simple
    tuples
  • Complexity arises from massive length of streams
  • Arrivals only streams
  • Example (x, 3), (y, 2), (x, 2) encodesthe
    arrival of 3 copies of item x, 2 copies of y,
    then 2 copies of x.
  • Could represent eg. packets on a network power
    usage
  • Arrivals and departures
  • Example (x, 3), (y,2), (x, -2) encodes final
    state of (x, 1), (y, 2).
  • Can represent fluctuating quantities, or measure
    differences between two distributions

x y
x y
13
Approximation and Randomization
  • Many things are hard to compute exactly over a
    stream
  • Is the count of all items the same in two
    different streams?
  • Requires linear space to compute exactly
  • Approximation find an answer correct within some
    factor
  • Find an answer that is within 10 of correct
    result
  • More generally, a (1? ?) factor approximation
  • Randomization allow a small probability of
    failure
  • Answer is correct, except with probability 1 in
    10,000
  • More generally, success probability (1-?)
  • Approximation and Randomization (?,
    ?)-approximations

14
Basic Tools Tail Inequalities
  • General bounds on tail probability of a random
    variable (probability that a random variable
    deviates far from its expectation)
  • Basic Inequalities Let X be a random variable
    with expectation ? and variance VarX. Then, for
    any ?gt0

15
Tail Bounds
  • Markov Inequality
  • For a random variable Y which takes only
    non-negative values.
  • PrY ? k ? E(Y)/k
  • (This will be lt 1 only for k gt E(Y))
  • Chebyshevs Inequality
  • For a random variable Y
  • PrY-E(Y) ? k ? Var(Y)/k2
  • Proof Set X (Y E(Y))2
  • E(X) E(Y2E(Y)22YE(Y)) E(Y2)E(Y)2-2E(Y)2
    Var(Y)
  • So PrY-E(Y) ? k Pr(Y E(Y))2 ?
    k2.
  • Using Markov ? E(Y E(Y))2/k2 Var(Y)/k2

16
Outline
  • Introduction to Data Streams
  • Motivating examples and applications
  • Data Streaming models
  • Basic tail bounds
  • Sampling from data streams
  • Sampling to estimate entropy

17
Sampling From a Data Stream
  • Fundamental prob sample m items uniformly from
    stream
  • Useful approximate costly computation on small
    sample
  • Challenge dont know how long stream is
  • So when/how often to sample?
  • Two solutions, apply to different situations
  • Reservoir sampling (dates from 1980s?)
  • Min-wise sampling (dates from 1990s?)

18
Reservoir Sampling
  • Sample first m items
  • Choose to sample the ith item (igtm) with
    probability m/i
  • If sampled, randomly replace a previously sampled
    item
  • Optimization when i gets large, compute which
    item will be sampled next, skip over intervening
    items. Vitter 85

19
Reservoir Sampling - Analysis
  • Analyze simple case sample size m 1
  • Probability ith item is the sample from stream
    length n
  • Prob. i is sampled on arrival ? prob. i survives
    to end

1/n
  • Case for m gt 1 is similar, easy to show uniform
    probability
  • Drawbacks of reservoir sampling hard to
    parallelize

20
Min-wise Sampling
  • For each item, pick a random fraction between 0
    and 1
  • Store item(s) with the smallest random tag Nath
    et al.04

0.391
0.908
0.291
0.555
0.619
0.273
  • Each item has same chance of least tag, so
    uniform
  • Can run on multiple streams separately, then
    merge

21
Sampling Exercises
  • What happens when each item in the stream also
    has a weight attached, and we want to sample
    based on these weights?
  • Generalize the reservoir sampling algorithm to
    draw a single sample in the weighted case.
  • Generalize reservoir sampling to sample multiple
    weighted items, and show an example where it
    fails to give a meaningful answer.
  • Research problem design new streaming algorithms
    for sampling in the weighted case, and analyze
    their properties.

22
Outline
  • Introduction to Data Streams
  • Motivating examples and applications
  • Data Streaming models
  • Basic tail bounds
  • Sampling from data streams
  • Sampling to estimate entropy

23
Application of Sampling Entropy
  • Given a long sequence of characters
  • S lta1, a2, a3 amgt each aj ? 1 n
  • Let fi frequency of i in the sequence
  • Compute the empirical entropy
  • H(S) - ?i fi/m log fi/m - ?i pi log pi
  • Example S lt a, b, a, b, c, a, d, agt
  • pa 1/2, pb 1/4, pc 1/8, pd 1/8
  • H(S) ½ ¼ ? 2 1/8 ? 3 1/8 ? 3 7/4
  • Entropy promoted for anomaly detection in networks

24
Challenge
  • Goal approximate H(S) in space sublinear
    (poly-log) in m (stream length), n (alphabet
    size)
  • (?,?) approx answer is (1?)H(S) w/prob 1-?
  • Easy if we have O(n) space compute each fi
    exactly
  • More challenging if n is huge, m is huge, and we
    have only one pass over the input in order
  • (The data stream model)

25
Sampling Based Algorithm
  • Simple estimator
  • Randomly sample a position j in the stream
  • Count how many times aj appears subsequently r
  • Output X -(r log (r/m) (r-1) log((r-1)/m))
  • Claim Estimator is unbiased EX H(S)
  • Proof prob of picking j 1/m, sum telescopes
    correctly
  • Variance of estimate is not too large VarX
    O(log2 m)
  • Observe that X log m
  • VarX E(X EX)2 lt (max(X) min(X))2
    O(log2 m)

26
Analysis of Basic Estimator
  • A general technique in data streams
  • Repeat in parallel an unbiased estimator with
    bounded variance, take average of estimates to
    improve variance
  • Var 1/k (Y1 Y2 ... Yk) 1/k VarY
  • By Chebyshev, need k repetitions to be
    VarX/?2E2X
  • For entropy, this means space k
    O(log2m/?2H2(S))
  • Problem for entropy when H(S) is very small?
  • Space needed for an accurate approx goes as 1/H2!

27
Low Entropy
  • But... what does a low entropy stream look like?
  • aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabaaaaa
  • Very boring most of the time, we are only rarely
    surprised
  • Can there be two frequent items?
  • aabababababababaababababbababababababa
  • No! Thats high entropy (¼ 1 bit / character)
  • Only way to get H(S) o(1) is to have only one
    character with pi close to 1

28
Removing the frequent character
  • Write entropy as
  • -pa log pa (1-pa) H(S)
  • Where S stream S with all as removed
  • Can show
  • Doesnt matter if H(S) is small as pa is large,
    additive error on H(S) ensures relative error on
    (1-pa)H(S)
  • Relative error (1-pa) on pa gives relative error
    on pa log pa
  • Summing both (positive) terms gives relative
    error overall

29
Finding the frequency character
  • Ejecting a is easy if we know in advance what it
    is
  • Can then compute pa exactly
  • Can find online deterministically
  • Assume pa gt 2/3 (if not, H(S) gt 0.9, and original
    alg works)
  • Run a heavy hitters algorithm on the stream
    (see later)
  • Modify analysis, find a and pa ? (1-pa)
  • But... how to also compute H(S) simultaneously
    if we dont know a from the start... do we need
    two passes?

30
Always have a back up plan...
  • Idea keep two samples to build our estimator
  • If at the end one of our samples is a, use the
    other
  • How to do this and ensure uniform sampling?
  • Pick first sample with min-wise sampling
  • At end of the stream, if the sampled character
    a, we want to sample from the stream ignoring
    all as
  • This is just the character achieving the
    smallest label distinct from the one that
    achieves the smallest label
  • Can track information to do this in a single
    pass, constant space

31
Sampling Two Tokens
B
C
D
B
B
B
A
A
A
A
A
C
Stream
0.627
0.549
0.228
0.366
0.770
0.191
0.408
Tags
0.202
0.173
0.082
0.217
0.815
Repeats
A
A
A
  • Assign tags, choose first token as before
  • Delete all occurrences of first token
  • Choose token with min remaining tag count
    repeats
  • Implementation keep track of two triples
  • (min tag, corresponding token, number of repeats)

32
Putting it all together
  • Can combine all these pieces
  • Build an estimator based on tracking this
    information, deciding whether there is a frequent
    character or not
  • A more involved Chernoff bounds argument improves
    number of repetitions of estimator from
    O(?-2VarX/E2X) to O(?-2RangeX/EX) O(?-2
    log m)
  • In O(?-2 log m log 1/?) space (words) we can
    compute an (?,?) approximation to H(S) in a
    single pass

33
Entropy Exercises
  • As a subroutine, we need to find an element that
    occurs more than 2/3 of the time and estimate its
    weight
  • How can we find a frequently occurring item?
  • How can we estimate its weight p with ?(1-p)
    error?
  • Our algorithm uses O(?-2 log m log 1/?) space,
    could this be improved or is it optimal (lower
    bounds)?
  • Our algorithm updates each sampled pair for every
    update, how quickly can we implement it?
  • (Research problem) What if there are multiple
    distributed streams and we want to compute the
    entropy of their union?

34
Outline
  • Introduction to Data Streams
  • Motivating examples and applications
  • Data Streaming models
  • Basic tail bounds
  • Sampling from data streams
  • Sampling to estimate entropy

35
Data Stream Algorithms Frequency Moments
Graham Cormode graham_at_research.att.com
36
Frequency Moments
  • Introduction to Frequency Moments and Sketches
  • Count-Min sketch for F? and frequent items
  • AMS Sketch for F2
  • Estimating F0
  • Extensions
  • Higher frequency moments
  • Combined frequency moments

37
Last Time
  • Introduced data streams and data stream models
  • Focus on a stream defining a frequency
    distribution
  • Sampling to draw a uniform sample from the stream
  • Entropy estimation based on sampling

38
This Time Frequency Moments
  • Given a stream of updates, let fi be the number
    of times that item i is seen in the stream
  • Define Fk of the stream as ?i (fi)k the kth
    Frequency Moment
  • Space Complexity of the Frequency Moments by
    Alon, Matias, Szegedy in STOC 1996 studied this
    problem
  • Awarded Godel prize in 2005
  • Set the pattern for much streaming algorithms to
    follow
  • Frequency moments are at the core of many
    streaming problems

39
Frequency Moments
  • F0 count 1 if fi ? 0 number of distinct items
  • F1 length of stream, easy
  • F2 sum the squares of the frequencies self
    join size
  • Fk related to statistical moments of the
    distribution
  • F? (really lim k? ? Fk1/k) dominated by the
    largest fk, finds the largest frequency
  • Different techniques needed for each one.
  • Mostly sketch techniques, which compute a certain
    kind of random linear projection of the stream

40
Sketches
  • Not every problem can be solved with sampling
  • Example counting how many distinct items in the
    stream
  • If a large fraction of items arent sampled,
    dont know if they are all same or all different
  • Other techniques take advantage that the
    algorithm can see all the data even if it cant
    remember it all
  • (To me) a sketch is a linear transform of the
    input
  • Model stream as defining a vector, sketch is
    result of multiplying stream vector by an
    (implicit) matrix

linear projection
41
Trivial Example of a Sketch
1 0 1 1 1 0 1 0 1
1 0 1 1 0 0 1 0 1
  • Test if two (asynchronous) binary streams are
    equal d (x,y) 0 iff xy, 1 otherwise
  • To test in small space pick a random hash
    function h
  • Test h(x)h(y) small chance of false positive,
    no chance of false negative.
  • Compute h(x), h(y) incrementally as new bits
    arrive (e.g. h(x) xiti mod p for random prime
    p, and t lt p)
  • Exercise extend to real valued vectors in update
    model

42
Frequency Moments
  • Introduction to Frequency Moments and Sketches
  • Count-Min sketch for F? and frequent items
  • AMS Sketch for F2
  • Estimating F0
  • Extensions
  • Higher frequency moments
  • Combined frequency moments

43
Count-Min Sketch
  • Simple sketch idea, can be used for as the basis
    of many different stream mining tasks.
  • Model input stream as a vector x of dimension U
  • Creates a small summary as an array of w ? d in
    size
  • Use d hash function to map vector entries to
    1..w
  • Works on arrivals only and arrivals departures
    streams

W
Array CMi,j
d
44
Count-Min Sketch Structure
j,c
dlog 1/?
w 2/?
  • Each entry in vector x is mapped to one bucket
    per row.
  • Merge two sketches by entry-wise summation
  • Estimate xj by taking mink CMk,hk(j)
  • Guarantees error less than eF1 in size O(1/e log
    1/d)
  • Probability of more error is less than 1-d

C, Muthukrishnan 04
45
Approximation of Point Queries
  • Approximate point query xj mink CMk,hk(j)
  • Analysis In k'th row, CMk,hk(j) xj Xk,j
  • Xk,j S xi hk(i) hk(j)
  • E(Xk,j) Si? j xiPrhk(i)hk(j) ?
    Prhk(i)hk(k) Si xi e F1/2 by pairwise
    independence of h
  • PrXk,j ? eF1 PrXk,j ? 2E(Xk,j) ? 1/2 by
    Markov inequality
  • So, Prxj ? xj eF1 Pr? k. Xk,j gt eF1
    ?1/2log 1/d d
  • Final result with certainty xj ? xj and
    with probability at least 1-d, xj lt xj e
    F1

46
Applications of Count-Min to F?
  • Count-Min sketch lets us estimate fi for any i
    (up to ?F1)
  • F? asks to find maxi fi
  • Slow way test every i after creating sketch
  • Faster way test every i after it is seen in the
    stream, and remember largest estimated value
  • Alternate way
  • keep a binary tree over the domain of input
    items, where each node corresponds to a subset
  • keep sketches of all nodes at same level
  • descend tree to find large frequencies,
    discarding branches with low frequency

47
Count-Min Exercises
  1. The median of a distribution is the item so that
    the sum of the frequencies of lexicographically
    smaller items is ½ F1. Use CM sketch to find the
    (approximate) median.
  2. Assume the input frequencies follow the Zipf
    distribution so that the ith largest frequency
    is ?(i-z) for zgt1. Show that CM sketch only
    needs to be size ?-1/z to give same guarantee
  3. Suppose we have arrival and departure streams
    where the frequencies of items are allowed to be
    negative. Extend CM sketch analysis to estimate
    these frequencies (note, Markov argument no
    longer works)
  4. How to find the large absolute frequencies when
    some are negative? Or in the difference of two
    streams?

48
Frequency Moments
  • Introduction to Frequency Moments and Sketches
  • Count-Min sketch for F? and frequent items
  • AMS Sketch for F2
  • Estimating F0
  • Extensions
  • Higher frequency moments
  • Combined frequency moments

49
F2 estimation
  • AMS sketch (for Alon-Matias-Szegedy) proposed in
    1996
  • Allows estimation of F2 (second frequency moment)
  • Used at the heart of many streaming and
    non-streaming mining applications achieves
    dimensionality reduction
  • Here, describe AMS sketch by generalizing CM
    sketch.
  • Uses extra hash functions g1...glog 1/d 1...U?
    1,-1
  • Now, given update (j,c), set CMk,hk(i)
    cgk(j)

linear projection
AMS sketch
50
F2 analysis
j,c
d8log 1/?
w 4/?2
  • Estimate F2 mediank åi CMk,i2
  • Each rows result is åi g(i)2xi2 åh(i)h(j)
    2 g(i) g(j) xi xj
  • But g(i)2 -12 12 1, and åi xi2 F2
  • g(i)g(j) has 1/2 chance of 1 or 1
    expectation is 0

51
F2 Variance
  • Expectation of row estimate Rk åi CMk,i2 is
    exactly F2
  • Variance of row k, VarRk, is an expectation
  • VarRk E (?buckets b (CMk,b)2 F2)2
  • Good exercise in algebra expand this sum and
    simplify
  • Many terms are zero in expectation because of
    terms like g(a)g(b)g(c)g(d) (degree at most 4)
  • Requires that hash function g is four-wise
    independent it behaves uniformly over subsets of
    size four or smaller
  • Such hash functions are easy to construct

52
F2 Variance
  • Terms with odd powers of g(a) are zero in
    expectation
  • g(a)g(b)g2(c), g(a)g(b)g(c)g(d), g(a)g3(b)
  • Leaves VarRk ? ?i g4(i) xi4 2 ?j? i
    g2(i) g2(j) xi2 xj2 4 ?h(i)h(j) g2(i)
    g2(j) xi2 xj2 - (xi4 ?j? i 2xi2
    xj2) ? F22/w
  • Row variance can finally be bounded by F22/w
  • Chebyshev for w4/?2 gives probability ¼ of
    failure
  • How to amplify this to small ? probability of
    failure?

53
Tail Inequalities for Sums
  • We derive stronger bounds on tail probabilities
    for the sum of independent Bernoulli trials via
    the Chernoff Bound
  • Let X1, ..., Xm be independent Bernoulli trials
    s.t. PrXi1 p (PrXi0 1-p).
  • Let X ?i1m Xi ,and ? mp be the expectation
    of X.
  • Then, for any ?gt0,

54
Applying Chernoff Bound
  • Each row gives an estimate that is within ?
    relative error with probability p gt ¾
  • Take d repetitions and find the median. Why the
    median?
  • Because bad estimates are either too small or too
    large
  • Good estimates form a contiguous group in the
    middle
  • At least d/2 estimates must be bad for median to
    be bad
  • Apply Chernoff bound to d independent estimates,
    p3/4
  • Pr More than d/2 bad estimates lt 2exp(d/8)
  • So we set d ?(ln ?) to give ? probability of
    failure
  • Same outline used many times in data streams

55
Aside on Independence
  • Full independence is expensive in a streaming
    setting
  • If hash functions are fully independent over n
    items, then we need ?(n) space to store their
    description
  • Pairwise and four-wise independent hash functions
    can be described in a constant number of words
  • The F2 algorithm uses a careful mix of limited
    and full independence
  • Each hash function is four-wise independent over
    all n items
  • Each repetition is fully independent of all
    others but there are only O(log 1/?)
    repetitions.

56
AMS Sketch Exercises
  • Let x and y be binary streams of length n. The
    Hamming distance H(x,y) i xi? yiShow
    how to use AMS sketches to approximate H(x,y)
  • Extend for strings drawn from an arbitrary
    alphabet
  • The inner product of two strings x, y is x ? y
    ?i1n xiyiUse AMS sketches to estimate x ?
    y
  • Hint try computing the inner product of the
    sketches.Show the estimator is unbiased (correct
    in expectation)
  • What form does the error in the approximation
    take?
  • Use Count-Min Sketches for the same problem and
    compare the errors.
  • Is it possible to build a (1??) approximation of
    x ? y?

57
Frequency Moments
  • Introduction to Frequency Moments and Sketches
  • Count-Min sketch for F? and frequent items
  • AMS Sketch for F2
  • Estimating F0
  • Extensions
  • Higher frequency moments
  • Combined frequency moments

58
F0 Estimation
  • F0 is the number of distinct items in the stream
  • a fundamental quantity with many applications
  • Early algorithms by Flajolet and Martin 1983
    gave nice hashing-based solution
  • analysis assumed fully independent hash functions
  • Will describe a generalized version of the FM
    algorithm due to Bar-Yossef et. al with only
    pairwise indendence

59
F0 Algorithm
  • Let m be the domain of stream elements
  • Each item in stream is from 1m
  • Pick a random hash function h m ? m3
  • With probability at least 1-1/m, no collisions
    under h
  • For each stream item i, compute h(i), and track
    the t distinct items achieving the smallest
    values of h(i)
  • Note if same i is seen many times, h(i) is same
  • Let vt tth smallest value of h(i) seen.
  • If F0 lt t, give exact answer, else estimate F0
    tm3/vt
  • vt/m3 ? fraction of hash domain occupied by t
    smallest

0m3
m3
60
Analysis of F0 algorithm
  • Suppose F0 tm3/vt gt (1?) F0 estimate is
    too high
  • So for stream set S ? 2m, we have
  • s ? S h(s) lt tm3/(1?)F0 gt t
  • Because ? lt 1, we have tm3/(1?)F0 ?
    (1-?/2)tm3/F0
  • Pr h(s) lt (1-?/2)tm3/F0 ? 1/m3 (1-?/2)tm3/F0
    (1-?/2)t/F0
  • (this analysis outline hides some rounding
    issues)

0m3
tm3/(1?)F0
vt
m3
61
Chebyshev Analysis
  • Let Y be number of items hashing to under
    tm3/(1?)F0
  • EY F0 Pr h(s) lt tm3/(1?)F0 (1-?/2)t
  • For each item i, variance of the event p(1-p) lt
    p
  • VarY ?s ? S Var h(s) lt tm3/(1?)F0 lt
    (1-?/2)t
  • We sum variances because of pairwise independence
  • Now apply Chebyshev
  • Pr Y gt t ? PrY EY gt ?t/2 ?
    4VarY/?2t2 lt 4t/(?2t2)
  • Set t20/?2 to make this Prob ? 1/5

62
Completing the analysis
  • We have shown Pr F0 gt (1?) F0 lt 1/5
  • Can show Pr F0 lt (1-?) F0 lt 1/5 similarly
  • too few items hash below a certain value
  • So Pr (1-?) F0 ? F0 ? (1?)F0 gt 3/5 Good
    estimate
  • Amplify this probability repeat O(log 1/?) times
    in parallel with different choices of hash
    function h
  • Take the median of the estimates, analysis as
    before

63
F0 Issues
  • Space cost
  • Store t hash values, so O(1/?2 log m) bits
  • Can improve to O(1/?2 log m) with additional
    tricks
  • Time cost
  • Find if hash value h(i) lt vt
  • Update vt and list of t smallest if h(i) not
    already present
  • Total time O(log 1/? log m) worst case

64
Range Efficiency
  • Sometimes input is specified as a stream of
    ranges a,b
  • a,b means insert all items (a, a1, a2 b)
  • Trivial solution just insert each item in the
    range
  • Range efficient F0 Pavan, Tirthapura 05
  • Start with an alg for F0 based on pairwise hash
    functions
  • Key problem track which items hash into a
    certain range
  • Dives into hash fns to divide and conquer for
    ranges
  • Range efficient F2 Calderbank et al. 05,
    Rusu,Dobra 06
  • Start with sketches for F2 which sum hash values
  • Design new hash functions so that range sums are
    fast

65
F0 Exercises
  • Suppose the stream consists of a sequence of
    insertions and deletions. Design an algorithm to
    approximate F0 of the current set.
  • What happens when some frequencies are negative?
  • Give an algorithm to find F0 of the most recent W
    arrivals
  • Use F0 algorithms to approximate Max-dominance
    given a stream of pairs (i,x(i)), approximate ?i
    max(i, x(i)) x(i)

66
Frequency Moments
  • Introduction to Frequency Moments and Sketches
  • Count-Min sketch for F? and frequent items
  • AMS Sketch for F2
  • Estimating F0
  • Extensions
  • Higher frequency moments
  • Combined frequency moments

67
Higher Frequency Moments
  • Fk for kgt2. Use sampling trick as with Entropy
    Alon et al 96
  • Uniformly pick an item from the stream length 1n
  • Set r how many times that item appears
    subsequently
  • Set estimate Fk n(rk (r-1)k)
  • EFk1/nn f1k - (f1-1)k (f1-1)k - (f1-2)k
    1k-0k f1k f2k Fk
  • VarFk?1/nn2(f1k-(f1-1)k)2
  • Use various bounds to bound the variance by k
    m1-1/k Fk2
  • Repeat k m1-1/k times in parallel to reduce
    variance
  • Total space needed is O(k m1-1/k) machine words

68
Improvements
  • Coppersmith and Kumar 04 Generalize the F2
    approach
  • E.g. For F3, set p1/?m, and hash items onto
    1-1/p, -1/p with probability 1/p, 1-1/p
    respectively.
  • Compute cube of sum of the hash values of the
    stream
  • Correct in expectation, bound variance ? O(?mF32)
  • Indyk, Woodruff 05, Bhuvangiri et al. 06
    Optimal solutions by extracting different
    frequencies
  • Use hashing to sample subsets of items and fis
  • Combine these to build the correct estimator
  • Cost is O(m1-2/k poly-log(m,n,1/?)) space

69
Combined Frequency Moments
Consider network traffic data defines a
communication graph eg edge (source,
destination) or edge (sourceport,
destport) Defines a (directed) multigraph We are
interested in the underlying (support) graph on n
nodes
  • Want to focus on number of distinct communication
    pairs, not size of communication
  • So want to compute moments of F0 values...

70
Multigraph Problems
  • Let Gi,j 1 if (i,j) appears in stream edge
    from i to j. Total of m distinct edges
  • Let di Sj1n Gi,j degree of node i
  • Find aggregates of dis
  • Estimate heavy dis (people who talk to many)
  • Estimate frequency momentsnumber of distinct di
    values, sum of squares
  • Range sums of dis (subnet traffic)

71
F? (F0) using CM-FM
  • Find is such that di gt f åi diFinds the people
    that talk to many others
  • Count-Min sketch only uses additions, so can
    apply

72
Accuracy for F?(F0)
  • Focus on point query accuracy estimate di.
  • Can prove estimate has only small bias in
    expectation
  • Analysis is similar to original CM sketch
    analysis, but now have to take account of F0
    estimation of counts
  • Gives an bound of O(1/?3 poly-log(n)) space
  • The product of the size of the sketches
  • Remains to fully understand other combinations of
    frequency moments, eg. F2(F0), F2(F2) etc.

73
Exercises / Problems
  1. (Research problem) What can be computed for other
    combinations of frequency moments, e.g. F2 of F2
    values, etc.?
  2. The F2 algorithm uses the fact that 1/-1 values
    square to preserve F2 but are 0 in expectation.
    Why wont it work to estimate F4 with h ? -1,
    1, -i, i?
  3. (Research problem) Read, understand and simplify
    analysis for optimal Fk estimation algorithms
  4. Take the sampling Fk algorithm and combine it
    with F0 estimators to approximate Fk of node
    degrees
  5. Why cant we use the sketch approach for F2 of
    node degrees? Show there the analysis breaks
    down

74
Frequency Moments
  • Introduction to Frequency Moments and Sketches
  • Count-Min sketch for F? and frequent items
  • AMS Sketch for F2
  • Estimating F0
  • Extensions
  • Higher frequency moments
  • Combined frequency moments

75
Data Stream Algorithms Lower Bounds
Graham Cormode graham_at_research.att.com
76
Streaming Lower Bounds
  • Lower bounds for data streams
  • Communication complexity bounds
  • Simple reductions
  • Hardness of Gap-Hamming problem
  • Reductions to Gap-Hamming

1 0 1 1 1 0 1 0 1
77
This Time Lower Bounds
  • So far, have seen many examples of things we can
    do with a streaming algorithm
  • What about things we cant do?
  • Whats the best we could achieve for things we
    can do?
  • Will show some simple lower bounds for data
    streams based on communication complexity

78
Streaming As Communication
1 0 1 1 1 0 1 0 1
  • Imagine Alice processing a stream
  • Then take the whole working memory, and send to
    Bob
  • Bob continues processing the remainder of the
    stream

79
Streaming As Communication
  • Suppose Alices part of the stream corresponds to
    string x, and Bobs part corresponds to string
    y...
  • ...and that computing the function on the stream
    corresponds to computing f(x,y)...
  • ...then if f(x,y) has communication complexity
    ?(g(n)), then the streaming computation has a
    space lower bound of ?(g(n))
  • Proof by contradiction If there was an
    algorithm with better space usage, we could run
    it on x, then send the memory contents as a
    message, and hence solve the communication problem

80
Deterministic Equality Testing
1 0 1 1 1 0 1 0 1
1 0 1 1 0 0 1 0 1
  • Alice has string x, Bob has string y, want to
    test if xy
  • Consider a deterministic (one-round, one-way)
    protocol that sends a message of length m lt n
  • There are 2m possible messages, so some strings
    must generate the same message this would cause
    error
  • So a deterministic message (sketch) must be ?(n)
    bits
  • In contrast, we saw a randomized sketch of size
    O(log n)

81
Hard Communication Problems
  • INDEX x is a binary string of length ny is an
    index in nGoal output xyResult (one-way)
    (randomized) communication complexity of INDEX is
    ?(n) bits
  • DISJOINTNESS x and y are both length n binary
    strings Goal Output 1 if ?i xiyi1, else
    0Result (multi-round) (randomized)
    communication complexity of DISJOINTNESS is ?(n)
    bits

82
Simple Reduction to Disjointness
x 1 0 1 1 0 1
1, 3, 4, 6
y 0 0 0 1 1 0
4, 5
  • F? output the highest frequency in a stream
  • Input the two strings x and y from disjointness
  • Stream if xi1, then put i in stream then
    same for y
  • Analysis if F?2, then intersection if F??1,
    then disjoint.
  • Conclusion Giving exact answer to F? requires
    ?(N) bits
  • Even approximating up to 50 error is hard
  • Even with randomization DISJ bound allows
    randomness

83
Simple Reduction to Index
x 1 0 1 1 0 1
1, 3, 4, 6
y 5
5
  • F0 output the number of items in the stream
  • Input the strings x and index y from INDEX
  • Stream if xi1, put i in stream then put y in
    stream
  • Analysis if (1-?)F0(x?y)gt(1?)F0(x) then
    xy1, else it is 0
  • Conclusion Approximating F0 for ?lt1/N requires
    ?(N) bits
  • Implies that space to approximate must be ?(1/?)
  • Bound allows randomization

84
Hardness Reduction Exercises
  • Use reductions to DISJ or INDEX to show the
    hardness of
  • Frequent items find all items in the stream
    whose frequency gt ?N, for some ?.
  • Sliding window given a stream of binary (0/1)
    values, compute the sum of the last N values
  • Can this be approximated instead?
  • Min-dominance given a stream of pairs (i,x(i)),
    approximate ?i min(i, x(i)) x(i)
  • Rank sum Given a stream of (x,y) pairs and query
    (p,q) specified after stream, approximate
    (x,y) xltp, yltq

85
Streaming Lower Bounds
  • Lower bounds for data streams
  • Communication complexity bounds
  • Simple reductions
  • Hardness of Gap-Hamming problem
  • Reductions to Gap-Hamming

1 0 1 1 1 0 1 0 1
86
Gap Hamming
  • GAP-HAMM communication problem
  • Alice holds x ? 0,1N, Bob holds y ? 0,1N
  • Promise H(x,y) is either ? N/2 - pN or ? N/2
    pN
  • Which is the case?
  • Model one message from Alice to Bob
  • Requires ?(N) bits of one-way randomized
    communication
  • Indyk, Woodruff03, Woodruff04, Jayram, Kumar,
    Sivakumar 07

87
Hardness of Gap Hamming
  • Reduction to an instance of INDEX
  • Map string x to u by 1? 1, 0 ? -1 (i.e. ui
    2xi -1 )
  • Assume both Alice and Bob have access to public
    random strings rj, where each bit of rj is iid
    -1, 1
  • Assume w.l.o.g. that length of string n is odd
    (important!)
  • Alice computes aj sign(rj ? u)
  • Bob computes bj sign(rjy)
  • Repeat N times with different random strings, and
    consider the Hamming distance of a1... aN with b1
    ... bN

88
Probability of a Hamming Error
  • Consider the pair aj sign(rj ? u), bj
    sign(rjy)
  • Let w ?i ? y ui rji
  • w is a sum of (n-1) values distributed iid
    uniform -1,1
  • Case 1 w ? 0. So w? 2, since (n-1) is even
  • so sign(aj) sign(w), independent of xy
  • Then Praj ? bj Prsign(w) ? sign(rjy) ½
  • Case 2 w 0. So aj sign(rj?u) sign(w
    uyrjy) sign(uyrjy)
  • Then Praj ? bj Prsign(uyrjy)
    sign(rjy)
  • This probability is 1 is uy1, 0 if uy-1
  • Completely biased by the answer to INDEX

89
Finishing the Reduction
  • So what is Prw0?
  • w is sum of (n-1) iid uniform -1,1 values
  • Textbook Prw0 c/?n, for some constant c
  • Do some probability manipulation
  • Praj bj ½ c/2?n if xy1
  • Praj bj ½ - c/2?n if xy0
  • Amplify this bias by making strings of length
    N4n/c2
  • Apply Chernoff bound on N instances
  • With probgt2/3, either H(a,b)gtN/2 ?N or
    H(a,b)ltN/2 - ?N
  • If we could solve GAP-HAMMING, could solve INDEX
  • Therefore, need ?(N) ?(n) bits for GAP-HAMMING

90
Streaming Lower Bounds
  • Lower bounds for data streams
  • Communication complexity bounds
  • Simple reductions
  • Hardness of Gap-Hamming problem
  • Reductions to Gap-Hamming

1 0 1 1 1 0 1 0 1
91
Lower Bound for Entropy
  • Alice x ? 0,1N, Bob y ? 0,1N
  • Entropy estimation algorithm A
  • Alice runs A on enc(x) ?(1,x1), (2,x2), ,
    (N,xN)?
  • Alice sends over memory contents to Bob
  • Bob continues A on enc(y) ?(1,y1), (2,y2), ,
    (N,yN)?

1
1
0
0
1
0
Alice
Bob
0
1
0
0
1
1
92
Lower Bound for Entropy
  • Observe there are
  • 2H(x,y) tokens with frequency 1 each
  • N-H(x,y) tokens with frequency 2 each
  • So, H(S) log N H(x,y)/N
  • Thus size of Alices memory contents ?(N).
    Set ? 1/(p(N) log N) to show bound of ?(?/log
    1/?)-2)

1
1
0
0
1
0
Alice
Bob
0
1
0
0
1
1
93
Lower Bound for F0
  • Same encoding works for F0 (Distinct Elements)
  • 2H(x,y) tokens with frequency 1 each
  • N-H(x,y) tokens with frequency 2 each
  • F0(S) N H(x,y)
  • Either H(x,y)gtN/2 ?N or H(x,y)ltN/2 - ?N
  • If we could approximate F0 with ? lt 1/?N, could
    separate
  • But space bound ?(N) ?(?-2) bits
  • Dependence on ? for F0 is tight
  • Similar arguments show ?(?-2) bounds for Fk,
  • Proof assumes k (and hence 2k) are constants

94
Lower Bounds Exercises
  1. Formally argue the space lower bound for F2 via
    Gap-Hamming
  2. Argue space lower bounds for Fk via Gap-Hamming
  3. (Research problem) Extend lower bounds for the
    case when the order of the stream is random or
    near-random
  4. (Research problem) Kumar conjectures the
    multi-round communication complexity of
    Gap-Hamming is ?(n) this would give lower
    bounds for multi-pass streaming

95
Streaming Lower Bounds
  • Lower bounds for data streams
  • Communication complexity bounds
  • Simple reductions
  • Hardness of Gap-Hamming problem
  • Reductions to Gap-Hamming

1 0 1 1 1 0 1 0 1
96
Data Stream Algorithms Extensions and Open
Problems
Graham Cormode graham_at_research.att.com
97
This Time Extensions
  • Have given the basics of streaming streams of
    items, frequency moments, upper and lower bounds
  • Many variations with many open problems
  • Streams representing different combinatorial
    objects
  • Streams that are distributed, correlated,
    uncertain
  • Systems for processing streams
  • Different models of streams
  • See also Open problems in Data Streams
    McGregor 07
  • Result of a workshop held at IIT Kanpur in Dec
    2006

98
Deterministic Streaming Algorithms
  • Focus so far has been on randomized algorithms
  • Many important problems can be solved
    deterministically!
  • Finding frequent items/ heavy hitters
  • Finding quantiles of a distribution
  • For many problems, lower bounds show
    randomization is necessary for sublinear space
  • Anything involving equality testing as a special
    case
  • Frequency moments
  • When they are possible, deterministic algorithms
    are often faster and use less space more
    practical to implement

99
Clustering On Data Streams
  • Goal output k cluster centers at end - any
    point can be classified using these centers.
  • Use divide and conquer approach Guha et al.
    00
  • Buffer as many points as possible, then cluster
    them
  • Cluster the clusters
  • Cluster the cluster clusters, etc...
  • Each level of clustering gives up extra factors
    in quality

Input
Output
100
Geometric Streaming
  • Stream specifies a sequence of d-dimensional
    points
  • Answer various geometric problems such as
  • Convex hull
  • Minimum spanning tree weight
  • Facility location
  • Minimum enclosing ball
  • Gridding approach reduces to Fk or related
    problems Indyk 03
  • Core-set keep a carefully chosen small subset of
    points and evaluate on them Har-Peled 02,
    Chan06
  • Simple example For minimum enclosing ball, keep
    extremal points in evenly-space directions

101
Sliding Window Computations
  • In a sliding window, we only consider the last W
    items
  • W still very large, so want poly-log(W) solutions
  • Exponential Histograms Datar et al.02 and
    Waves Gibbons Tirthapura02
  • Deterministic structure tracks counts in a window
  • Based on doubling bucket sizes to give relative
    error
  • Same structure sketches solves for aggregates
  • Asynchronous streams items not in timestamp
    order
  • Relative error counts possible Busch, Tirthapura
    07
  • Extend concept to other aggregates C. et al. 08

102
Time Decay
  • Assign a weight to each item as a function of its
    age
  • E.g. Exponential decay or polynomial decay
  • Implies weighted versions of problems
  • Cohen and Strauss 2003
  • Can reduce sum and counts to multiple instances
    of sliding window queries
  • C., Korn and Tirthapura 2008
  • Same observations applies to othercomputations
    (quantiles, frequent items)

age ?
103
Multi-Pass Algorithms
  • Some situations allow multiple passes of the
    stream
  • E.g. scanning over slow storage (tape) random
    access not possible, but can scan multiple times
  • Earliest work in streaming Munro, Paterson 78
    studied the pass/space tradeoff for finding
    medians
  • Lower bounds can follow from multi-round
    communication complexity bounds

1 0 1 1 1 0 1 0 1
104
Other Massive Data Models
  • Massive Unordered Data (MUD) model Feldman et
    al. 08
  • Abstracts computations in MapReduce/Hadoop
    settings
  • Can provably simulate deterministic streaming
    algs
  • What about randomized computations, multiple
    passes?

105
Skewed Streams
  • In practice, not all frequency distributions are
    worst case
  • Few items are frequent, then a long tail of
    infrequent items
  • Such skew is prevalent in network data, word
    frequency, paper citations, city sizes, etc.
  • Zipfian distribution with skew z gt 0 (z
    1..2 typical)
  • Analyze algorithms under assumption of skewed
    data
  • Improved F2 space cost O(e-2/(1z) log 1/d),
    provided zgt1

106
Graph Streaming
(4,5) (2,3) (1,3) (3,5) (1,2) (2,4) (1,5) (3,4)
  • Stream specifies a massive graph edge by edge
  • Most natural problems have ?(V) space lower
    bounds
  • Semi-streaming model allow ?(V) but o(E)
    spaceTherefore also o(V2) space also
  • Allow one (or few) passes to approximate
  • Minimum Spanning Tree Weight
  • Graph Distances (based on spanners)
  • Maximum weight matching
  • Counting Triangles

107
Matrix Streaming
  • Stream specifies a massive n ? n matrix
  • Either by giving entries in some order, or
    updates to entries
  • In one (or few) passes, find
  • CUR Decomposition
  • Page Rank Vector
  • Approximate Matrix product
  • Singular Vector Decomposition
  • Current methods take small constant number of
    passes, sample constant number of rows and
    columns by weight
  • Sketching methods dont seem so useful here

O(1) Columns
O(1) Rows
Carefully chosen U
108
Permutation Streaming
  • Stream presents a permutation of items
  • Abstraction of several settings, more of
    theoretic interest
  • Approximate number of inversions in the stream
  • Locations where i gt j but i appears before j in
    stream
  • Can be reduced to a variation of quantiles
    Gupta, Zane03
  • Find length of longest increasing subsequence
  • Reduce (up to factor 2) to simpler function
    Ergun, Jowhari 08
  • Approximate this using a different variation of
    quantiles
  • Deterministic lower bound ?(N1/2), randomized
    bound open

109
Random Order Streaming
  • Lower bounds are sometimes based on carefully
    creating adversarial orders of streams
  • Random order streams order is uniformly permuted
  • Can sometimes give much better upper bounds
    prefix of stream gives a good sample of dbn. to
    come
  • Lower bounds in random order give stronger
    evidence of robust hardness, e.g. Chakrabarti
    et al. 08
  • Hardness via communication complexity of random
    partitions
  • GAP-HAMMING still has linear lower bound
  • t2-party DISJOINTNESS has ?(n/t) lower bound

110
Probabilistic Streams
Example S (?x, ½?, ?y, 1/3?, ?y, ¼?) Encodes
6 possible worlds
G ? x y x,y y,y x,y,y
PrG ¼ ¼ 5/24 5/24 1/24 1/24
  • Instead of exact values, stream of discrete
    distributions
  • Specify exponentially many possible worlds
  • Adds complexity to previously studied problems
  • Sum and Count are easy (by linearity of
    expectation)
  • AvgSum/Count is hard! because of ratio
    McGregor et al. 07
  • Linearity of expectation, summation of variance
  • Allows estimation of Fk over streams C,
    Garofalakis 07

111
Distributed Streams
  • Motivated by Sensor Networks large wireless
    nets
  • Communication drains battery compute more, send
    less
  • Key problem design stream summary data
    structures that can be combined to summarize the
    union of streams
  • Most sketches (AMS, Count-Min, F0) naturally
    distribute
  • Similar results needed for other problems

http//www.intel.com/research/exploratory/motes.ht
m
base station (root, coordinator)
112
Continuous Distributed Model
  • Goal Continuously track (global) query over
    streams at the coordinator while bounding the
    communication
  • Large-scale network-event monitoring, real-time
    anomaly/ DDoS attack detection, power grid
    monitoring,
  • Results known for quantiles, Fk, clustering...
  • Cost not much higher than one time computation C
    et al. 08

113
Extensions for P2P Networks
  • Much work focused on specifics of sensor and
    wired nets
  • P2P and Grid computing present alternate models
  • Structure of multi-hop overlay networks
  • Controlled failure model nodes explicitly
    leave and join
  • Allows us to think beyond model of highly
    resource constrained sensors.
  • Implementations such as OpenDHT over PlanetLab
    Rhea et al.05

114
Authenticated Stream Aggregation
  • Wide-area query processing
  • Possible malicious aggregators
  • Can suppress or add spurious information
  • Authenticate query results at the querier?
  • Perhaps, to within some approximation error
  • Initial steps in Garofalakis et al.06,
  • Sliding window Hadjieleftheriou et al. 07

115
Data Stream Algorithms
  • Slides are on the web on my website
  • Long list of references also on the web
  • http//dimacs.rutgers.edu/graham
About PowerShow.com