Graph Problems in the Streaming Model - PowerPoint PPT Presentation

About This Presentation
Title:

Graph Problems in the Streaming Model

Description:

To show 1/6 approx: Account for the weight of edges lost in terms of weight of ... Can we do better in distributed fashion? 2. How do we communicate to detect ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 48
Provided by: cseIi
Category:

less

Transcript and Presenter's Notes

Title: Graph Problems in the Streaming Model


1
Graph Problems in the Streaming Model
  • Sampath Kannan
  • University of Pennsylvania
  • Work done with Joan Feigenbaum, Andrew McGregor,
    Siddharth Suri and Jian Zhang

2
Graph Streaming
  • G(V,E),
  • V known V n
  • E revealed in arbitrary order (e1, e2, )
  • Space allowed O(n polylog n) Semi streaming

3
Motivation?
  • Fundamental problems help calibrate model
  • Massive graphs such as the webgraph can appear as
    stream
  • Recommendation systems and more generally data
    mining

4
Why so much space?
  • Even simple problems need it
  • Given u,v, and a streamed graph G, is there path
    of length 2 between u v?
  • Requires W(n) space.
  • More generally for balanced graph properties

5
Balanced Properties
v
A property is balanced, if there existsstream of
edges such that before seeing last edge There
exists v last edge is (v,x)... for ?(n) xs,
property holds for ?(n) xs property doesnt
hold.
6
Lower Bound for Balanced Props
Consider all isomorphic versions of the
graph that demonstrates the balance
property. Before seeing last edge, streaming
algorithm has to remember the subset x of
vertices such that the addition of edge (v,x)
causes property to hold. As we range over
isomorphisms... this is an arbitrary subset of
the given cardinality... and there are
exponentially many possibilities.
7
Exceptions
  • Counting Local Structures
  • Counting triangles (Bar-Yossef et al, Buriol et
    al)
  • Counting E(G2) (Ganguly et al)
  • Duplicate elimination and aggregation
  • (Cormode,Muthukrishnan)

8
One algorithm design technique
  • Sparsification (Eppstein, Galil,Italiano,Nissenzwe
    ig 97)
  • For graph property P G strong certificate for G
    if ? H (G ? H) ? P ? (G ? H) ? P.
  • Existence of quickly computable, sparse, strong
    certificates leads to good semi-streaming
    algorithms

9
Sparsification-based algorithms
  • Bipartiteness, 1-, 2-, 3-vertex
    connectedcomponents, 2-, 3-edge connected
    components O(a(n)) per edge
  • MST, 4-vertex connected comps., 3-edge connected
    comps. O(log n)
  • Higher connectivities O(n). (Zelke)

10
Bipartite Matching
Matching (maximal) Augmenting path
Approximable with local greed
11
Constant-pass 2/3-approx for bip. matching
  • Maximal matching is .5 approx If M maximum
    and M maximal then M matches at least one
    endpoint of each edge in M has M/2 edges.
  • If M has only aM vertex-disjoint 3-aug-paths
    gt
  • M (1 a) 2 OPT/3M maximum M? M bunch
    of augmenting paths. Count!

12
  • Can find maximal matching
  • To go beyond Need to get most aug. paths of
    length 3.
  • ???Randomly project all free vertices into Layer
    0 or Layer 3
  • Matched edges go from layer 1 to layer 2.
  • Expect half the augmenting paths of length 3
    to respect layering
  • Use maximal matchings between successive
    layers to get constant fraction of these.
  • Gives constant-pass 2/3 - ? approximation

13
  • To get approximation scheme Need to findmost
    augmenting paths of length ??????
  • Again project vertices into k1 layers to find
    augmenting paths of length k
  • Use carefully chosen maximal matchings
    algorithms between successive layers
  • Repeat constant number of times
  • Gives streaming linear time approx scheme for
    unweighted matching in general graphs (McGregor)

14
Weighted Matching
15
A 1/6 Approximation in 1 Pass
  • At all times we store some matching M.
  • On seeing edge e (u,v) we compare the w(e) with
    the weight W of edges e1 and e2 in M incident on
    u and v.
  • If w(e) gt 2W then
  • M ? M ? e \ e1,e2

16
  • To show 1/6 approx Account for the weight of
    edges lost in terms of weight of edges that
    survive
  • Can improve approx to 1/2 - ? (McGregor) in
    constant number of passes
  • Choose an edge if it is (1 ?) times the weight
    of edges that it kills.

17
Approximating Distances
18
The Sketch Approach
  • A two-stage approach
  • First stage While going through the stream,
    construct a small sketch of the input graph.
  • Second stage Compute the distance using the
    sketch, without further access to the stream.
  • Perform BFS-like computations in the second
    stage.

19
Graph Spanners as Sketches
  • Multiplicative t-spanner Edge subgraph H of a
    graph G, s.t., for any pair of vertices u and v,
    distH(u,v) ? tdistG(u,v).
  • There is a t-Spanner with O(n11/t) edges.
  • Reduce streaming graph distance to streaming
    spanner construction.
  • BFS-like subroutines are used in most existing
    spanner constructions.

20
Streaming Spanner Construction
  • For each incoming edge, decide whether it should
    be in the spanner.
  • If the edge causes a cycle of length ? t, do not
    put the edge in the spanner.
  • This gives a t-spanner, because there is a path P
    of length lt t connecting the two endpoints of any
    discarded edge.
  • This spanner is sparse.
  • Thm Bollobás78 A graph whose girth is
    larger than k can only have O(n12/(k-1)) edges.
  • Need to know For an incoming edge, does a short
    path exist?

21
Baswana Sen show almost linear time
non-streaming algorithm for spanners
growingBFS-trees from appropriate
nodes. Difficult to do in streaming
fashion Instead we grow a BFS-like tree not
just from itsroot! Clusters Rooted BFS
trees Preclusters Free floating pieces of BFS
trees will attach to clusters
22
Summary of the One-Pass Algorithm
  • Use a vertex-labeling scheme to construct
    clusters.
  • Structure of the algorithm
  • In the pre-processing phase, generate a
    multi-level set of labels for the vertices.
  • Go through the stream for each edge
  • According to the current assignment of labels to
    vertices, decide whether to put this edge in the
    spanner.
  • Depending on the type of edge, possibly assign
    more labels to one of its endpoints.
  • Next, an example with t log n

23
Labels
(2,2) (2,7)
(1,2) (1,4) (1,7) (1,11)
(1,2) (1,4) (1,7) (1,11)
(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (
0,9) (0,10) (0,11) (0,12)
(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (
0,9) (0,10) (0,11) (0,12)
  • logn/2 levels
  • w.h.p., there are top-level labels.
  • Semantics of labels
  • The set of vertices assigned the same top-level
    label forms a cluster.
  • The set of vertices assigned the same lower-level
    label forms a pre-cluster.

24
Initial Label Assignment
(2,2) (2,7)
(1,2) (1,4) (1,7) (1,11)
(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (
0,9) (0,10) (0,11) (0,12)
v1 v2 v3 v4 v5 v6 v7 v8
v9 v10 v11 v12
25
On arrival of an edge
  • Already know what to do with
  • Intra-cluster/pre-cluster edges
  • Inter-cluster edges
  • Edges connecting pre-clusters the sticky edges
  • They are added to the spanner.
  • They may lead to new label assignment and cluster
    growth.

26
Good Neighbor (1)
(3,2) (2,2) (1,2) (0,2)
(3,2)
(2,2)
Has marked labels
(1,6) (0,6)
v
u
27
Good Neighbor (2)
C(3,2)
C(2,2)
C(1,2)
C(1,6)
v
u
28
Bad Neighbor
No marked labels
(1,6)
(3,2)
v
u
29
Properties of the Clusters
  • Small diameter
  • Number of clusters bounded by .
  • Do not need to cover the whole graph with
    clusters, but the uncovered subgraph is sparse.

The uncovered subgraph consists of sticky edges,
and there are not too many of them.
30
Sticky Edges are Rare
u1
v
u1, u2, u3, u4
u4
u2
u3
  • A neighbor is good with probability at least ½.
  • After seeing at most logn/2 good neighbors, v
    will be assigned a top-level label and be
    included in a cluster. No more sticky edges for
    v.
  • The number of sticky edges can be bounded by the
    length of the shortest prefix in the above
    sequence that contains logn/2 good neighbors.

31
4. Lower Bounds
32
One-pass diameter lower bound
  • Theorem For any ?????, any one-pass algorithm
    that
  • returns a k (slightly better than 1/?) approx to
    diameter
  • in weighted graph requires ??n1?) space.
  • Proof (Sketch)
  • Some properties of random graph G in Gn,p with p
    1/n1-?
  • w.h.p. Contains set E of edges E n1??64
  • no edge in E is in a cycle of length k or less.
  • When all edges in E are removed, graph still
    has diameter lt 2/?

Fix one such G (V, E ? E)
33
  • Sketch (contd) Reduce from INDEX (hard for
    comm. cmplxty)
  • INDEX Alice has m-bit string x and Bob has index
    i. One-way comm. complexity for Bob
    to learn xi is m.
  • Reduction m edges in E enumerated 1 .. m.
  • Alice constructs prefix of stream corresponding
    to multiple copies of
  • H (V,E ? E) where E ? E are the
    indices where xi1. All Alices edges
    have weight 1
  • Bob constructs rest of stream If his index
    corresponds to edge (a,b) in E
  • He connects vertex b in one copy with vertex a
    in next copy at 0 weight
  • Also creates source s and sink t and connects s
    to a in 1st copy and b in last copy to t at
    high weight.
  • Properties If xi 1 where i is Bobs index then
    small diameter
  • else large diameter.
  • Small space streaming violates comm. lower bound.

34
Open Problems
  • Are there interesting subclasses of graphs for
    which distances and diameters are easier in
    streaming model?
  • Is there a more generous but reasonable model?

35
Network Intrusion Detection Systems
  • Current techniques fairly primitive
  • Misuse Pattern match packets with misuse
    signatures in database
  • Anomaly Look for statistical anomalies in
    individual packet headers and payload
  • Needed
  • Look across multiple packets for intrusions
  • Deal with interleaved traffic

36
An Example Browsing habits
  • You read sports and cartoons. Youre equally
    likely to read both. You do not remember what you
    read last.
  • Youd expect a random sequence

SCSSCSSCSSCCSCCCSSSSCSC
37
Two readers
  • I like health, entertainment, and politics
  • I always read entertainment first, health next
    and politics last
  • The sequence would be
  • EHPEHPEHPEHPEHPEHPEHP

38
Two readers, one log file
  • If there is one log file
  • Assume there is no correlation between us

SECHSSPECSHPESCSSHCPCESCHCCPSESHPESSHPE
Is there enough information to tell that there
are two people browsing? What are they browsing?
How are they browsing?
39
Clues in stream?
  • Yes, under model assumptions.
  • H, E, P have special relationship.
  • They cannot belong to different (uncorrelated)
    people.
  • Not clear about S and C ... These could
  • be two people or one person.

SECHSSPECSHPESCSSHCPCESCHCCPSESHPESSHPE
40
Markov Chains as Stochastic Sources
.4
2
1
Output sequence 1 4 7 7 1 2 5 7 ...
.3
.4
.7
.2
4
6
.5
.8
.1
3
.5
.2
5
1
.9
7
.9
.1
41
Markov chains on S,E,C,H,F
Modeled by
1
E
H
1
1
F
42
  • Need more realistic generalizations of such
    analysis to
  • deal with
  • Worm detection
  • Anomaly detection at high traffic links in a
    network
  • TCP compliance
  • BGP policy behavior

43
Partial Solution Clusters (1)
  • A cluster is a subset of vertices and a small
    diameter spanning tree built on these vertices.
  • Intra-cluster edge

44
Partial Solution Clusters (2)
  • Inter-cluster edges

Bollobáss result no longer applies. Need to
control the number of clusters (i.e., make it
).
45
Open Shortest Path First (OSPF)
  • Packet routing protocol
  • Each link broadcasts its weight (initially could
    be 1/bw...)
  • To route from A to B, each router sends along
    shortest path to B, dividing traffic evenly if
    many shortest paths.
  • Adjustments
  • Human operator observing congestion on link
    could raise wt
  • Local decisions could lead to oscillation
    suboptimality
  • Link latency Convex function of its utilization
  • Goal Minimize max link latency, total link
    latency, expected path latency, etc.
  • Exact optimizations typically NP-hard

46
Streaming problem
  • Can we automate the weight adjustments?
  • Simple scenario
  • Assume weights have been optimized for current
    traffic matrix
  • Assume we now have a new (unknown) traffic
    matrix
  • observed at routers
  • Assume some simple goal ... minimize time to
    converge to new solution ... or something ...
  • Streaming algorithm should itself be allowed to
    generatetraffic for communication between
    monitors and for
  • diagnostics, but this overhead should be low.

47
Early Worm Detection
  • EarlyBird System Singh et al identifies
    following characteristics
  • Substantial volume of identical traffic
  • Rising infection levels ( sources destinations
    increasing)
  • Random probing (infected source tries many IP
    addresses)
  • 1. Top-k type streaming algorithm can identify
    high volume of
  • identical traffic at one location.
  • Can we do better in distributed fashion?
  • 2. How do we communicate to detect rising inf.
    levels?
  • 3. Sophisticated worms may not use random
    probing.
  • What other discriminating tests are possible?
  • 4. Sophisticated worms are polymorphic not
    identical traffic.
Write a Comment
User Comments (0)
About PowerShow.com