Efficient Top-k Queries in Large-Scale Networks - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient Top-k Queries in Large-Scale Networks

Description:

Problem: find the k objects with highest sums ... B = k. TA Running over Networks (A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J, 1) (B, 10) ... – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 39
Provided by: cao4
Category:

less

Transcript and Presenter's Notes

Title: Efficient Top-k Queries in Large-Scale Networks


1
Efficient Top-k Queries in Large-Scale Networks
  • Pei Cao
  • Cisco Systems, Inc.
  • Consulting Faculty, Stanford University

2
Motivation
  • Enterprise content delivery networks (CDNs)
  • CE web cache and streaming media cache combined
  • Number of branches 50 - 2000

Data Center
Central Manager
56Kbps,128kbps, DSL
Branch Offices
. . .
. . .
CE
CE
CE
3
Top-k Queries in CDNs
  • Example queries
  • Across all CEs, which URLs are accessed most
    often?
  • Across all CEs, which domains consume the most
    storage?
  • Across all CEs, which cached objects produced the
    biggest bandwidth savings?
  • etc.

4
Definitions
  • a network of m nodes, connected to a central
    manager (CM)
  • each node i has a reverse-sorted list of (
    x, Vi(x) )
  • an objects sum
  • V(x) V1(x)V2(x)Vm(x)
  • Problem find the k objects with highest sums
  • Goal answer this question with minimum network
    traffic
  • ? A generic problem in distributed systems

5
Existing Methods
  • Each node sends the full list of objects and
    their values to the Central Manager
  • Pro simple to implement works fine when the
    number of objects is small
  • Con when the number of objects is large,
    consumes too much network bandwidth
  • Use the threshold algorithm (TA)
  • Proposed by multiple groups in the database
    research community

6
The Threshold Algorithm (TA)
  • Example find top 2 objects with max sums in
    three columns

Node 1
Node 2
Node 3
Central Manager (CM)
?
(A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J,
1) (K, 1) . . .
(B, 10) (D, 9) (F, 8) (H, 6) (G, 5) (C, 1) (A,
1) . . .
(C, 10) (A, 9) (G, 8) (J, 7) (F, 6) (D, 4) (B,
1) . . .
T 30 V(A)20, V(C)19, V(B)18
?
T 26 V(A)20, V(C)19,
?
T 24 V(F)22, V(A)20,
?
T 21 V(F)22, V(A)20,
?
T 18 V(F)22, V(A)20,
7
Adapting TA for Distributed Environments
  • Consists of multiple rounds
  • Each round has two round trips
  • Round-trip 1 sorted access CM asks for the
    next B objects on the lists and nodes respond
  • Round-trip 2 random lookup CM sends a list of
    object names to nodes and nodes supply values
  • B k

8
TA Running over Networks
Node 2
Node 3
Node 1
CM
(B, 10) (D, 9) (F, 8) (H, 6) (G, 5) (C, 1) (A,
1) . . .
(C, 10) (A, 9) (G, 8) (J, 7) (F, 6) (D, 4) (B,
1) . . .
T 26 looks up A, B, C, D ? V(A)20, V(C)19
cant stop
(A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J,
1) . . .
?
T 21 looks up E, F, G, H, J ? V(F)22,
V(A)20 cant stop
?
?
T 10 stop
9
Problems with TA in Large Networks
  • Num of round-trips required vary by data input
  • High bandwidth consumption when number of nodes
    is large
  • In round trip 2, the list of random-lookup
    objects are the union of all objects sent by m
    nodes in round trip 1
  • In round trip 2, the list goes to all m nodes

10
New Algorithm Two-Phase Uniform Threshold (TPUT)
  • Motivation algorithm should terminate in a fixed
    (and small) number of round trips
  • Operates in two phases
  • Phase 1 get a lower-bound estimate on the bottom
    value in the top-k set (i.e. the true bottom,
    denoted as E)
  • Phase 2 all nodes send objects who sums are
    potentially higher than the lower bound CM
    aggregates the info, refines the estimate,
    determines the candidates, and looks up
    candidates in all nodes

11
Partial Sums and Upper Bounds
  • Partial sum PS(x) ?Vi(x)
  • Upper bound U(x) ?Ui(x)

Vi(x), if x has been reported by node i to CM
Vi(x)
0, otherwise
Vi(x), if x has been reported by node i to CM
Ui(x)
Li, otherwise
Li is the lowest value that node I has reported
to CM
12
Examples
Node 2
Node 3
Node 1
CM
(B, 10) (D, 9) (F, 8) (H, 6) (G, 5) (C, 1) (A,
1) . . .
(C, 10) (A, 9) (G, 8) (J, 7) (F, 6) (D, 4) (B,
1) . . .
PS(A) 10 0 9 19 U(A) 10 9 9
28 PS(B) 0 10 0 10 U(B) 8 10 9
27
(A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J,
1) . . .
?
For any object O, PS(O) V(O) U(O)
13
Steps in TPUT
  • Round-trip 1
  • Manager ? Nodes start top-k query
  • Nodes ? Manager here are my top-k objects
  • Manager
  • Calculate partial sums of all objects and sort
    them
  • Take the kth value, call it E1 E1 E
  • set t E1/m
  • Round-trip 2
  • Manager ? Nodes send me all objects with value
    t
  • Nodes ? Manager here they are
  • Manager
  • Calculate partial sums of all objects and sort
    them take the kth value, call it E2 E1 E2
    E
  • For each object, calculate its upper bound
    select those objects whose upper bounds are E2
    call the set S

14
TPUT
  • Round-trip 3
  • Manager ? Nodes here is S send me all objects
    in S
  • Nodes ? Manager here they are
  • Manager calculate sums for objects in S select
    the top k objects

15
Example
Node 2
Node 3
Node 1
CM
(B, 10) (D, 9) (F, 8) (H, 6) (G, 5) (C, 1) (A,
1) . . .
(C, 10) (A, 9) (G, 8) (J, 7) (F, 6) (D, 4) (B,
1) . . .
PS(A) 19 PS(C) 18 ? E1 18 t 6
(A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J,
1) . . .
?
PS(F) 22 PS(A) 19 ? E2 19 U(H) 18, U(J)
19 ? H and J are out! S (A, B, C, D, E, F, G)
?
S(F) 22 S(A) 20 S(C) 19 Top 2 objects
are F and A.
16
Improving the Pruning Power
  • Observation if E2 E1, then no object can be
    pruned away
  • Solution set t (E1/m) a, where 0ltalt1
  • Effect
  • Increases traffic in round-trip 2
  • But shrinks the set of candidates, hence reduce
    traffic in round-trip 3
  • Optimal alpha depends on data set
  • Default alpha 0.5

17
Compression via Hashing
  • Problem object IDs can be too long
  • Solution send hashed keys of object IDs
  • Node report to CM (hash(o), V(o))
  • If hash(o1)hash(o2), then V max(V(o1), V(o2))
  • Candidate set S is a set of hashed keys
  • Size of key log( of objects in all nodes)
  • Effect
  • maintains correctness of pruning by upper bounds
  • However, might need an additional round trip

18
Evaluating TPUT Algorithm
  • Trace-driven simulation
  • Optimality analysis

19
Trace Data for Simulations
  • NLANR-10 daily web access from 10 NLANR proxies
  • Worldcup-30 2-hr web accesses from 30 servers
    hosting 1997 WorldCup
  • DEC-64 split one-day DEC traces into 64
    sub-traces by client IP
  • Simulating an enterprise with 64 branch offices
  • DEC-128 split two-day DEC traces into 128
    sub-traces by client IP
  • Simulating an enterprise with 128 branch offices
  • NLANR-208 split NLANR traces into 208 sub proxy
    traces by client IP
  • Simulating an enterprise CDN of 208 nodes
  • Berkeley-512 split one week UCB traces into 512
    sub traces
  • Simulating a 512 branch office with 16 people per
    branch

20
Performance Metrics
  • Communication costs
  • Messages are always compressed by gzip
  • Unicast-bytes assuming CM communicates with
    nodes via uni-cast
  • Multicast-bytes assuming CM broadcasts to nodes

21
Results on Unicast-Bytes
22
Results on Multicast-Bytes
23
Optimality Analysis
  • Main results
  • TPUT is instance optimal for data sets following
    a log-log slope function.
  • Zipf distribution is a special case.
  • Zipf distribution opt-ratio (m-1)2m km
  • Setting alt1 reduces cost qualitatively.
  • Zipf distribution ratio (m-1)?O(vm )
    k?m/ a

24
General Instance Optimality
  • Definition
  • An algorithm T is instance-optimal with
    optimality ratio C1, if exists C2, such that for
    any data series D, and any algorithm A,
  • cost(T, D) C1 cost(A, D) C2
  • cost is amount of network traffic
  • Threshold Algorithm is instance optimal with
    opt-ratio O(m2)

25
Worst Cases for Fixed-Number Round Trip Algorithms
  • TPUT is not general instance optimal
  • Nor can any algorithm that terminates in a fixed
    number of round trips regardless of input

Finding obj with highest sum
Node 1 (A, 1) (C, 1) (X1, 0.6) (X2,
0.6) . . . (Xn, 0.6) (B, 0.5) . .
Node 2 (B, 1) (D, 0.2) . . . . . . . . .
26
Log-Log Slope Function
  • L(j) is the value at position j in a
    reverse-sorted list
  • The list satisfies log-log slope function C(n),
    if, for all jk, L(jC(n)) lt L(j)/n
  • For Zipf-like distribution L(j) 1/j?, C(n)
    n1/?.

List Position 1
. . . .
. Position j . .
. . .
. . Position jC(n) .
. . .
. . .
L(j)
lt L(j)/n
27
Properties of the Two Lower Bounds
  • E1 E/m, where E is the true bottom
  • E2 gt E/2
  • E2 E1
  • For any x, V(x) PS(x) lt (m-1)t? V(x) PS(x)
    lt (m-1) E1/m? E E2 lt E1 (m-1)/m ? E2 gt E
    E1(m-1)/m
  • E2 gt (m/(2m-1))E
  • Consequently
  • Since L(k) E1 in every node, each node sends at
    most kC(m) to manager in round trip 2
  • A candidate in round trip 3 has average value
    RgtE/2m

28
Restricted Instance Optimality of TPUT (a1)
  • Assume D is a collection of m lists all following
    log-log slope function C(n), then for any
    algorithm A,
  • cost(TPUT,D) cost(A,D) ((m-1)C(2m) C(m)k)
  • Proof assume the optimal algorithm for D stops
    at position bi on list i, then L(bi) lt E? the
    number of candidates in round-trip 3 is
    bi C(2m)

29
Effect of alt1
  • Intuition
  • if an object appears in few nodes and still
    makes the cut, then its average value must be
    high
  • if an object has a small value and makes the
    cut, then it must appear in many nodes
  • Let li be the num of objects that appear in
    exactly i nodes from round-trip 2, then
  • 1l1 2l2 3l3 mlm C(m (1a)/a)
    ?bi
  • For each i, If an object appears in less than i
    nodes and still makes the cut, then its average
    value R E2 (1-a)/I? l1 l2 li C( i
    (1 a)/(1-a)) ?bi
  • Size of candidate set is l1 l2 lm

30
Analysis of alt1
  • Whats the maximum l1l2 lm under the
    following constraints?
  • 1l12l2 3l3 mlm C(m (1a)/a) B
  • l1 C(1ß) B
  • l1l2 C(2ß) B
  • ...
  • l1l2 lm C( m ß) B
  • where ß (1a)/(1-a), B ?bi
  • Solution maximize l1, l2, , ld, and
    set ld1, ld2, , lm to 0
  • Lj C(i ß) B C((i-1) ß) B
  • d C(d ß) B - ?C(i ß) B C(m (1a)/a) B
  • Candidate set size S C(d ß) B

31
a For Zipf Distributions
  • For Zipf distribution, where C(n) n, size of
    candidate set is O(vm) ?bi
  • ? Optimality ratio for TPUT with alt1 is (m-1)
    c vm mk
  • Optimal a depends on m, but should gt 1/3
    default 0.5

32
Summary and Open Questions
  • TPUT algorithm works well for top-k queries in
    distributed networks
  • Introducing a0.5 improve performance
    significantly
  • TPUT is instance-optimal under log-log slope
    function assumption
  • Easy to extend the algorithm to hierarchical
    networks
  • Open question
  • Is TPUT instance optimal compared with all fixed
    round trip algorithms over all data sets?

33
Performance of Threshold Algorithm
Trace Raw Data K10 TA UniCast K10 TA MultiCast K100 TA UniCast K100 TA UniCast
NL-10 26MB
WC-30
DEC-64
DEC-128
NL-208
UCB-512
34
Unicast-Bytes for Top-100 Objects
35
Multicast-Bytes for Top-100 Objects
36
Fixed-Number Round Trip Algorithms
  • Criteria by which a node decides to send objects
  • By position
  • By name
  • By value
  • Any fixed-number round trip algorithm must
    include a by value operation
  • Any algorithm, if include by value operation,
    wont be instance optimal

37
Why Uniform Threshold?
Node 2
Node 3
Node 1
CM
(B, 10) (D, 9)
(C, 10) (A, 9)
T 8 9 9 26 E1 18 ? Could set a per-node
ti Li E1/T
(A, 10) (C, 8)
?
  • Benefit of uniform threshold E2 gt E/2, where E
    is the true bottom
  • E2 E1
  • E2 E (m-1)/m E1
  • because V(x)-PS(x) lt (m-1)t for all x
  • ? E2 (m/(2m-1)) E

38
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com