Title: Programming Support and Resource Management for Cluster-based Internet Services
1Programming Support and Resource Management for
Cluster-based Internet Services
- Hong Tang
- Department of Computer Science
- University of California, Santa Barbara
2Cluster-based Internet Services
- Advantages
- Cost-effectiveness.
- Incremental scalability.
- High availability.
- Examples
- Yahoo, MSN, AOL, Google, Teoma.
Firewall/ Traffic switch
Web server/ Query handlers
Local-area network
Service nodes
3Challenges
- Hardware failures and configuration errors due to
a large number of components. - Platform heterogeneity due to irregularity in
hardware, networking, data partitions. - Serving highly concurrent and fluctuating traffic
under interactive response constraint.
4Neptune Programming and Runtime Support for
Cluster-based Internet Services
- Programming support
- Component-oriented style allows programers to
focus on application functionality. - Neptune API provides high-level primitives for
service programming. - Runtime support
- Neptune runtime glues components together and
takes care of reliability and scalability issues. - Applications
- Discussion groups online auctions index search
persistent cache utility BLAST-based protein
sequence match. - Industrial Deployment Teoma/AskJeeves.
5Example Document Search Engine
Index servers (partition 1)
Query caches
Firewall/ Traffic switch
Web server/ Query handlers
Local-area network
Index servers (partition 2)
Doc server (partition 2)
Index servers (partition 3)
Doc server (partition 1)
6Outline
- Cluster-based Internet services Background and
challenges. - Programming support for data aggregation
operations. - Integrated resource management and QoS support.
- Future work.
7Data Aggregation Operation
- Aggregate request processing results from
multiple data partitions. - Examples search engine, discussing groups,
- Naïve approach
- Rely on a fixed server for data collection and
aggregation. - The fixed server is a scalability bottleneck.
- Actually used in TACC framework (Fox97) and
previous version of Neptune system. - Need explicit programming support and efficient
runtime system design!
8Data Aggregation Operation The Search Engine
Example
Index servers (partition 1)
Query caches
Firewall/ Traffic switch
Web server/ Query handlers
Local-area network
Index servers (partition 2)
Doc server (partition 2)
Index servers (partition 3)
Doc server (partition 1)
9Data Aggregation Operation
- Aggregate request processing results from
multiple data partitions. - Examples search engine, discussion groups,
- Naïve approach
- Rely on a fixed server for data collection and
aggregation. - The fixed server is a scalability bottleneck.
- Actually used in TACC framework (Fox97) and
previous version of Neptune system. - Need explicit programming support and efficient
runtime system design!
10Design Objectives
- An easy-to-use programming primitive.
- Scalable to a large number of partitions.
- Interactive responses and high throughput.
- Reminder All must be achieved in a cluster
environment! - Component failures.
- Platform heterogeneity.
11Data Aggregation Call (DAC)The Basic Semantics
DAC(P, opproc , opreduce)
Requirement of reduce() commutative and
associative.
partition 1
partition 2
partition 3
partition 4
12Adding Quality Control to DAC
- What if some server fails?
- Partial aggregation results may still be useful.
- Provide aggregation quality guarantee .
- Aggregation quality Percentage of partitions
contributed to the aggregation result. - What if some server is very slow?
- Better return partial results than waiting.
- Provide soft deadline guarantee .
DAC(P, opproc , opreduce ,q, T)
13Design Alternatives
Service client
Service client
Service client
P1
P1
P2
Pn
P1
P2
Pn
P3
P2
P4
P5
P6
(b)
(a)
(c)
Base
Flat
Hierarchical tree
14Tree-based Reduction
Participating servers
Service client
The reduction tree is built dynamically for each
request.
15Building Dynamic Reduction Trees
- Objective
- High throughput and low response time.
- Achieving high throughput
- Balance load, keep all servers busy.
- Achieving low response time?
16Building Dynamic Reduction Trees
17Building Dynamic Reduction Trees
- Objective
- High throughput and low response time.
- Achieving high throughput
- Balance load, keep all servers busy.
- Achieving low response time
- Reducing the longest queue length.
- Queue length indicates server load.
- Balance load!
- Observation Under highly concurrent workload,
the goals of reducing response time and improving
throughput require us to balance load! - Decisions Tree shape server assignment.
18Load-aware Server Assignment
A
- A servers load increase is determined by of
children. - k children 1 local processing k reduction.
- Underloaded servers nodes with more children.
- Overloaded servers leaf nodes, or nodes with
fewer children.
B
C
D
E
F
G
19Choosing Reduction Tree Shapes
- Static tree shapes Balanced d-ary tree binomial
tree. - Problem Not sufficient to correct load imbalance
caused by platform heterogeneity in a cluster
environment.
20Load-adaptive Tree Formation (LAT)
7
G
H
6
5
4
3
2
1
D
E
F
A
B
D
C
E
F
G
H
21LAT Adjustment
- Problem When all servers have similar load, LAT
will assign one reduction operation per server,
resulting in a link list. - Solution Final adjustment ensures the depth is
no more than logN. If a subtree is in the form of
a link list, change it to a binomial tree.
22LAT Summary
- Steps
- Collecting server load information.
- Assigning operations to servers.
- Constructing the reduction tree.
- Adjusting the tree shape.
- Time complexity O(nlogn).
23Request Scheduling in a Server
- Problem Blocking threads for data from children
will reduce throughput. - Solution Event-driven scheduling.
Data recved from child
reduction
All data from children aggregated
Timeout
Local proc done (non-leaf node)
Req recved
Local proc done (leaf node)
Local process initiated
Send data to parent
24Handling Server Failures
- Failures
- Server stopped No heartbeat packets.
- Server unresponsive Very long queue.
- Solutions
- Exclude stopped servers from the reduction tree.
- Use staged timeout to eagerly prune unresponsive
servers.
25Evaluation Settings
- A cluster of Linux servers (kernel ver. 2.4.18)
- 30 dual-CPU (400MHz P-II), 512MB MEM 4 quad-CPU
(500MHz P-II), 1GB MEM. - Benchmark I Search engine index server.
- Dataset 28 partitions, 1-1.2GB each.
- Workload Trace-driven.
- One week trace from Ask Jeeves.
- Contains only uncached queries.
- Benchmark II CPU-spinning microbenchmark.
- Workload Synthetic.
26Ease of Use
- Applications Index server NCBIs BLAST protein
sequence matcher online facial recognizer. - First implemented without DAC.
- A graduate student modified it with DAC.
Services Code Size (lines) Changed Lines Effort (days)
Index 2384 142 (5.9) 1.5
BLAST 1060K 307 (0.03) 2
Face 4306 190 (4.4) 1
27Tree Formation Schemes
- 24 dual-CPU nodes, index server benchmark.
28Tree Formation Schemes
- 20 dual-CPU, 4 quad-CPU nodes (heterogeneous).
(A) Response Time - 20 Dual, 4 Quad
(B) Throughput - 20 Dual, 4 Quad
1500
25
Binomial
LAT
20
1000
15
Binomial
Throughput (req/sec)
LAT
Response Time (ms)
10
500
5
16 ? 25
4 ? 21
0
0
10
15
20
25
30
10
15
20
25
30
Request Rate (req/sec)
Request Rate (req/sec)
29Handling Servers Failures
- LAT with Staged timeout (ST).
- Event-driven request scheduling (ED).
- Three versions None, ED-only, EDST.
30Scalability (simulation)
(B) Scalability Throughput
(A) Scalability Response Time
0.5
100
0.4
80
0.3
60
Throughput (req/sec)
Response Time (s)
40
0.2
Throughput
95 Demand level
60 Demand level
80 Demand level
0.1
20
90 Demand level
0
0
100
200
300
400
500
100
200
300
400
500
Number of Server Partitions
Number of Server Partitions
31Summary
- Programming support
- DAC primitive.
- Runtime system
- LAT tree formation.
- Event-driven scheduling.
- Staged timeout.
- PPoPP03.
32Outline
- Cluster-based Internet services background and
challenges. - Programming support for data aggregation
operations. - Integrated resource management and QoS support.
- Future work.
33Research Objectives
- Service-specific resource management objectives.
- Previous research Rely on concrete metrics to
measure resource management efficiency. - Observation Different services may have
different objectives. - Statement Resource management objectives should
not be built into the runtime system. - Differentiated services qualities for multiple
request classes (QoS). - Internet traffic is bursty 31 peak-to-average
load ratio reported at Ask Jeeves. - Prioritized resource allocation is desirable.
34Service Yield Function
- Service yield The benefit achieved from serving
a request. - A monotonically non-increasing function of
response time. - Service yield function Y(r) Specified by
service providers. - Optimization goal Maximize aggregate yield
.
35Sample Service Yield Functions
36Service Differentiation
- Service class A category of service requests
that enjoy the same level of QoS support. - Client identity (paid vs unpaid membership).
- Service types (order placement vs catalog
browsing). - Provision
- Differentiated service yield functions.
- Proportional resource allocation guarantee.
37Runtime System Request Scheduling
- Functionally homogeneous sub-cluster.
- Example Replicas of index server partition 1.
- Cluster level
- Which server to handle a request?
- Server level
- When to serve a request?
Service client
Service client
Service client
Cluster-level request dispatch
Server
Server
Other server
...
Other server
Sub-cluster
Service cluster
38Cluster Level Partitioning or Not?
- Periodic server partitioning Infocom01.
- Partition the sub-cluster among service classes.
- Periodically adjust server pool sizes based on
request demand of the service classes. - Problems
- Decisions are made by a centralized dispatcher.
- Periodical adjustment means slow response to
demand changes. - This work Random polling.
- Service differentiation at the server level.
- Functional-symmetry and decentralization.
- Better handling of demand spikes and failures.
39Server Level Scheduling
- Drop requests that are likely to generate zero
yield. - If there is any under-allocated service class,
schedule a request in that class. - Otherwise, find the request that has the best
chance to maximize aggregate yield. - System underloaded?
- Observation Yield loss due to missed deadlines.
- Idea Schedule requests with tight deadlines.
- Solution YID (yield-inflated deadline)
scheduling. - System overloaded?
- Observation Yield loss due to lack of resources.
- Idea Schedule requests with low resource
consumption. - Solution YIC (yield-inflated-cost) scheduling.
...
Class 1
Class 2
Class N
Request scheduling for service differentiation
Thread pool
40Evaluation Settings
- A cluster of 24 dual-CPU Linux servers.
- Benchmark Differentiated index search service.
- Three service classes
- Gold, Silver, Bronze memberships.
- Request composition 10 30 60.
- Service yield ratio 4 2 1.
- 20 resource guarantee forBronze class.
- Workload Trace-driven.
- One week trace from Ask Jeeves.
- Contains only uncached queries.
Service yield functions
6
Gold
Silver
5
Bronze
4
3
2
1
0
0
1
2
3
4
5
6
Response time (seconds)
41Service Differentiation During a Demand Spike and
Server Failure
- Demand spike for the Silver class between time 50
and 150. - One server failure between time 200 and 250.
Elapsed Time (sec)
42Service Differentiation During a Demand Spike and
Server Failure
- Periodic server partitioning.
Elapsed Time (sec)
43Summary
- Service yield function.
- As a mechanism to express resource management
objectives. - As a means to differentiate service qualities.
- Two-level decentralized request scheduling.
- Cluster level Random polling.
- Server level Adaptive scheduling.
- OSDI02.
44Related Work
- Programming support for cluster-based Internet
services TACC Fox97, MultiSpace Gribble99,
Ninja von Behren02. - Event-driven request processing Flash Pai99,
SEDA Welsh01. - Tree-based reduction in MPI Gropp96, MagPIe
Kielmann99, TMPI Tang01. - Data aggregation Aggregation queries for
databases Saito99, Madden02, Scientific
application Chang01. - QoS for computer networks Weighted Fair Queuing
Demers90 Parekh93, Leaky Bucket, LIRA
Stoica98, Dovrolis99. - QoS or real-time scheduling at the single host
level Huang89, Haritsa93, Waldspurger94,
Mogul96, LRP Druschel96, Jones97, Eclipse
Bruno98, Resource Container Banga99,
Steere99. - QoS and resource management for Web servers
Almeida98, Pandey98, Abdelzaher99,
Bhatti99, Chandra00, Li00, Voigt01. - QoS and load balancing for Internet services
LARD Pai98, Cluster Reserves Aron00,
Sullivan00, DDSD Zhu01, Chase01,
Goswami93, Mitzenmacher97, Zhou87.
45Outline
- Cluster-based Internet services background and
challenges. - Programming support for data aggregation
operations. - Integrated resource management and QoS support.
- Future work.
46Self-organizing Storage Cluster
- Challenge Distributed storage resources are hard
to manage and utilize. - Fragmented storage space.
- Frequent disk failures.
- Objective Let the cluster manage storage
resources by itself. - Storage virtualization.
- Incrementally scalable.
- Automatic redundancy maintenance.
47Dynamic Service Composition
- Challenge Internet services are evolving
rapidly. - More functionality requires more service
components. - Reusing existing service components.
- Objective Programming and runtime support for
dynamic service composition. - Easy to use composition mechanisms.
- On-the-fly service reconfiguration.
48Q A
- Acknowledgement
- Tao Yang, Lingkun Chu UCSB
- Kai Shen University of Rochester
- Project Web Site http//www.cs.ucsb.edu/projects/
neptune/ - Personal home pagehttp//www.cs.ucsb.edu/htang/
49Event-driven Scheduling
(B) Throughput - 24 Partitions
(A) Response Time - 24 Partitions
25
1000
Event Driven
Event Driven
No Event Driven
20
800
No Event Driven
15
600
Throughput (req/sec)
Response Time (ms)
10
400
5
200
0
0
5
10
15
20
25
5
10
15
20
25
Request Rate (req/sec)
Request Rate (req/sec)
50Evaluation Workload Trace
Traces Number of requests Number of requests Mean arrival interval Mean service time
Traces Total Non-cached Mean arrival interval Mean service time
Gold (Tue peak) 507,202 154,466 163.1ms 247.9ms
Silver (Wed peak) 512,227 151,827 166.0ms 249.7ms
Bronze (Thu peak) 517,116 156,214 161.3ms 245.1ms
51Compare MPI Reduce and DAC
MPI Reduce DAC
Primitive semantics Tolerate failures All or nothing Allow partial results
Primitive semantics Deadline requirement No Yes
Primitive semantics Programming model Procedure-based Request-driven
Runtime system design Tree shape Static Dynamic
Runtime system design Server assignment Static Dynamic