Programming Support and Resource Management for Cluster-based Internet Services - PowerPoint PPT Presentation

About This Presentation

Title:

Programming Support and Resource Management for Cluster-based Internet Services

Description:

University of California, Santa Barbara. 2/24/2003. Hong Tang, UCSB. 2 ... Serving highly concurrent and fluctuating traffic under interactive response constraint. ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 52

Provided by: Hong87

Learn more at: https://arcb.csc.ncsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Programming Support and Resource Management for Cluster-based Internet Services

1
Programming Support and Resource Management for
Cluster-based Internet Services

Hong Tang
Department of Computer Science
University of California, Santa Barbara

2
Cluster-based Internet Services

Advantages
Cost-effectiveness.
Incremental scalability.
High availability.
Examples
Yahoo, MSN, AOL, Google, Teoma.

Firewall/ Traffic switch
Web server/ Query handlers
Local-area network
Service nodes
3
Challenges

Hardware failures and configuration errors due to
a large number of components.
Platform heterogeneity due to irregularity in
hardware, networking, data partitions.
Serving highly concurrent and fluctuating traffic
under interactive response constraint.

4
Neptune Programming and Runtime Support for
Cluster-based Internet Services

Programming support
Component-oriented style allows programers to
focus on application functionality.
Neptune API provides high-level primitives for
service programming.
Runtime support
Neptune runtime glues components together and
takes care of reliability and scalability issues.
Applications
Discussion groups online auctions index search
persistent cache utility BLAST-based protein
sequence match.
Industrial Deployment Teoma/AskJeeves.

5
Example Document Search Engine
Index servers (partition 1)
Query caches
Firewall/ Traffic switch
Web server/ Query handlers
Local-area network
Index servers (partition 2)
Doc server (partition 2)
Index servers (partition 3)
Doc server (partition 1)
6
Outline

Cluster-based Internet services Background and
challenges.
Programming support for data aggregation
operations.
Integrated resource management and QoS support.
Future work.

7
Data Aggregation Operation

Aggregate request processing results from
multiple data partitions.
Examples search engine, discussing groups,
Naïve approach
Rely on a fixed server for data collection and
aggregation.
The fixed server is a scalability bottleneck.
Actually used in TACC framework (Fox97) and
previous version of Neptune system.
Need explicit programming support and efficient
runtime system design!

8
Data Aggregation Operation The Search Engine
Example
Index servers (partition 1)
Query caches
Firewall/ Traffic switch
Web server/ Query handlers
Local-area network
Index servers (partition 2)
Doc server (partition 2)
Index servers (partition 3)
Doc server (partition 1)
9
Data Aggregation Operation

Aggregate request processing results from
multiple data partitions.
Examples search engine, discussion groups,
Naïve approach
Rely on a fixed server for data collection and
aggregation.
The fixed server is a scalability bottleneck.
Actually used in TACC framework (Fox97) and
previous version of Neptune system.
Need explicit programming support and efficient
runtime system design!

10
Design Objectives

An easy-to-use programming primitive.
Scalable to a large number of partitions.
Interactive responses and high throughput.
Reminder All must be achieved in a cluster
environment!
Component failures.
Platform heterogeneity.

11
Data Aggregation Call (DAC)The Basic Semantics
DAC(P, opproc , opreduce)
Requirement of reduce() commutative and
associative.
partition 1
partition 2
partition 3
partition 4
12
Adding Quality Control to DAC

What if some server fails?
Partial aggregation results may still be useful.
Provide aggregation quality guarantee .
Aggregation quality Percentage of partitions
contributed to the aggregation result.
What if some server is very slow?
Better return partial results than waiting.
Provide soft deadline guarantee .

DAC(P, opproc , opreduce ,q, T)
13
Design Alternatives
Service client
Service client
Service client
P1
P1
P2
Pn
P1
P2
Pn
P3
P2
P4
P5
P6
(b)
(a)
(c)
Base
Flat
Hierarchical tree
14
Tree-based Reduction
Participating servers
Service client
The reduction tree is built dynamically for each
request.
15
Building Dynamic Reduction Trees

Objective
High throughput and low response time.
Achieving high throughput
Balance load, keep all servers busy.
Achieving low response time?

16
Building Dynamic Reduction Trees
17
Building Dynamic Reduction Trees

Objective
High throughput and low response time.
Achieving high throughput
Balance load, keep all servers busy.
Achieving low response time
Reducing the longest queue length.
Queue length indicates server load.
Balance load!
Observation Under highly concurrent workload,
the goals of reducing response time and improving
throughput require us to balance load!
Decisions Tree shape server assignment.

18
Load-aware Server Assignment
A

A servers load increase is determined by of
children.
k children 1 local processing k reduction.
Underloaded servers nodes with more children.
Overloaded servers leaf nodes, or nodes with
fewer children.

B
C
D
E
F
G
19
Choosing Reduction Tree Shapes

Static tree shapes Balanced d-ary tree binomial
tree.
Problem Not sufficient to correct load imbalance
caused by platform heterogeneity in a cluster
environment.

20
Load-adaptive Tree Formation (LAT)
7
G
H
6
5
4
3
2
1
D
E
F
A
B
D
C
E
F
G
H
21
LAT Adjustment

Problem When all servers have similar load, LAT
will assign one reduction operation per server,
resulting in a link list.
Solution Final adjustment ensures the depth is
no more than logN. If a subtree is in the form of
a link list, change it to a binomial tree.

22
LAT Summary

Steps
Collecting server load information.
Assigning operations to servers.
Constructing the reduction tree.
Adjusting the tree shape.
Time complexity O(nlogn).

23
Request Scheduling in a Server

Problem Blocking threads for data from children
will reduce throughput.
Solution Event-driven scheduling.

Data recved from child
reduction
All data from children aggregated
Timeout
Local proc done (non-leaf node)
Req recved
Local proc done (leaf node)
Local process initiated
Send data to parent
24
Handling Server Failures

Failures
Server stopped No heartbeat packets.
Server unresponsive Very long queue.
Solutions
Exclude stopped servers from the reduction tree.
Use staged timeout to eagerly prune unresponsive
servers.

25
Evaluation Settings

A cluster of Linux servers (kernel ver. 2.4.18)
30 dual-CPU (400MHz P-II), 512MB MEM 4 quad-CPU
(500MHz P-II), 1GB MEM.
Benchmark I Search engine index server.
Dataset 28 partitions, 1-1.2GB each.
Workload Trace-driven.
One week trace from Ask Jeeves.
Contains only uncached queries.
Benchmark II CPU-spinning microbenchmark.
Workload Synthetic.

26
Ease of Use

Applications Index server NCBIs BLAST protein
sequence matcher online facial recognizer.
First implemented without DAC.
A graduate student modified it with DAC.

Services Code Size (lines) Changed Lines Effort (days)
Index 2384 142 (5.9) 1.5
BLAST 1060K 307 (0.03) 2
Face 4306 190 (4.4) 1
27
Tree Formation Schemes

24 dual-CPU nodes, index server benchmark.

28
Tree Formation Schemes

20 dual-CPU, 4 quad-CPU nodes (heterogeneous).

(A) Response Time - 20 Dual, 4 Quad
(B) Throughput - 20 Dual, 4 Quad
1500
25
Binomial
LAT
20
1000
15
Binomial
Throughput (req/sec)
LAT
Response Time (ms)
10
500
5
16 ? 25
4 ? 21
0
0
10
15
20
25
30
10
15
20
25
30
Request Rate (req/sec)
Request Rate (req/sec)
29
Handling Servers Failures

LAT with Staged timeout (ST).
Event-driven request scheduling (ED).
Three versions None, ED-only, EDST.

30
Scalability (simulation)
(B) Scalability Throughput
(A) Scalability Response Time
0.5
100
0.4
80
0.3
60
Throughput (req/sec)
Response Time (s)
40
0.2
Throughput
95 Demand level
60 Demand level
80 Demand level
0.1
20
90 Demand level
0
0
100
200
300
400
500
100
200
300
400
500
Number of Server Partitions
Number of Server Partitions
31
Summary

Programming support
DAC primitive.
Runtime system
LAT tree formation.
Event-driven scheduling.
Staged timeout.
PPoPP03.

32
Outline

Cluster-based Internet services background and
challenges.
Programming support for data aggregation
operations.
Integrated resource management and QoS support.
Future work.

33
Research Objectives

Service-specific resource management objectives.
Previous research Rely on concrete metrics to
measure resource management efficiency.
Observation Different services may have
different objectives.
Statement Resource management objectives should
not be built into the runtime system.
Differentiated services qualities for multiple
request classes (QoS).
Internet traffic is bursty 31 peak-to-average
load ratio reported at Ask Jeeves.
Prioritized resource allocation is desirable.

34
Service Yield Function

Service yield The benefit achieved from serving
a request.
A monotonically non-increasing function of
response time.
Service yield function Y(r) Specified by
service providers.
Optimization goal Maximize aggregate yield
.

35
Sample Service Yield Functions
36
Service Differentiation

Service class A category of service requests
that enjoy the same level of QoS support.
Client identity (paid vs unpaid membership).
Service types (order placement vs catalog
browsing).
Provision
Differentiated service yield functions.
Proportional resource allocation guarantee.

37
Runtime System Request Scheduling

Functionally homogeneous sub-cluster.
Example Replicas of index server partition 1.
Cluster level
Which server to handle a request?
Server level
When to serve a request?

Service client
Service client
Service client
Cluster-level request dispatch
Server
Server
Other server
...
Other server
Sub-cluster
Service cluster
38
Cluster Level Partitioning or Not?

Periodic server partitioning Infocom01.
Partition the sub-cluster among service classes.
Periodically adjust server pool sizes based on
request demand of the service classes.
Problems
Decisions are made by a centralized dispatcher.
Periodical adjustment means slow response to
demand changes.
This work Random polling.
Service differentiation at the server level.
Functional-symmetry and decentralization.
Better handling of demand spikes and failures.

39
Server Level Scheduling

Drop requests that are likely to generate zero
yield.
If there is any under-allocated service class,
schedule a request in that class.
Otherwise, find the request that has the best
chance to maximize aggregate yield.
System underloaded?
Observation Yield loss due to missed deadlines.
Idea Schedule requests with tight deadlines.
Solution YID (yield-inflated deadline)
scheduling.
System overloaded?
Observation Yield loss due to lack of resources.
Idea Schedule requests with low resource
consumption.
Solution YIC (yield-inflated-cost) scheduling.

...
Class 1
Class 2
Class N
Request scheduling for service differentiation
Thread pool
40
Evaluation Settings

A cluster of 24 dual-CPU Linux servers.
Benchmark Differentiated index search service.
Three service classes
Gold, Silver, Bronze memberships.
Request composition 10 30 60.
Service yield ratio 4 2 1.
20 resource guarantee forBronze class.
Workload Trace-driven.
One week trace from Ask Jeeves.
Contains only uncached queries.

Service yield functions
6
Gold
Silver
5
Bronze
4
3
2
1
0
0
1
2
3
4
5
6
Response time (seconds)
41
Service Differentiation During a Demand Spike and
Server Failure

Demand spike for the Silver class between time 50
and 150.
One server failure between time 200 and 250.

Elapsed Time (sec)
42
Service Differentiation During a Demand Spike and
Server Failure

Periodic server partitioning.

Elapsed Time (sec)
43
Summary

Service yield function.
As a mechanism to express resource management
objectives.
As a means to differentiate service qualities.
Two-level decentralized request scheduling.
Cluster level Random polling.
Server level Adaptive scheduling.
OSDI02.

44
Related Work

Programming support for cluster-based Internet
services TACC Fox97, MultiSpace Gribble99,
Ninja von Behren02.
Event-driven request processing Flash Pai99,
SEDA Welsh01.
Tree-based reduction in MPI Gropp96, MagPIe
Kielmann99, TMPI Tang01.
Data aggregation Aggregation queries for
databases Saito99, Madden02, Scientific
application Chang01.
QoS for computer networks Weighted Fair Queuing
Demers90 Parekh93, Leaky Bucket, LIRA
Stoica98, Dovrolis99.
QoS or real-time scheduling at the single host
level Huang89, Haritsa93, Waldspurger94,
Mogul96, LRP Druschel96, Jones97, Eclipse
Bruno98, Resource Container Banga99,
Steere99.
QoS and resource management for Web servers
Almeida98, Pandey98, Abdelzaher99,
Bhatti99, Chandra00, Li00, Voigt01.
QoS and load balancing for Internet services
LARD Pai98, Cluster Reserves Aron00,
Sullivan00, DDSD Zhu01, Chase01,
Goswami93, Mitzenmacher97, Zhou87.

45
Outline

Cluster-based Internet services background and
challenges.
Programming support for data aggregation
operations.
Integrated resource management and QoS support.
Future work.

46
Self-organizing Storage Cluster

Challenge Distributed storage resources are hard
to manage and utilize.
Fragmented storage space.
Frequent disk failures.
Objective Let the cluster manage storage
resources by itself.
Storage virtualization.
Incrementally scalable.
Automatic redundancy maintenance.

47
Dynamic Service Composition

Challenge Internet services are evolving
rapidly.
More functionality requires more service
components.
Reusing existing service components.
Objective Programming and runtime support for
dynamic service composition.
Easy to use composition mechanisms.
On-the-fly service reconfiguration.

48
Q A

Acknowledgement
Tao Yang, Lingkun Chu UCSB
Kai Shen University of Rochester
Project Web Site http//www.cs.ucsb.edu/projects/
neptune/
Personal home pagehttp//www.cs.ucsb.edu/htang/

49
Event-driven Scheduling
(B) Throughput - 24 Partitions
(A) Response Time - 24 Partitions
25
1000
Event Driven
Event Driven
No Event Driven
20
800
No Event Driven
15
600
Throughput (req/sec)
Response Time (ms)
10
400
5
200
0
0
5
10
15
20
25
5
10
15
20
25
Request Rate (req/sec)
Request Rate (req/sec)
50
Evaluation Workload Trace
Traces Number of requests Number of requests Mean arrival interval Mean service time
Traces Total Non-cached Mean arrival interval Mean service time
Gold (Tue peak) 507,202 154,466 163.1ms 247.9ms
Silver (Wed peak) 512,227 151,827 166.0ms 249.7ms
Bronze (Thu peak) 517,116 156,214 161.3ms 245.1ms
51
Compare MPI Reduce and DAC
MPI Reduce DAC
Primitive semantics Tolerate failures All or nothing Allow partial results
Primitive semantics Deadline requirement No Yes
Primitive semantics Programming model Procedure-based Request-driven
Runtime system design Tree shape Static Dynamic
Runtime system design Server assignment Static Dynamic

Write a Comment

User Comments (0)