Scheduling and Resource Management for Next-generation Clusters

About This Presentation

Title:

Scheduling and Resource Management for Next-generation Clusters

Description:

is demonstrated in simulations of wind instruments using a cluster of 20 ... The PC cluster based parallel simulation environment and the technologies ... – PowerPoint PPT presentation

Number of Views:131

Avg rating:3.0/5.0

Slides: 71

Provided by: yang123

Learn more at: https://arcb.csc.ncsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Scheduling and Resource Management for Next-generation Clusters

1
Scheduling and Resource Management for
Next-generation Clusters

Yanyong Zhang
Penn State University
www.cse.psu.edu/yyzhang

2
What is a Cluster?

Cost effective
Easily scalable
Highly available
Readily upgradeable

3
Scientific Engineering Applications

HPTi win 5 year 15M procurement to provide
systems for weather modeling (NOAA).
(http//www.noaanews.noaa.gov/stories/s419.htm)
Sandia's expansion of their Alpha-based C-plant
system.
Maui HPCC LosLobos Linux Super-cluster
(http//www.dl.ac.uk/CFS/benchmarks/beowulf/ts
ld007.htm)
A performance-price ratio of is demonstrated in
simulations of wind instruments using a cluster
of 20 .
(http//www.swiss.ai.mit.edu/pas/p/sc95.html)
The PC cluster based parallel simulation
environment and the technologies will have a
positive impact on networking research nationwide
.
(http//www.osc.edu/press/releases/2001/approve
d.shtml)

4
Commercial Applications

Business applications
Transaction Processing (IBM DB2, oracle )
Decision Support System (IBM DB2, oracle )
Internet applications
Web serving / searching (Google.Com )
Infowares (yahoo.Com, AOL.Com)
Email, eChat, ePhone, eBook,eBank, eSociety,
eAnything
Computing portal

5
Resource Management

Each application is demanding
Several applications/users can be present at the
same time

Resource management and Quality-of-service
become important.
6
System Model
4
4
3

Each node is
independent
Maximum MPL
Arrival queue

7
Two Phases in Resource Management

Allocation Issues
Admission Control
Arrival Queue Principle
Scheduling Issues (CPU Scheduling)
Resource Isolation
Co-allocation

8
Co-allocation / Co-scheduling
P1
P0
P0
t0
t1
TIME
9
Outline

From OSs perspective
Contribution 1 boosting the CPU utilization at
supercomputing centers
Contribution 2 providing quick responses for
commercial workloads
Contribution 3 scheduling multiple classes of
applications
From applications perspective
Contribution 4 optimizing clustered DB2

NEXT
10
Contribution 1Boosting CPU Utilization at
Supercomputing Centers
11
Objective
Response Time
Wait Time
Execute Time
Wait in the arrival Q
Wait in the ready/blocked Q
minimize
12
Existing Techniques

Back Filling (BF)
Gang Scheduling (GS)
Migration (M)

time
2
8
8
3
2
6
2
space
of CPUs 14
13
Proposed Scheme

MBGS GS BF M
Use GS as the basic framework
At each row of GS matrix, apply BF technique
Whenever GS matrix is re-calculated, M should be
considered.

14
How Does MBGS Perform?
15
Outline

From OSs perspective
Contribution 1 boosting the CPU utilization at
supercomputing centers
Contribution 2 providing quick responses for
commercial workloads
Contribution 3 scheduling multiple classes of
applications
From applications perspective
Contribution 4 optimizing clustered DB2

NEXT
16
Contribution 2Reducing Response Times for
Commercial Applications
17
Objective
Response Time
Wait Time
Execute Time
Wait in the arrival Q
Wait in the ready/blocked Q

Minimize wait time
Minimize response time

18
Previous Work IGang Scheduling (GS)
(1)
MINUTES !
(2)
GS is not responsive enough !
19
Previous Work IIDynamic Co-scheduling
P1
P2
P3
P0
B
D
A
C
Its As turn
C just finishes I/O
B just gets a msg
Everybody else is blocked
The scheduler on each node makes
independent decision based on local events
without global synchronizations.
20
Dynamic Co-scheduling Heuristics
21
Simulation Study

A detailed simulator at a microsecond granularity
System parameters
System configurations (maximum MPL, to partition
or not)
System overheads (context switch overheads,
interrupt costs, costs associated with
manipulating queues)

22
Simulation Study (Contd)

Application parameters
Injection load
Characteristics (CPU intensive, IO intensive,
communication intensive or somewhere in the
middle)

23
Impact of Load
24
Impact of Workload Characteristics
Comm intensive
I/O intensive
25
Periodic Boost Heuristics

S1 Compute Phase
S2 S1 Unconsumed Msg.
S3 Recv. Msg. Arrived
S4 Recv. No Msg.
A S3-gt S2,S1
B S3-gtS2-gtS1
C S3,S2,S1
D S3,S2-gtS1
E S2-gtS3-gtS1

26
Analytical Modeling Study

The state space is impossible to handle.

Dynamic arrival
27
Analysis Description
Number of jobs on node k
Original State Space (impossible to handle!!)
Assumption The state of each processor is
stochastically independent and identical to
the state of the other processors.
?
28
Analysis Description (Cont)
?
Address the state transition rates
using Continuous Markov model Build
the Generator Matrix Q
?
Get the invariant probability vector ? by
solving ?Q 0, and ?e 1.
?
Use fixed-point iteration to get the solution
29
SB Example
r2
30
Results
Optimal PB Frequency
Optimal Spin Time for SB
31
Results Optimal Quantum Length
CPU Intensive
Comm Intensive
I/O Intensive
32
Outline

From OSs perspective
Contribution 1 boosting the CPU utilization at
supercomputing centers
Contribution 2 providing quick responses for
commercial workloads
Contribution 3 scheduling multiple classes of
applications
From applications perspective
Contribution 4 optimizing clustered DB2

NEXT
33
Contribution 3Scheduling Multiple Classes of
Applications
interactive
real time
batch
34
Objective
?BE
?RT
How long did it take me to finish?? Response time
How many deadlines have been missed? Miss rate
cluster
35
Fairness Ratio (xy)
Cluster Resource
x
xy
y
xy
36
How to Adhere to Fairness Ratio?
37
BE response time
?RT ?BE 21
?RT ?BE 19
?RT ?BE 91
38
RT Deadline Miss Rate
?RT ?BE 19
?RT ?BE 21
?RT ?BE 91
39
Outline

From OSs perspective
Contribution 1 boosting the CPU utilization at
supercomputing centers
Contribution 2 providing quick responses for
commercial workloads
Contribution 3 scheduling multiple classes of
applications
From applications perspective
Characterizing decision support workloads on the
clustered database server
Resource management for transaction processing
workloads on the clustered database server

NEXT
40
Experiment Setup

IBM DB2 Universal Database for Linux, EEE,
Version 7.2
8 dual node Linux/Pentium cluster, that has 256
MB RAM and 18 GB disk on each node.
TPC-H workload. Queries are run sequentially (Q1
Q20). Completion time for each query is
measured.

41
Platform
Select from T
Client
42
Methodology

Identify the components with high system
overhead.
For each such component, characterize the request
distribution.
Come up with ways of optimization.
Quantify potential benefits from the
optimization.

43
Sampling OS Statistics

Sample the statistics provided by stat, net/dev,
process/stat.
User/system CPU
of pages faults
of blocks read/written
of reads/writes
of packets sent/received
CPU utilization during I/O

44
Kernel Instrumentation

Instrument each system call in the kernel.

Enter system call
Exit system call
unblock
block
resume execution
45
Operating System Profile

Considerable part of the execution time is taken
by pread system call.
There is good overlap of computation with I/O for
some queries.
More reads than writes.

46
TPC-H pread Overhead
Query of exe time Query of exe time
Q6 20.0 Q13 10.0
Q14 19.0 Q3 9.6
Q19 16.9 Q4 9.1
Q12 15.4 Q18 9.0
Q15 13.4 Q20 7.9
Q7 12.1 Q2 5.2
Q17 10.8 Q9 5.2
Q8 10.5 Q5 4.6
Q10 10.3 Q16 4.1
Q1 10.0 Q11 3.5
pread overhead of preads X overhead per
pread.
47
pread Optimization
pread(dest, chunk) for each page in the
chunk if the page is not in cache
bring it in from disk
copy the page into dest

Optimization
Re-mapping the
buffer
Copy on write

30?s
48
Copy-on-write
Query reduction Query reduction
Q1 98.9 Q11 96.1
Q2 85.7 Q12 87.1
Q3 96.0 Q13 100.0
Q4 80.9 Q14 96.1
Q5 100.0 Q15 96.8
Q6 100.0 Q16 70.7
Q7 79.7 Q17 94.5
Q8 79.3 Q18 100.0
Q9 88.7 Q19 95.7
Q10 77.8 Q20 94.4
of copy-on-write
reduction 1 -
of preads
49
Operating System Profile

Socket calls are the next dominant system calls.

50
Message Characteristics
Q11
Q16
Message Size (bytes)
Message Inter-injection Time (Millisecond)
Message Destination
51
Observations on Messages

Only a small set of message sizes is used.
Many messages are sent in a short period.
Message destination distribution is uniform.

Many messages are point-to-point implementations
of multicast/broadcast messages.
Multicast can reduce of messages.

52
Potential Reduction in Messages
query total small large query total small large
Q1 44.7 71.4 38.7 Q11 9.6 28.6 0.1
Q2 20.4 58.7 0.2 Q12 8.3 7.8 2.9
Q3 48.2 64.3 38.0 Q13 24.5 75.2 0.1
Q4 22.6 58.6 0.1 Q14 27.9 80.4 0.7
Q5 8.0 7.1 8.4 Q15 46.6 56.5 0.7
Q6 76.4 78.6 45.5 Q16 59.1 63.0 56.9
Q7 57.5 71.4 56.2 Q17 41.5 66.7 27.3
Q8 29.1 75.5 4.8 Q18 11.4 32.3 0.0
Q9 66.8 78.5 61.1 Q19 26.7 79.4 0.2
Q10 25.0 73.6 0.1 Q20 21.1 62.8 0.1
53
Online Algorithm
Send ( msg, dest ) send msg to node dest
Send ( msg, dest ) if (msg buffered_msg
dest ? dest_set) dest_set dest_set ?
dest else buffer the msg
Send_bg () foreach buffered_msg if
( it has been buffered longer than threshold )
send multicast msg to nodes in dest_set
54
Impact of Threshold
Q7
Q16
Threshold (millisecond)
Threshold (millisecond)
55
Outline

From OSs perspective
Contribution 1 boosting the CPU utilization at
supercomputing centers
Contribution 2 providing quick responses for
commercial workloads
Contribution 3 scheduling multiple classes of
applications
From applications perspective
Characterizing decision support workloads on the
clustered database server
Resource management for clustered database
applications

NEXT
56
Ongoing/Near-term Work

What is the optimal number of jobs which should
be admitted?
Can we dynamically pause some processes based on
resource requirement and resource availability?
Which dynamic co-scheduling scheme works best
here?
How do we exploit application level information
in scheduling?

57
Future Work

Some next-generation applications
Real time medical imaging and collaborative
surgery

Application requirements
VAST processing power, disk
capacity and network bandwidth
absolute availability
deterministic performance

58
Future Work

E-business on demand

Requirements
performance
more users
responsive
Quality-of-service
availability
security
power consumption
pricing model

59
Future Work

What does it take to get there?
Hardware innovations
Resource management and isolation
Good scalability
High availability
Deterministic Performance

60
Future Work

Not only high performance
Energy consumption
Security
Pricing for service
User satisfaction
System management
Ease of use

61
Related Work

parallel job scheduling
Gang Scheduling Ousterhout82
Backfilling (Lifka95, Feitelson98)
Migration (Epima96)
Dynamic co-scheduling
Spin Block (Arpaci-Dusseau98, Anglano00),
Periodic Boost (Nagar99)
Demand-based Coscheduling (Sobalvarro97),

62
Related Work (Contd)

Real-time Scheduling
Earliest Deadline First
Rate Monotonic
Least Laxity First
Single node Multi-class scheduling
Hierarchical scheduling (Goyal96)
Proportional share (Waldspurger95)
Commercial clustered server (Pai98, reserve)

63
Related Work (Contd)

Commercial Workloads (CAECW, Barford99,
Kant99)
Database Characterizing (Keeton99,
Ailamaki99, Rosenblum97)
OS support for database (Stonebraker81,
Gray78, Christmann87)
Reducing copies in IO (Pai00, Druschel93,
Thadani95)

64
Publications

IEEE Transactions on Parallel and Distributed
Systems.
International Parallel and Distributed Processing
Symposium (IPDPS 2000)
ACM International Conference on Supercomputing
(ICS 2000)
International Euro-par Conference (Europar 2000)
ACM Symposium on Parallel Algorithms and
Architectures (SPAA 2001)
Workshop on Job Scheduling Strategies for
Parallel Processing (JSSPP 2001)
Workshop on Computer Architecture Evaluation
Using Commercial Workloads (CAECW 2002)

65
Publications IBatch Applications

Y. Zhang, H. Franke, J. Moreira, A.
Sivasubramaniam. An Integrated Approach to
Parallel Scheduling Using Gang-Scheduling,Backfill
ing and Migration, 7th Workshop on Job Scheduling
Strategies for Parallel Processing.
Y. Zhang, H. Franke, J. Moreira, A.
Sivasubramaniam. The Impact of Migration on
Parallel Job Scheduling for Distributed Systems.
Proceedings of 6th International Euro-Par
Conference Lecture Notes in Computer Science
1900, pages 242-251, Munich, Aug/Sep 2000.
Y. Zhang, H. Franke, J. Moreira, A.
Sivasubramaniam. Improving Parallel Job
Scheduling by combining Gang Scheduling and
Backfilling Techniques. International Parallel
and Distributed Processing Symposium
(IPDPS'2000), pages 133-142, May 2000.
Y. Zhang, H. Franke, J. Moreira, A.
Sivasubramaniam. A Comparative Analysis of Space-
and Time-Sharing Techniques for Parallel Job
Scheduling in Large Scale Parallel Systems.
Submitted to IEEE Transactions on Parallel and
Distributed Systems.

66
Publications IIInteractive Applications

M. Squillante, Y. Zhang, A. Sivasubramaniam, N.
Gautam, H. Franke, J. Moreira. Analytic Modeling
and Analysis of Dynamic Coscheduling for a Wide
Spectrum of Parallel and Distributed
Environments. Penn State CSE tech report
CSE-01-004.
Y. Zhang, A. Sivasubramaniam, J. Moreira, H.
Franke. Impact of Workload and System Parameters
on Next Generation Cluster Scheduling Mechanisms.
To appear in IEEE Transactions on Parallel and
Distributed Systems.
Y. Zhang, A. Sivasubramaniam, H. Franke, J.
Moreira. A Simulation-based Performance Study of
Cluster Scheduling Mechanisms. 14th ACM
International Conference on Supercomputing
(ICS'2000), pages 100-109, May 2000.
M. Squillante, Y. Zhang, A. Sivasubramaniam, N.
Gautam, H. Franke, J. Moreira. Analytic Modeling
and Analysis of Dynamic Coscheduling for a Wide
Spectrum of Parallel and Distributed
Environments. Submitted to ACM Transactions on
Modeling and Compute Simulation (TOMACS).

67
Publications IIIMulti-class Applications

Y. Zhang, A. Sivasubramaniam.Scheduling
Best-Effort and Real-Time Pipelined Applications
on Time-Shared Clusters, the 13th Annual ACM
symposium on Parallel Algorithms and
Architectures.
Y. Zhang, A. Sivasubramaniam.Scheduling
Best-Effort and Real-Time Pipelined Applications
on Time-Shared Clusters, Submitted to IEEE
Transactions on Parallel and Distributed Systems.

68
Publications IVDatabase

Y. Zhang, J. Zhang, A. Sivasubramaniam, C. Liu,
H. Franke. Decision-Support Workload
Characteristics on a Clustered Database Server
from the OS Perspective. Penn State Technical
Report CSE-01-003

69
Thank You !
70
Applications

Numerous scientific engineering apps
Parametric simulations
Business applications
E-commerce applications (Amazon.Com, eBay.Com )
Database applications (IBM DB2, oracle )
Internet applications
Web serving / searching (Google.Com )
Infowares (yahoo.Com, AOL.Com)
Asps (application service providers)
Email, eChat, ePhone, eBook,eBank, eSociety,
eAnything
Computing portals
Mission critical applications
Command control systems, banks, nuclear reactor
control, star-war, and handling life threatening
situations

71
Admission Control

Example 1
Example 2

Arrival Queue
CPU
CPU
72
Arrival Queue

First come first serve, shortest job first, first
fit, best fit, worst fit
Or is it possible to violate the priority
principle based on the resource availability? And
how?

Job 1
Job 2
system
73
Resource Isolation
74
Previous Work ISpace Sharing

Easy to implement
Severe fragmentation

2
2
3
3
6
6
75
Previous Work IIBackfilling (BF)
Arrival Queue
time
8
2
8
3
6
space
of CPUs 14
76
Previous Work IIIGang Scheduling(GS)
5
3
5
5
2
6
2
6
2
3
77
Previous Work IVMigration (M)

Jobs can be migrated during execution.

78
A Question

Can we combine BF, GS, and M to deliver even
better performance ?

79
Basic GS

At each scheduling event, the matrix will be
re-calculated
Two optimizations to the matrix
CollapseMatrx
FillMatrix

80
GS BF
81
Backfilling (BF)

Conservative BF
? model of overestimation

Characteristics of ? model
? is the fraction of jobs which run for at least
their estimated time
The rest of jobs just execute for some fraction
of their estimation time
The fraction is uniformly distributed

82
One Problem

How to estimate job execution time in time-shared
environment?
Upper bound MPL user submitted estimation

Using upper bound to backfill !
83
Why Migration?
84
What to migrate?
85
MBGS GS BF M
86
(No Transcript)
87
Model Validation
88
Future Work

Can we piggyback some information from src node
to help the scheduling on the dest node?
Can the system dynamically figure out where it is
operating and choose the best scheme accordingly?
(global optimization)

89
Objective

To minimize the interference between jobs of
different classes.
To schedule jobs in each class efficiently.
Minimize BE response times
Minimize RT deadline miss rates

90
Previous Work

Parallel BE job scheduling
Gang scheduling
dynamic co-scheduling.
RT scheduling
Earliest Deadline First,
Rate Monotonic,
Least Laxity First etc
Single node Multi-class scheduling
hierarchical schedulers,
proportional share schedulers, etc

91
Proposed Work

Look at coexisting multiple class parallel
applications on a timed-shared cluster !

92
One-level Gang Scheduler (1GS)
93
1GS Optimizations
Optimization Principle A class can use slots of
other class if they cannot be utilized by other
class.
94
Two-level Dynamic Coscheduler with Rigid TDM
(2DCS-TDM)
xy 21

Globally decide when to schedule which class.
Locally decide the schedule within each class.

E
95
2DCS-TDM optimizations
Optimization Principle a class can borrow time
from other class when no job in that class can be
run !
96
Two-level Dynamic Coscheduler with Proportional
Share Scheduling (2DCS-PS)

Locally decide when to schedule which class.
Locally decide the schedule within each class.

xy 21
97
2DCS-PS Optimizations
Optimization Principle If no job in one class
can be run, only jobs in other class will be
scheduled !
98
Future Work

How can the admission control algorithms affect
the performance?
Can we provide a deterministic admission control
algorithms since in some cases being
deterministic is very important?
Can we provide a tunable admission control so
that the users can specify how much performance
they are willing to lose to trade off the
admission rate?
How do the system parameters affect the
performance?