Arun Babu Nagarajan Frank Mueller North Carolina State University - PowerPoint PPT Presentation

About This Presentation

Title:

Arun Babu Nagarajan Frank Mueller North Carolina State University

Description:

moss.csc.ncsu.edu – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 21

Provided by: ncs48

Category:

more less

Transcript and Presenter's Notes

Title: Arun Babu Nagarajan Frank Mueller North Carolina State University

1
Arun Babu NagarajanFrank MuellerNorth Carolina
State University
Proactive Fault Tolerance for HPC using Xen
Virtualization
2
Problem Statement

Trends in HPC high end systems with thousands of
processors
Increased probability of a node failure MTBF
becomes shorter
MPI widely accepted in scientific computing
Problem with MPI no recovery from faults in the
standard
Currently FT exist but
only reactive process checkpoint/restart
must restart entire job
inefficient if only one (few) node(s) fails
overhead due to redoing some of the work
issues checkpoint at what frequency?
100 hr job will run for addln 150 hrs on a
petaflop machine (w/o failure) I.philip, 2005

3
Our Solution

Proactive FT
anticipates node failure
takes preventive action instead of a reaction
to a failure
migrate the whole OS to a better physical node
entirely transparent to the application (rather
to the OS itself)
hence avoids high overhead compared to reactive
scheme (associated overhead w/ our scheme is very
little )

4
Design space

1. A mechanism to predict/anticipate the failure
of a node
OpenIPMI
lm_sensors (more system specific x86 Linux)
2. A mechanism to identify the best target node
Custom centralized approaches doesnt scale
unreliable
Scalable distributed approach Ganglia
3. More importantly, a mechanism (for preventive
action) which supports the relocation of the
running application with
its state preserved
minimum overhead on the application itself
Xen Virtualisation with live migration support
C.Clark et al, May2005
Open source

5
Mechanisms explained

1. Health Monitoring with OpenIPMI
Baseboard Mgmt Controller (BMC) equipped with
sensors to monitor diff. properties like
temperature, fan speed, voltage etc. of each node
IPMI (Intelligent Platform Management Interface)
increasingly common in HPC
std. message-based interface to monitor H/W
raw messaging harder to use and debug
OpenIPMI open source, higher level abstraction
from raw IPMI message-response system to
communicate w/ BMC ( ie. to read sensors)
We use OpenIPMI to gather health information of
nodes

6
Mechanisms explained

2. Ganglia
widely used, scalable distributed load
monitoring tool
All the nodes in the cluster run a ganglia daemon
and each node has a approximate view of the
entire cluster
UDP used to transfer messages
Measures
cpu usage, mem usage, n/w usage by default
We use ganglia to identify least loaded node ?
migration target
Also extended to distribute IPMI sensor data

7
Mechanisms explained

3. Fault Tolerance w/ xen
para-virtualized environment
OS modified
application unchanged
Privileged VM Guest VM runs on Xen hypervisor/
VMM
Guest VMs can live migrate to other hosts ?
little overhead
State of the VM preserved
VM halted for an insignificant period of time
Migration phases
phase 1 send guest image ? dst node, app
running
phase 2 repeated diffs ? dst node, app still
running
phase 3 commit final diffs ? dst node, OS/app
frozen
phase 4 activate guest on dst, app running again

H/w
8
Overall set-up of the components
PFT Daemon
PFT Daemon
BMC
Baseboard Management Contoller
Ganglia
Migrate
Ganglia

Stand-by Xen host, no guest

Privileged VM
Privileged VM

Deteriorating health ? migrate guest (w/ MPI app)
to stand-by host

Xen VMM
Xen VMM
H/w
BMC
H/w
BMC
H/w
BMC
9
Overall set-up of the components
PFT Daemon
PFT Daemon
BMC
Baseboard Management Contoller
Ganglia
Ganglia

Stand-by Xen host, no guest

Privileged VM
Privileged VM
Xen VMM
Xen VMM

Deteriorating health ? migrate guest (w/ MPI app)
to stand-by host
The destination host generates unsolicited ARP
reply advertising that Guest VM IP has moved to a
new location C.Clark et. Al 2005
- This will take care of peers to resend packets
to the new host

H/w
BMC
H/w
BMC
PFT Daemon
PFT Daemon
MPI Task
MPI Task
Ganglia
Ganglia
Guest VM
Guest VM
Privileged VM
Privileged VM
Xen VMM
Xen VMM
H/w
BMC
10
Proactive Fault Tolerance (PFT) Daemon
PFT Daemon

Runs on privileged VM (host)
Initialize
Read safe threshold from config file
ltSensor namegt ltLow Thrgt ltHi Thrgt
CPU temperature, fan speeds
extensible (corrupt sectors, network, voltage
fluctuations, )
Init connection w/ IPMI BMC using authentication
parameters and hostname
Gathers a listing of available sensors in the
system and validates it against out list

IPMI Baseboard Mgmt Controller
Initialize
Health Monitor
Threshold Breach?
N
Y
Load Balance
Ganglia
Raise Alarm / Maintenance of the system
11
PFT Daemon

Health Monitoring
interacts w/ IPMI BMC (via OpenIPMI) to read
sensors
Periodic sampling of data (event driven is also
supported)
threshold exceeded ? control handed over to load
balancing
PFTd determines migration target by contacting
Ganglia
Load-based selection (lowest load)
Load obtained by /proc file system
Invokes Xen live migration for guest VM
Xen user-land tools (at VM/host)
command line interface for live migration
PFT Daemon initiates migration for guest VM

12
Experimental Framework

Cluster of 16 nodes (dual core, dual Opteron 265,
1 Gbps Ether)
Xen-3.0.2-3 VMM
Privileged and guest VM run ported Linux kernel
version 2.6.16
Guest VM
Very same configuration as privileged VM
Has 1GB RAM
Booted on VMM w/ PXE netboot via NFS
Has access to NFS (same as the privileged VM)
Ganglia on Privileged VM (and also Guest VM) in
all nodes
Node sensors obtained via OpenIPMI

13
Experimental Framework

NAS Parallel Benchmarks run on Guest Virtual
Machine
MPICH-2 w/ MPD ring on n GuestVMs (no job-pause
required!)
Process on Privileged domain
monitors MPI task runs
issues migration command (NFS used for
synchronization)
Measured
wallclock time with and w/o migration
actual downtime migration overhead (modified
Xen migration)
benchmarks run 10 times, results report avg.
NPB V3.2.1 BT, CG, EP, LU and SP benchmarks
IS run is too short
MG requires gt 1GB for class C

14
Experimental Results
2. Double node failure

1. Single node failure

NPB Class B / 4 nodes
NPB Class C / 4 nodes

Single node failure overhead of 1-4 over
total wall clock time
Double node failure - overhead of 2-8 over
total wall clock time

15
Experimental Results

3. Behavior of Problem Scaling

Chart depicts only the overhead section
Dark region represents the part for which the VM
was halted
The light region represents the delay incurred
due to migration (diff operation.. Etc)

NPB 4 nodes

Generally overhead increases with problem size
(CG is exception )

16
Experimental Results

4. Behavior of Task Scaling

Generally we expect a decrease in overhead on
increasing the of nodes
Some discrepancies for BT and LU observed
(Migration duration is 40s but here we have 60s)

NPB Class C
17
Experimental Results

5. Migration duration

NPB 4 nodes
NPB 4/8/16 nodes

Min 13s needed to transfer a 1GB VM w/o any
active processes
Max 40 seconds needed before migration is
initiated
Depends on the n/w bandwidth, RAM size on the
application

18
Experimental Results

6. Scalability (Total execution time)

NPB Class C

Speedup is not very much affected

19
Related Work

FT Reactive approach is more common
Automatic
Checkpoint/restart (eg BLCR Berkeley Labs
Checkpnt Restart)S.Sankaran et.al LACSI 03,
G.Stellner, IPPS 96
Log based (Log msg temporal ordering)
G.Bosilica , Supercomputing, 2002
Non-automatic
Explicit invocation of checkpoint routines
R.T.Aulwes et. Al, IPDPS 2004, G. E. Fagg and
J. J. Dongarra, 2000
Virtualization in HPC is less/no overhead
W.Hunaf et al, ICS 06
To make virtualization competitive for MP
environments, vmm-bypass I/o in VM has been
experimentedJ.Liu et.al USENIX 06
n/w virtualization can be optimized A.Menon
et.al USENIX 06

20
Conclusion

In contrast to the currently available reactive
FT schemes, we have come up with a proactive
system with much less overhead
Transparent and automatic FT for arbitrary MPI
applications
Ideally complements long running MPI jobs
Proactive system will complement reactive systems
greatly. (It will help to reduce the high
overhead associated with reactive schemes greatly)

Write a Comment

User Comments (0)