Arun Babu Nagarajan Frank Mueller North Carolina State University - PowerPoint PPT Presentation

About This Presentation
Title:

Arun Babu Nagarajan Frank Mueller North Carolina State University

Description:

moss.csc.ncsu.edu – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 21
Provided by: ncs48
Category:

less

Transcript and Presenter's Notes

Title: Arun Babu Nagarajan Frank Mueller North Carolina State University


1
Arun Babu NagarajanFrank MuellerNorth Carolina
State University
Proactive Fault Tolerance for HPC using Xen
Virtualization
2
Problem Statement
  • Trends in HPC high end systems with thousands of
    processors
  • Increased probability of a node failure MTBF
    becomes shorter
  • MPI widely accepted in scientific computing
  • Problem with MPI no recovery from faults in the
    standard
  • Currently FT exist but
  • only reactive process checkpoint/restart
  • must restart entire job
  • inefficient if only one (few) node(s) fails
  • overhead due to redoing some of the work
  • issues checkpoint at what frequency?
  • 100 hr job will run for addln 150 hrs on a
    petaflop machine (w/o failure) I.philip, 2005

3
Our Solution
  • Proactive FT
  • anticipates node failure
  • takes preventive action instead of a reaction
    to a failure
  • migrate the whole OS to a better physical node
  • entirely transparent to the application (rather
    to the OS itself)
  • hence avoids high overhead compared to reactive
    scheme (associated overhead w/ our scheme is very
    little )

4
Design space
  • 1. A mechanism to predict/anticipate the failure
    of a node
  • OpenIPMI
  • lm_sensors (more system specific x86 Linux)
  • 2. A mechanism to identify the best target node
  • Custom centralized approaches doesnt scale
    unreliable
  • Scalable distributed approach Ganglia
  • 3. More importantly, a mechanism (for preventive
    action) which supports the relocation of the
    running application with
  • its state preserved
  • minimum overhead on the application itself
  • Xen Virtualisation with live migration support
    C.Clark et al, May2005
  • Open source

5
Mechanisms explained
  • 1. Health Monitoring with OpenIPMI
  • Baseboard Mgmt Controller (BMC) equipped with
    sensors to monitor diff. properties like
    temperature, fan speed, voltage etc. of each node
  • IPMI (Intelligent Platform Management Interface)
  • increasingly common in HPC
  • std. message-based interface to monitor H/W
  • raw messaging harder to use and debug
  • OpenIPMI open source, higher level abstraction
    from raw IPMI message-response system to
    communicate w/ BMC ( ie. to read sensors)
  • We use OpenIPMI to gather health information of
    nodes

6
Mechanisms explained
  • 2. Ganglia
  • widely used, scalable distributed load
    monitoring tool
  • All the nodes in the cluster run a ganglia daemon
    and each node has a approximate view of the
    entire cluster
  • UDP used to transfer messages
  • Measures
  • cpu usage, mem usage, n/w usage by default
  • We use ganglia to identify least loaded node ?
    migration target
  • Also extended to distribute IPMI sensor data

7
Mechanisms explained
  • 3. Fault Tolerance w/ xen
  • para-virtualized environment
  • OS modified
  • application unchanged
  • Privileged VM Guest VM runs on Xen hypervisor/
    VMM
  • Guest VMs can live migrate to other hosts ?
    little overhead
  • State of the VM preserved
  • VM halted for an insignificant period of time
  • Migration phases
  • phase 1 send guest image ? dst node, app
    running
  • phase 2 repeated diffs ? dst node, app still
    running
  • phase 3 commit final diffs ? dst node, OS/app
    frozen
  • phase 4 activate guest on dst, app running again

H/w
8
Overall set-up of the components
PFT Daemon
PFT Daemon
BMC
Baseboard Management Contoller
Ganglia
Migrate
Ganglia
  • Stand-by Xen host, no guest

Privileged VM
Privileged VM
  • Deteriorating health ? migrate guest (w/ MPI app)
    to stand-by host

Xen VMM
Xen VMM
H/w
BMC
H/w
BMC
H/w
BMC
9
Overall set-up of the components
PFT Daemon
PFT Daemon
BMC
Baseboard Management Contoller
Ganglia
Ganglia
  • Stand-by Xen host, no guest

Privileged VM
Privileged VM
Xen VMM
Xen VMM
  • Deteriorating health ? migrate guest (w/ MPI app)
    to stand-by host
  • The destination host generates unsolicited ARP
    reply advertising that Guest VM IP has moved to a
    new location C.Clark et. Al 2005
  • - This will take care of peers to resend packets
    to the new host

H/w
BMC
H/w
BMC
PFT Daemon
PFT Daemon
MPI Task
MPI Task
Ganglia
Ganglia
Guest VM
Guest VM
Privileged VM
Privileged VM
Xen VMM
Xen VMM
H/w
BMC
10
Proactive Fault Tolerance (PFT) Daemon
PFT Daemon
  • Runs on privileged VM (host)
  • Initialize
  • Read safe threshold from config file
  • ltSensor namegt ltLow Thrgt ltHi Thrgt
  • CPU temperature, fan speeds
  • extensible (corrupt sectors, network, voltage
    fluctuations, )
  • Init connection w/ IPMI BMC using authentication
    parameters and hostname
  • Gathers a listing of available sensors in the
    system and validates it against out list

IPMI Baseboard Mgmt Controller
Initialize
Health Monitor
Threshold Breach?
N
Y
Load Balance
Ganglia
Raise Alarm / Maintenance of the system
11
PFT Daemon
  • Health Monitoring
  • interacts w/ IPMI BMC (via OpenIPMI) to read
    sensors
  • Periodic sampling of data (event driven is also
    supported)
  • threshold exceeded ? control handed over to load
    balancing
  • PFTd determines migration target by contacting
    Ganglia
  • Load-based selection (lowest load)
  • Load obtained by /proc file system
  • Invokes Xen live migration for guest VM
  • Xen user-land tools (at VM/host)
  • command line interface for live migration
  • PFT Daemon initiates migration for guest VM

12
Experimental Framework
  • Cluster of 16 nodes (dual core, dual Opteron 265,
    1 Gbps Ether)
  • Xen-3.0.2-3 VMM
  • Privileged and guest VM run ported Linux kernel
    version 2.6.16
  • Guest VM
  • Very same configuration as privileged VM
  • Has 1GB RAM
  • Booted on VMM w/ PXE netboot via NFS
  • Has access to NFS (same as the privileged VM)
  • Ganglia on Privileged VM (and also Guest VM) in
    all nodes
  • Node sensors obtained via OpenIPMI

13
Experimental Framework
  • NAS Parallel Benchmarks run on Guest Virtual
    Machine
  • MPICH-2 w/ MPD ring on n GuestVMs (no job-pause
    required!)
  • Process on Privileged domain
  • monitors MPI task runs
  • issues migration command (NFS used for
    synchronization)
  • Measured
  • wallclock time with and w/o migration
  • actual downtime migration overhead (modified
    Xen migration)
  • benchmarks run 10 times, results report avg.
  • NPB V3.2.1 BT, CG, EP, LU and SP benchmarks
  • IS run is too short
  • MG requires gt 1GB for class C

14
Experimental Results
2. Double node failure
  • 1. Single node failure

NPB Class B / 4 nodes
NPB Class C / 4 nodes
  • Single node failure overhead of 1-4 over
    total wall clock time
  • Double node failure - overhead of 2-8 over
    total wall clock time

15
Experimental Results
  • 3. Behavior of Problem Scaling
  • Chart depicts only the overhead section
  • Dark region represents the part for which the VM
    was halted
  • The light region represents the delay incurred
    due to migration (diff operation.. Etc)

NPB 4 nodes
  • Generally overhead increases with problem size
    (CG is exception )

16
Experimental Results
  • 4. Behavior of Task Scaling
  • Generally we expect a decrease in overhead on
    increasing the of nodes
  • Some discrepancies for BT and LU observed
    (Migration duration is 40s but here we have 60s)

NPB Class C
17
Experimental Results
  • 5. Migration duration

NPB 4 nodes
NPB 4/8/16 nodes
  • Min 13s needed to transfer a 1GB VM w/o any
    active processes
  • Max 40 seconds needed before migration is
    initiated
  • Depends on the n/w bandwidth, RAM size on the
    application

18
Experimental Results
  • 6. Scalability (Total execution time)

NPB Class C
  • Speedup is not very much affected

19
Related Work
  • FT Reactive approach is more common
  • Automatic
  • Checkpoint/restart (eg BLCR Berkeley Labs
    Checkpnt Restart)S.Sankaran et.al LACSI 03,
    G.Stellner, IPPS 96
  • Log based (Log msg temporal ordering)
    G.Bosilica , Supercomputing, 2002
  • Non-automatic
  • Explicit invocation of checkpoint routines
    R.T.Aulwes et. Al, IPDPS 2004, G. E. Fagg and
    J. J. Dongarra, 2000
  • Virtualization in HPC is less/no overhead
    W.Hunaf et al, ICS 06
  • To make virtualization competitive for MP
    environments, vmm-bypass I/o in VM has been
    experimentedJ.Liu et.al USENIX 06
  • n/w virtualization can be optimized A.Menon
    et.al USENIX 06

20
Conclusion
  • In contrast to the currently available reactive
    FT schemes, we have come up with a proactive
    system with much less overhead
  • Transparent and automatic FT for arbitrary MPI
    applications
  • Ideally complements long running MPI jobs
  • Proactive system will complement reactive systems
    greatly. (It will help to reduce the high
    overhead associated with reactive schemes greatly)
Write a Comment
User Comments (0)
About PowerShow.com