Title: High Availability: An Emerging Critical Factor for High Performance Computing
1High Availability An Emerging Critical Factor
for High Performance Computing
- Dan Stanzione
- Director
- High Performance Computing Initiative
- Arizona State University
- 10/12/04
- Los Alamos Computer Sciences Institute
- High Availability and Performance Computing
Workshop
2Outline
- What is the modern definition of High
Availability? (A little history) - Why is it becoming more important?
- Examples
- Challenges
- New Ideas
3A quick history of HA and HPC
- Most of the history of HA has nothing to do with
HPC (Makes for shorter slides) - What it does have to do with
- The phone system (first failover systems)
- Database applications
- VMS
- SAN
- In most of the Enterprise computing world,
Clustering means failover not Beowulf - In this space, HA driven by up-time requirements
4A quick history of HA and HPC
- RAID
- The issue
- Disks were reliable enough individually but when
clustered together, MTBF (mean time between
failures) goes way down. - Solutions redundancy and fault tolerance
5Observations
- Fault tolerance becomes critical in two
situations - Complex When lots of components are clustered
together, and all are required to complete a task
(High Performance Computing) - Critical When failures are extremely costly
(Mission-critical applications) - The current state of HPC is such that both the
complex and critical criteria are satisfied.
Hence, HA is doubly important for HPC
6Recent history of HA and HPC
- As these two goals converge in HPC, HA has seen
more activity, in projects and publishing - HA-Linux
- HA-OSCAR
- Fault Tolerant MPI
- (MPI-FT, FT-MPI, MPICH-V)
- Fault Tolerant Grid SW
- (Condor-G, Globus V3, GEMS)
- Hear much more about this today...
7Levels of High Availability
- Hardware
- Software
- Services
8Grids making everything worse
- A grid application relies on a large, distributed
set of computing resources, networks, and
middleware - Many, many more points of failure than ever
before - Central administration is removed
- More failure modes possible (black holes)
- A highly available grid goes way beyond hardware
failures - (application-gtmiddleware-gttransport-gtOS-gthardware)
number of grid sites
9Trends in science reinforce the trend to towards
complexity
- Bigger science, bigger data, bigger model
- Integration of online experimentation
- Multi-disciplinary science multi-physics model
(joined applications) - More teams of researchers collaborating (no
central administrative control!) - More reliance on simulation
10An ASU Example LTER Scenario
Price
Quotas
11Translational Genomics Affiliated Cluster
Biofluidics Cluster
NanoStructures Cluster
Decision Theatre Visualization Cluster
Environmental Fluid Dynamics Cluster
Mini-Grid Backbone
Storage Cluster
Fulton School Central Cluster (256 Procs)
12Examples ASU Decision Theater
Rendering Cluster
Modeling Cluster
13Examples HPC Centers grow up
- ASU
- TACC, OSC, NCSA, SDSC, DOE Labs, many others
- Production HPC is now a requirement for all
(not 5 9's yet, but demand grows). - Must be 100 available for at least the length of
longest run! - We need better metrics!!! uptime not adequate
how about successful completion?
14Challenges
- High Availability must go beyond hardware fault
tolerance. - Computers, networks, software services must all
be reliable (grid community is recognizing this) - Detection of failure is more difficult
- Production HPC raises the bar
- High Availability must reach application level
15How can we achieve High Availability?
- More reliable components, HW and SW
- Just kidding.
- Redundancy extra components take over for failed
ones (e.g. HA-OSCAR). - Restart/Re-schedule If at first you don't
succeed... - In the event of a failure, try again.
- Most Grid FT approaches use this.
- Essentially, the toolbox is redundancy in space
(components) or time (restart)
16Silly Ideas RAIC
- Redundant Array of Inexpensive Clusters
- Hardware vendors approve!
- More likely replicate computation
- Are nodes really that expensive?
17Make the redundancy tradeoff someone else's
problem
- Let policy happen at job/user/site level
- Specify to resource manager?
- Possible modes
- This job is
- Fully redundant use duplicate HW everywhere
- Automatically re-started on failure
- Run only on reliable resources (grid, i.e., not
running on failover components) - I'll take my chances
- (Inspiration on loan from PVFS design)
18Make the redundancy tradeoff someone else's
problem
- In terms of nodes, all clusters are already
redundant (if you make them half the size). - In theory, grids too.
- FT is expensive build the support, but not the
policy - Don't make the space/time redundancy tradeoff at
the system software level!
19Conclusions
- As HPC, and in particular clustering succeeds,
stakes go up. - Research is bigger in terms of people,
disciplines, dollars and data. - Added complexity means bigger systems and less
control - More mission critical, more components.... High
Availability is a must (at the application
level)! - More than one way to achieve probably no one
size fits all solution. Why try and make one?
20Barriers?