High Availability: An Emerging Critical Factor for High Performance Computing - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

High Availability: An Emerging Critical Factor for High Performance Computing

Description:

What is the modern definition of High Availability? ( A ... A quick history of HA and HPC ... (Condor-G, Globus V3, GEMS) Hear much more about this today... – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 21

Provided by: xcrCeni

Category:

more less

Transcript and Presenter's Notes

Title: High Availability: An Emerging Critical Factor for High Performance Computing

1
High Availability An Emerging Critical Factor
for High Performance Computing

Dan Stanzione
Director
High Performance Computing Initiative
Arizona State University
10/12/04
Los Alamos Computer Sciences Institute
High Availability and Performance Computing
Workshop

2
Outline

What is the modern definition of High
Availability? (A little history)
Why is it becoming more important?
Examples
Challenges
New Ideas

3
A quick history of HA and HPC

Most of the history of HA has nothing to do with
HPC (Makes for shorter slides)
What it does have to do with
The phone system (first failover systems)
Database applications
VMS
SAN
In most of the Enterprise computing world,
Clustering means failover not Beowulf
In this space, HA driven by up-time requirements

4
A quick history of HA and HPC

RAID
The issue
Disks were reliable enough individually but when
clustered together, MTBF (mean time between
failures) goes way down.
Solutions redundancy and fault tolerance

5
Observations

Fault tolerance becomes critical in two
situations
Complex When lots of components are clustered
together, and all are required to complete a task
(High Performance Computing)
Critical When failures are extremely costly
(Mission-critical applications)
The current state of HPC is such that both the
complex and critical criteria are satisfied.
Hence, HA is doubly important for HPC

6
Recent history of HA and HPC

As these two goals converge in HPC, HA has seen
more activity, in projects and publishing
HA-Linux
HA-OSCAR
Fault Tolerant MPI
(MPI-FT, FT-MPI, MPICH-V)
Fault Tolerant Grid SW
(Condor-G, Globus V3, GEMS)
Hear much more about this today...

7
Levels of High Availability

Hardware
Software
Services

8
Grids making everything worse

A grid application relies on a large, distributed
set of computing resources, networks, and
middleware
Many, many more points of failure than ever
before
Central administration is removed
More failure modes possible (black holes)
A highly available grid goes way beyond hardware
failures
(application-gtmiddleware-gttransport-gtOS-gthardware)
number of grid sites

9
Trends in science reinforce the trend to towards
complexity

Bigger science, bigger data, bigger model
Integration of online experimentation
Multi-disciplinary science multi-physics model
(joined applications)
More teams of researchers collaborating (no
central administrative control!)
More reliance on simulation

10
An ASU Example LTER Scenario
Price
Quotas
11
Translational Genomics Affiliated Cluster
Biofluidics Cluster
NanoStructures Cluster
Decision Theatre Visualization Cluster
Environmental Fluid Dynamics Cluster
Mini-Grid Backbone
Storage Cluster
Fulton School Central Cluster (256 Procs)
12
Examples ASU Decision Theater
Rendering Cluster
Modeling Cluster
13
Examples HPC Centers grow up

ASU
TACC, OSC, NCSA, SDSC, DOE Labs, many others
Production HPC is now a requirement for all
(not 5 9's yet, but demand grows).
Must be 100 available for at least the length of
longest run!
We need better metrics!!! uptime not adequate
how about successful completion?

14
Challenges

High Availability must go beyond hardware fault
tolerance.
Computers, networks, software services must all
be reliable (grid community is recognizing this)
Detection of failure is more difficult
Production HPC raises the bar
High Availability must reach application level

15
How can we achieve High Availability?

More reliable components, HW and SW
Just kidding.
Redundancy extra components take over for failed
ones (e.g. HA-OSCAR).
Restart/Re-schedule If at first you don't
succeed...
In the event of a failure, try again.
Most Grid FT approaches use this.
Essentially, the toolbox is redundancy in space
(components) or time (restart)

16
Silly Ideas RAIC

Redundant Array of Inexpensive Clusters
Hardware vendors approve!
More likely replicate computation
Are nodes really that expensive?

17
Make the redundancy tradeoff someone else's
problem

Let policy happen at job/user/site level
Specify to resource manager?
Possible modes
This job is
Fully redundant use duplicate HW everywhere
Automatically re-started on failure
Run only on reliable resources (grid, i.e., not
running on failover components)
I'll take my chances
(Inspiration on loan from PVFS design)

18
Make the redundancy tradeoff someone else's
problem

In terms of nodes, all clusters are already
redundant (if you make them half the size).
In theory, grids too.
FT is expensive build the support, but not the
policy
Don't make the space/time redundancy tradeoff at
the system software level!

19
Conclusions

As HPC, and in particular clustering succeeds,
stakes go up.
Research is bigger in terms of people,
disciplines, dollars and data.
Added complexity means bigger systems and less
control
More mission critical, more components.... High
Availability is a must (at the application
level)!
More than one way to achieve probably no one
size fits all solution. Why try and make one?

20
Barriers?