High Availability: An Emerging Critical Factor for High Performance Computing - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

High Availability: An Emerging Critical Factor for High Performance Computing

Description:

What is the modern definition of High Availability? ( A ... A quick history of HA and HPC ... (Condor-G, Globus V3, GEMS) Hear much more about this today... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 21
Provided by: xcrCeni
Category:

less

Transcript and Presenter's Notes

Title: High Availability: An Emerging Critical Factor for High Performance Computing


1
High Availability An Emerging Critical Factor
for High Performance Computing
  • Dan Stanzione
  • Director
  • High Performance Computing Initiative
  • Arizona State University
  • 10/12/04
  • Los Alamos Computer Sciences Institute
  • High Availability and Performance Computing
    Workshop

2
Outline
  • What is the modern definition of High
    Availability? (A little history)
  • Why is it becoming more important?
  • Examples
  • Challenges
  • New Ideas

3
A quick history of HA and HPC
  • Most of the history of HA has nothing to do with
    HPC (Makes for shorter slides)
  • What it does have to do with
  • The phone system (first failover systems)
  • Database applications
  • VMS
  • SAN
  • In most of the Enterprise computing world,
    Clustering means failover not Beowulf
  • In this space, HA driven by up-time requirements

4
A quick history of HA and HPC
  • RAID
  • The issue
  • Disks were reliable enough individually but when
    clustered together, MTBF (mean time between
    failures) goes way down.
  • Solutions redundancy and fault tolerance

5
Observations
  • Fault tolerance becomes critical in two
    situations
  • Complex When lots of components are clustered
    together, and all are required to complete a task
    (High Performance Computing)
  • Critical When failures are extremely costly
    (Mission-critical applications)
  • The current state of HPC is such that both the
    complex and critical criteria are satisfied.
    Hence, HA is doubly important for HPC

6
Recent history of HA and HPC
  • As these two goals converge in HPC, HA has seen
    more activity, in projects and publishing
  • HA-Linux
  • HA-OSCAR
  • Fault Tolerant MPI
  • (MPI-FT, FT-MPI, MPICH-V)
  • Fault Tolerant Grid SW
  • (Condor-G, Globus V3, GEMS)
  • Hear much more about this today...

7
Levels of High Availability
  • Hardware
  • Software
  • Services

8
Grids making everything worse
  • A grid application relies on a large, distributed
    set of computing resources, networks, and
    middleware
  • Many, many more points of failure than ever
    before
  • Central administration is removed
  • More failure modes possible (black holes)
  • A highly available grid goes way beyond hardware
    failures
  • (application-gtmiddleware-gttransport-gtOS-gthardware)
    number of grid sites

9
Trends in science reinforce the trend to towards
complexity
  • Bigger science, bigger data, bigger model
  • Integration of online experimentation
  • Multi-disciplinary science multi-physics model
    (joined applications)
  • More teams of researchers collaborating (no
    central administrative control!)
  • More reliance on simulation

10
An ASU Example LTER Scenario
Price
Quotas
11
Translational Genomics Affiliated Cluster
Biofluidics Cluster
NanoStructures Cluster
Decision Theatre Visualization Cluster
Environmental Fluid Dynamics Cluster
Mini-Grid Backbone
Storage Cluster
Fulton School Central Cluster (256 Procs)
12
Examples ASU Decision Theater
Rendering Cluster
Modeling Cluster
13
Examples HPC Centers grow up
  • ASU
  • TACC, OSC, NCSA, SDSC, DOE Labs, many others
  • Production HPC is now a requirement for all
    (not 5 9's yet, but demand grows).
  • Must be 100 available for at least the length of
    longest run!
  • We need better metrics!!! uptime not adequate
    how about successful completion?

14
Challenges
  • High Availability must go beyond hardware fault
    tolerance.
  • Computers, networks, software services must all
    be reliable (grid community is recognizing this)
  • Detection of failure is more difficult
  • Production HPC raises the bar
  • High Availability must reach application level

15
How can we achieve High Availability?
  • More reliable components, HW and SW
  • Just kidding.
  • Redundancy extra components take over for failed
    ones (e.g. HA-OSCAR).
  • Restart/Re-schedule If at first you don't
    succeed...
  • In the event of a failure, try again.
  • Most Grid FT approaches use this.
  • Essentially, the toolbox is redundancy in space
    (components) or time (restart)

16
Silly Ideas RAIC
  • Redundant Array of Inexpensive Clusters
  • Hardware vendors approve!
  • More likely replicate computation
  • Are nodes really that expensive?

17
Make the redundancy tradeoff someone else's
problem
  • Let policy happen at job/user/site level
  • Specify to resource manager?
  • Possible modes
  • This job is
  • Fully redundant use duplicate HW everywhere
  • Automatically re-started on failure
  • Run only on reliable resources (grid, i.e., not
    running on failover components)
  • I'll take my chances
  • (Inspiration on loan from PVFS design)

18
Make the redundancy tradeoff someone else's
problem
  • In terms of nodes, all clusters are already
    redundant (if you make them half the size).
  • In theory, grids too.
  • FT is expensive build the support, but not the
    policy
  • Don't make the space/time redundancy tradeoff at
    the system software level!

19
Conclusions
  • As HPC, and in particular clustering succeeds,
    stakes go up.
  • Research is bigger in terms of people,
    disciplines, dollars and data.
  • Added complexity means bigger systems and less
    control
  • More mission critical, more components.... High
    Availability is a must (at the application
    level)!
  • More than one way to achieve probably no one
    size fits all solution. Why try and make one?

20
Barriers?
  • Let's discuss...
Write a Comment
User Comments (0)
About PowerShow.com