Design Overview of the TSTT Integrated Field Approximation for Numerical Simulation (FANS) Library - PowerPoint PPT Presentation

About This Presentation
Title:

Design Overview of the TSTT Integrated Field Approximation for Numerical Simulation (FANS) Library

Description:

Analysis of Cluster Failures on Blue Gene Supercomputers Tom Hacker* Fabian Romero+ Chris Carothers Scientific Computation Research Center Department of Computer Science – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 26
Provided by: ab50
Learn more at: http://www.cs.rpi.edu
Category:

less

Transcript and Presenter's Notes

Title: Design Overview of the TSTT Integrated Field Approximation for Numerical Simulation (FANS) Library


1
Analysis of Cluster Failures on Blue Gene
Supercomputers

Tom Hacker Fabian Romero
Chris Carothers
Scientific Computation Research Center Department
of Computer Science Rensselaer Polytechnic
Institute
Depart. Of Computer Information
Tech. Discovery Cyber Center Purdue University
2
Outline
  • Update on NSF PetaApps CFD Project
  • PetaApps Project Components
  • Current Scaling Results
  • Challenges for Fault Tolerance
  • Analysis of Clustered Failures
  • EPFL 8K Blue Gene/L
  • RPI 32K Blue Gene/L
  • Analysis Approach
  • Findings
  • Summary

To appear in upcoming issue of JPDC
3
NSF PetaApps Parallel Adaptive CFD
  • PetaApps Components
  • CFD Solver
  • Adaptivity
  • Petascale Perf Sim
  • Fault Recovery
  • Demonstration Apps
  • Cardiovascular Flow
  • Flow Control
  • Two-phase Flow

Ken Jansen (PD), Onkar Sahni, Chris Carothers,
Mark S. Shephard

Scientific Computation Research Center
Rensselaer Polytechnic Institute
Acknowledgments Partners Simmetrix, Acusim,
Kitware, IBM NSF PetaApps, ITR, CTS DOE
SciDAC-ITAPS, NERI AFOSR IndustryIBM, Northrup
Grumman, Boeing, Lockheed Martin, Motorola
Computer Resources TeraGrid, ANL, NERSC,
RPI-CCNI
4
PHASTA Flow Solver Parallel Paradigm
  • Time-accurate, stabilized FEM flow solver
  • Two types of work
  • Equation formation
  • O(40) peer-to-peer non/blocking comms
  • Overlapping comms with comp
  • Scales well on many machines
  • Implicit, iterative equation solution
  • Matrix assembled on processor ONLY
  • Each Krylov vector is
  • qAp (matrix-vector product)
  • Same peer-to-peer comm of q PLUS
  • Orthogonalize against prior vectors
  • REQUIRES NORMSgtMPI_Allreduce
  • This sets up a cycle of global comms. separated
    by modest amount of work
  • Not currently able to overlap Comms
  • Even if work is balanced perfectly, OS jitter can
    imbalance it.
  • Imbalance WILL show up in MPI_Allreduce
  • Scales well on machines with low noise (like Blue
    Gene)

5
Parallel Implicit Flow Solver
IncompressibleAbdominal Aorta Aneurysm (AAA)
fafdfafd adsf a
Cores (avg. elems./core) IBM BG/L RPI-CCNI
t (secs.) scale factor
512 (204800) 2119.7 1 (base)
1024 (102400) 1052.4 1.01
2048 (51200) 529.1 1.00
4096 (25600) 267.0 0.99
8192 (12800) 130.5 1.02
16384 (6400) 64.5 1.03
32768 (3200) 35.6 0.93
32K parts show modest degradation due to 15 node
imbalance (with only about 600 mesh-nodes/part)
Rgn./elem. ratioi rgnsi/avg_rgns Node ratioi
nodesi/avg_nodes (Min Zhou)
6
Scaling of AAA 105M Case
Scaling loss due to OS jitter in MPI_Allreduce
7
AAA Adapted to 109 ElementsScaling on Blue Gene
/P
8
ROSS Massively Parallel Discrete-Event
Simulation
Local Control Mechanism error detection and
rollback
Global Control Mechanism compute Global Virtual
Time (GVT)
V i r t u a l T i m e
V i r t u a l T i m e
collect versions of state / events perform
I/O operations that are lt GVT
(1) undo state Ds (2) cancel sent events
GVT
LP 2
LP 3
LP 1
LP 2
LP 3
LP 1
unprocessed event
processed event
straggler event
committed event
9
  • PHOLD is a stress-test.
  • On BG/L _at_ CCNI
  • 1 Million LPs (note DES of BG/L comms had 6M
    LPs).
  • 10 events per LP
  • Upto 100 probablity any event scheduled to any
    other LP
  • Other events scheduled to self.

10
  • PHOLD on Blue Gene/P
  • At 64K cores, only 16 LPs per core with 10
    events per LP.
  • At 128K cores, only 8 LPs per core MAX
    parallelism and performance drops off
    significantly.
  • Peak performance of 12.26 billion events/sec for
    10 remote case.

11
Challenges for Petascale Fault Tolerance
  • Good news with caveats
  • Our applications are scaling well.
  • Butscaling runs are relatively short. (i.e., lt 5
    mins) and so dont experience failures
  • One early example
  • Phasta could only run for at most 5 hours using
    32K nodes before Blue Gene/L lost at least one
    node and whole program died
  • Systems I/O bound..cannot checkpoint..
  • BG/P Intrepid has 550 TFlops of compute but
    only 55 to 60 GB/sec disk IO bandwidth using 4 MB
    data blocks
  • At 10 of peak flops, can only do 1 double of
    I/O per 8000 double precision FLOPS.
  • So we need to understand how systems fail in
    order to build efficient fault tolerant
    supercomputer systems

12
Assessing Reliability on Petascale Systems
  • Systems containing a large number of components
    experience a low mean time to failure
  • Usual methods of failure analysis assume
  • Independent failure events
  • Exponentially distributed time between events
  • Necessary for homogenous Markov modeling and
    Poisson processes
  • Practical experience with systems of this scale
    show that
  • Failures are frequently cascading
  • Analyzing time between events in reality is
    difficult
  • Difficult to put knowledge about reliability to
    practical use
  • Better understanding of reliability from a
    practical perspective would be useful
  • Increase reliability of large and long-running
    jobs
  • Decrease checkpoint frequency
  • Squeeze more efficiency from large-scale systems

13
Our approach
  • Understand underlying statistics of failure on a
    large system in the field
  • Use this understanding to attempt to predict node
    reliability
  • Put this knowledge to work to improve job
    reliability

14
Characteristics of Failure
  • Failures in large-scale systems are rarely
    independent singular events
  • A set of failures can arise from a single
    underlying cause
  • Network problems
  • Rack-level problems
  • Software subsystem problems
  • Failures are manifested as a cluster of failures
  • Grouped spatially (e.g., in a rack)
  • Grouped temporally

15
Blue Gene
  • We gathered RAS logs from two large Blue Gene
    systems (EPFL and RPI)

16
Blue Gene RAS Data
  • Events include a level of severity
  • INFO (least severe), WARNING, SEVERE, ERROR,
    FATAL, FAILURE
  • Location and time
  • Mapped these events into a 3D space to understand
    what was happening over time on the system
  • Node address -gt X axis
  • Time of event -gt Y axis
  • Severity level -gt Z axis

17
RPI Blue Gene Event Graph
18
EPFL Blue Gene Event Graph
19
Assessing Events
  • Significant spatial and temporal clustering
  • Used cluster analysis in R to reduce clustering
  • Time between events transformed to Weibull
  • Needed a model to predict node reliability
  • In practice, nodes are either
  • Healthy and operating normally
  • Degraded and suspect
  • Down

20
Node Reliability Model
21
Two Markov Models
  • Accurate, but slow
  • Continuous time Markov model
  • Cluster analysis takes over 10 hours
  • Less accurate, but much faster
  • Discrete time Markov model
  • Time step adjustable
  • Computed in minutes

22
Predicted Reliability RPI EPFL
23
Practical Application
  • We can use this information to guide the
    scheduler
  • Rank nodes by predicted reliability
  • High reliability least likely to fail
  • Low reliability most likely to fail
  • Assign most reliable nodes to largest queues
  • Assign least reliable nodes to smallest queues

24
(No Transcript)
25
Summary
  • Our apps are scaling well on balanced hardware
    but have strong need to understand how failures
    impact performance, especially at petascale
    levels
  • Analysis of failure logs suggests failures follow
    a Weibull distribution.
  • Semi-Markov models are able to access reliability
    of nodes on the systems
  • Nodes that log a large number of RAS events
    (i.e., are noisy) are less reliable than nodes
    that log few events (i.e., lt 3).
  • Grouping less noisy nodes together creates a
    partition that is much less likely to fail which
    significantly improves overall job completion
    rates and reduces the need for checkpointing.
Write a Comment
User Comments (0)
About PowerShow.com