Design Overview of the TSTT Integrated Field Approximation for Numerical Simulation (FANS) Library - PowerPoint PPT Presentation

About This Presentation

Title:

Design Overview of the TSTT Integrated Field Approximation for Numerical Simulation (FANS) Library

Description:

Analysis of Cluster Failures on Blue Gene Supercomputers Tom Hacker* Fabian Romero+ Chris Carothers Scientific Computation Research Center Department of Computer Science – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 26

Provided by: ab50

Learn more at: http://www.cs.rpi.edu

Category:

more less

Transcript and Presenter's Notes

Title: Design Overview of the TSTT Integrated Field Approximation for Numerical Simulation (FANS) Library

1
Analysis of Cluster Failures on Blue Gene
Supercomputers

Tom Hacker Fabian Romero
Chris Carothers
Scientific Computation Research Center Department
of Computer Science Rensselaer Polytechnic
Institute
Depart. Of Computer Information
Tech. Discovery Cyber Center Purdue University
2
Outline

Update on NSF PetaApps CFD Project
PetaApps Project Components
Current Scaling Results
Challenges for Fault Tolerance
Analysis of Clustered Failures
EPFL 8K Blue Gene/L
RPI 32K Blue Gene/L
Analysis Approach
Findings
Summary

To appear in upcoming issue of JPDC
3
NSF PetaApps Parallel Adaptive CFD

PetaApps Components
CFD Solver
Adaptivity
Petascale Perf Sim
Fault Recovery
Demonstration Apps
Cardiovascular Flow
Flow Control
Two-phase Flow

Ken Jansen (PD), Onkar Sahni, Chris Carothers,
Mark S. Shephard

Scientific Computation Research Center
Rensselaer Polytechnic Institute
Acknowledgments Partners Simmetrix, Acusim,
Kitware, IBM NSF PetaApps, ITR, CTS DOE
SciDAC-ITAPS, NERI AFOSR IndustryIBM, Northrup
Grumman, Boeing, Lockheed Martin, Motorola
Computer Resources TeraGrid, ANL, NERSC,
RPI-CCNI
4
PHASTA Flow Solver Parallel Paradigm

Time-accurate, stabilized FEM flow solver
Two types of work
Equation formation
O(40) peer-to-peer non/blocking comms
Overlapping comms with comp
Scales well on many machines
Implicit, iterative equation solution
Matrix assembled on processor ONLY
Each Krylov vector is
qAp (matrix-vector product)
Same peer-to-peer comm of q PLUS
Orthogonalize against prior vectors
REQUIRES NORMSgtMPI_Allreduce
This sets up a cycle of global comms. separated
by modest amount of work
Not currently able to overlap Comms
Even if work is balanced perfectly, OS jitter can
imbalance it.
Imbalance WILL show up in MPI_Allreduce
Scales well on machines with low noise (like Blue
Gene)

5
Parallel Implicit Flow Solver
IncompressibleAbdominal Aorta Aneurysm (AAA)
fafdfafd adsf a
Cores (avg. elems./core) IBM BG/L RPI-CCNI
t (secs.) scale factor
512 (204800) 2119.7 1 (base)
1024 (102400) 1052.4 1.01
2048 (51200) 529.1 1.00
4096 (25600) 267.0 0.99
8192 (12800) 130.5 1.02
16384 (6400) 64.5 1.03
32768 (3200) 35.6 0.93
32K parts show modest degradation due to 15 node
imbalance (with only about 600 mesh-nodes/part)
Rgn./elem. ratioi rgnsi/avg_rgns Node ratioi
nodesi/avg_nodes (Min Zhou)
6
Scaling of AAA 105M Case
Scaling loss due to OS jitter in MPI_Allreduce
7
AAA Adapted to 109 ElementsScaling on Blue Gene
/P
8
ROSS Massively Parallel Discrete-Event
Simulation
Local Control Mechanism error detection and
rollback
Global Control Mechanism compute Global Virtual
Time (GVT)
V i r t u a l T i m e
V i r t u a l T i m e
collect versions of state / events perform
I/O operations that are lt GVT
(1) undo state Ds (2) cancel sent events
GVT
LP 2
LP 3
LP 1
LP 2
LP 3
LP 1
unprocessed event
processed event
straggler event
committed event
9

PHOLD is a stress-test.
On BG/L _at_ CCNI
1 Million LPs (note DES of BG/L comms had 6M
LPs).
10 events per LP
Upto 100 probablity any event scheduled to any
other LP
Other events scheduled to self.

PHOLD on Blue Gene/P
At 64K cores, only 16 LPs per core with 10
events per LP.
At 128K cores, only 8 LPs per core MAX
parallelism and performance drops off
significantly.
Peak performance of 12.26 billion events/sec for
10 remote case.

11
Challenges for Petascale Fault Tolerance

Good news with caveats
Our applications are scaling well.
Butscaling runs are relatively short. (i.e., lt 5
mins) and so dont experience failures
One early example
Phasta could only run for at most 5 hours using
32K nodes before Blue Gene/L lost at least one
node and whole program died
Systems I/O bound..cannot checkpoint..
BG/P Intrepid has 550 TFlops of compute but
only 55 to 60 GB/sec disk IO bandwidth using 4 MB
data blocks
At 10 of peak flops, can only do 1 double of
I/O per 8000 double precision FLOPS.
So we need to understand how systems fail in
order to build efficient fault tolerant
supercomputer systems

12
Assessing Reliability on Petascale Systems

Systems containing a large number of components
experience a low mean time to failure
Usual methods of failure analysis assume
Independent failure events
Exponentially distributed time between events
Necessary for homogenous Markov modeling and
Poisson processes
Practical experience with systems of this scale
show that
Failures are frequently cascading
Analyzing time between events in reality is
difficult
Difficult to put knowledge about reliability to
practical use
Better understanding of reliability from a
practical perspective would be useful
Increase reliability of large and long-running
jobs
Decrease checkpoint frequency
Squeeze more efficiency from large-scale systems

13
Our approach

Understand underlying statistics of failure on a
large system in the field
Use this understanding to attempt to predict node
reliability
Put this knowledge to work to improve job
reliability

14
Characteristics of Failure

Failures in large-scale systems are rarely
independent singular events
A set of failures can arise from a single
underlying cause
Network problems
Rack-level problems
Software subsystem problems
Failures are manifested as a cluster of failures
Grouped spatially (e.g., in a rack)
Grouped temporally

15
Blue Gene

We gathered RAS logs from two large Blue Gene
systems (EPFL and RPI)

16
Blue Gene RAS Data

Events include a level of severity
INFO (least severe), WARNING, SEVERE, ERROR,
FATAL, FAILURE
Location and time
Mapped these events into a 3D space to understand
what was happening over time on the system
Node address -gt X axis
Time of event -gt Y axis
Severity level -gt Z axis

17
RPI Blue Gene Event Graph
18
EPFL Blue Gene Event Graph
19
Assessing Events

Significant spatial and temporal clustering
Used cluster analysis in R to reduce clustering
Time between events transformed to Weibull
Needed a model to predict node reliability
In practice, nodes are either
Healthy and operating normally
Degraded and suspect
Down

20
Node Reliability Model
21
Two Markov Models

Accurate, but slow
Continuous time Markov model
Cluster analysis takes over 10 hours
Less accurate, but much faster
Discrete time Markov model
Time step adjustable
Computed in minutes

22
Predicted Reliability RPI EPFL
23
Practical Application

We can use this information to guide the
scheduler
Rank nodes by predicted reliability
High reliability least likely to fail
Low reliability most likely to fail
Assign most reliable nodes to largest queues
Assign least reliable nodes to smallest queues

24
(No Transcript)
25
Summary

Our apps are scaling well on balanced hardware
but have strong need to understand how failures
impact performance, especially at petascale
levels
Analysis of failure logs suggests failures follow
a Weibull distribution.
Semi-Markov models are able to access reliability
of nodes on the systems
Nodes that log a large number of RAS events
(i.e., are noisy) are less reliable than nodes
that log few events (i.e., lt 3).
Grouping less noisy nodes together creates a
partition that is much less likely to fail which
significantly improves overall job completion
rates and reduces the need for checkpointing.