NAS Past and Future - The Road to TFLOPs - PowerPoint PPT Presentation

About This Presentation
Title:

NAS Past and Future - The Road to TFLOPs

Description:

NAS Past and Future The Road to TFLOPs – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 28
Provided by: multimed6
Category:
Tags: nas | floe | future | past | road | tflops

less

Transcript and Presenter's Notes

Title: NAS Past and Future - The Road to TFLOPs


1
NAS Past and Future - The Road to TFLOPs
  • DOE Workshop
  • Architectures for Ultrascale Science
  • Charleston, SC
  • April 13-14. 2004

2
NASA Advanced Engineering, Modeling and
Simulation Facility
Lomax - 512 Processor SGI Origin 3800
Chapman - 1,024 Processor SGI Origin 3800
High End Computation
Digital Flight for CEV Development Trajectory and
mission simulation
Data Storage Data Archiving Data Mining
High End Data Analysis Visualization
  • Computing Requirements
  • 20,000 hrs of 1024 Chapman equivalent (830 days
    for the first 10 minutes flights)
  • Need 100X improvement in capability for three-day
    turn around
  • Need research on scaling and advanced mission
    computing strategies

Kalpana - 512 Processor SGI Altix
Tomorrow
Today
Advanced Networks
Impact of Large Scale Capability
Goal for Risk Assessment Launch to orbit
simulation
  • X37 - Drone Prototype
  • Designed to be dropped out of Shuttle for
    re-entry
  • Cannot be physically modeled (e.g. wind tunnels
    are out)
  • Run on Dedicated 512p systems
  • Capability allowed NASA to discover design flaw
    and initiate redesign (thermal overload gt5000O On
    wing leading edge - catastrophic failure
    condition)
  • Computing Requirements
  • 6,000 hrs of 1024 Chapman equivalent (250 days
    for the first 10 minutes flights)
  • Need 100X improvement in capability for three-day
    turn around
  • Need research on scaling and advanced mission
    computing strategies

3
Supercomputing Support ofColumbia Accident
Investigation Board (CAIB)
Ames marshalled supercomputing-related RD assets
and tools to produce time-critical analysis for
CAIB
RD and Operations teams joined to form
integrated, interdisciplinary team to support
CAIB in evaluation of launch and re-entry events
Bringing the investigation full circle, the team
utilized advanced data analysis and visualization
technologies to explore and understand the vast
amount of data produced by this massive
simulation effort
RD Team Grand Challenge Driven Ames
Research Center Glenn Research Center
Langley Research Center
Engineering Team Operations Driven Johnson
Space Center Marshall Space Flight Center
Industry Partners
The team used state-of-the-art codes to meet the
challenge of modeling a free object in the
flow around an airframe, with automatic re-griddin
g at each time step
Grand Challenges
Next Generation Codes Algorithms
OVERFLOW Honorable Mention, NASA Software of
Year STS107
Supercomputers, Storage, Networks
Rapid, high-fidelity computational
investigations may involve hundreds or thousands
of large simulations and results datasets, but
automated computation and data management tools
are at the ready
INS3D NASA Software of Year Turbopump Analysis
CART3D NASA Software of Year STS-107
Modeling Environment (experts and tools) -
Compilers - Scaling and Porting -
Parallelization Tools
These codes have been honed by RD experts in
large-scale simulation, with their environment of
programmer tools developed to minimize the effort
of executing efficiently
The codes and tools were tuned to exploit
advanced hardware (supercomputers, networks, and
storage) that was developed for just such
large, tightly-coupled computations, enabling
rapid turnaround
4
ECCO (Consortium for Estimating the Circulation
and Climate of the Ocean)
RD Team Grand Challenge Driven Ames
Research Center Goddard Space Flight Center
Jet Propulsion Laboratory MIT Scripps Institute
Engineering Team Operations Driven Ames
Research Center Goddard Space Flight Center
Jet Propulsion Laboratory MIT Scripps Institute
RD and Operations teams joined to form
integrated, interdisciplinary team to support ECCO
Bringing the investigation full circle, the team
utilized advanced data analysis and visualization
technologies to explore and understand the vast
amount of data produced by this massive
simulation effort
The team used state-of-the-art codes to meet the
Modeling and Simulation challenges.
Grand Challenges
Rapid Deployment of Resources - Networks
(2x to 3x faster) - Specialized Archive
for International Access.
Next Generation Codes Algorithms
  • Team of scientists from JPL, MIT and Scripps
    coupled with computer science and graphics
    support from ARC with initiation and support from
    HQ, Code Y
  • 3 decadal simulations completed
  • Model formulation to cubed sphere from mercadal
    projection enabling simulation of ocean
    circulation at poles

Supercomputers, Storage, Networks
Rapid, high-fidelity computational
investigations may involve hundreds or thousands
of large simulations and results datasets, but
automated computation and data management tools
are at the ready
Modeling Environment (experts and tools) -
Compilers - Scaling and Porting -
Parallelization Tools
These codes have been honed by RD experts in
large-scale simulation, with their environment of
programmer tools developed to minimize the effort
of executing efficiently
The codes and tools were tuned to exploit
advanced hardware (supercomputers, networks, and
storage) that was developed for just such
large, tightly-coupled computations, enabling
rapid turnaround
5
Past, Present, and Future Requirements
6
HEC is an Insurance Policy for STS, ISS, and OSP
7
Mission Critical ApplicationsUsage of High End
Computational Resources
Available Cycles/Month (in 1000)
Testbeds
Capacity
October 2001
X37
Fuel Line Crack
STS107 Accident Investigation
May 1, 2004
RTF ITA NESC
8
NASA Aerospace Missions Driving HEC
CEV/NGLT Design
Emergency Response
Shuttle Stewardship
Digital Astronaut
  • Each of these missions requires
  • Advanced Simulation Tools - Efficient codes -
    Advanced physical modeling
  • Ensemble Analysis - Data analysis/vis/underst
    anding - Virtual/digital flights or test cells
  • Science/engineering environments - Access to exp
    and observation data - Systems analysis

9
Power and Propulsion Simulation for Space
Exploration
Scenario Power and propulsion technology is
critical for ETO, orbital transfer and reaction
control during mission. Propulsion performance
and power requirement data are generated for
mission simulation as well as risk assessment on
crew due to fuel leaks.
  • Modeling and Simulation Capabilities
  • Performance and reliability prediction of widely
    varying engines, such as main engine, orbital
    transfer engine and attitude control engine
  • Prediction of impact of fuel/propellant leak on
    crew and environment
  • High-accuracy prediction of engine sub-systems
    performance and reliability
  • Data on fuel, i.e reaction rate, Isp, stability
    in handling, toxicity
  • Systems analysis capability for trade study in
    designing components
  • Transient thermal load on structure during
    start-up (during the initial 10seconds)
  • Accurate prediction of systems vibration due to
    subsystem and plume
  • Prediction of chemical species from engine
    exhaust and base of multiple engines
  • Simulation of flight in rarefied gas environment
    during orbital maneuver

EXAMPLE OF ORBITAL TRANSFER AND ATTITUDE CONTROL
ENGINE (Currently Shuttle operates with 44
reaction control engines)
EXAMPLE OF A ROCKET ENGINE (SSME)
10
High-Fidelity Turbopump Simulation
  • Joint effort with NASA/MSFC and
    Boeing/Rocketdyne
  • Developed high-fidelity unsteady simulation
    capability using Impeller-Diffuser of a
    Wide-Range Engine Turbopump (Water test
    conditions are provided by Boeing/Rocketdyne)
  • This procedure is being applied to SSME Low
    Pressure Oxidizer Turbopump (LPOTP
  • This software is applied to an independent
    analysis of flowliner crack (Pressure contours
    just upstream of inducer shows low pressure back
    flow region rotating
  • inducer)

Transient flow simulation during start upWide
range turbopump
CFD simulation of fuel line including inducer
POC C. Kiris, D. Kwak, W. Chan
11
SSME LH2 Flowliner Crack Analysis
  • Unsteady CFD procedure enabled inducer and
    flowliner simulation
  • Root cause of the SSME LH2 flowliner crack is
    being investigated via high-fidelity unsteady
    flow simulation by ARC and MSFC
  • Extent of backflow varies with inducer design.
    The backflow also generates pre-swirl in the flow
    approaching the inducer.
  • CFD simulations indicate the presence of
    high-frequency pressure oscillations upstream of
    inducer where crack occurs between slots.
  • To date over 2M Chapman hours have been used

Cracked Flowliners
Inducer
Numerical simulation helical backflow causing
vibration in flowliner region
12
Simulation of Blood Circulation (For Human Space
Flights)
Blood circulation will have major impacts on crew
during and post mission. Circle of Willis is
simulated as a building block study toward
circulatory system simulation
  • Circle of Willis in Brain
  • Image segmentation from MR Angiography
  • (MRA Courtesy of Prof. Tim David, Univ. of
    Canterbury, NZ)

A snapshot of INS3D Results superimposed on MRA
  • The carotid and vertebrobasilar arteries form a
    circle
  • of communicating arteries at the basis of the
    brain.
  • Simulation of blood circulation through CoW is of
  • current interest Even when one of main arteries
    is
  • Occluded, blood is supplied to small arteries
    through
  • other arteries?Auto-regulation model is included

ECA (External Carotid Arteries) Supply blood
to face and scalp ICA (Internal Carotid
Arteries) Supply blood to the anterior 2/3 of
cerebrum.
POC S.C. Kim, C. Kiris, D. Kwak T. David
13
NAS Parallel Processors / NAS Facility HEC
Chapman - 1,024 Processor SGI Origin 3800
Lomax - 512 Processor SGI Origin 3800
Kalpana - 512 Processor SGI Altix System
14
HEC Transformation Proposal
SGI Altix Systems (IA64 - 6 GFLOP/CPU)Peak 6
TFLOP/s (1024 CPUs)
Cray X1 Systems (Custom - 12 GFLOP/MSP)Peak 6
TFLOP/s (512 MSPS)
Developing tomorrows enabling technologies.
15
ECCO Code Performance 11/04/03
The ECCO code is a well known ocean circulation
model, with features that allow it to run in a
coupled mode where land, ice, and atmospheric
models are run to provide a complete earth system
modeling capability. In addition the code can run
in a data assimilation mode that allows
observational data to be used to improve the
quality of the various physical sub-models in the
calculation. The chart below shows the current
performance on the Altix and other platforms for
a 1/4 degree resolution global ocean
circulation problem. (in reality, much of the
calculation runs at an effective much higher
resolution due to grid shrink at the poles).
Note Virtually no changes to the code have been
made across platforms. Only changes needed to
make it functional have been done. The
preliminary Altix results are very good to date.
A number of code modifications have been
identified that will significantly improve on
this performance number. NOTE The performance on
both Chapman and Altix with full I/O are
super-linear. That is, as you add more CPUs you
get even faster speedups. The Alpha numbers show
a knee at 256 CPUs. UPDATE ECCO NOW
RUNNING 6 YRS/DAY ON 480 CPUS (03/04)
PERFORMANCE 10/04 Years Simulated per Day of
Compute
Performance (Yrs/day)
CPU Count
NOTE Alpha Data re-plotted from Gerhard Theurich
charts in NCCS paper
16
POP 1.4.3 - 0.1 Degree North Atlantic Problem
The second POP scenario was a run of the 0.1
degree North Atlantic simulation as defined by
LANL last year. The grid for this problem is
992x1280x40 (51M points). As stated before no
significant code changes were made. The results
are presented below. Note that this simulation
contains about 10x more points than the 1 degree
problem above and requires about 6x more time
steps per day. Thus, the work is about 60x
more, yet yet the run performance is only about
17x slower on 256 CPUs. The turnover in both 1.0
and 0.1 degree problems is due to two effects, 1)
Scaling in the barotropic computation, and 2)
Extra useless work engendered by the extensive
use of F90 array syntax notation in the code.
NOTE POP graphics courtesy of Bob Malone LANL
17
CCSM 2.0 Code Performance - 1000 year simulation
CCSM was used last year by NCAR to conduct a
1000 year global simulation using T42 resolution
for the atmosphere and 1 degree resolution for
the ocean. The simulation required 200 days of
compute time to complete. The Altix code at this
point has been partially optimized using MLP for
all inter model communications. Some sub-models
have been optimized further. About 4 man-months
have been devoted to the project.
MLP Altix 1.5GHz
53 days (192 CPUs)
MLP O3K 0.6 GHz
73 days (256 CPUS)
200 days
MPI IBM Pw3
318 days
MPI SGI O3k
0 days 200 days 400
days
Compute time for 1000 year simulation
18
OVERFLOW-MLP Performance
The following chart displays the performance of
OVERFLOW-MLP for Altix and Origin systems.
OVERFLOW-MLP is a hybrid multi-level parallel
code using OpenMP for loop level parallelism and
MLP (a faster alternative to MPI) for the coarse
grained parallelism. NOTE This code is 99
VECTOR per Cray. The performance below translates
into a problem run time of 0.9 seconds per step
on the 256p Altix.
35 Million Point Airplane Problem - GFLOP/s
versus CPU Count
CPU count
19
The CART3D Code - OpenMP Test
The CART3D code was the NASA Software of the
Year winner for 2003. It is routinely used for a
number of CFD problems within the agency. Its
most recent application was to assist in the foam
impact analysis done for the STS107 accident
investigation. The chart to the right presents
the results of executing the OpenMP based CART3D
production CFD code on various problems across
differing CPU counts on the NAS Altix and O3K
systems. As can be seen, the scaling to 500 CPUs
on the weeks old Altix 512 CPU system is
excellent.
20
NAS HSP3 Compute Server Suite Performance
The charts below present the relative performance
(O3K6001) across 4 platforms for the NAS HSP3
Compute Server Suite. This selection of codes was
used historically as a benchmark suite for the
HSP3 procurement (C90) at NAS.
HSP3 Compute Server Suite Performance
Relative Performance
Code
Relative Performance
Code
Code
21
The NAS Parallel Benchmarks (NPBs) V2.1
The chart below presents the results of several
executions of the NAS Parallel Benchmarks (NPBs
2.1) on Origin 3000 and Altix 1.3/1.5 GHz
Systems. The NPBs are a collection of codes and
code segments used throughout industry to
comparatively rate the performance of alternative
HPC systems.
NPB Performance Relative to O3K 600 MHz
Ratio to O3K
22
So What Do We Need?
  • Previous discussion shows need for 10-100x
    improvements in performance
  • Current 5 TFLOP/s peak - thus we need at
    least 50 TFLOP/s peak asap
  • Particularly need 10x on individual
    applications (capability)
  • Can we get there with existing architectures?
  • Likely yes at the 10-100x level over next few
    years
  • Can we get 100 TFLOP sustained in 5 years
  • Not likely given code/hw mismatch
  • What is solution?
  • High/low mix of mainstream and exotic offerings
  • (capacity versus capability architectures)

23
100 TFLOP/s Sustained Capacity is Not Hard
(Just very very expensive)
First note that NASA in aero and space sciences
is not even on the map for a 100 TFLOP/s
sustained need in capacity or capability. Constra
ints in budgets may have stunted the growth in
code capabilities and the specification of
requirements. The class of problems is certainly
big enough Earth sciences has stated much larger
compute requirements - so large in fact they may
never be met. Even a 100 TFLOP/s peak system
would be outstanding upgrade to NASA So How do
we get to 100 TFLOP/s peak? - Simple We get
there with 16,000 clustered CPUs at 6 GFLOP/s
each Off the shelf 40-80M now - 15-30M 1-2
years depending on CPU/connect Solution
however, is not very robust capability engine
for NASA
24
100 TFLOP/s Sustained Capability is Hard
Capability is critical to many NASA missions. It
often arises out of timelines driven by launch
windows, or unexpected real time emergencies.
Currently there is no simple solution to
constructing even a 10 TFLOP/s sustained system
that supports many of the NASA mission critical
codes. Issues in the way Codes and
algorithms - even best of today are latency
driven in the end Mainstream hardware
communication latencies are not keeping up 10
TFLOP/s sustained for NASA is a very tough
problem We do not generally grow the problem
to ease the latency burden Often the resolution
is sufficient - growing irrelevant/counterproducti
ve ASSERTION 10 TFLOP/s
sustained is simply not doable using current
mainstream roadmaps
25
So How Do We Get to 10 TFLOP/s Sustained
Capability for NASA
Hire Seymour Cray. There has never been an
equivalent since. It appears that 10-100 TFLOP/s
sustained is moving into the realm of custom or
semi-custom design. A number of exotic
architectures are showing some promise for
limited application areas. Comments on current
possibilities FPGA - not likely an option in
our lifetime - myriad of reasons memory access
issues greatly reduce performance software
rewrite is beyond a nightmare in
difficulty Blue Gene clones/derivatives - maybe
good for some apps interconnects with
nearest/butterfly connections show
promise class of codes is very limited - NASA
earth sciences maybe More Exotic - Its in the
interconnect - its in the memory access The
interconnects of today do not address many app
issues Custom connects with COTS CPUs can work
- vendors just dont get it
26
100 TFLOP/s - Conclusions/Summary
NASA is in desperate need for just 5-10 TFLOP/s
sustained on its current workload asap. 100
TFLOP/s sustained is not even on most NASA
enterprise road maps. The vast majority of
NASAs NAS facility cycles are CFD of some
flavor, or use algorithms that map closely to
CFD communication needs. These are the toughest
problems to scale on current mainstream vendor
offerings All NASA apps could take advantage of
better interconnect strategy than available from
the vendors. No vendor seems to fully understand
the need. FPGAs or larger versions of existing
systems will not scale current problems to
significantly increased performance Hybrids of
standard CPU parts and semi-custom interconnects
could work. Once again it needs infusion of cash
to get vendors interested and vendor recognition
of the problem. Within its budgetary constraints
NAS will continue to work with vendors to
investigate evolutionary mods to architectures
that show promise
27
Supplementary comment FLOP/s are the Wrong Metric
The modern RISC CPU can do hundreds of floating
point operations in the time it takes to do one
cache line fetch. FLOPs are free. This is the
Inverse of the old style Cray 90 architectures,
where FLOPs cost and Memory transactions were
essentially free. Any measurement using FLOP
count can be highly deceiving. The example below
shows a simple case of FLOP packing that triples
the net FLOP rate with no difference in the
resulting answers. The measured FLOP Rate is
simply higher. So how do we prevent this flop
packing in the field? Given a problem Do
I1,100000 X(I)x(I)y(I) Enddo Assume this
performance 1 FLOP/s Then on any modern CPU we
add extraneous nonsensical work Do
I1,100000 Zz1.0 Rr1.0 X(I)x(I)y(I) Enddo Y
ou will find the measured performance 3 (This
is a simple illustration - you will have to be
more clever to beat the optimizing compiler from
recognizing the useless work and removing it)
Write a Comment
User Comments (0)
About PowerShow.com