Title: Blue Waters and the Future of High Performance Computing Marc Snir
1Blue Waters and the Future of High Performance
ComputingMarc Snir
2Outline
- How a Supercomputer looks like in gt 2010
- Cannot go into the details of Blue Waters
configuration - What are the main challenges to making good use
of such systems
3Large NSF Funded Supercomputers Beyond 2010
- One Petascale platform -- Blue Waters at NCSA, U
Illinois - Sustained performance petaflop range
- Memory petabyte range
- Disk 10s petabytes
- Archival storage exabyte range
- Power Megawatts
- Price 200M (not including building, operation,
application development) - Multiple 1/4 scale platforms at various
universities - Available to NSF-funded grand challenge teams
on a competitive basis
4The Uniprocessor Crisis
- Manufacturers cannot increase clock rate anymore
(power problem) - Computer architects have run out of productive
ideas on how to use more transistors to increase
single thread performance - Diminishing return on caches
- Diminishing return on instruction-level
parallelism - Increased processor performance will come only
from the increase on number of cores per chip - Petascale 250K -- 1M threads
- Need algorithms with massive levels of parallelism
5Average Processors Top 500 System
6Mileage is Less than Advertised
nominal IPC
Instruction per cycle, frequent item mining
(M Wei)
7Its the Memory, Stupid
PC Balance (word operands from memory per flop)
Seem stuck at 110 ratio
(source McAlpin)
8The Memory Wall and Palliatives
- The problem
- Memory bandwidth is limited (cost)
- Queue of pending loads has limited depth
(performance) - Compilers cannot issue enough concurrent loads to
fill the memory pipe - Compilers cannot issue loads early enough to
avoid stalls - Solutions
- Multicore and vector operations -- to fill the
pipe Simultaneous multithreading -- to tolerate
latency - Need even higher levels of parallelism!
9Solutions to the Memory Wall
- Caching and locality
- Need algorithms with good locality
- Split communication
- Memory prefetch (local memory)
- Put/get (remote memory)
- Need programmed communication to local and remote
memory (not necessarily message-passing) - N.B. Compute power is essentially free you pay
for storing and moving data - Peak FLOPs are a silly measure of performance
- A computer that achieves a high fraction of its
peak flop rate is ill-designed
10Global Communication
- Under software control
- Remote loads too expensive
- Global coherence too expensive
- Software means userlibrary now (MPI) can mean
compilerhw in 201x. - But programmer has to manage locality
- Probably moving from 2-sided communication to
1-sided communication (put/get) - May have hw accelerators for global operations
(e.g., global barriers)
11I/O
- Parallel file system
- Optimized for the case where 10K-100K processes
share 1 file - unfortunately, users often open multiple files
per process - File system is logically shared, physically
distributed - May need 2 parity disks
12Supercomputer vs. Cluster -- is it Merely a
Matter of Size?
- All systems use commodity processors
- Size matters need denser packaging, and higher
quality components - Supercomputers use more expensive server
technology and more advanced cooling - Need more scalable switch, with higher bw and
lower latency - Proprietary interface, vs. NIC on I/O bus
- Need parallel file system
- Cluster file systems still have a way to go
- Need very robust, out of band, system control
infrastructure
13Do we Need Petascale Systems?
- Yes, every self-respecting science engineering
discipline has a roadmap explaining why it needs
petascale performance, and beyond - Two buzzwords Multiphysics, multiscale
- Do we need this performance now (rather than in
2020, when it will be much cheaper)? - Yes, many simulations have high potential
societal impact (health, energy, global warming) - Note since programming for petascale performance
will not be significantly easier in 2020,
scientists (who do not pay for compute time) have
no incentive to wait
14Do We Have Applications with Sufficient
Parallelism?
- Probably -- we have analyzed in detail plausible
applications as part of the Track 1 competition - But simple benchmark applications are not the
same as complete applications of scientific
interest - Solve larger problem
- Easy, but not always needed
- Increase resolution (finer mesh)
- Parallelism increases (by k3)
- Number of iteration also
increases (by
k) - May be limited by accuracy of
initial
conditions - Increase complexity of simulation
- Hard
15Will Codes be Ready in 201x?
- Likelythere are sufficiently many research
groups who want to be ready - Main obstacle NSF underestimates cost of
application development - Will programming be any easier than it is now?
- No High Performance Computing is about
performance programming this is hard, even on
uniprocessors - Problem is inherently hard, and there are no good
tools for performance tuning - Market for HPC software is too small
- Practically no independent HPC software companies
16Are we Making Progress on Software Productivity
for HPC?
- Not muchthere is significant focus on new
languages for HPC -- most likely misplaced - Not clear that parallelism requires new
high-level constructs frameworks built with
existing OO languages do hide parallelism - Not clear that small HPC market can justify
unique languages - New languages needed for performance, not for
raising level of abstraction - Good tools for debugging and performance tuning
are missing - A screen has only one pixel per thread
17Will Blue Water Stay Up Long Enough to Complete
Long Computations?
- Of course not -- MTBF is measured in days, if we
are lucky - Any petascale application must do periodic
checkpointing and have restart code - Also needed for splitting long computations into
multiple submissions, possibly on different
machines, and for checking computation evolution - User checkpoint, not system checkpoint
- Optimal checkpoint intervals depends on system
MTBF, checkpoint overhead and recovery overhead - Low MTBF means low machine utilization but is not
a disaster (assuming that file system is
reliable)
18System Utilization as function of MTBF
Optimal Checkpoint interval
hours
100
System utilization
50
MTBF (hours)
- Assume 5 min checkpoint, 15 min recovery and
optimal checkpoint frequency
19Now that any Processor has Multiple Cores, is
High Performance Computing Getting out of the
Ghetto?
- No By definition, a supercomputer is at the
bleeding-edge - Different concerns, different scales, different
communities - 1-100 way parallelism vs. 100k-1M way parallelism
- Tightly coupled shared memory vs. distributed
memory - 100B industry vs. 1B industry
- Good sw (e.g. Windows) vs. lousy sw
- Programming for the masses vs. programming for
the elite - Expert environments vs. Joe Shmoe environment
- I/O and network intensive vs. compute intensive
- Reactive vs. transformational
20Summary
- Fitzgerald "The very rich are different from you
and me." Hemingway "Yeahthey have more money. - Supercomputing is different from normal
computing they have bigger machines - P.S. The famous Scott-Ernest dialogue apparently
never took place. Oh well