UltraEfficient Scientific Computing More Science Less Power John Shalf Leonid Oliker, Michael Wehner

About This Presentation

Title:

UltraEfficient Scientific Computing More Science Less Power John Shalf Leonid Oliker, Michael Wehner

Description:

UltraEfficient Scientific Computing More Science Less Power John Shalf Leonid Oliker, Michael Wehner – PowerPoint PPT presentation

Number of Views:114

Avg rating:3.0/5.0

Slides: 58

Provided by: flav59

Category:

more less

Transcript and Presenter's Notes

Title: UltraEfficient Scientific Computing More Science Less Power John Shalf Leonid Oliker, Michael Wehner

1
Ultra-Efficient Scientific Computing More
Science Less PowerJohn ShalfLeonid Oliker,
Michael Wehner, Kathy YelickRAMP Retreat
January 16, 2008
2
End of Dennard Scaling

New Constraints
Power limits clock rates
Cannot squeeze more performance from ILP (complex
cores) either!
But Moores Law continues!
What to do with all of those transistors if
everything else is flat-lining?
Now, cores per chip doubles every 18 months
instead of clock frequency!
No more free lunch for performance improvement!

Figure courtesy of Kunle Olukotun, Lance Hammond,
Herb Sutter, and Burton Smith
3
ORNL Computing Power and Cooling 2006 - 2011

Immediate need to add 8 MW to prepare for 2007
installs of new systems
NLCF petascale system could require an additional
10 MW by 2008
Need total of 40-50 MW for projected systems by
2011
Numbers just for computers add 75 for cooling
Cooling will require 12,000 15,000 tons of
chiller capacity

YIKES!
31M
23M
17M
9M
3M
Cost estimates based on 0.05 kW/hr
Data taken from Energy Management System-4
(EMS4). EMS4 is the DOE corporate system for
collecting energy information from the sites.
EMS4 is a web-based system that collects energy
consumption and cost information for all energy
sources used at each DOE site. Information is
entered into EMS4 by the site and reviewed at
Headquarters for accuracy.
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF
ENERGY
4
Top500 Estimated Power Requirements
5
Power is an Industry Wide Problem
Hiding in Plain Sight, Google Seeks More Power,
by John Markoff, June 14, 2006
New Google Plant in The Dulles, Oregon, from
NYT, June 14, 2006
6
Cost of Power Will Dominate, and Ultimately Limit
Practical Scale of Future Systems
Unrestrained IT power consumption could eclipse
hardware costs and put great pressure on
affordability, data center infrastructure, and
the environment.
Source Luiz André Barroso, (Google) The Price
of Performance, ACM Queue, Vol. 2, No. 7, pp.
48-53, September 2005. (Modified with
permission.)
7
Ultra-Efficient Computing100x over Business As
Usual

Cooperative effort we call science-driven system
architecture
Effective future exascale systems must be
developed in the context of application
requirements
Radically change HPC system development via
application-driven hardware/software co-design
Achieve 100x power efficiency and 100x capability
of mainstream HPC approach for targeted high
impact applications
Accelerate development cycle for exascale HPC
systems
Approach is applicable to numerous scientific
areas in the DOE Office of Science
Proposed pilot application Ultra-high resolution
climate change simulation

8
New Design Constraint POWER

Transistors still getting smaller
Moores Law is alive and well
But Dennard scaling is dead!
No power efficiency improvements with smaller
transistors
No clock frequency scaling with smaller
transistors
All magical improvement of silicon goodness has
ended
Traditional methods for extracting more
performance are well-mined
Cannot expect exotic architectures to save us
from the power wall
Even resources of DARPA can only accelerate
existing research prototypes (not magic new
technology)!

9
Estimated Exascale Power Requirements

LBNL IJHPCA Study for 1/5 Exaflop for Climate
Science in 2008
Extrapolation of Blue Gene and AMD design trends
Estimate 20 MW for BG and 179 MW for AMD
DOE E3 Report
Extrapolation of existing design trends to
exascale in 2016
Estimate 130 MW
DARPA Study
More detailed assessment of component
technologies
Estimate 20 MW just for memory alone, 60 MW
aggregate extrapolated from current design trends
Baltimore Sun Article (Jan 23, 2007) NSA drawing
65-75MW in Maryland
Crisis Baltimore Gas Electric does not have
sufficient power for city of Baltimore!
Expected to increase by 10-15MW per year!
The current approach is not
sustainable!

10
Path to Power EfficiencyReducing Waste in
Computing

Examine methodology of low-power embedded
computing market
optimized for low power, low cost, and high
computational efficiency
Years of research in low-power embedded
computing have shown only one design technique to
reduce power reduce waste.
? Mark Horowitz, Stanford University Rambus
Inc.
Sources of Waste
Wasted transistors (surface area)
Wasted computation (useless work/speculation/stall
s)
Wasted bandwidth (data movement)
Designing for serial performance

11
Our New Design Paradigm Application-Driven HPC

Identify high-impact exascale scientific
applications
Tailor system architecture to highly parallel
applications
Co-design algorithms and software together with
the hardware
Enabled by hardware emulation environments
Supported by auto-tuning for code generation

12
Designing for Efficiency is Application Class
Specific
13
Processor Power and PerformanceEmbedded
Application-Specific Cores
Courtesy of Chris Rowen, Tensilica Inc.
Performance on EEMBC benchmarks aggregate for
Consumer, Telecom, Office, Network, based on
ARM1136J-S (Freescale i.MX31), ARM1026EJ-S,
Tensilica Diamond 570T, T1050 and T1030, MIPS
20K, NECVR5000). MIPS M4K, MIPS 4Ke, MIPS 4Ks,
MIPS 24K, ARM 968E-S, ARM 966E-S, ARM926EJ-S,
ARM7TDMI-S scaled by ratio of Dhrystone MIPS
within architecture family. All power figures
from vendor websites, 2/23/2006.
14
How Small Is Small?

Power5 (Server)
389 mm2
120 W _at_ 1900 MHz
Intel Core2 sc (Laptop)
130 mm2
15 W _at_ 1000 MHz
PowerPC450 (BlueGene/P)
8 mm2
3 W _at_ 850 MHz
XTensa DP (cell phones)
0.8 mm2
0.09 W _at_ 650 MHz

TensilicaDP
PPC450
Intel Core2
Power 5
Each core operates at 1/3 to 1/10th efficiency of
largest chip, but you can pack 100x more cores
onto a chip and consume 1/20 the power!
15
Chris Rowen Data
16
Intel
17
Partnerships for Power-Efficient Computing

Identify high-impact exascale office of science
projects!
Embark on targeted program of tightly coupled
hardware/software co-design
Impossible using the typical two-year hardware
lead times
Break slow feedback loop for system designs via
RAMP hardware emulation platform and auto-tuned
code generation
Technology partners
UC Berkeley K Yelick, J Wawrzynek, K
Asanovic, K Keutzer
Stanford University / Rambus Inc. M Horowitz
Tensilica Inc. C Rowen
Pilot application kilometer-scale climate model
Provides important answers to question with
multi-trillion-dollar ramifications
Climate community partners Michael Wehner, Bill
Collins, David Randall, et al.

18
Cloud System Resolving Climate Simulation

A major source of errors in climate models is
poor cloud simulation
At 1 km horizontal resolution, cloud systems
can be resolved
Requires significant algorithm work and
unprecedented concurrencies
Dave Randalls SciDAC-funded effort at Colorado
State University offers an algorithm for this
regime
Icosahedral grid is highly uniform
Amenable to massively concurrent architectures
composed of power-efficient embedded cores

19
Effects of Finer Resolutions
Duffy, et al
Enhanced resolution of mountains yield model
improvements at larger scales
20
Pushing Current Model to High Resolution
20 km resolution produces reasonable tropical
cyclones
21
Kilometer-scale fidelity

Current cloud parameterizations break down
somewhere around 10km
Deep convective processes responsible for
moisture transport from near surface to higher
altitudes are inadequately represented at current
resolutions
Assumptions regarding the distribution of cloud
types become invalid in the Arakawa-Schubert
scheme
Uncertainty in short and long term forecasts can
be traced to these inaccuracies
However, at 2 or 3km, a radical reformulation of
atmospheric general circulation models is
possible
Cloud system resolving models replace cumulus
convection and large scale precipitation
parameterizations.
Will this lead to better global cloud
distributions

22
Extrapolating fvCAM to km Scale

fvCAM NCAR Community Atmospheric Model version
3.1
Atmospheric component of fully coupled climate
model, CCSM3.0
Finite Volume hydrostatic dynamics (Lin-Rood)
Parameterized physics is the same as the spectral
version
We use fvCAM as a tool to estimate future
computational requirements.
Major algorithm components of FVCAM?
Dynamics - solves atmospheric motion, N.S. eqn
fluid dynamics
Ops O(mn2) Time step determined by the Courant
(CFL) condition
Time step depends horizontal resolution (n)
Physics - Parameterized external processes
relevant to state of atmosphere
Ops O(mn), Time step can remain constant Dt
30 minutes
Not subject to CFL condition
Filtering
Ops O(mlog(m)n2), addresses high aspect cells
at poles via FFT
Allows violation of overly restrictive Courant
condition near poles

23
Extrapolation to km-Scale
Theoretical scaling behavior matches
experimental measurements
By extrapolating out to 1.5km, we see the
dynamics dominates calculation time while Physics
and Filters overheads become negligible
24
Scaling Processor Performance Requirements

A practical constraint is that the number of
subdomains is limited to be less than or equal to
the number of horizontal cells
Using the current 1D approach is limited to only
4000 subdomains at 1km
Would require 1Teraflop/subdomain using this
approach!
Number of 2D subdomains estimated using 3x3 or
10x10 cells
Can utilize millions of subdomains
Assuming 10x10x10 cells (given 100 vertical
layers) 20M subdomains
0.5Gflop/processor would achieve 1000x speedup
over realtime
Vertical solution requires high communication
(aided with multi-core/SMP)
This is a lower bound in the absence of
communication costs and load imbalance

25
Memory Scaling Behavior

Memory estimate at km-scale is about 25 TB total)
100 TB total with 100 vertical levels
Total memory requirement independent of domain
decomposition
Due to Courant condition, operation count scales
at greater rate than mesh cells - thus relatively
low per processor memory requirement
Memory bytes per flop drop from 0.7 for 200km
mesh to .009 for 1.5km mesh.
Using current 1D approach requires 6GB per
processor
With 2D approach requires only 5MB per processor

26
Interconnect Requirements
Data assumes 2D 10x10 decomposition
where only 10 of the calculation is devoted to
communication

Three factors cause sustained performance lower
than peak
Single processor performance, interprocessor
communication, load balancing
2D case message size are independent on
horizontal resolution, however in 1D case
communication contains ghost cells over the
entire range of longitudes
Assuming (pessimistically) communication only
occurs during 10 of calculation - not over the
entire (100) interval - increases bandwidth
demands 10x
2D 10x10 case requires minimum 277 MB/s
bandwidth and maximum18microsec latency
1D case would require minimum of 256 GB/s
bandwidth
Note that the hardware/algorithm ability to
overlap computation with communication would
decrease interconnect requirements
Load balance is important issue, but is not
examined in our study

27
Communication Topology
28
New Discretization for Massive Parallelism

Latitude-longitude based algorithm would not
scale to 1km
Filtering cost would be only 7 of calculation
However the semi-Lagrangian advection algorithm
breaks down
Grid cell aspect ratio at the pole is 10000!
Advection time step is problematic at this scale
Ultimately requires new discretization for
atmosphere model
Must expose sufficient parallelism to exploit
power-efficient design
Investigating Cubed Sphere (NOAA) and Icosahedral
(Randall code)

Current
Cubed Sphere
Icosahedral
29
Strawman 1km Climate Computer

I mesh at 1000X real time
.015oX.02oX100L (1.5km)
10 Petaflops sustained
100-200 Petaflops peak
100 Terabytes total memory
Only 5 MB memory per processor
5 GB/s local memory performance per domain (1
byte/flop)
2 million horizontal subdomains
10 vertical domains (assume fast vertical
communication)
20 million processors at 500Mflops each sustained
200 MB/s in four nearest neighbor directions
Tight coupling of communication in vertical
dimension

We now compare available technology in current
generation of HPC systems
30
Estimation of 1 km Climate Model Computational
Requirements

We have performed a detailed analysis of
kilometer-scale climate model resource
requirements
Paper in International Journal of High
Performance Computing Applications
Equations of motion dominate at ultra-high
resolutions because of Courant stability
condition
Require model run 1000x faster than real time
(minimum)
A truly exascale class scientific problem
About 2 billion icosahedral points
20 million processors with modest vertical
parallelization
Modest 0.5 gigaflops/processor with 5 MB memory
per processor
Modest 200MB/s comm bandwidth to nearest neighbors

31
Customization Continuum

Application-driven architecture does not NOT
necessitate a special purpose machine!
D.E. Shaw System Semicustom design with some
custom elements
Uses fully programmable cores with full-custom
co-processors to achieve efficiency (1Megawatt)
Simulate 100x1000x longer timescales than ANY
feasible HPC system
Programmability broadens application reach (but
narrower than our approach)
MD-Grape Full custom ASIC design
1 petaflop performance for one application using
260 kilowatts
Cost 9M from concept to implementation
Application-Driven Architecture (Climate
Simulator) Semicustom design
Highly programmable core architecture using
C/C/Fortran
100x better power efficiency is modest compared
to demonstrated capability of more specialized
approaches!

32
Climate Strawman System DesignIn 2008

Design system around the requirements of the
massively parallel application
Example kilometer scale climate model
application
We examined three different approaches
AMD Opteron Commodity approach, lower efficiency
for scientific applications offset by cost
efficiencies of mass market
BlueGene Generic embedded processor core and
customize system-on-chip (SoC) services to
improve power efficiency for scientific
applications
Tensilica Customized embedded CPU as well as
SoC provides further power efficiency benefits
but maintains programmability

Solve an exascale problem without building an
exaflop/s machine!
33
Climate System Design ConceptStrawman Design
Study
From Chris Rowen, Tensilica
10PF sustained 120 m2 lt3MWatts lt 75M
34
Automatic Processor Generation(Example from
Existing Tensilica Design Flow)
Application-optimized processor implementation
(RTL/Verilog)
Base CPU
OCD
Apps Datapaths
Timer
Cache
FPU
Extended Registers

Processor configuration
Select from menu
Automatic instruction discovery (XPRES Compiler)
Explicit instruction description (TIE)

Build with any process in any fab (Costs 1M)
Tailored SW Tools Compiler, debugger,
simulators, Linux, other OS Ports (Automatically
generated together with the Core)
35
Impact on Broader DOE Scientific Workload

We propose a cloud resolving climate change
simulation to illustrate our power-efficient,
application-driven design methodology
Our approach is geared for a class of codes, not
just for a single code instantiation
This methodology is broadly applicable and could
be extended to other scientific disciplines
BlueGene was originally targeted at chemistry and
bioinformatics applications ? result was very
power-efficient architecture, and application was
broader than the original target

36
More Info

NERSC Science Driven System Architecture Group
http//www.nersc.gov/projects/SDSA
Power Efficient Semi-custom Computing
http//vis.lbl.gov/jshalf/SIAM_CSE07
The View from Berkeley
http//view.eecs.berkeley.edu
Memory Bandwidth
http//www.nersc.gov/projects/SDSA/reports/uploade
d/SOS11_mem_Shalf.pdf

37
Extra
38
Consumer Electronics Convergence
From Tsugio Makimoto
39
Consumer Electronics has Replaced PCs as the
Dominant Market Force in CPU Design!!
From Tsugio Makimoto
IPodITunes exceeds 50 of Apples Net Profit
Apple Introduces IPod
Apple Introduces Cell Phone (iPhone)
40
Convergence of Platforms

Multiple parallel general-purpose processors
(GPPs)
Multiple application-specific processors (ASPs)

The Processor is the new Transistor Rowen
41
BG/Lthe Rise of the Embedded Processor?
TOP 500 Performance by Architecture
10000000
MPP SMP Cluster Constellations Single
Processor SIMD Others MPP embedded
1000000
100000
10000
Aggregate Rmax (Tflop/s)
1000
100
10
1
06/1993
06/1994
06/1995
06/1996
06/1997
06/1998
06/1999
06/2000
06/2001
06/2002
06/2003
06/2004
06/2005
42
Tension Between Commodity and Specialized
Architecture

Commodity Components
Amortize high development costs by sharing costs
with high volume market
Accept lower computational efficiency for much
lower capital equipment costs!
Specialization
Specialize to task in order to improve
computational efficiency.
Specialization used very successfully by embeded
processor community
Not cost effective if volume is too low.
When cost of power exceeds capital equipment
costs
Commodity clusters are optimizing wrong part of
the cost model
Will need for higher computational efficiency
drive more specialization? (look at embedded
market lots of specialization)

43
What is Happening Now?

Moores Law
Silicon lithography will improve by 2x every 18
months
Double the number of transistors per chip every
18mo.
CMOS Power
Total Power V2 f C V Ileakage
active power
passive power
As we reduce feature size Capacitance ( C )
decreases proportionally to transistor size
Enables increase of clock frequency ( f )
proportionally to Moores law lithography
improvements, with same power use
This is called Fixed Voltage Clock Frequency
Scaling (Borkar 99)
Since 90nm
V2 f C V Ileakage
Can no longer take advantage of frequency scaling
because passive power (V Ileakage ) dominates
Result is recent clock-frequency stall reflected
in Patterson Graph at right

SPEC_Int benchmark performance since 1978 from
Patterson Hennessy Vol 4.
44
What is Happening Now?

Moores Law
Silicon lithography will improve by 2x every 18
months
Double the number of transistors per chip every
18mo.
CMOS Power
Total Power V2 f C V Ileakage
active power
passive power
As we reduce feature size Capacitance ( C )
decreases proportionally to transistor size
Enables increase of clock frequency ( f )
proportionally to Moores law lithography
improvements, with same power use
This is called Fixed Voltage Clock Frequency
Scaling (Borkar 99)
Since 90nm
V2 f C V Ileakage
Can no longer take advantage of frequency scaling
because passive power (V Ileakage ) dominates
Result is recent clock-frequency stall reflected
in Patterson Graph at right

We are here!
SPEC_Int benchmark performance since 1978 from
Patterson Hennessy Vol 4.
45
Some Final Comments on Convergence(who is in the
drivers seat of the multicore revolution?)
46
Parallel Computing EverywhereCisco CRS-1 Terabit
Router
16 Clusters of 12 cores each (192 cores!)
16 PPE

1884 Xtensa general purpose processor cores per
Silicon Packet Processor
Up to 400,000 processors per system
(this is not just about HPC!!!)

Replaces ASIC using 188 GP cores! Emulates ASIC
at nearly same power/performance Better
power/performance than FPGA! New Definition for
Custom in SoC
47
Conclusions

Enormous transition is underway that affects all
sectors of computing industry
Motivated by power limits
Proceeding before emergence of the parallel
programming model
Will lead to new era of architectural exploration
given uncertainties about programming and
execution model (and we MUST explore!)
Need to get involved now
3-5 years for new hardware designs to emerge
3-5 years lead for new software ideas necessary
to support new hardware to emerge
5 MORE years to general adoption of new software

48
Interconnect Design Considerations for Massive
Concurrency

Application studies provide insight to
requirements for Interconnects (both on-chip and
off-chip)
On-chip interconnect is 2D planar (crossbar wont
scale!)
Sparse connectivity for dwarfs crossbar is
overkill
No single best topology
A Bandwidth-oriented network for data
Most point-to-point message exhibit sparse
topology bandwidth bound
Separate Latency-oriented network for collectives
E.g., Thinking Machines CM-5, Cray T3D, IBM
BlueGene/LP
Ultimately, need to be aware of the on-chip
interconnect topology in addition to the off-chip
topology
Adaptive topology interconnects (HFAST)
Intelligent task migration?

49
Reliable System Design

The future is unreliable
Silicon Lithography pushes towards the atomic
scale, the opportunity for spurious hardware
errors will increase dramatically
Reliability of a system is not necessarily
proportional to the number of cores in the system
Reliability is proportional to of sockets in
system (not cores/chip)
At LLNL, BG/L has longer MTBF than Purple despite
having 12x more processor cores
Integrating more peripheral devices onto a single
chip (e.g. caches, memory controller,
interconnect) can further reduce chip count and
increase reliability (System-on-Chip/SOC)
A key limiting factor is software infrastructure
Software was designed assuming perfect data
integrity (but that is not a multicore issue)
Software written with implicit assumption of
smaller concurrency (1M cores not part of
original design assumptions)
Requires fundamental re-thinking of OS and math
library design assumptions

50
Operating Systems for CMP

Old OS Assumptions are bogus for hundreds of
cores!
Assumes limited number of CPUs that must be
shared
Old OS time-multiplexing (context switching and
cache pollution!)
New OS spatial partitioning
Greedy allocation of finite I/O device interfaces
(eg. 100 cores go after the network interface
simultaneously)
Old OS First process to acquire lock gets device
(resource/lock contention! Nondeterm delay!)
New OS QoS management for symmetric device
access
Background task handling via threads and signals
Old OS Interrupts and threads (time-multiplexing)
(inefficient!)
New OS side-cores dedicated to DMA and async I/O
Fault Isolation
Old OS CPU failure --gt Kernel Panic (will happen
with increasing frequency in future silicon!)
New OS CPU failure --gt Partition Restart
(partitioned device drivers)
Old OS invoked any interprocessor communication
or scheduling vs. direct HW access

51
I/O For Massive Concurrency

Scalable I/O for massively concurrent systems!
Many issues with coordinating access to disk
within node (on chip or CMP)
OS will need to devote more attention to QoS for
cores competing for finite resource (mutex locks
and greedy resource allocation policies will not
do!) (it is rugby where device the ball)

52
Intel
53
Chris Rowen Data
54
Increasing Blue Gene Impact

SC 2005 Gordon Bell Award, 101.7 TFs on real
materials science simulation
Recently exceeding 200 TFs sustained
Sweep of the all four HPC Challenge class 1
benchmarks
G-HPL (259 Tflop/s), G-RandomAccess (35 GUPS)
EP-STREAM (160 TB/s) and G-FFT (2.3 Tflop/s)
Over 80 large-scale applications ported and
running on BG/L

27.6 kW power consumption per rack (max) 7 kW
power consumption (idle)
Slide adapted from Rick Stevens, ANL
55
Future Scaling without Innovation
If we scale current peak performance numbers for
various architectures and allowing system peak
doubling every 18 months. Trouble ahead
Slide adapted from Rick Stevens, ANL
56
Projected electricity use- Various scenarios
2007 - 2011
Green Grid - DOE Energy Savings Goal 10.7
billion kWh/yr by 2011
Source Report to Congress on Server and Data
Center Energy Efficiency Public Law 109-431 US
EPA, August 2, 2007
57
Petascale Architectural Exploration(back of the
envelope calculation)

Software challenges (at all levels) are a
tremendous obstacle for any of these approaches.
Unprecedented levels of concurrency are required.
Unprecedented levels of power are required if we
adopt conventional route
Embedded route offers tractable power, but
daunting concurrency!
This only gets us to 10 Petaflops peak -
200PF system to meet application sustained
performance requirements
thus cost and power are likely to be 10x-20x more.