Title: UltraEfficient Scientific Computing More Science Less Power John Shalf Leonid Oliker, Michael Wehner
1Ultra-Efficient Scientific Computing More
Science Less PowerJohn ShalfLeonid Oliker,
Michael Wehner, Kathy YelickRAMP Retreat
January 16, 2008
2End of Dennard Scaling
- New Constraints
- Power limits clock rates
- Cannot squeeze more performance from ILP (complex
cores) either! - But Moores Law continues!
- What to do with all of those transistors if
everything else is flat-lining? - Now, cores per chip doubles every 18 months
instead of clock frequency! - No more free lunch for performance improvement!
Figure courtesy of Kunle Olukotun, Lance Hammond,
Herb Sutter, and Burton Smith
3ORNL Computing Power and Cooling 2006 - 2011
- Immediate need to add 8 MW to prepare for 2007
installs of new systems - NLCF petascale system could require an additional
10 MW by 2008 - Need total of 40-50 MW for projected systems by
2011 - Numbers just for computers add 75 for cooling
- Cooling will require 12,000 15,000 tons of
chiller capacity
YIKES!
31M
23M
17M
9M
3M
Cost estimates based on 0.05 kW/hr
Data taken from Energy Management System-4
(EMS4). EMS4 is the DOE corporate system for
collecting energy information from the sites.
EMS4 is a web-based system that collects energy
consumption and cost information for all energy
sources used at each DOE site. Information is
entered into EMS4 by the site and reviewed at
Headquarters for accuracy.
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF
ENERGY
4Top500 Estimated Power Requirements
5Power is an Industry Wide Problem
Hiding in Plain Sight, Google Seeks More Power,
by John Markoff, June 14, 2006
New Google Plant in The Dulles, Oregon, from
NYT, June 14, 2006
6Cost of Power Will Dominate, and Ultimately Limit
Practical Scale of Future Systems
Unrestrained IT power consumption could eclipse
hardware costs and put great pressure on
affordability, data center infrastructure, and
the environment.
Source Luiz André Barroso, (Google) The Price
of Performance, ACM Queue, Vol. 2, No. 7, pp.
48-53, September 2005. (Modified with
permission.)
7Ultra-Efficient Computing100x over Business As
Usual
- Cooperative effort we call science-driven system
architecture - Effective future exascale systems must be
developed in the context of application
requirements - Radically change HPC system development via
application-driven hardware/software co-design - Achieve 100x power efficiency and 100x capability
of mainstream HPC approach for targeted high
impact applications - Accelerate development cycle for exascale HPC
systems - Approach is applicable to numerous scientific
areas in the DOE Office of Science - Proposed pilot application Ultra-high resolution
climate change simulation
8New Design Constraint POWER
- Transistors still getting smaller
- Moores Law is alive and well
- But Dennard scaling is dead!
- No power efficiency improvements with smaller
transistors - No clock frequency scaling with smaller
transistors - All magical improvement of silicon goodness has
ended - Traditional methods for extracting more
performance are well-mined - Cannot expect exotic architectures to save us
from the power wall - Even resources of DARPA can only accelerate
existing research prototypes (not magic new
technology)!
9Estimated Exascale Power Requirements
- LBNL IJHPCA Study for 1/5 Exaflop for Climate
Science in 2008 - Extrapolation of Blue Gene and AMD design trends
- Estimate 20 MW for BG and 179 MW for AMD
- DOE E3 Report
- Extrapolation of existing design trends to
exascale in 2016 - Estimate 130 MW
- DARPA Study
- More detailed assessment of component
technologies - Estimate 20 MW just for memory alone, 60 MW
aggregate extrapolated from current design trends - Baltimore Sun Article (Jan 23, 2007) NSA drawing
65-75MW in Maryland - Crisis Baltimore Gas Electric does not have
sufficient power for city of Baltimore! - Expected to increase by 10-15MW per year!
- The current approach is not
sustainable!
10Path to Power EfficiencyReducing Waste in
Computing
- Examine methodology of low-power embedded
computing market - optimized for low power, low cost, and high
computational efficiency - Years of research in low-power embedded
computing have shown only one design technique to
reduce power reduce waste. - ? Mark Horowitz, Stanford University Rambus
Inc. - Sources of Waste
- Wasted transistors (surface area)
- Wasted computation (useless work/speculation/stall
s) - Wasted bandwidth (data movement)
- Designing for serial performance
11Our New Design Paradigm Application-Driven HPC
- Identify high-impact exascale scientific
applications - Tailor system architecture to highly parallel
applications - Co-design algorithms and software together with
the hardware - Enabled by hardware emulation environments
- Supported by auto-tuning for code generation
12Designing for Efficiency is Application Class
Specific
13Processor Power and PerformanceEmbedded
Application-Specific Cores
Courtesy of Chris Rowen, Tensilica Inc.
Performance on EEMBC benchmarks aggregate for
Consumer, Telecom, Office, Network, based on
ARM1136J-S (Freescale i.MX31), ARM1026EJ-S,
Tensilica Diamond 570T, T1050 and T1030, MIPS
20K, NECVR5000). MIPS M4K, MIPS 4Ke, MIPS 4Ks,
MIPS 24K, ARM 968E-S, ARM 966E-S, ARM926EJ-S,
ARM7TDMI-S scaled by ratio of Dhrystone MIPS
within architecture family. All power figures
from vendor websites, 2/23/2006.
14How Small Is Small?
- Power5 (Server)
- 389 mm2
- 120 W _at_ 1900 MHz
- Intel Core2 sc (Laptop)
- 130 mm2
- 15 W _at_ 1000 MHz
- PowerPC450 (BlueGene/P)
- 8 mm2
- 3 W _at_ 850 MHz
- XTensa DP (cell phones)
- 0.8 mm2
- 0.09 W _at_ 650 MHz
TensilicaDP
PPC450
Intel Core2
Power 5
Each core operates at 1/3 to 1/10th efficiency of
largest chip, but you can pack 100x more cores
onto a chip and consume 1/20 the power!
15Chris Rowen Data
16Intel
17Partnerships for Power-Efficient Computing
- Identify high-impact exascale office of science
projects! - Embark on targeted program of tightly coupled
hardware/software co-design - Impossible using the typical two-year hardware
lead times - Break slow feedback loop for system designs via
RAMP hardware emulation platform and auto-tuned
code generation - Technology partners
- UC Berkeley K Yelick, J Wawrzynek, K
Asanovic, K Keutzer - Stanford University / Rambus Inc. M Horowitz
- Tensilica Inc. C Rowen
- Pilot application kilometer-scale climate model
- Provides important answers to question with
multi-trillion-dollar ramifications - Climate community partners Michael Wehner, Bill
Collins, David Randall, et al.
18Cloud System Resolving Climate Simulation
- A major source of errors in climate models is
poor cloud simulation - At 1 km horizontal resolution, cloud systems
can be resolved - Requires significant algorithm work and
unprecedented concurrencies - Dave Randalls SciDAC-funded effort at Colorado
State University offers an algorithm for this
regime - Icosahedral grid is highly uniform
- Amenable to massively concurrent architectures
composed of power-efficient embedded cores
19Effects of Finer Resolutions
Duffy, et al
Enhanced resolution of mountains yield model
improvements at larger scales
20Pushing Current Model to High Resolution
20 km resolution produces reasonable tropical
cyclones
21Kilometer-scale fidelity
- Current cloud parameterizations break down
somewhere around 10km - Deep convective processes responsible for
moisture transport from near surface to higher
altitudes are inadequately represented at current
resolutions - Assumptions regarding the distribution of cloud
types become invalid in the Arakawa-Schubert
scheme - Uncertainty in short and long term forecasts can
be traced to these inaccuracies - However, at 2 or 3km, a radical reformulation of
atmospheric general circulation models is
possible - Cloud system resolving models replace cumulus
convection and large scale precipitation
parameterizations. - Will this lead to better global cloud
distributions
22Extrapolating fvCAM to km Scale
- fvCAM NCAR Community Atmospheric Model version
3.1 - Atmospheric component of fully coupled climate
model, CCSM3.0 - Finite Volume hydrostatic dynamics (Lin-Rood)
- Parameterized physics is the same as the spectral
version - We use fvCAM as a tool to estimate future
computational requirements. - Major algorithm components of FVCAM?
- Dynamics - solves atmospheric motion, N.S. eqn
fluid dynamics - Ops O(mn2) Time step determined by the Courant
(CFL) condition - Time step depends horizontal resolution (n)
- Physics - Parameterized external processes
relevant to state of atmosphere - Ops O(mn), Time step can remain constant Dt
30 minutes - Not subject to CFL condition
- Filtering
- Ops O(mlog(m)n2), addresses high aspect cells
at poles via FFT - Allows violation of overly restrictive Courant
condition near poles
23Extrapolation to km-Scale
Theoretical scaling behavior matches
experimental measurements
By extrapolating out to 1.5km, we see the
dynamics dominates calculation time while Physics
and Filters overheads become negligible
24Scaling Processor Performance Requirements
- A practical constraint is that the number of
subdomains is limited to be less than or equal to
the number of horizontal cells - Using the current 1D approach is limited to only
4000 subdomains at 1km - Would require 1Teraflop/subdomain using this
approach! - Number of 2D subdomains estimated using 3x3 or
10x10 cells - Can utilize millions of subdomains
- Assuming 10x10x10 cells (given 100 vertical
layers) 20M subdomains - 0.5Gflop/processor would achieve 1000x speedup
over realtime - Vertical solution requires high communication
(aided with multi-core/SMP) - This is a lower bound in the absence of
communication costs and load imbalance
25Memory Scaling Behavior
- Memory estimate at km-scale is about 25 TB total)
- 100 TB total with 100 vertical levels
- Total memory requirement independent of domain
decomposition - Due to Courant condition, operation count scales
at greater rate than mesh cells - thus relatively
low per processor memory requirement - Memory bytes per flop drop from 0.7 for 200km
mesh to .009 for 1.5km mesh. - Using current 1D approach requires 6GB per
processor - With 2D approach requires only 5MB per processor
26Interconnect Requirements
Data assumes 2D 10x10 decomposition
where only 10 of the calculation is devoted to
communication
- Three factors cause sustained performance lower
than peak - Single processor performance, interprocessor
communication, load balancing - 2D case message size are independent on
horizontal resolution, however in 1D case
communication contains ghost cells over the
entire range of longitudes - Assuming (pessimistically) communication only
occurs during 10 of calculation - not over the
entire (100) interval - increases bandwidth
demands 10x - 2D 10x10 case requires minimum 277 MB/s
bandwidth and maximum18microsec latency - 1D case would require minimum of 256 GB/s
bandwidth - Note that the hardware/algorithm ability to
overlap computation with communication would
decrease interconnect requirements - Load balance is important issue, but is not
examined in our study
27Communication Topology
28New Discretization for Massive Parallelism
- Latitude-longitude based algorithm would not
scale to 1km - Filtering cost would be only 7 of calculation
- However the semi-Lagrangian advection algorithm
breaks down - Grid cell aspect ratio at the pole is 10000!
- Advection time step is problematic at this scale
- Ultimately requires new discretization for
atmosphere model - Must expose sufficient parallelism to exploit
power-efficient design - Investigating Cubed Sphere (NOAA) and Icosahedral
(Randall code)
Current
Cubed Sphere
Icosahedral
29Strawman 1km Climate Computer
- I mesh at 1000X real time
- .015oX.02oX100L (1.5km)
- 10 Petaflops sustained
- 100-200 Petaflops peak
- 100 Terabytes total memory
- Only 5 MB memory per processor
- 5 GB/s local memory performance per domain (1
byte/flop) - 2 million horizontal subdomains
- 10 vertical domains (assume fast vertical
communication) - 20 million processors at 500Mflops each sustained
- 200 MB/s in four nearest neighbor directions
- Tight coupling of communication in vertical
dimension
We now compare available technology in current
generation of HPC systems
30Estimation of 1 km Climate Model Computational
Requirements
- We have performed a detailed analysis of
kilometer-scale climate model resource
requirements - Paper in International Journal of High
Performance Computing Applications - Equations of motion dominate at ultra-high
resolutions because of Courant stability
condition - Require model run 1000x faster than real time
(minimum) - A truly exascale class scientific problem
- About 2 billion icosahedral points
- 20 million processors with modest vertical
parallelization - Modest 0.5 gigaflops/processor with 5 MB memory
per processor - Modest 200MB/s comm bandwidth to nearest neighbors
31Customization Continuum
- Application-driven architecture does not NOT
necessitate a special purpose machine! - D.E. Shaw System Semicustom design with some
custom elements - Uses fully programmable cores with full-custom
co-processors to achieve efficiency (1Megawatt) - Simulate 100x1000x longer timescales than ANY
feasible HPC system - Programmability broadens application reach (but
narrower than our approach) - MD-Grape Full custom ASIC design
- 1 petaflop performance for one application using
260 kilowatts - Cost 9M from concept to implementation
- Application-Driven Architecture (Climate
Simulator) Semicustom design - Highly programmable core architecture using
C/C/Fortran - 100x better power efficiency is modest compared
to demonstrated capability of more specialized
approaches!
32Climate Strawman System DesignIn 2008
- Design system around the requirements of the
massively parallel application - Example kilometer scale climate model
application - We examined three different approaches
- AMD Opteron Commodity approach, lower efficiency
for scientific applications offset by cost
efficiencies of mass market - BlueGene Generic embedded processor core and
customize system-on-chip (SoC) services to
improve power efficiency for scientific
applications - Tensilica Customized embedded CPU as well as
SoC provides further power efficiency benefits
but maintains programmability
Solve an exascale problem without building an
exaflop/s machine!
33Climate System Design ConceptStrawman Design
Study
From Chris Rowen, Tensilica
10PF sustained 120 m2 lt3MWatts lt 75M
34Automatic Processor Generation(Example from
Existing Tensilica Design Flow)
Application-optimized processor implementation
(RTL/Verilog)
Base CPU
OCD
Apps Datapaths
Timer
Cache
FPU
Extended Registers
- Processor configuration
- Select from menu
- Automatic instruction discovery (XPRES Compiler)
- Explicit instruction description (TIE)
Build with any process in any fab (Costs 1M)
Tailored SW Tools Compiler, debugger,
simulators, Linux, other OS Ports (Automatically
generated together with the Core)
35Impact on Broader DOE Scientific Workload
- We propose a cloud resolving climate change
simulation to illustrate our power-efficient,
application-driven design methodology - Our approach is geared for a class of codes, not
just for a single code instantiation - This methodology is broadly applicable and could
be extended to other scientific disciplines - BlueGene was originally targeted at chemistry and
bioinformatics applications ? result was very
power-efficient architecture, and application was
broader than the original target
36More Info
- NERSC Science Driven System Architecture Group
- http//www.nersc.gov/projects/SDSA
- Power Efficient Semi-custom Computing
- http//vis.lbl.gov/jshalf/SIAM_CSE07
- The View from Berkeley
- http//view.eecs.berkeley.edu
- Memory Bandwidth
- http//www.nersc.gov/projects/SDSA/reports/uploade
d/SOS11_mem_Shalf.pdf
37Extra
38Consumer Electronics Convergence
From Tsugio Makimoto
39Consumer Electronics has Replaced PCs as the
Dominant Market Force in CPU Design!!
From Tsugio Makimoto
IPodITunes exceeds 50 of Apples Net Profit
Apple Introduces IPod
Apple Introduces Cell Phone (iPhone)
40Convergence of Platforms
- Multiple parallel general-purpose processors
(GPPs) - Multiple application-specific processors (ASPs)
The Processor is the new Transistor Rowen
41BG/Lthe Rise of the Embedded Processor?
TOP 500 Performance by Architecture
10000000
MPP SMP Cluster Constellations Single
Processor SIMD Others MPP embedded
1000000
100000
10000
Aggregate Rmax (Tflop/s)
1000
100
10
1
06/1993
06/1994
06/1995
06/1996
06/1997
06/1998
06/1999
06/2000
06/2001
06/2002
06/2003
06/2004
06/2005
42Tension Between Commodity and Specialized
Architecture
- Commodity Components
- Amortize high development costs by sharing costs
with high volume market - Accept lower computational efficiency for much
lower capital equipment costs! - Specialization
- Specialize to task in order to improve
computational efficiency. - Specialization used very successfully by embeded
processor community - Not cost effective if volume is too low.
- When cost of power exceeds capital equipment
costs - Commodity clusters are optimizing wrong part of
the cost model - Will need for higher computational efficiency
drive more specialization? (look at embedded
market lots of specialization)
43What is Happening Now?
- Moores Law
- Silicon lithography will improve by 2x every 18
months - Double the number of transistors per chip every
18mo. - CMOS Power
- Total Power V2 f C V Ileakage
- active power
passive power - As we reduce feature size Capacitance ( C )
decreases proportionally to transistor size - Enables increase of clock frequency ( f )
proportionally to Moores law lithography
improvements, with same power use - This is called Fixed Voltage Clock Frequency
Scaling (Borkar 99) - Since 90nm
- V2 f C V Ileakage
- Can no longer take advantage of frequency scaling
because passive power (V Ileakage ) dominates - Result is recent clock-frequency stall reflected
in Patterson Graph at right
SPEC_Int benchmark performance since 1978 from
Patterson Hennessy Vol 4.
44What is Happening Now?
- Moores Law
- Silicon lithography will improve by 2x every 18
months - Double the number of transistors per chip every
18mo. - CMOS Power
- Total Power V2 f C V Ileakage
- active power
passive power - As we reduce feature size Capacitance ( C )
decreases proportionally to transistor size - Enables increase of clock frequency ( f )
proportionally to Moores law lithography
improvements, with same power use - This is called Fixed Voltage Clock Frequency
Scaling (Borkar 99) - Since 90nm
- V2 f C V Ileakage
- Can no longer take advantage of frequency scaling
because passive power (V Ileakage ) dominates - Result is recent clock-frequency stall reflected
in Patterson Graph at right
We are here!
SPEC_Int benchmark performance since 1978 from
Patterson Hennessy Vol 4.
45Some Final Comments on Convergence(who is in the
drivers seat of the multicore revolution?)
46Parallel Computing EverywhereCisco CRS-1 Terabit
Router
16 Clusters of 12 cores each (192 cores!)
16 PPE
- 1884 Xtensa general purpose processor cores per
Silicon Packet Processor - Up to 400,000 processors per system
- (this is not just about HPC!!!)
Replaces ASIC using 188 GP cores! Emulates ASIC
at nearly same power/performance Better
power/performance than FPGA! New Definition for
Custom in SoC
47Conclusions
- Enormous transition is underway that affects all
sectors of computing industry - Motivated by power limits
- Proceeding before emergence of the parallel
programming model - Will lead to new era of architectural exploration
given uncertainties about programming and
execution model (and we MUST explore!) - Need to get involved now
- 3-5 years for new hardware designs to emerge
- 3-5 years lead for new software ideas necessary
to support new hardware to emerge - 5 MORE years to general adoption of new software
48Interconnect Design Considerations for Massive
Concurrency
- Application studies provide insight to
requirements for Interconnects (both on-chip and
off-chip) - On-chip interconnect is 2D planar (crossbar wont
scale!) - Sparse connectivity for dwarfs crossbar is
overkill - No single best topology
- A Bandwidth-oriented network for data
- Most point-to-point message exhibit sparse
topology bandwidth bound - Separate Latency-oriented network for collectives
- E.g., Thinking Machines CM-5, Cray T3D, IBM
BlueGene/LP - Ultimately, need to be aware of the on-chip
interconnect topology in addition to the off-chip
topology - Adaptive topology interconnects (HFAST)
- Intelligent task migration?
49Reliable System Design
- The future is unreliable
- Silicon Lithography pushes towards the atomic
scale, the opportunity for spurious hardware
errors will increase dramatically - Reliability of a system is not necessarily
proportional to the number of cores in the system - Reliability is proportional to of sockets in
system (not cores/chip) - At LLNL, BG/L has longer MTBF than Purple despite
having 12x more processor cores - Integrating more peripheral devices onto a single
chip (e.g. caches, memory controller,
interconnect) can further reduce chip count and
increase reliability (System-on-Chip/SOC) - A key limiting factor is software infrastructure
- Software was designed assuming perfect data
integrity (but that is not a multicore issue) - Software written with implicit assumption of
smaller concurrency (1M cores not part of
original design assumptions) - Requires fundamental re-thinking of OS and math
library design assumptions
50Operating Systems for CMP
- Old OS Assumptions are bogus for hundreds of
cores! - Assumes limited number of CPUs that must be
shared - Old OS time-multiplexing (context switching and
cache pollution!) - New OS spatial partitioning
- Greedy allocation of finite I/O device interfaces
(eg. 100 cores go after the network interface
simultaneously) - Old OS First process to acquire lock gets device
(resource/lock contention! Nondeterm delay!) - New OS QoS management for symmetric device
access - Background task handling via threads and signals
- Old OS Interrupts and threads (time-multiplexing)
(inefficient!) - New OS side-cores dedicated to DMA and async I/O
- Fault Isolation
- Old OS CPU failure --gt Kernel Panic (will happen
with increasing frequency in future silicon!) - New OS CPU failure --gt Partition Restart
(partitioned device drivers) - Old OS invoked any interprocessor communication
or scheduling vs. direct HW access
51I/O For Massive Concurrency
- Scalable I/O for massively concurrent systems!
- Many issues with coordinating access to disk
within node (on chip or CMP) - OS will need to devote more attention to QoS for
cores competing for finite resource (mutex locks
and greedy resource allocation policies will not
do!) (it is rugby where device the ball)
52Intel
53Chris Rowen Data
54Increasing Blue Gene Impact
- SC 2005 Gordon Bell Award, 101.7 TFs on real
materials science simulation - Recently exceeding 200 TFs sustained
- Sweep of the all four HPC Challenge class 1
benchmarks - G-HPL (259 Tflop/s), G-RandomAccess (35 GUPS)
EP-STREAM (160 TB/s) and G-FFT (2.3 Tflop/s) - Over 80 large-scale applications ported and
running on BG/L
27.6 kW power consumption per rack (max) 7 kW
power consumption (idle)
Slide adapted from Rick Stevens, ANL
55Future Scaling without Innovation
If we scale current peak performance numbers for
various architectures and allowing system peak
doubling every 18 months. Trouble ahead
Slide adapted from Rick Stevens, ANL
56Projected electricity use- Various scenarios
2007 - 2011
Green Grid - DOE Energy Savings Goal 10.7
billion kWh/yr by 2011
Source Report to Congress on Server and Data
Center Energy Efficiency Public Law 109-431 US
EPA, August 2, 2007
57Petascale Architectural Exploration(back of the
envelope calculation)
- Software challenges (at all levels) are a
tremendous obstacle for any of these approaches. - Unprecedented levels of concurrency are required.
- Unprecedented levels of power are required if we
adopt conventional route - Embedded route offers tractable power, but
daunting concurrency! - This only gets us to 10 Petaflops peak -
- 200PF system to meet application sustained
performance requirements - thus cost and power are likely to be 10x-20x more.