UltraEfficient Scientific Computing More Science Less Power John Shalf Leonid Oliker, Michael Wehner - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

UltraEfficient Scientific Computing More Science Less Power John Shalf Leonid Oliker, Michael Wehner

Description:

UltraEfficient Scientific Computing More Science Less Power John Shalf Leonid Oliker, Michael Wehner – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 58
Provided by: flav59
Category:

less

Transcript and Presenter's Notes

Title: UltraEfficient Scientific Computing More Science Less Power John Shalf Leonid Oliker, Michael Wehner


1
Ultra-Efficient Scientific Computing More
Science Less PowerJohn ShalfLeonid Oliker,
Michael Wehner, Kathy YelickRAMP Retreat
January 16, 2008
2
End of Dennard Scaling
  • New Constraints
  • Power limits clock rates
  • Cannot squeeze more performance from ILP (complex
    cores) either!
  • But Moores Law continues!
  • What to do with all of those transistors if
    everything else is flat-lining?
  • Now, cores per chip doubles every 18 months
    instead of clock frequency!
  • No more free lunch for performance improvement!

Figure courtesy of Kunle Olukotun, Lance Hammond,
Herb Sutter, and Burton Smith
3
ORNL Computing Power and Cooling 2006 - 2011
  • Immediate need to add 8 MW to prepare for 2007
    installs of new systems
  • NLCF petascale system could require an additional
    10 MW by 2008
  • Need total of 40-50 MW for projected systems by
    2011
  • Numbers just for computers add 75 for cooling
  • Cooling will require 12,000 15,000 tons of
    chiller capacity

YIKES!
31M
23M
17M
9M
3M
Cost estimates based on 0.05 kW/hr
Data taken from Energy Management System-4
(EMS4). EMS4 is the DOE corporate system for
collecting energy information from the sites.
EMS4 is a web-based system that collects energy
consumption and cost information for all energy
sources used at each DOE site. Information is
entered into EMS4 by the site and reviewed at
Headquarters for accuracy.
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF
ENERGY
4
Top500 Estimated Power Requirements
5
Power is an Industry Wide Problem
Hiding in Plain Sight, Google Seeks More Power,
by John Markoff, June 14, 2006
New Google Plant in The Dulles, Oregon, from
NYT, June 14, 2006
6
Cost of Power Will Dominate, and Ultimately Limit
Practical Scale of Future Systems
Unrestrained IT power consumption could eclipse
hardware costs and put great pressure on
affordability, data center infrastructure, and
the environment.
Source Luiz André Barroso, (Google) The Price
of Performance, ACM Queue, Vol. 2, No. 7, pp.
48-53, September 2005. (Modified with
permission.)
7
Ultra-Efficient Computing100x over Business As
Usual
  • Cooperative effort we call science-driven system
    architecture
  • Effective future exascale systems must be
    developed in the context of application
    requirements
  • Radically change HPC system development via
    application-driven hardware/software co-design
  • Achieve 100x power efficiency and 100x capability
    of mainstream HPC approach for targeted high
    impact applications
  • Accelerate development cycle for exascale HPC
    systems
  • Approach is applicable to numerous scientific
    areas in the DOE Office of Science
  • Proposed pilot application Ultra-high resolution
    climate change simulation

8
New Design Constraint POWER
  • Transistors still getting smaller
  • Moores Law is alive and well
  • But Dennard scaling is dead!
  • No power efficiency improvements with smaller
    transistors
  • No clock frequency scaling with smaller
    transistors
  • All magical improvement of silicon goodness has
    ended
  • Traditional methods for extracting more
    performance are well-mined
  • Cannot expect exotic architectures to save us
    from the power wall
  • Even resources of DARPA can only accelerate
    existing research prototypes (not magic new
    technology)!

9
Estimated Exascale Power Requirements
  • LBNL IJHPCA Study for 1/5 Exaflop for Climate
    Science in 2008
  • Extrapolation of Blue Gene and AMD design trends
  • Estimate 20 MW for BG and 179 MW for AMD
  • DOE E3 Report
  • Extrapolation of existing design trends to
    exascale in 2016
  • Estimate 130 MW
  • DARPA Study
  • More detailed assessment of component
    technologies
  • Estimate 20 MW just for memory alone, 60 MW
    aggregate extrapolated from current design trends
  • Baltimore Sun Article (Jan 23, 2007) NSA drawing
    65-75MW in Maryland
  • Crisis Baltimore Gas Electric does not have
    sufficient power for city of Baltimore!
  • Expected to increase by 10-15MW per year!
  • The current approach is not
    sustainable!

10
Path to Power EfficiencyReducing Waste in
Computing
  • Examine methodology of low-power embedded
    computing market
  • optimized for low power, low cost, and high
    computational efficiency
  • Years of research in low-power embedded
    computing have shown only one design technique to
    reduce power reduce waste.
  • ? Mark Horowitz, Stanford University Rambus
    Inc.
  • Sources of Waste
  • Wasted transistors (surface area)
  • Wasted computation (useless work/speculation/stall
    s)
  • Wasted bandwidth (data movement)
  • Designing for serial performance

11
Our New Design Paradigm Application-Driven HPC
  • Identify high-impact exascale scientific
    applications
  • Tailor system architecture to highly parallel
    applications
  • Co-design algorithms and software together with
    the hardware
  • Enabled by hardware emulation environments
  • Supported by auto-tuning for code generation

12
Designing for Efficiency is Application Class
Specific
13
Processor Power and PerformanceEmbedded
Application-Specific Cores
Courtesy of Chris Rowen, Tensilica Inc.
Performance on EEMBC benchmarks aggregate for
Consumer, Telecom, Office, Network, based on
ARM1136J-S (Freescale i.MX31), ARM1026EJ-S,
Tensilica Diamond 570T, T1050 and T1030, MIPS
20K, NECVR5000). MIPS M4K, MIPS 4Ke, MIPS 4Ks,
MIPS 24K, ARM 968E-S, ARM 966E-S, ARM926EJ-S,
ARM7TDMI-S scaled by ratio of Dhrystone MIPS
within architecture family. All power figures
from vendor websites, 2/23/2006.
14
How Small Is Small?
  • Power5 (Server)
  • 389 mm2
  • 120 W _at_ 1900 MHz
  • Intel Core2 sc (Laptop)
  • 130 mm2
  • 15 W _at_ 1000 MHz
  • PowerPC450 (BlueGene/P)
  • 8 mm2
  • 3 W _at_ 850 MHz
  • XTensa DP (cell phones)
  • 0.8 mm2
  • 0.09 W _at_ 650 MHz

TensilicaDP
PPC450
Intel Core2
Power 5
Each core operates at 1/3 to 1/10th efficiency of
largest chip, but you can pack 100x more cores
onto a chip and consume 1/20 the power!
15
Chris Rowen Data
16
Intel
17
Partnerships for Power-Efficient Computing
  • Identify high-impact exascale office of science
    projects!
  • Embark on targeted program of tightly coupled
    hardware/software co-design
  • Impossible using the typical two-year hardware
    lead times
  • Break slow feedback loop for system designs via
    RAMP hardware emulation platform and auto-tuned
    code generation
  • Technology partners
  • UC Berkeley K Yelick, J Wawrzynek, K
    Asanovic, K Keutzer
  • Stanford University / Rambus Inc. M Horowitz
  • Tensilica Inc. C Rowen
  • Pilot application kilometer-scale climate model
  • Provides important answers to question with
    multi-trillion-dollar ramifications
  • Climate community partners Michael Wehner, Bill
    Collins, David Randall, et al.

18
Cloud System Resolving Climate Simulation
  • A major source of errors in climate models is
    poor cloud simulation
  • At 1 km horizontal resolution, cloud systems
    can be resolved
  • Requires significant algorithm work and
    unprecedented concurrencies
  • Dave Randalls SciDAC-funded effort at Colorado
    State University offers an algorithm for this
    regime
  • Icosahedral grid is highly uniform
  • Amenable to massively concurrent architectures
    composed of power-efficient embedded cores

19
Effects of Finer Resolutions
Duffy, et al
Enhanced resolution of mountains yield model
improvements at larger scales
20
Pushing Current Model to High Resolution
20 km resolution produces reasonable tropical
cyclones
21
Kilometer-scale fidelity
  • Current cloud parameterizations break down
    somewhere around 10km
  • Deep convective processes responsible for
    moisture transport from near surface to higher
    altitudes are inadequately represented at current
    resolutions
  • Assumptions regarding the distribution of cloud
    types become invalid in the Arakawa-Schubert
    scheme
  • Uncertainty in short and long term forecasts can
    be traced to these inaccuracies
  • However, at 2 or 3km, a radical reformulation of
    atmospheric general circulation models is
    possible
  • Cloud system resolving models replace cumulus
    convection and large scale precipitation
    parameterizations.
  • Will this lead to better global cloud
    distributions

22
Extrapolating fvCAM to km Scale
  • fvCAM NCAR Community Atmospheric Model version
    3.1
  • Atmospheric component of fully coupled climate
    model, CCSM3.0
  • Finite Volume hydrostatic dynamics (Lin-Rood)
  • Parameterized physics is the same as the spectral
    version
  • We use fvCAM as a tool to estimate future
    computational requirements.
  • Major algorithm components of FVCAM?
  • Dynamics - solves atmospheric motion, N.S. eqn
    fluid dynamics
  • Ops O(mn2) Time step determined by the Courant
    (CFL) condition
  • Time step depends horizontal resolution (n)
  • Physics - Parameterized external processes
    relevant to state of atmosphere
  • Ops O(mn), Time step can remain constant Dt
    30 minutes
  • Not subject to CFL condition
  • Filtering
  • Ops O(mlog(m)n2), addresses high aspect cells
    at poles via FFT
  • Allows violation of overly restrictive Courant
    condition near poles

23
Extrapolation to km-Scale
Theoretical scaling behavior matches
experimental measurements
By extrapolating out to 1.5km, we see the
dynamics dominates calculation time while Physics
and Filters overheads become negligible
24
Scaling Processor Performance Requirements
  • A practical constraint is that the number of
    subdomains is limited to be less than or equal to
    the number of horizontal cells
  • Using the current 1D approach is limited to only
    4000 subdomains at 1km
  • Would require 1Teraflop/subdomain using this
    approach!
  • Number of 2D subdomains estimated using 3x3 or
    10x10 cells
  • Can utilize millions of subdomains
  • Assuming 10x10x10 cells (given 100 vertical
    layers) 20M subdomains
  • 0.5Gflop/processor would achieve 1000x speedup
    over realtime
  • Vertical solution requires high communication
    (aided with multi-core/SMP)
  • This is a lower bound in the absence of
    communication costs and load imbalance

25
Memory Scaling Behavior
  • Memory estimate at km-scale is about 25 TB total)
  • 100 TB total with 100 vertical levels
  • Total memory requirement independent of domain
    decomposition
  • Due to Courant condition, operation count scales
    at greater rate than mesh cells - thus relatively
    low per processor memory requirement
  • Memory bytes per flop drop from 0.7 for 200km
    mesh to .009 for 1.5km mesh.
  • Using current 1D approach requires 6GB per
    processor
  • With 2D approach requires only 5MB per processor

26
Interconnect Requirements
Data assumes 2D 10x10 decomposition
where only 10 of the calculation is devoted to
communication
  • Three factors cause sustained performance lower
    than peak
  • Single processor performance, interprocessor
    communication, load balancing
  • 2D case message size are independent on
    horizontal resolution, however in 1D case
    communication contains ghost cells over the
    entire range of longitudes
  • Assuming (pessimistically) communication only
    occurs during 10 of calculation - not over the
    entire (100) interval - increases bandwidth
    demands 10x
  • 2D 10x10 case requires minimum 277 MB/s
    bandwidth and maximum18microsec latency
  • 1D case would require minimum of 256 GB/s
    bandwidth
  • Note that the hardware/algorithm ability to
    overlap computation with communication would
    decrease interconnect requirements
  • Load balance is important issue, but is not
    examined in our study

27
Communication Topology
28
New Discretization for Massive Parallelism
  • Latitude-longitude based algorithm would not
    scale to 1km
  • Filtering cost would be only 7 of calculation
  • However the semi-Lagrangian advection algorithm
    breaks down
  • Grid cell aspect ratio at the pole is 10000!
  • Advection time step is problematic at this scale
  • Ultimately requires new discretization for
    atmosphere model
  • Must expose sufficient parallelism to exploit
    power-efficient design
  • Investigating Cubed Sphere (NOAA) and Icosahedral
    (Randall code)

Current
Cubed Sphere
Icosahedral
29
Strawman 1km Climate Computer
  • I mesh at 1000X real time
  • .015oX.02oX100L (1.5km)
  • 10 Petaflops sustained
  • 100-200 Petaflops peak
  • 100 Terabytes total memory
  • Only 5 MB memory per processor
  • 5 GB/s local memory performance per domain (1
    byte/flop)
  • 2 million horizontal subdomains
  • 10 vertical domains (assume fast vertical
    communication)
  • 20 million processors at 500Mflops each sustained
  • 200 MB/s in four nearest neighbor directions
  • Tight coupling of communication in vertical
    dimension

We now compare available technology in current
generation of HPC systems
30
Estimation of 1 km Climate Model Computational
Requirements
  • We have performed a detailed analysis of
    kilometer-scale climate model resource
    requirements
  • Paper in International Journal of High
    Performance Computing Applications
  • Equations of motion dominate at ultra-high
    resolutions because of Courant stability
    condition
  • Require model run 1000x faster than real time
    (minimum)
  • A truly exascale class scientific problem
  • About 2 billion icosahedral points
  • 20 million processors with modest vertical
    parallelization
  • Modest 0.5 gigaflops/processor with 5 MB memory
    per processor
  • Modest 200MB/s comm bandwidth to nearest neighbors

31
Customization Continuum
  • Application-driven architecture does not NOT
    necessitate a special purpose machine!
  • D.E. Shaw System Semicustom design with some
    custom elements
  • Uses fully programmable cores with full-custom
    co-processors to achieve efficiency (1Megawatt)
  • Simulate 100x1000x longer timescales than ANY
    feasible HPC system
  • Programmability broadens application reach (but
    narrower than our approach)
  • MD-Grape Full custom ASIC design
  • 1 petaflop performance for one application using
    260 kilowatts
  • Cost 9M from concept to implementation
  • Application-Driven Architecture (Climate
    Simulator) Semicustom design
  • Highly programmable core architecture using
    C/C/Fortran
  • 100x better power efficiency is modest compared
    to demonstrated capability of more specialized
    approaches!

32
Climate Strawman System DesignIn 2008
  • Design system around the requirements of the
    massively parallel application
  • Example kilometer scale climate model
    application
  • We examined three different approaches
  • AMD Opteron Commodity approach, lower efficiency
    for scientific applications offset by cost
    efficiencies of mass market
  • BlueGene Generic embedded processor core and
    customize system-on-chip (SoC) services to
    improve power efficiency for scientific
    applications
  • Tensilica Customized embedded CPU as well as
    SoC provides further power efficiency benefits
    but maintains programmability

Solve an exascale problem without building an
exaflop/s machine!
33
Climate System Design ConceptStrawman Design
Study
From Chris Rowen, Tensilica
10PF sustained 120 m2 lt3MWatts lt 75M
34
Automatic Processor Generation(Example from
Existing Tensilica Design Flow)
Application-optimized processor implementation
(RTL/Verilog)
Base CPU
OCD
Apps Datapaths
Timer
Cache
FPU
Extended Registers
  • Processor configuration
  • Select from menu
  • Automatic instruction discovery (XPRES Compiler)
  • Explicit instruction description (TIE)

Build with any process in any fab (Costs 1M)
Tailored SW Tools Compiler, debugger,
simulators, Linux, other OS Ports (Automatically
generated together with the Core)
35
Impact on Broader DOE Scientific Workload
  • We propose a cloud resolving climate change
    simulation to illustrate our power-efficient,
    application-driven design methodology
  • Our approach is geared for a class of codes, not
    just for a single code instantiation
  • This methodology is broadly applicable and could
    be extended to other scientific disciplines
  • BlueGene was originally targeted at chemistry and
    bioinformatics applications ? result was very
    power-efficient architecture, and application was
    broader than the original target

36
More Info
  • NERSC Science Driven System Architecture Group
  • http//www.nersc.gov/projects/SDSA
  • Power Efficient Semi-custom Computing
  • http//vis.lbl.gov/jshalf/SIAM_CSE07
  • The View from Berkeley
  • http//view.eecs.berkeley.edu
  • Memory Bandwidth
  • http//www.nersc.gov/projects/SDSA/reports/uploade
    d/SOS11_mem_Shalf.pdf

37
Extra
38
Consumer Electronics Convergence
From Tsugio Makimoto
39
Consumer Electronics has Replaced PCs as the
Dominant Market Force in CPU Design!!
From Tsugio Makimoto
IPodITunes exceeds 50 of Apples Net Profit
Apple Introduces IPod
Apple Introduces Cell Phone (iPhone)
40
Convergence of Platforms
  • Multiple parallel general-purpose processors
    (GPPs)
  • Multiple application-specific processors (ASPs)

The Processor is the new Transistor Rowen
41
BG/Lthe Rise of the Embedded Processor?
TOP 500 Performance by Architecture
10000000
MPP SMP Cluster Constellations Single
Processor SIMD Others MPP embedded
1000000
100000
10000
Aggregate Rmax (Tflop/s)
1000
100
10
1
06/1993
06/1994
06/1995
06/1996
06/1997
06/1998
06/1999
06/2000
06/2001
06/2002
06/2003
06/2004
06/2005
42
Tension Between Commodity and Specialized
Architecture
  • Commodity Components
  • Amortize high development costs by sharing costs
    with high volume market
  • Accept lower computational efficiency for much
    lower capital equipment costs!
  • Specialization
  • Specialize to task in order to improve
    computational efficiency.
  • Specialization used very successfully by embeded
    processor community
  • Not cost effective if volume is too low.
  • When cost of power exceeds capital equipment
    costs
  • Commodity clusters are optimizing wrong part of
    the cost model
  • Will need for higher computational efficiency
    drive more specialization? (look at embedded
    market lots of specialization)

43
What is Happening Now?
  • Moores Law
  • Silicon lithography will improve by 2x every 18
    months
  • Double the number of transistors per chip every
    18mo.
  • CMOS Power
  • Total Power V2 f C V Ileakage
  • active power
    passive power
  • As we reduce feature size Capacitance ( C )
    decreases proportionally to transistor size
  • Enables increase of clock frequency ( f )
    proportionally to Moores law lithography
    improvements, with same power use
  • This is called Fixed Voltage Clock Frequency
    Scaling (Borkar 99)
  • Since 90nm
  • V2 f C V Ileakage
  • Can no longer take advantage of frequency scaling
    because passive power (V Ileakage ) dominates
  • Result is recent clock-frequency stall reflected
    in Patterson Graph at right

SPEC_Int benchmark performance since 1978 from
Patterson Hennessy Vol 4.
44
What is Happening Now?
  • Moores Law
  • Silicon lithography will improve by 2x every 18
    months
  • Double the number of transistors per chip every
    18mo.
  • CMOS Power
  • Total Power V2 f C V Ileakage
  • active power
    passive power
  • As we reduce feature size Capacitance ( C )
    decreases proportionally to transistor size
  • Enables increase of clock frequency ( f )
    proportionally to Moores law lithography
    improvements, with same power use
  • This is called Fixed Voltage Clock Frequency
    Scaling (Borkar 99)
  • Since 90nm
  • V2 f C V Ileakage
  • Can no longer take advantage of frequency scaling
    because passive power (V Ileakage ) dominates
  • Result is recent clock-frequency stall reflected
    in Patterson Graph at right

We are here!
SPEC_Int benchmark performance since 1978 from
Patterson Hennessy Vol 4.
45
Some Final Comments on Convergence(who is in the
drivers seat of the multicore revolution?)
46
Parallel Computing EverywhereCisco CRS-1 Terabit
Router
16 Clusters of 12 cores each (192 cores!)
16 PPE
  • 1884 Xtensa general purpose processor cores per
    Silicon Packet Processor
  • Up to 400,000 processors per system
  • (this is not just about HPC!!!)

Replaces ASIC using 188 GP cores! Emulates ASIC
at nearly same power/performance Better
power/performance than FPGA! New Definition for
Custom in SoC
47
Conclusions
  • Enormous transition is underway that affects all
    sectors of computing industry
  • Motivated by power limits
  • Proceeding before emergence of the parallel
    programming model
  • Will lead to new era of architectural exploration
    given uncertainties about programming and
    execution model (and we MUST explore!)
  • Need to get involved now
  • 3-5 years for new hardware designs to emerge
  • 3-5 years lead for new software ideas necessary
    to support new hardware to emerge
  • 5 MORE years to general adoption of new software

48
Interconnect Design Considerations for Massive
Concurrency
  • Application studies provide insight to
    requirements for Interconnects (both on-chip and
    off-chip)
  • On-chip interconnect is 2D planar (crossbar wont
    scale!)
  • Sparse connectivity for dwarfs crossbar is
    overkill
  • No single best topology
  • A Bandwidth-oriented network for data
  • Most point-to-point message exhibit sparse
    topology bandwidth bound
  • Separate Latency-oriented network for collectives
  • E.g., Thinking Machines CM-5, Cray T3D, IBM
    BlueGene/LP
  • Ultimately, need to be aware of the on-chip
    interconnect topology in addition to the off-chip
    topology
  • Adaptive topology interconnects (HFAST)
  • Intelligent task migration?

49
Reliable System Design
  • The future is unreliable
  • Silicon Lithography pushes towards the atomic
    scale, the opportunity for spurious hardware
    errors will increase dramatically
  • Reliability of a system is not necessarily
    proportional to the number of cores in the system
  • Reliability is proportional to of sockets in
    system (not cores/chip)
  • At LLNL, BG/L has longer MTBF than Purple despite
    having 12x more processor cores
  • Integrating more peripheral devices onto a single
    chip (e.g. caches, memory controller,
    interconnect) can further reduce chip count and
    increase reliability (System-on-Chip/SOC)
  • A key limiting factor is software infrastructure
  • Software was designed assuming perfect data
    integrity (but that is not a multicore issue)
  • Software written with implicit assumption of
    smaller concurrency (1M cores not part of
    original design assumptions)
  • Requires fundamental re-thinking of OS and math
    library design assumptions

50
Operating Systems for CMP
  • Old OS Assumptions are bogus for hundreds of
    cores!
  • Assumes limited number of CPUs that must be
    shared
  • Old OS time-multiplexing (context switching and
    cache pollution!)
  • New OS spatial partitioning
  • Greedy allocation of finite I/O device interfaces
    (eg. 100 cores go after the network interface
    simultaneously)
  • Old OS First process to acquire lock gets device
    (resource/lock contention! Nondeterm delay!)
  • New OS QoS management for symmetric device
    access
  • Background task handling via threads and signals
  • Old OS Interrupts and threads (time-multiplexing)
    (inefficient!)
  • New OS side-cores dedicated to DMA and async I/O
  • Fault Isolation
  • Old OS CPU failure --gt Kernel Panic (will happen
    with increasing frequency in future silicon!)
  • New OS CPU failure --gt Partition Restart
    (partitioned device drivers)
  • Old OS invoked any interprocessor communication
    or scheduling vs. direct HW access

51
I/O For Massive Concurrency
  • Scalable I/O for massively concurrent systems!
  • Many issues with coordinating access to disk
    within node (on chip or CMP)
  • OS will need to devote more attention to QoS for
    cores competing for finite resource (mutex locks
    and greedy resource allocation policies will not
    do!) (it is rugby where device the ball)

52
Intel
53
Chris Rowen Data
54
Increasing Blue Gene Impact
  • SC 2005 Gordon Bell Award, 101.7 TFs on real
    materials science simulation
  • Recently exceeding 200 TFs sustained
  • Sweep of the all four HPC Challenge class 1
    benchmarks
  • G-HPL (259 Tflop/s), G-RandomAccess (35 GUPS)
    EP-STREAM (160 TB/s) and G-FFT (2.3 Tflop/s)
  • Over 80 large-scale applications ported and
    running on BG/L

27.6 kW power consumption per rack (max) 7 kW
power consumption (idle)
Slide adapted from Rick Stevens, ANL
55
Future Scaling without Innovation
If we scale current peak performance numbers for
various architectures and allowing system peak
doubling every 18 months. Trouble ahead
Slide adapted from Rick Stevens, ANL
56
Projected electricity use- Various scenarios
2007 - 2011
Green Grid - DOE Energy Savings Goal 10.7
billion kWh/yr by 2011
Source Report to Congress on Server and Data
Center Energy Efficiency Public Law 109-431 US
EPA, August 2, 2007
57
Petascale Architectural Exploration(back of the
envelope calculation)
  • Software challenges (at all levels) are a
    tremendous obstacle for any of these approaches.
  • Unprecedented levels of concurrency are required.
  • Unprecedented levels of power are required if we
    adopt conventional route
  • Embedded route offers tractable power, but
    daunting concurrency!
  • This only gets us to 10 Petaflops peak -
  • 200PF system to meet application sustained
    performance requirements
  • thus cost and power are likely to be 10x-20x more.
Write a Comment
User Comments (0)
About PowerShow.com