What%20hardware%20accelerators%20are%20you%20using/evaluating? - PowerPoint PPT Presentation

About This Presentation

Title:

What%20hardware%20accelerators%20are%20you%20using/evaluating?

Description:

Cells in a Roadrunner configuration ... For Roadrunner, connect to rest of code on Opteron via DaCS and 'message relay' Roadrunner is more than a petascale ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 11

Provided by: kenk3

Category:

more less

Transcript and Presenter's Notes

Title: What%20hardware%20accelerators%20are%20you%20using/evaluating?

1
Performance

What hardware accelerators are you
using/evaluating?
Cells in a Roadrunner configuration
8-way SPE threads w/ local memory, DMA vector
unit programming issues but tremendous
flexibility
Fast (25.6 GB/s) large memory (4GB or larger)
Augmented C language also C now Fortran GNU
XL variants OpenMP is new OpenCL is being
prototyped
Opterons can run bulk of code not needing
acceleration Cell-only clusters possible

2
Performance

What hardware accelerators are you
using/evaluating? Several years ago
GPUs (pre CUDA Tesla)
Brook Scout (LANL data-parallel language)
No 32bit at the time limited memory everything
is a data-parallel problem
No ECC memory insufficient parity/ECC
protection of data paths and logic
Others at LANL still working in this area
including Tesla CUDA)
Clearspeed (several years ago)
Earliest Clearspeeds before the Advance families
Augmented C language 96 SIMD PEs
Everything is done as long SIMD data parallel and
in synch
Low power
FPGAs (HDL, several years ago)
Programming is hard -- very hard
Logic space limited the number of 64bit ops
Fast SRAM but small external DRAM modest size
but no faster than CPUs
One algorithm at a time, so significant impact to
use for multi-physics
Low power

3
Performance

Describe the applications that you are porting to
accelerators?
MD (materials), laser-plasma PIC, IMC X-ray
(particle) transport, GROMACS, n-body universe
galaxies, DNS turbulence supernovea, HIV
genealogy, nanowire long-time-scale MD
Ocean circulation, wildfires, discrete social
simulations, clouds rain, influenza spread,
plasma turbulence, plasma sheaths, fluid
instabilities
My personal observations
Particle methods are generally easiest
Codes with good characteristics
A few computationally intense algorithms
pre-existing or obvious fine-grain parallel
work units
C language versus Fortran or highly OO C

4
Performance

Describe the kinds of speed-ups are you seeing
(provide the basis for the comparison)?
5x to 10X over single-Opteron-core for code with
high memory BW intensive and 5-10 peak
10x to 25x on particle methods, searches, etc.
How does it compare to scaling out (i.e., just
using more X86 processors)? What are the
bottlenecks to further performance improvements?
Scale out via more sockets is better BUT!
Scaling efficiencies are a problem already for
several LANL applications running at 4,000 to
10,000 cores scale out of LANL-sized machines
means for HW, space, power
Scaling out by multi-core is not a clear winner
Memory BW and cache architectures often limit
performance which Cells mostly get around
Memory BW per core is decreasing at inverse
Moores law rate!

5
Economics

Describe the programming effort required to make
use of the accelerator.
½ to 1 man-year to convert a code, mostly
dealing with data structures and threaded
parallelism designs.
Lack of debugging similar tools are like the
earliest days of parallel computing (LANL was
leader then as well remember early PVM Ethernet
workstation carpet clusters in the mid-80s
before MPPs)
We like to see 1-2 programming experts (PhD-level
or equiv) assigned to forefront-science code
projects which have 1 to 4 physics experts
(PhD-level)
Amortization
Ready for the future codes and skilled
programmers. We expect our dual-level
(MPIthreads) SIMD-vectorization techniques
used for Roadrunner to pay off on future
multi-core and many-core chips as well.
Its not just about running codes this year.
Others will have to work through new forms of
parallelism soon.
We can do science now that isnt possible with
most other machines

6
Economics

Compare accelerator cost to scaling out cost
Commodity-processor-only machines would have cost
2X what Roadrunner did in 2006-2007 (80M more)
Used 2X or more power (1M per MW)
Significantly larger nodes counts cause scaling
reliability issues
Accelerators or heterogeneous chips should be
Greener
Ease of use issues
Newer Cell programming techniques (ALF, OpenMP)
could make this easier.
A Cell cluster would be easier, but the PPE is
really, really slow for non- SPU accelerated code
segments.
Not for the faint of heart, but Top20 machines
never are

7
Futures

What is the future direction of hardware based
accelerators?
Domain specific libraries can make them far more
useful in those specific areas
Some may appear on Intel QPI or AMD HT.
Specialized cores will show up within commodity
microprocessors ignore them or use them
GPU-based systems will have to adopt ECC
partity protection
Convey appears to have the most viable FPGA
approach (FPGA as compiler managed co-processor)
Software futures?
OpenCL looks promising but doesnt address
programming the specialized accelerator devices
themselves
The uber-auto-wizard-compiler will never come
Heterogeneous compilers may come.
Debuggers tools may come
What are your thoughts on what the vendors need
to do to ensure wider acceptance of accelerators?
Create next generation versions and sell as
mainstream products

8
Steps in a Cell Conversion

Compile run on PowerPC PPE
Identify isolate algorithm data to run
parallel on 8 remote SPEs
Compile scalar version of algorithm on SPE
Add SPE thread process control
Add DMAs
Use blocking DMAs at this stage just for
functionality
Worry about data alignments
First on a single SPE, then on 8 SPEs
Optimize SPE code
SIMD, branches?merges
Add asynch double/triple buffering of DMAs
For Roadrunner, connect to rest of code on
Opteron via DaCS and message relay

9
Roadrunner LANL addressing the shock moving
through high-performance computing

Roadrunner is more than a petascale supercomputer
for todays use
provides a balanced platform to explore new
algorithm design, programming models, and to
refresh developer skills
LANL has been an early adopter of
transformational technology
1970s HPC is scalar LANL adopts vector (Cray
1 w/ no OS)
1980s HPC is vector LANL adopts data parallel
(big CM-2)
2000s HPC is multi-core clusters LANL adopts
hybrid (Roadrunner)

Credit to Scott Pakin, CCS-1, for this list idea
10
Perspective Fun or Nightmare?
Opteron
Cell PPC
Cell SPE (x8 parallel)
MPI
(1)
Host launches Cell code
DaCS
Host data pushed/pulled to Cell
(2)
Cell spawns parallel threads on SPEs
(3)
Node may need to push/pull more data to/from
Cell to/from cluster or could be available
for concurrent work during this time
(5b)
(5a)
DMA
MPI
DMA
Parallel threads completed
(6)
Updated data pushed/pulled to Host
DaCS
Cell code completed
MPI
How much can be automatedin compilers or
languages?

Write a Comment

User Comments (0)