What%20hardware%20accelerators%20are%20you%20using/evaluating? - PowerPoint PPT Presentation

About This Presentation
Title:

What%20hardware%20accelerators%20are%20you%20using/evaluating?

Description:

Cells in a Roadrunner configuration ... For Roadrunner, connect to rest of code on Opteron via DaCS and 'message relay' Roadrunner is more than a petascale ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 11
Provided by: kenk3
Category:

less

Transcript and Presenter's Notes

Title: What%20hardware%20accelerators%20are%20you%20using/evaluating?


1
Performance
  • What hardware accelerators are you
    using/evaluating?
  • Cells in a Roadrunner configuration
  • 8-way SPE threads w/ local memory, DMA vector
    unit programming issues but tremendous
    flexibility
  • Fast (25.6 GB/s) large memory (4GB or larger)
  • Augmented C language also C now Fortran GNU
    XL variants OpenMP is new OpenCL is being
    prototyped
  • Opterons can run bulk of code not needing
    acceleration Cell-only clusters possible

2
Performance
  • What hardware accelerators are you
    using/evaluating? Several years ago
  • GPUs (pre CUDA Tesla)
  • Brook Scout (LANL data-parallel language)
  • No 32bit at the time limited memory everything
    is a data-parallel problem
  • No ECC memory insufficient parity/ECC
    protection of data paths and logic
  • Others at LANL still working in this area
    including Tesla CUDA)
  • Clearspeed (several years ago)
  • Earliest Clearspeeds before the Advance families
  • Augmented C language 96 SIMD PEs
  • Everything is done as long SIMD data parallel and
    in synch
  • Low power
  • FPGAs (HDL, several years ago)
  • Programming is hard -- very hard
  • Logic space limited the number of 64bit ops
  • Fast SRAM but small external DRAM modest size
    but no faster than CPUs
  • One algorithm at a time, so significant impact to
    use for multi-physics
  • Low power

3
Performance
  • Describe the applications that you are porting to
    accelerators?
  • MD (materials), laser-plasma PIC, IMC X-ray
    (particle) transport, GROMACS, n-body universe
    galaxies, DNS turbulence supernovea, HIV
    genealogy, nanowire long-time-scale MD
  • Ocean circulation, wildfires, discrete social
    simulations, clouds rain, influenza spread,
    plasma turbulence, plasma sheaths, fluid
    instabilities
  • My personal observations
  • Particle methods are generally easiest
  • Codes with good characteristics
  • A few computationally intense algorithms
  • pre-existing or obvious fine-grain parallel
    work units
  • C language versus Fortran or highly OO C

4
Performance
  • Describe the kinds of speed-ups are you seeing
    (provide the basis for the comparison)?
  • 5x to 10X over single-Opteron-core for code with
    high memory BW intensive and 5-10 peak
  • 10x to 25x on particle methods, searches, etc.
  • How does it compare to scaling out (i.e., just
    using more X86 processors)? What are the
    bottlenecks to further performance improvements?
  • Scale out via more sockets is better BUT!
  • Scaling efficiencies are a problem already for
    several LANL applications running at 4,000 to
    10,000 cores scale out of LANL-sized machines
    means for HW, space, power
  • Scaling out by multi-core is not a clear winner
  • Memory BW and cache architectures often limit
    performance which Cells mostly get around
  • Memory BW per core is decreasing at inverse
    Moores law rate!

5
Economics
  • Describe the programming effort required to make
    use of the accelerator.
  • ½ to 1 man-year to convert a code, mostly
    dealing with data structures and threaded
    parallelism designs.
  • Lack of debugging similar tools are like the
    earliest days of parallel computing (LANL was
    leader then as well remember early PVM Ethernet
    workstation carpet clusters in the mid-80s
    before MPPs)
  • We like to see 1-2 programming experts (PhD-level
    or equiv) assigned to forefront-science code
    projects which have 1 to 4 physics experts
    (PhD-level)
  • Amortization
  • Ready for the future codes and skilled
    programmers. We expect our dual-level
    (MPIthreads) SIMD-vectorization techniques
    used for Roadrunner to pay off on future
    multi-core and many-core chips as well.
  • Its not just about running codes this year.
    Others will have to work through new forms of
    parallelism soon.
  • We can do science now that isnt possible with
    most other machines

6
Economics
  • Compare accelerator cost to scaling out cost
  • Commodity-processor-only machines would have cost
    2X what Roadrunner did in 2006-2007 (80M more)
  • Used 2X or more power (1M per MW)
  • Significantly larger nodes counts cause scaling
    reliability issues
  • Accelerators or heterogeneous chips should be
    Greener
  • Ease of use issues
  • Newer Cell programming techniques (ALF, OpenMP)
    could make this easier.
  • A Cell cluster would be easier, but the PPE is
    really, really slow for non- SPU accelerated code
    segments.
  • Not for the faint of heart, but Top20 machines
    never are

7
Futures
  • What is the future direction of hardware based
    accelerators?
  • Domain specific libraries can make them far more
    useful in those specific areas
  • Some may appear on Intel QPI or AMD HT.
  • Specialized cores will show up within commodity
    microprocessors ignore them or use them
  • GPU-based systems will have to adopt ECC
    partity protection
  • Convey appears to have the most viable FPGA
    approach (FPGA as compiler managed co-processor)
  • Software futures?
  • OpenCL looks promising but doesnt address
    programming the specialized accelerator devices
    themselves
  • The uber-auto-wizard-compiler will never come
  • Heterogeneous compilers may come.
  • Debuggers tools may come
  • What are your thoughts on what the vendors need
    to do to ensure wider acceptance of accelerators?
  • Create next generation versions and sell as
    mainstream products

8
Steps in a Cell Conversion
  • Compile run on PowerPC PPE
  • Identify isolate algorithm data to run
    parallel on 8 remote SPEs
  • Compile scalar version of algorithm on SPE
  • Add SPE thread process control
  • Add DMAs
  • Use blocking DMAs at this stage just for
    functionality
  • Worry about data alignments
  • First on a single SPE, then on 8 SPEs
  • Optimize SPE code
  • SIMD, branches?merges
  • Add asynch double/triple buffering of DMAs
  • For Roadrunner, connect to rest of code on
    Opteron via DaCS and message relay

9
Roadrunner LANL addressing the shock moving
through high-performance computing
  • Roadrunner is more than a petascale supercomputer
    for todays use
  • provides a balanced platform to explore new
    algorithm design, programming models, and to
    refresh developer skills
  • LANL has been an early adopter of
    transformational technology
  • 1970s HPC is scalar LANL adopts vector (Cray
    1 w/ no OS)
  • 1980s HPC is vector LANL adopts data parallel
    (big CM-2)
  • 2000s HPC is multi-core clusters LANL adopts
    hybrid (Roadrunner)

Credit to Scott Pakin, CCS-1, for this list idea
10
Perspective Fun or Nightmare?
Opteron
Cell PPC
Cell SPE (x8 parallel)
MPI
(1)
Host launches Cell code
DaCS
Host data pushed/pulled to Cell
(2)
Cell spawns parallel threads on SPEs
(3)
Node may need to push/pull more data to/from
Cell to/from cluster or could be available
for concurrent work during this time
(5b)
(5a)
DMA
MPI
DMA
Parallel threads completed
(6)
Updated data pushed/pulled to Host
DaCS
Cell code completed
MPI
How much can be automatedin compilers or
languages?
Write a Comment
User Comments (0)
About PowerShow.com