Entering the Petaflop Era: The Architecture and Performance of Roadrunner - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Entering the Petaflop Era: The Architecture and Performance of Roadrunner

Description:

... Phil Church, Vaughn Clinton, Susan Coulter, Robert Cunningham, David Daniel, ... Grunau, David Gunter, Chuck Hales, Doug Hefele, Paul Henning, Catherine Hensley, ... – PowerPoint PPT presentation

Number of Views:154

Avg rating:3.0/5.0

Slides: 34

Provided by: scott192

Category:

more less

Transcript and Presenter's Notes

Title: Entering the Petaflop Era: The Architecture and Performance of Roadrunner

1
Entering the Petaflop EraThe Architecture and
Performance ofRoadrunner

Kevin J. Barker, Kei Davis, Adolfy Hoisie,
Darren J. Kerbyson, Mike Lang, Scott Pakin, and
José C. Sancho
Performance and Architecture Lab (PAL)
Los Alamos National Laboratory
18 November 2008

2
Clarification

Many, many people at LANL and IBM contributed to
Roadrunners success
LANL side Brian Albright, Daniel Archuleta,
Kevin Barker, Brian Barrett, John Bent, Benjamin
Bergen, Kevin Bowers, Todd Bowman, Joseph
Bridges, Jeffrey Brown, Ernie Buenafe, Richard
Campion, Ralph Castain, John Cerutti, Mark
Chadwick, Hsing-Bung Chen, Phil Church, Vaughn
Clinton, Susan Coulter, Robert Cunningham, David
Daniel, Kei Davis, Nathan Debardeleben, Gabriel
de la Cruz, Nehal Desai, Guy Dimonte, Andrew
Dubois, Charles Ferenbaugh, Parks Fields, Timothy
Germann, Gary Grider, Daryl Grunau, David Gunter,
Chuck Hales, Doug Hefele, Paul Henning, Catherine
Hensley, Marissa Herrera, Stephen Hodson, Adolfy
Hoisie, Laura Hughes, Craig Idler, Jeff Inman,
Mohammed Jebbanema, Timothy Kelley, Kathleen
Kelly, Darren Kerbyson, Brett Kettering, Ken
Koch, Thomas Kwan, Michael Lang, Rick Light,
Diana Little, Josip Loncaric, Monica Lucero, Hal
Marshall, Rick Martineau, Gloria Martinez, Paul
Martinez, Benjamin McClelland, Patrick McCormick,
Michael McKay, Allen McPherson, Amy Meilander,
Sarah Michalak, Raymond Miller, Jamal Mohd-Yusof,
David Montoya, Terri Morris, John Morrison, James
Nuñez, Scott Pakin, Georgia Pedicini, Jennis
Pruett, Meghan Quist, Craig Rasmussen, Randal
Rheinheimer, Denny Rice, Rick Rivera, Bill Rust,
José Sancho, Rita Sandoval, Bob Shea, Matt
Sheats, Andrew Shewmaker, Galen Shipman, LeAnne
Silva, Randall Smith, Julianne Stidham, Joyce
Sullivan, Sriram Swaminarayan, Wayne Sweatt,
Martin Torrey, Alfred Torrez, Justin Tripp, John
Turner, Steve Turpin, Ron Velarde, Mark Vernon,
Manuel Vigil, Robert Villa, Cheryl Wampler,
Robert Webster, Andy White, Chuck Wilder,
Karl-Heinz Winkler, Lin Yin,
IBM side Joe Abeyta, Mike Aho, Ben Alexander,
Tom Ballard, Greg Bellows, Brad Benton, Ken
Blake, Ann Borrett, Bill Brandmeyer, Henry
Brandt, Evelyn Brengle, Dan Brokenshire, Dean
Burdick, James Campa, Jim Carey, Paul Carey, Jeff
Ceason, Alex Chow, Stephen Coladangelo Jr.,
Myneeka Cook, Mike Corrado, Cait Crawford, Jason
Dale, Dave Darrington, Kris Davis, Mike Day,
Ester Deciulescu, Dennis DeLorme, Dan Dionne,
Niketa D'Mello, Dan Dumarot, Karl Duvalsaint,
Adrian Edwards, Adam Emerich, Chris Engel, Gordon
Fossum, Chris Frazier, Amir F. Sanjar, Suzanne
Gagnon, Scott Garfinkle, Tony Godwin, Stan Gowen,
Don Grice, John Gunnels, Bill Hanson, Dave
Heidel, Gail Hepworth, Paul Herb, Peter Hofstee,
Brian Horton, Murali Iyer, Ron Jones, Peter
Keller, Mike Kistler, Rudolf Land, Susan Lee,
Kelvin Li, Dave Limpert, Joaquin Madruga, Ted
Maeurer, Gerald Malagrino, Prashant Manikal,
Camille Mann, Matt Markland, Pat McCarthy, Mary
McLaughlin-English, Ross Mikosh, Barry Minor,
Reid Minyen, Gary Mullen-Schultz, Don Mulvey,
Mark Nutter, Jim OConnor, Doug Oliver, Michael
Paolini, Mike Perks, Michael Perrone, David
Philipp, Liza Poggemeyer, Paula Richards, Phil
Sanders, Tim Schimke, Pete Schommer, Andy Schram,
Harrell Sellers, Luc Smolders, Mary Snow, Dennis
Spathis, Sean Starke, Greg Stewart, Larry Stoen,
Paul Swiatocha, Keith Tally, Sally Tekulsky, Van
To, Thinh Tran, Dave Turek, Brian Watt, Ulrich
Weigand, Cornell Wright, Shujun Zhou,
PAL was tasked with predicting, measuring, and
understanding Roadrunners performance

3
Outline

Background
Architecture
Microbenchmark performance
Application performance
Conclusions

4
What is Roadrunner?

Built by IBM for Los Alamos National Laboratory
First supercomputer to achieve 1 Pflop/s on
LINPACK
1.38 Pflop/s peak, 1.026 Pflop/s on LINPACK
(June 2008)
Currently the worlds fastest supercomputer
A number of other firsts
First 1 system to use a commodity interconnect
(InfiniBand)
First 1 system to run a commodity operating
system (Linux)
First 1 system to contain a mix of CPU types
(OpteronCell)
One of the most energy-efficient supercomputers
3 on the Green500 listmore flop/s per watt than
any but two of the Top500

5
Roadrunner Performance in Perspective
5,987
1,200
1,000
800
Total Top500 Performance (Tflop/s)
600
400
200
0
(Data taken from the June 2008 Top500 list)
6
Roadrunner ArchitecturePart 1 Opteron Blades
Opteron socket
Opteron core
Opteron core
1.8 GHz 3.6 Gflop/s 6464 KB L1 cache 2 MB L2
cache
Total cores 0
Total flop/s 0
Total cores 1
Total flop/s 3,600,000,000
Total cores 2
Total flop/s 7,200,000,000
7
Roadrunner ArchitecturePart 1 Opteron Blades
LS21 Blade
8 GB DDR2 memory
8 GB DDR2 memory
Total cores 2
Total flop/s 7,200,000,000
Total cores 4
Total flop/s 14,400,000,000
8
Roadrunner ArchitecturePart 2 Cell Blades
PowerXCell 8i socket
SPE core
SPE core
SPE core
SPE core
PPE core
EIB, 204.8 GB/s
SPE core
SPE core
SPE core
SPE core
3.2 GHz 6.4 Gflop/s 3232 KB L1 cache 512 KB L2
cache
3.2 GHz 12.8 Gflop/s 256 KB local store
Total cores 0
Total flop/s 0
Total cores 1
Total flop/s 12,800,000,000
Total cores 9
Total flop/s 108,800,000,000
Total cores 4
Total flop/s 14,400,000,000
9
Roadrunner ArchitecturePart 2 Cell Blades

Not your average Cell processor

Feature Original Cell BE (PlayStation 3) PowerXCell 8i (Roadrunner)
SPE double-precision floating point operations 6 cycle stall Fully pipelined
SPE double-precision floating point operations 13 cycle latency 9 cycle latency
SPE double-precision floating point operations Single issue Dual issue
SPE double-precision floating point operations 14.3 Gflop/s 102.4 Gflop/s
External memory interface Rambus XDR DDR2
External memory interface 2 GB limit 16 GB limit
10
Roadrunner ArchitecturePart 2 Cell Blades
QS22 Blade
4 GB DDR2 memory
4 GB DDR2 memory
Total cores 9
Total flop/s 108,800,000,000
Total cores 18
Total flop/s 217,600,000,000
Total cores 4
Total flop/s 14,400,000,000
11
Roadrunner ArchitecturePart 3 Nodes
Triblade
HT x16 6.4 GB/s
IB 2 GB/s
PCIe x8 2 GB/s
Total cores 18
Total flop/s 217,600,000,000
Total cores 18
Total flop/s 217,600,000,000
Total cores 40
Total flop/s 449,600,000,000
Total cores 4
Total flop/s 14,400,000,000
12
Roadrunner ArchitecturePart 4 Scaling Out
Rack
BladeCenter
Total cores 120
Total flop/s 1,348,800,000,000
Total cores 40
Total flop/s 449,600,000,000
Total cores 480
Total flop/s 5,395,200,000,000
13
Roadrunner ArchitecturePart 4 Scaling Out
Compute Unit (CU)
Total cores 480
Total flop/s 5,395,200,000,000
Total cores 7,200
Total flop/s 80,928,000,000,000
14
Roadrunner ArchitecturePart 4 Scaling Out
Roadrunner
Total cores 7,200
Total flop/s 80,928,000,000,000
Total cores 122,400
Total flop/s 1,375,776,000,000,000
15
Roadrunner ArchitectureSummary of Key
Characteristics

Hybrid architecture
12,240 Opteron cores for control- or
network-intensive routines and irregular memory
accesses
12,240 Cell sockets (97,920 SPE cores) for
compute-intensive routines with regular memory
accesses
Equal memory (4 GB) per Opteron core and Cell
socket
Total of 98 TB memory
High peak performance
DP peak 1.38 Pflop/s
SP peak 2.91 Pflop/s

Ordinary InfiniBand network
Approximately a fat treesee paper for details
Modest of nodes (3,060)
91 of performance comes from SPE cores

SPEs (91)
PPEs (6)
Opterons (3)
16
Why This Architecture?

Attempt to optimize application performance given
multiple constraints
Cost
Flexibility
Power cooling
Floor space
Delivery schedule
Roadrunners hybrid architecture deemed the best
solution

Hybrid may be the new trend in HPC
1970s HPC is scalar LANL is an early adopter of
vector (Cray 1 1)
1980s HPC is vector LANL is an early adopter of
data-parallel (TMC CM-1)
1990s HPC is data-parallel LANL is an early
adopter of distributed-memory (TMC CM-5)
2000s HPC is distributed-memory LANL is an
early adopter of hybrid (Roadrunner)

17
Outline

Background
Architecture
Microbenchmark performance
Application performance
Conclusions

18
Memory Subsystem Performance

Indicative of (and helps explain) computation
performance
Evaluated load/store accesses only (not DMA)
SPE Local store
PPE Main memory on QS22 blade
Opteron Main memory on LS21 blade
Memory bandwidth
Measured with Stream Triad
A(i) B(i) qC(i)

19
Memory Subsystem Performance

PPE core provides 78 more peak flop/s than
Opteron core
but

29.28
30
25

PPE observes only 16 of Opterons memory
bandwidth
Our experience
PPE not fast enough for significant computation
PPE speed on small kernels is about 1/3 Opteron
speed
On Roadrunner, PPEs are best used for shuttling
data between SPEs and Opterons

20
15
Bandwidth (GB/s)
10
5.41
5
0.89
0
SPE
PPE
Opteron
Core type
20
Communication Subsystem Performance

Key contributor to performance of many parallel
applications
Complexities of measuring on Roadrunner
Multiple networks Element Interconnect Bus,
FlexIO, PCI Express, HyperTransport, InfiniBand
Different low-level protocols put/get vs.
send/receive
Our approach
Normalize all protocols to send/receive
MPI send/receive for Opteron-Opteron
communication
DaCS send/receive for PPE-Opteron communication
Cell Messaging Layer for SPE-SPE communication
Measure ping-pong performance (half round-trip
time) across each interconnect type

21
Small-Message Communication Time
0.3µs
0.8µs
3.2µs
2.1µs
22
Large-Message Communication Time
1000
100
10
Time (µs)

PPEOpteron is bandwidth bottleneck
Same link technology as OpteronOpteron (PCI
Express x8)
As SW matures, we expect performance to improve
to current MPIIB performance

1
0.1
1
10
100
1000
10000
100000
Message Size (B)
23
Outline

Background
Architecture
Microbenchmark performance
Application performance
Conclusions

24
Early Roadrunner Applications

VPIC
Particle-in-cell code
7X improvement over Opteron-only
SPaSM
Short-range molecular dynamics code
6X improvement over Opteron-only
Milagro
Implicit Monte Carlo code
6X improvement over Opteron-only
PetaVision
Neuron synapse simulation
1.144 Pflop/s (single prec.)

Each ported to Roadrunner by a couple of people
in a short period of time
Months, not years
Most had to learn Cell programming first
All had to deal with preproduction HW SW
Relatively few code changes
1030 of code (est.)
Yes, Roadrunner is programmable

25
Challenge Sweep3D

Neutron-transport kernel
3-D global grid with 2-D data decomposition
Wavefront communication
Receive boundaries from upstream
Compute
Send boundaries downstream
Repeat from each of eight corners
Hard to get performance at scale
Small messages (few KB)
Tightly coupledruns only as fast as slowest link
Tradeoff between frequency of communication and
available parallelism

26
Sweep3D on Roadrunner
0.8

All compute done on SPEs
PPEs and Opterons used as smart NICs
(Remember 91 of performance on SPEs)
Same basic data structures and control flow as
conventional Sweep3D
Cell Messaging Layer provides MPI for SPEs
One MPI rank per SPE
Treat Roadrunner as a 97,920-SPE cluster

0.7
0.6
0.5
Iteration time (s)
0.4
0.3
0.2
0.1
0.0
1
2
4
8
16
32
64
128
256
512
1024
2048
3060
Node count

2X improvement at scale

Original Sweep3D (compute on Opterons)
Roadrunner Sweep3D (compute on SPEs)

Expect 4X with feasible SW modifications

Roadrunner Sweep3D modeled with PCIe bandwidth
IB bandwidth
27
Conclusions

Roadrunner is a complex architecture
Three types of processor cores
Multiple internal interconnects
Many address spaces per node
Good performance is possible, however
Even communication-heavy Sweep3D sees
2X improvement over Opteron-only runs with 4X
expected in the future
Other applications see 67X improvement
Hybrid computing may be here to stay
Higher performance/watt than other systems
Scales well (few powerful nodes enables smaller
networks)
Opens up exciting new research possibilities in
system architecture, programming models,
performance tools, and more

28
Additional Roadrunner Resources

Read the paper
Visit our Web site
http//www.lanl.gov/roadrunner
Download our software
http//cellmessaging.sf.net/ (Cell Messaging
Layer)
Stop by our booths
Los Alamos National Laboratory (550)
NNSA Advanced Simulation and Computing (512)
Attend our talks
Birds of a Feather session (Wednesday evening)
Roadrunner First to a Petaflop, First of a New
Breed
ACM Gordon Bell Finalists session (Thursday
morning)
0.374 Pflop/s Trillion-particle Particle-in-cell
Modeling of Laser Plasma Interactions on
Roadrunner
369 Tflop/s Molecular Dynamics Simulations on
the Roadrunner General-purpose Heterogeneous
Supercomputer

29
Backup Slides
30
Intended Uses of Roadrunner

Ensure the safety and reliability of the nations
nuclear weapons stockpile
also
Research into
astronomy
climate change
cosmology
energy
human genome science
material science

31
Programming Roadrunner
(SPaSM, Milagro, PetaVision)
(VPIC, Sweep3D)

Multiple low-level communication libraries
MPI for OpteronOpteron communication
DaCS for PPEOpteron communication
SPU intrinsics for PPESPE and SPESPE
communication
Alternatives
PetaVision ALF for PPEs to coordinate SPEs
Sweep3D Cell Messaging Layer for SPEremote SPE
comm.

32
Why Doesnt Roadrunner
use only Cells? Need to run legacy codes
put the Cells on InfiniBand? Cost
use GPUs instead of Cells? Ease of programming and double-precision performance
do something else completely differently? Probably one of cost, flexibility, power cooling, floor space, or delivery schedule
33
Hybrid vs. Conventional
Characteristic Roadrunner Jaguar (XT5 only)
Peak perf. (Pflop/s) 1.38 1.38
LINPACK of peak 76 77
CPU type OpteronCell Opteron
Node count 3,060 18,772
Core count 122,400 150,176
Power (MW), measured 2.35 6.95