Title: Entering the Petaflop Era: The Architecture and Performance of Roadrunner
1Entering the Petaflop EraThe Architecture and
Performance ofRoadrunner
- Kevin J. Barker, Kei Davis, Adolfy Hoisie,
Darren J. Kerbyson, Mike Lang, Scott Pakin, and
José C. Sancho - Performance and Architecture Lab (PAL)
- Los Alamos National Laboratory
- 18 November 2008
2Clarification
- Many, many people at LANL and IBM contributed to
Roadrunners success - LANL side Brian Albright, Daniel Archuleta,
Kevin Barker, Brian Barrett, John Bent, Benjamin
Bergen, Kevin Bowers, Todd Bowman, Joseph
Bridges, Jeffrey Brown, Ernie Buenafe, Richard
Campion, Ralph Castain, John Cerutti, Mark
Chadwick, Hsing-Bung Chen, Phil Church, Vaughn
Clinton, Susan Coulter, Robert Cunningham, David
Daniel, Kei Davis, Nathan Debardeleben, Gabriel
de la Cruz, Nehal Desai, Guy Dimonte, Andrew
Dubois, Charles Ferenbaugh, Parks Fields, Timothy
Germann, Gary Grider, Daryl Grunau, David Gunter,
Chuck Hales, Doug Hefele, Paul Henning, Catherine
Hensley, Marissa Herrera, Stephen Hodson, Adolfy
Hoisie, Laura Hughes, Craig Idler, Jeff Inman,
Mohammed Jebbanema, Timothy Kelley, Kathleen
Kelly, Darren Kerbyson, Brett Kettering, Ken
Koch, Thomas Kwan, Michael Lang, Rick Light,
Diana Little, Josip Loncaric, Monica Lucero, Hal
Marshall, Rick Martineau, Gloria Martinez, Paul
Martinez, Benjamin McClelland, Patrick McCormick,
Michael McKay, Allen McPherson, Amy Meilander,
Sarah Michalak, Raymond Miller, Jamal Mohd-Yusof,
David Montoya, Terri Morris, John Morrison, James
Nuñez, Scott Pakin, Georgia Pedicini, Jennis
Pruett, Meghan Quist, Craig Rasmussen, Randal
Rheinheimer, Denny Rice, Rick Rivera, Bill Rust,
José Sancho, Rita Sandoval, Bob Shea, Matt
Sheats, Andrew Shewmaker, Galen Shipman, LeAnne
Silva, Randall Smith, Julianne Stidham, Joyce
Sullivan, Sriram Swaminarayan, Wayne Sweatt,
Martin Torrey, Alfred Torrez, Justin Tripp, John
Turner, Steve Turpin, Ron Velarde, Mark Vernon,
Manuel Vigil, Robert Villa, Cheryl Wampler,
Robert Webster, Andy White, Chuck Wilder,
Karl-Heinz Winkler, Lin Yin, - IBM side Joe Abeyta, Mike Aho, Ben Alexander,
Tom Ballard, Greg Bellows, Brad Benton, Ken
Blake, Ann Borrett, Bill Brandmeyer, Henry
Brandt, Evelyn Brengle, Dan Brokenshire, Dean
Burdick, James Campa, Jim Carey, Paul Carey, Jeff
Ceason, Alex Chow, Stephen Coladangelo Jr.,
Myneeka Cook, Mike Corrado, Cait Crawford, Jason
Dale, Dave Darrington, Kris Davis, Mike Day,
Ester Deciulescu, Dennis DeLorme, Dan Dionne,
Niketa D'Mello, Dan Dumarot, Karl Duvalsaint,
Adrian Edwards, Adam Emerich, Chris Engel, Gordon
Fossum, Chris Frazier, Amir F. Sanjar, Suzanne
Gagnon, Scott Garfinkle, Tony Godwin, Stan Gowen,
Don Grice, John Gunnels, Bill Hanson, Dave
Heidel, Gail Hepworth, Paul Herb, Peter Hofstee,
Brian Horton, Murali Iyer, Ron Jones, Peter
Keller, Mike Kistler, Rudolf Land, Susan Lee,
Kelvin Li, Dave Limpert, Joaquin Madruga, Ted
Maeurer, Gerald Malagrino, Prashant Manikal,
Camille Mann, Matt Markland, Pat McCarthy, Mary
McLaughlin-English, Ross Mikosh, Barry Minor,
Reid Minyen, Gary Mullen-Schultz, Don Mulvey,
Mark Nutter, Jim OConnor, Doug Oliver, Michael
Paolini, Mike Perks, Michael Perrone, David
Philipp, Liza Poggemeyer, Paula Richards, Phil
Sanders, Tim Schimke, Pete Schommer, Andy Schram,
Harrell Sellers, Luc Smolders, Mary Snow, Dennis
Spathis, Sean Starke, Greg Stewart, Larry Stoen,
Paul Swiatocha, Keith Tally, Sally Tekulsky, Van
To, Thinh Tran, Dave Turek, Brian Watt, Ulrich
Weigand, Cornell Wright, Shujun Zhou, - PAL was tasked with predicting, measuring, and
understanding Roadrunners performance
3Outline
- Background
- Architecture
- Microbenchmark performance
- Application performance
- Conclusions
4What is Roadrunner?
- Built by IBM for Los Alamos National Laboratory
- First supercomputer to achieve 1 Pflop/s on
LINPACK - 1.38 Pflop/s peak, 1.026 Pflop/s on LINPACK
(June 2008) - Currently the worlds fastest supercomputer
- A number of other firsts
- First 1 system to use a commodity interconnect
(InfiniBand) - First 1 system to run a commodity operating
system (Linux) - First 1 system to contain a mix of CPU types
(OpteronCell) - One of the most energy-efficient supercomputers
- 3 on the Green500 listmore flop/s per watt than
any but two of the Top500
5Roadrunner Performance in Perspective
5,987
1,200
1,000
800
Total Top500 Performance (Tflop/s)
600
400
200
0
(Data taken from the June 2008 Top500 list)
6Roadrunner ArchitecturePart 1 Opteron Blades
Opteron socket
Opteron core
Opteron core
1.8 GHz 3.6 Gflop/s 6464 KB L1 cache 2 MB L2
cache
Total cores 0
Total flop/s 0
Total cores 1
Total flop/s 3,600,000,000
Total cores 2
Total flop/s 7,200,000,000
7Roadrunner ArchitecturePart 1 Opteron Blades
LS21 Blade
8 GB DDR2 memory
8 GB DDR2 memory
Total cores 2
Total flop/s 7,200,000,000
Total cores 4
Total flop/s 14,400,000,000
8Roadrunner ArchitecturePart 2 Cell Blades
PowerXCell 8i socket
SPE core
SPE core
SPE core
SPE core
PPE core
EIB, 204.8 GB/s
SPE core
SPE core
SPE core
SPE core
3.2 GHz 6.4 Gflop/s 3232 KB L1 cache 512 KB L2
cache
3.2 GHz 12.8 Gflop/s 256 KB local store
Total cores 0
Total flop/s 0
Total cores 1
Total flop/s 12,800,000,000
Total cores 9
Total flop/s 108,800,000,000
Total cores 4
Total flop/s 14,400,000,000
9Roadrunner ArchitecturePart 2 Cell Blades
- Not your average Cell processor
Feature Original Cell BE (PlayStation 3) PowerXCell 8i (Roadrunner)
SPE double-precision floating point operations 6 cycle stall Fully pipelined
SPE double-precision floating point operations 13 cycle latency 9 cycle latency
SPE double-precision floating point operations Single issue Dual issue
SPE double-precision floating point operations 14.3 Gflop/s 102.4 Gflop/s
External memory interface Rambus XDR DDR2
External memory interface 2 GB limit 16 GB limit
10Roadrunner ArchitecturePart 2 Cell Blades
QS22 Blade
4 GB DDR2 memory
4 GB DDR2 memory
Total cores 9
Total flop/s 108,800,000,000
Total cores 18
Total flop/s 217,600,000,000
Total cores 4
Total flop/s 14,400,000,000
11Roadrunner ArchitecturePart 3 Nodes
Triblade
HT x16 6.4 GB/s
IB 2 GB/s
PCIe x8 2 GB/s
Total cores 18
Total flop/s 217,600,000,000
Total cores 18
Total flop/s 217,600,000,000
Total cores 40
Total flop/s 449,600,000,000
Total cores 4
Total flop/s 14,400,000,000
12Roadrunner ArchitecturePart 4 Scaling Out
Rack
BladeCenter
Total cores 120
Total flop/s 1,348,800,000,000
Total cores 40
Total flop/s 449,600,000,000
Total cores 480
Total flop/s 5,395,200,000,000
13Roadrunner ArchitecturePart 4 Scaling Out
Compute Unit (CU)
Total cores 480
Total flop/s 5,395,200,000,000
Total cores 7,200
Total flop/s 80,928,000,000,000
14Roadrunner ArchitecturePart 4 Scaling Out
Roadrunner
Total cores 7,200
Total flop/s 80,928,000,000,000
Total cores 122,400
Total flop/s 1,375,776,000,000,000
15Roadrunner ArchitectureSummary of Key
Characteristics
- Hybrid architecture
- 12,240 Opteron cores for control- or
network-intensive routines and irregular memory
accesses - 12,240 Cell sockets (97,920 SPE cores) for
compute-intensive routines with regular memory
accesses - Equal memory (4 GB) per Opteron core and Cell
socket - Total of 98 TB memory
- High peak performance
- DP peak 1.38 Pflop/s
- SP peak 2.91 Pflop/s
- Ordinary InfiniBand network
- Approximately a fat treesee paper for details
- Modest of nodes (3,060)
- 91 of performance comes from SPE cores
SPEs (91)
PPEs (6)
Opterons (3)
16Why This Architecture?
- Attempt to optimize application performance given
multiple constraints - Cost
- Flexibility
- Power cooling
- Floor space
- Delivery schedule
- Roadrunners hybrid architecture deemed the best
solution
- Hybrid may be the new trend in HPC
- 1970s HPC is scalar LANL is an early adopter of
vector (Cray 1 1) - 1980s HPC is vector LANL is an early adopter of
data-parallel (TMC CM-1) - 1990s HPC is data-parallel LANL is an early
adopter of distributed-memory (TMC CM-5) - 2000s HPC is distributed-memory LANL is an
early adopter of hybrid (Roadrunner)
17Outline
- Background
- Architecture
- Microbenchmark performance
- Application performance
- Conclusions
18Memory Subsystem Performance
- Indicative of (and helps explain) computation
performance - Evaluated load/store accesses only (not DMA)
- SPE Local store
- PPE Main memory on QS22 blade
- Opteron Main memory on LS21 blade
- Memory bandwidth
- Measured with Stream Triad
- A(i) B(i) qC(i)
19Memory Subsystem Performance
- PPE core provides 78 more peak flop/s than
Opteron core - but
29.28
30
25
- PPE observes only 16 of Opterons memory
bandwidth - Our experience
- PPE not fast enough for significant computation
- PPE speed on small kernels is about 1/3 Opteron
speed - On Roadrunner, PPEs are best used for shuttling
data between SPEs and Opterons
20
15
Bandwidth (GB/s)
10
5.41
5
0.89
0
SPE
PPE
Opteron
Core type
20Communication Subsystem Performance
- Key contributor to performance of many parallel
applications - Complexities of measuring on Roadrunner
- Multiple networks Element Interconnect Bus,
FlexIO, PCI Express, HyperTransport, InfiniBand - Different low-level protocols put/get vs.
send/receive - Our approach
- Normalize all protocols to send/receive
- MPI send/receive for Opteron-Opteron
communication - DaCS send/receive for PPE-Opteron communication
- Cell Messaging Layer for SPE-SPE communication
- Measure ping-pong performance (half round-trip
time) across each interconnect type
21Small-Message Communication Time
0.3µs
0.8µs
3.2µs
2.1µs
22Large-Message Communication Time
1000
100
10
Time (µs)
- PPEOpteron is bandwidth bottleneck
- Same link technology as OpteronOpteron (PCI
Express x8) - As SW matures, we expect performance to improve
to current MPIIB performance
1
0.1
1
10
100
1000
10000
100000
Message Size (B)
23Outline
- Background
- Architecture
- Microbenchmark performance
- Application performance
- Conclusions
24Early Roadrunner Applications
- VPIC
- Particle-in-cell code
- 7X improvement over Opteron-only
- SPaSM
- Short-range molecular dynamics code
- 6X improvement over Opteron-only
- Milagro
- Implicit Monte Carlo code
- 6X improvement over Opteron-only
- PetaVision
- Neuron synapse simulation
- 1.144 Pflop/s (single prec.)
- Each ported to Roadrunner by a couple of people
in a short period of time - Months, not years
- Most had to learn Cell programming first
- All had to deal with preproduction HW SW
- Relatively few code changes
- 1030 of code (est.)
- Yes, Roadrunner is programmable
25Challenge Sweep3D
- Neutron-transport kernel
- 3-D global grid with 2-D data decomposition
- Wavefront communication
- Receive boundaries from upstream
- Compute
- Send boundaries downstream
- Repeat from each of eight corners
- Hard to get performance at scale
- Small messages (few KB)
- Tightly coupledruns only as fast as slowest link
- Tradeoff between frequency of communication and
available parallelism
26Sweep3D on Roadrunner
0.8
- All compute done on SPEs
- PPEs and Opterons used as smart NICs
- (Remember 91 of performance on SPEs)
- Same basic data structures and control flow as
conventional Sweep3D - Cell Messaging Layer provides MPI for SPEs
- One MPI rank per SPE
- Treat Roadrunner as a 97,920-SPE cluster
0.7
0.6
0.5
Iteration time (s)
0.4
0.3
0.2
0.1
0.0
1
2
4
8
16
32
64
128
256
512
1024
2048
3060
Node count
Original Sweep3D (compute on Opterons)
Roadrunner Sweep3D (compute on SPEs)
- Expect 4X with feasible SW modifications
Roadrunner Sweep3D modeled with PCIe bandwidth
IB bandwidth
27Conclusions
- Roadrunner is a complex architecture
- Three types of processor cores
- Multiple internal interconnects
- Many address spaces per node
- Good performance is possible, however
- Even communication-heavy Sweep3D sees
2X improvement over Opteron-only runs with 4X
expected in the future - Other applications see 67X improvement
- Hybrid computing may be here to stay
- Higher performance/watt than other systems
- Scales well (few powerful nodes enables smaller
networks) - Opens up exciting new research possibilities in
system architecture, programming models,
performance tools, and more
28Additional Roadrunner Resources
- Read the paper
- Visit our Web site
- http//www.lanl.gov/roadrunner
- Download our software
- http//cellmessaging.sf.net/ (Cell Messaging
Layer) - Stop by our booths
- Los Alamos National Laboratory (550)
- NNSA Advanced Simulation and Computing (512)
- Attend our talks
- Birds of a Feather session (Wednesday evening)
- Roadrunner First to a Petaflop, First of a New
Breed - ACM Gordon Bell Finalists session (Thursday
morning) - 0.374 Pflop/s Trillion-particle Particle-in-cell
Modeling of Laser Plasma Interactions on
Roadrunner - 369 Tflop/s Molecular Dynamics Simulations on
the Roadrunner General-purpose Heterogeneous
Supercomputer
29Backup Slides
30Intended Uses of Roadrunner
- Ensure the safety and reliability of the nations
nuclear weapons stockpile - also
- Research into
- astronomy
- climate change
- cosmology
- energy
- human genome science
- material science
31Programming Roadrunner
(SPaSM, Milagro, PetaVision)
(VPIC, Sweep3D)
- Multiple low-level communication libraries
- MPI for OpteronOpteron communication
- DaCS for PPEOpteron communication
- SPU intrinsics for PPESPE and SPESPE
communication - Alternatives
- PetaVision ALF for PPEs to coordinate SPEs
- Sweep3D Cell Messaging Layer for SPEremote SPE
comm.
32Why Doesnt Roadrunner
use only Cells? Need to run legacy codes
put the Cells on InfiniBand? Cost
use GPUs instead of Cells? Ease of programming and double-precision performance
do something else completely differently? Probably one of cost, flexibility, power cooling, floor space, or delivery schedule
33Hybrid vs. Conventional
Characteristic Roadrunner Jaguar (XT5 only)
Peak perf. (Pflop/s) 1.38 1.38
LINPACK of peak 76 77
CPU type OpteronCell Opteron
Node count 3,060 18,772
Core count 122,400 150,176
Power (MW), measured 2.35 6.95
- Same peak flop/s but
- 6X the number of nodes
- 23 more cores
- 3X the power requirement