Entering the Petaflop Era: The Architecture and Performance of Roadrunner - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Entering the Petaflop Era: The Architecture and Performance of Roadrunner

Description:

... Phil Church, Vaughn Clinton, Susan Coulter, Robert Cunningham, David Daniel, ... Grunau, David Gunter, Chuck Hales, Doug Hefele, Paul Henning, Catherine Hensley, ... – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 34
Provided by: scott192
Category:

less

Transcript and Presenter's Notes

Title: Entering the Petaflop Era: The Architecture and Performance of Roadrunner


1
Entering the Petaflop EraThe Architecture and
Performance ofRoadrunner
  • Kevin J. Barker, Kei Davis, Adolfy Hoisie,
    Darren J. Kerbyson, Mike Lang, Scott Pakin, and
    José C. Sancho
  • Performance and Architecture Lab (PAL)
  • Los Alamos National Laboratory
  • 18 November 2008

2
Clarification
  • Many, many people at LANL and IBM contributed to
    Roadrunners success
  • LANL side Brian Albright, Daniel Archuleta,
    Kevin Barker, Brian Barrett, John Bent, Benjamin
    Bergen, Kevin Bowers, Todd Bowman, Joseph
    Bridges, Jeffrey Brown, Ernie Buenafe, Richard
    Campion, Ralph Castain, John Cerutti, Mark
    Chadwick, Hsing-Bung Chen, Phil Church, Vaughn
    Clinton, Susan Coulter, Robert Cunningham, David
    Daniel, Kei Davis, Nathan Debardeleben, Gabriel
    de la Cruz, Nehal Desai, Guy Dimonte, Andrew
    Dubois, Charles Ferenbaugh, Parks Fields, Timothy
    Germann, Gary Grider, Daryl Grunau, David Gunter,
    Chuck Hales, Doug Hefele, Paul Henning, Catherine
    Hensley, Marissa Herrera, Stephen Hodson, Adolfy
    Hoisie, Laura Hughes, Craig Idler, Jeff Inman,
    Mohammed Jebbanema, Timothy Kelley, Kathleen
    Kelly, Darren Kerbyson, Brett Kettering, Ken
    Koch, Thomas Kwan, Michael Lang, Rick Light,
    Diana Little, Josip Loncaric, Monica Lucero, Hal
    Marshall, Rick Martineau, Gloria Martinez, Paul
    Martinez, Benjamin McClelland, Patrick McCormick,
    Michael McKay, Allen McPherson, Amy Meilander,
    Sarah Michalak, Raymond Miller, Jamal Mohd-Yusof,
    David Montoya, Terri Morris, John Morrison, James
    Nuñez, Scott Pakin, Georgia Pedicini, Jennis
    Pruett, Meghan Quist, Craig Rasmussen, Randal
    Rheinheimer, Denny Rice, Rick Rivera, Bill Rust,
    José Sancho, Rita Sandoval, Bob Shea, Matt
    Sheats, Andrew Shewmaker, Galen Shipman, LeAnne
    Silva, Randall Smith, Julianne Stidham, Joyce
    Sullivan, Sriram Swaminarayan, Wayne Sweatt,
    Martin Torrey, Alfred Torrez, Justin Tripp, John
    Turner, Steve Turpin, Ron Velarde, Mark Vernon,
    Manuel Vigil, Robert Villa, Cheryl Wampler,
    Robert Webster, Andy White, Chuck Wilder,
    Karl-Heinz Winkler, Lin Yin, 
  • IBM side Joe Abeyta, Mike Aho, Ben Alexander,
    Tom Ballard, Greg Bellows, Brad Benton, Ken
    Blake, Ann Borrett, Bill Brandmeyer, Henry
    Brandt, Evelyn Brengle, Dan Brokenshire, Dean
    Burdick, James Campa, Jim Carey, Paul Carey, Jeff
    Ceason, Alex Chow, Stephen Coladangelo Jr.,
    Myneeka Cook, Mike Corrado, Cait Crawford, Jason
    Dale, Dave Darrington, Kris Davis, Mike Day,
    Ester Deciulescu, Dennis DeLorme, Dan Dionne,
    Niketa D'Mello, Dan Dumarot, Karl Duvalsaint,
    Adrian Edwards, Adam Emerich, Chris Engel, Gordon
    Fossum, Chris Frazier, Amir F. Sanjar, Suzanne
    Gagnon, Scott Garfinkle, Tony Godwin, Stan Gowen,
    Don Grice, John Gunnels, Bill Hanson, Dave
    Heidel, Gail Hepworth, Paul Herb, Peter Hofstee,
    Brian Horton, Murali Iyer, Ron Jones, Peter
    Keller, Mike Kistler, Rudolf Land, Susan Lee,
    Kelvin Li, Dave Limpert, Joaquin Madruga, Ted
    Maeurer, Gerald Malagrino, Prashant Manikal,
    Camille Mann, Matt Markland, Pat McCarthy, Mary
    McLaughlin-English, Ross Mikosh, Barry Minor,
    Reid Minyen, Gary Mullen-Schultz, Don Mulvey,
    Mark Nutter, Jim OConnor, Doug Oliver, Michael
    Paolini, Mike Perks, Michael Perrone, David
    Philipp, Liza Poggemeyer, Paula Richards, Phil
    Sanders, Tim Schimke, Pete Schommer, Andy Schram,
    Harrell Sellers, Luc Smolders, Mary Snow, Dennis
    Spathis, Sean Starke, Greg Stewart, Larry Stoen,
    Paul Swiatocha, Keith Tally, Sally Tekulsky, Van
    To, Thinh Tran, Dave Turek, Brian Watt, Ulrich
    Weigand, Cornell Wright, Shujun Zhou, 
  • PAL was tasked with predicting, measuring, and
    understanding Roadrunners performance

3
Outline
  • Background
  • Architecture
  • Microbenchmark performance
  • Application performance
  • Conclusions

4
What is Roadrunner?
  • Built by IBM for Los Alamos National Laboratory
  • First supercomputer to achieve 1 Pflop/s on
    LINPACK
  • 1.38 Pflop/s peak, 1.026 Pflop/s on LINPACK
    (June 2008)
  • Currently the worlds fastest supercomputer
  • A number of other firsts
  • First 1 system to use a commodity interconnect
    (InfiniBand)
  • First 1 system to run a commodity operating
    system (Linux)
  • First 1 system to contain a mix of CPU types
    (OpteronCell)
  • One of the most energy-efficient supercomputers
  • 3 on the Green500 listmore flop/s per watt than
    any but two of the Top500

5
Roadrunner Performance in Perspective
5,987
1,200
1,000
800
Total Top500 Performance (Tflop/s)
600
400
200
0
(Data taken from the June 2008 Top500 list)
6
Roadrunner ArchitecturePart 1 Opteron Blades
Opteron socket
Opteron core
Opteron core
1.8 GHz 3.6 Gflop/s 6464 KB L1 cache 2 MB L2
cache
Total cores 0
Total flop/s 0
Total cores 1
Total flop/s 3,600,000,000
Total cores 2
Total flop/s 7,200,000,000
7
Roadrunner ArchitecturePart 1 Opteron Blades
LS21 Blade
8 GB DDR2 memory
8 GB DDR2 memory
Total cores 2
Total flop/s 7,200,000,000
Total cores 4
Total flop/s 14,400,000,000
8
Roadrunner ArchitecturePart 2 Cell Blades
PowerXCell 8i socket
SPE core
SPE core
SPE core
SPE core
PPE core
EIB, 204.8 GB/s
SPE core
SPE core
SPE core
SPE core
3.2 GHz 6.4 Gflop/s 3232 KB L1 cache 512 KB L2
cache
3.2 GHz 12.8 Gflop/s 256 KB local store
Total cores 0
Total flop/s 0
Total cores 1
Total flop/s 12,800,000,000
Total cores 9
Total flop/s 108,800,000,000
Total cores 4
Total flop/s 14,400,000,000
9
Roadrunner ArchitecturePart 2 Cell Blades
  • Not your average Cell processor

Feature Original Cell BE (PlayStation 3) PowerXCell 8i (Roadrunner)
SPE double-precision floating point operations 6 cycle stall Fully pipelined
SPE double-precision floating point operations 13 cycle latency 9 cycle latency
SPE double-precision floating point operations Single issue Dual issue
SPE double-precision floating point operations 14.3 Gflop/s 102.4 Gflop/s
External memory interface Rambus XDR DDR2
External memory interface 2 GB limit 16 GB limit
10
Roadrunner ArchitecturePart 2 Cell Blades
QS22 Blade
4 GB DDR2 memory
4 GB DDR2 memory
Total cores 9
Total flop/s 108,800,000,000
Total cores 18
Total flop/s 217,600,000,000
Total cores 4
Total flop/s 14,400,000,000
11
Roadrunner ArchitecturePart 3 Nodes
Triblade
HT x16 6.4 GB/s
IB 2 GB/s
PCIe x8 2 GB/s
Total cores 18
Total flop/s 217,600,000,000
Total cores 18
Total flop/s 217,600,000,000
Total cores 40
Total flop/s 449,600,000,000
Total cores 4
Total flop/s 14,400,000,000
12
Roadrunner ArchitecturePart 4 Scaling Out
Rack
BladeCenter
Total cores 120
Total flop/s 1,348,800,000,000
Total cores 40
Total flop/s 449,600,000,000
Total cores 480
Total flop/s 5,395,200,000,000
13
Roadrunner ArchitecturePart 4 Scaling Out
Compute Unit (CU)
Total cores 480
Total flop/s 5,395,200,000,000
Total cores 7,200
Total flop/s 80,928,000,000,000
14
Roadrunner ArchitecturePart 4 Scaling Out
Roadrunner
Total cores 7,200
Total flop/s 80,928,000,000,000
Total cores 122,400
Total flop/s 1,375,776,000,000,000
15
Roadrunner ArchitectureSummary of Key
Characteristics
  • Hybrid architecture
  • 12,240 Opteron cores for control- or
    network-intensive routines and irregular memory
    accesses
  • 12,240 Cell sockets (97,920 SPE cores) for
    compute-intensive routines with regular memory
    accesses
  • Equal memory (4 GB) per Opteron core and Cell
    socket
  • Total of 98 TB memory
  • High peak performance
  • DP peak 1.38 Pflop/s
  • SP peak 2.91 Pflop/s
  • Ordinary InfiniBand network
  • Approximately a fat treesee paper for details
  • Modest of nodes (3,060)
  • 91 of performance comes from SPE cores

SPEs (91)
PPEs (6)
Opterons (3)
16
Why This Architecture?
  • Attempt to optimize application performance given
    multiple constraints
  • Cost
  • Flexibility
  • Power cooling
  • Floor space
  • Delivery schedule
  • Roadrunners hybrid architecture deemed the best
    solution
  • Hybrid may be the new trend in HPC
  • 1970s HPC is scalar LANL is an early adopter of
    vector (Cray 1 1)
  • 1980s HPC is vector LANL is an early adopter of
    data-parallel (TMC CM-1)
  • 1990s HPC is data-parallel LANL is an early
    adopter of distributed-memory (TMC CM-5)
  • 2000s HPC is distributed-memory LANL is an
    early adopter of hybrid (Roadrunner)

17
Outline
  • Background
  • Architecture
  • Microbenchmark performance
  • Application performance
  • Conclusions

18
Memory Subsystem Performance
  • Indicative of (and helps explain) computation
    performance
  • Evaluated load/store accesses only (not DMA)
  • SPE Local store
  • PPE Main memory on QS22 blade
  • Opteron Main memory on LS21 blade
  • Memory bandwidth
  • Measured with Stream Triad
  • A(i) B(i) qC(i)

19
Memory Subsystem Performance
  • PPE core provides 78 more peak flop/s than
    Opteron core
  • but

29.28
30
25
  • PPE observes only 16 of Opterons memory
    bandwidth
  • Our experience
  • PPE not fast enough for significant computation
  • PPE speed on small kernels is about 1/3 Opteron
    speed
  • On Roadrunner, PPEs are best used for shuttling
    data between SPEs and Opterons

20
15
Bandwidth (GB/s)
10
5.41
5
0.89
0
SPE
PPE
Opteron
Core type
20
Communication Subsystem Performance
  • Key contributor to performance of many parallel
    applications
  • Complexities of measuring on Roadrunner
  • Multiple networks Element Interconnect Bus,
    FlexIO, PCI Express, HyperTransport, InfiniBand
  • Different low-level protocols put/get vs.
    send/receive
  • Our approach
  • Normalize all protocols to send/receive
  • MPI send/receive for Opteron-Opteron
    communication
  • DaCS send/receive for PPE-Opteron communication
  • Cell Messaging Layer for SPE-SPE communication
  • Measure ping-pong performance (half round-trip
    time) across each interconnect type

21
Small-Message Communication Time
0.3µs
0.8µs
3.2µs
2.1µs
22
Large-Message Communication Time
1000
100
10
Time (µs)
  • PPEOpteron is bandwidth bottleneck
  • Same link technology as OpteronOpteron (PCI
    Express x8)
  • As SW matures, we expect performance to improve
    to current MPIIB performance

1
0.1
1
10
100
1000
10000
100000
Message Size (B)
23
Outline
  • Background
  • Architecture
  • Microbenchmark performance
  • Application performance
  • Conclusions

24
Early Roadrunner Applications
  • VPIC
  • Particle-in-cell code
  • 7X improvement over Opteron-only
  • SPaSM
  • Short-range molecular dynamics code
  • 6X improvement over Opteron-only
  • Milagro
  • Implicit Monte Carlo code
  • 6X improvement over Opteron-only
  • PetaVision
  • Neuron synapse simulation
  • 1.144 Pflop/s (single prec.)
  • Each ported to Roadrunner by a couple of people
    in a short period of time
  • Months, not years
  • Most had to learn Cell programming first
  • All had to deal with preproduction HW  SW
  • Relatively few code changes
  • 1030 of code (est.)
  • Yes, Roadrunner is programmable

25
Challenge Sweep3D
  • Neutron-transport kernel
  • 3-D global grid with 2-D data decomposition
  • Wavefront communication
  • Receive boundaries from upstream
  • Compute
  • Send boundaries downstream
  • Repeat from each of eight corners
  • Hard to get performance at scale
  • Small messages (few KB)
  • Tightly coupledruns only as fast as slowest link
  • Tradeoff between frequency of communication and
    available parallelism

26
Sweep3D on Roadrunner
0.8
  • All compute done on SPEs
  • PPEs and Opterons used as smart NICs
  • (Remember 91 of performance on SPEs)
  • Same basic data structures and control flow as
    conventional Sweep3D
  • Cell Messaging Layer provides MPI for SPEs
  • One MPI rank per SPE
  • Treat Roadrunner as a 97,920-SPE cluster

0.7
0.6
0.5
Iteration time (s)
0.4
0.3
0.2
0.1
0.0
1
2
4
8
16
32
64
128
256
512
1024
2048
3060
Node count
  • 2X improvement at scale

Original Sweep3D (compute on Opterons)
Roadrunner Sweep3D (compute on SPEs)
  • Expect 4X with feasible SW modifications

Roadrunner Sweep3D modeled with PCIe bandwidth 
IB bandwidth
27
Conclusions
  • Roadrunner is a complex architecture
  • Three types of processor cores
  • Multiple internal interconnects
  • Many address spaces per node
  • Good performance is possible, however
  • Even communication-heavy Sweep3D sees
    2X improvement over Opteron-only runs with 4X
    expected in the future
  • Other applications see 67X improvement
  • Hybrid computing may be here to stay
  • Higher performance/watt than other systems
  • Scales well (few powerful nodes enables smaller
    networks)
  • Opens up exciting new research possibilities in
    system architecture, programming models,
    performance tools, and more

28
Additional Roadrunner Resources
  • Read the paper
  • Visit our Web site
  • http//www.lanl.gov/roadrunner
  • Download our software
  • http//cellmessaging.sf.net/ (Cell Messaging
    Layer)
  • Stop by our booths
  • Los Alamos National Laboratory (550)
  • NNSA Advanced Simulation and Computing (512)
  • Attend our talks
  • Birds of a Feather session (Wednesday evening)
  • Roadrunner First to a Petaflop, First of a New
    Breed
  • ACM Gordon Bell Finalists session (Thursday
    morning)
  • 0.374 Pflop/s Trillion-particle Particle-in-cell
    Modeling of Laser Plasma Interactions on
    Roadrunner
  • 369 Tflop/s Molecular Dynamics Simulations on
    the Roadrunner General-purpose Heterogeneous
    Supercomputer

29
Backup Slides
30
Intended Uses of Roadrunner
  • Ensure the safety and reliability of the nations
    nuclear weapons stockpile
  •  also 
  • Research into
  • astronomy
  • climate change
  • cosmology
  • energy
  • human genome science
  • material science

31
Programming Roadrunner
(SPaSM, Milagro, PetaVision)
(VPIC, Sweep3D)
  • Multiple low-level communication libraries
  • MPI for OpteronOpteron communication
  • DaCS for PPEOpteron communication
  • SPU intrinsics for PPESPE and SPESPE
    communication
  • Alternatives
  • PetaVision ALF for PPEs to coordinate SPEs
  • Sweep3D Cell Messaging Layer for SPEremote SPE
    comm.

32
Why Doesnt Roadrunner
use only Cells? Need to run legacy codes
put the Cells on InfiniBand? Cost
use GPUs instead of Cells? Ease of programming and double-precision performance
do something else completely differently? Probably one of cost, flexibility, power  cooling, floor space, or delivery schedule
33
Hybrid vs. Conventional
Characteristic Roadrunner Jaguar (XT5 only)
Peak perf. (Pflop/s) 1.38 1.38
LINPACK of peak 76 77
CPU type OpteronCell Opteron
Node count 3,060 18,772
Core count 122,400 150,176
Power (MW), measured 2.35 6.95
  • Same peak flop/s but
  • 6X the number of nodes
  • 23 more cores
  • 3X the power requirement
Write a Comment
User Comments (0)
About PowerShow.com