Title: Proprietary or Commodity? Interconnect Performance in Large-scale Supercomputers
1 Proprietary or Commodity?Interconnect
Performance in Large-scale Supercomputers
- Author Olli-Pekka Lehto
- Supervisor Prof. Jorma Virtamo
- Instructor D.Sc. (Tech.) Jussi Heikonen
2Contents
- Introduction
- Background
- Objectives
- Test platforms
- Testing methodology
- Results
- Conclusions
3Introduction
- A Modern High Performance Computing (HPC) system
consists of - Login nodes, service nodes, IO nodes, Compute
nodes - Usually multicore/SMP
- Interconnect network(s) which links all nodes
together - Applications are run in parallel
- The individual tasks of a parallel application
exchange data via the interconnect - The interconnect plays a critical role in the
overall performance of the parallel application - Commodity system
- Use of off-the-shelf components
- Leverages economies of scale involved in the PC
industry - Industry standard architectures which allow for
the system to be extended in a vendor-independent
fashion - Proprietary system
- A highly integrated system designed specifically
for HPC - May contain some off-the-shelf components but
cannot be extended in a vendor
independent fashion.
4BackgroundThe Cluster Revolution
In the last decade clusters have rapidly become
the system architecture of choice in HPC. The
high end of the market is still dominated by
proprietary MPP systems.
Source http//www.top500.org
5Background The Architecture Evolution
The architectures of proprietary and commodity
HPC systems have been converging. Nowadays it's
difficult to differentiate between the two.
Increasing RD costs drive move towards
commodity components
Competition between AMD and Intel Competition in
specialized cluster interconnects
The interconnect network is a key differentiating
factor between commodity and proprietary
architectures.
Disclaimer IBM Blue Gene is an notable execption
6Problem Statement
Does a supercomputer architecture with a
proprietary interconnection network offer a
significant performance advantage compared to a
similiar sized cluster with a commodity network?
7Test Platforms
- In 2007 CSC - Scientific Computing Ltd. conducted
a 10M procurement to aquire new HPC systems - The procurement was split into two parts
- Lot 1 Capability computing
- Massively parallel Grand Challenge computation
- Lot 2 Capacity computing
- Sequential and small to medium size parallel
problems - Problems requiring large amounts of memory
8Test Platform 1 Louhi
- Winner of Lot 1 capability computing
- Cray XT4
- Phase 1 (2007) 2024 2.6GHz AMD Opteron cores
(dual core) 10.5TFlop/s - Phase 2 (2008) 6736 2.6GHz AMD Opteron cores
(quad core ) 70.1TFlop/s - Unicos/lc operating system
- Linux on the login and service nodes
- Catamount microkernel on compute nodes
- Proprietary SeaStar2 interconnection network
- 3-dimensional torus topology
- Each node connected to 6 neighbors with
7,6GByte/s links - Each node has a router integrated into the NIC
- NIC connected directly to AMD HyperTransport bus
- NIC has an onboard CPU for protocol processing
(protocol offloading) - Remote Direct Memory Access (RDMA)
4 Flops/cycle
9Test Platform 1 Louhi
Source Cray Inc.
10Test Platform 2 Murska
- Winner of Lot 2 capacity computing
- HP CP4000BL XC blade cluster
- 2048 2.6GHz AMD Opteron cores (dual-core)10.6
TFlop/s - Dual-socket HP BL465c server blades
- HPC Linux operating system
- HP's turnkey cluster OS based on RHEL
- InfiniBand interconnect network
- A multipurpose high-speed network
- Fat tree topology (blocking)
- 24-port blade enclosure switches
- 16 16Gbit/s DDR ) downlinks
- 8 16Gbit/s DDR uplinks, running at 8Gbit/s
- 288-port master switch with SDR ) ports
(8Gbit/s) - Recently upgraded to DDR
- Host Channel Adapters (HCA) connected to 8x PCIe
buses - Remote Direct Memory Access (RDMA)
) Double Data Rate ) Single Data Rate
11Test Methodology
- Testing of individual parameters with
microbenchmarks - End-to-end communication latency and bandwidth
- Communication processing overhead
- Consistency of performance across the system
- Testing of real-world behavior with a scientific
application - Gromacs A popular open source molecular
dynamics application - All measurements use Message Passing Interface
(MPI) - MPI is by far the most popular parallel
programming application programming interface
(API) for HPC - Murska uses HP's HP-MPI implementation
- Louhi uses a Cray-modified version of the MPICH2
implementation
12End-to-end Latency
- Intel MPI Benchmarks (IMB) PingPong test was used
- Measures the latency to send and recieve a single
point-to-point message as a function of the
message size - Arguably the most popular metric of interconnect
performance (esp. short messages) - Murska's HP-MPI has 2 modes of operation (RDMA
and SRQ) - RDMA requires 256kbytes of memory per MPI task
while SRQ (Shared Recieve Queue) has a constant
memory requirement
SRQ causes a notable latency overhead
As the message size grows, Louhi outperforms
Murska
Murska using RDMA has a slightly lower
latency with short messages
13End-to-end Bandwidth
- IMB PingPong test with large message sizes
- Inverse of the latency
- Test was done between nearest-neighbor nodes
Louhi has still not reached peak bandwidth with
the largest message size (4 megabytes)
The gap between RDMA and SRQ narrows as the link
becomes saturated SRQ doesn't affect
performance with large message sizes
14Communication Processing Overhead
- Measure of how much communication stresses the
CPU - The C in HPC stands for Computing, not
Communication ) - MPI has asynchronous communication routines which
overlap communication and computation - This requires autonomous communication mechanisms
- Murska has RDMA, Louhi has protocol offloading
and RDMA - Sandia Nat'l Labs SMB benchmark was used
- See how much work one process gets done while
another process communicates with it constantly. - Result as application availability percentage
- 100 communication is performed completely in
the background - 0 communication is performed completely in the
foreground, no work done - Separate results for getting work done at the
sender and at the reciever side
15Reciever Side Availability
100 Availability Communication does not
interfere with processing at all
Louhi's availability improves with 8k-128k
messages
Murska's availability drops significantly between
16k and 32k
0 Availability Processing communication hogs
the CPU completely
16Sender Side Availability
Louhi's availability improves dramatically with
large messages The offload engine can process
packets autonomously.
17Gromacs
- A popular and mature molecular dynamics
simulation package - Open Source, downloadable from http//www.gromacs.
org - Programmed with MPI
- Designed to exploit overlapping computation and
communication, if available - Parallel speedup was measured by using a fixed
size molecular system - Run times for task counts from 16 to 128 were
measured - MPI calls were profiled
- How much time spent in communication subroutines?
- Which subroutines were the most time-consuming?
18Gromacs Run Time
Murskas scaling stops at 32 tasks
Louhis scaling stops at 64 tasks
Time spent in MPI communication routines starts
increasing at 64 tasks
19MPI Call Profile
Fraction of the total MPI time spent in a
specific MPI call
With small message sizes the MPI_Alltoall (all
processes send a message to each other) dominates
the time spent in MPI.
MPI_Wait (wait for an asynchronous message
transfer to complete) starts dominating time
usage on Murska as the task count grows.
MPI_Alltoall starts dominating MPI time usage on
Louhi again as the task count grows.
20Conclusions
- On Murska, a tradeoff has to be made with large
parallel problems - SRQ Sacrifice latency in favor of memory
capacity - RDMA Sacrifice memory capacity in favor of
latency - Murska is able to outperform Louhi in some
benchmarks - Especially in short message performance in RDMA
mode - Louhi was more consistent in providing low
processing overhead - Being able to overlap long messages tends to be
more important than short messages as they take
more time to complete - Gromacs scaled significantly better on Louhi
- Most likely largely due to lower communication
processing overhead - A proprietary system still has it's place
- The interconnect is designed from ground up to
handle MPI communication and HPC workloads - The streamlined microkernel also helps out
- Focusing only on hero numbers (e.g. short
message latency) can be misleading
21Questions?
Thank you!