Proprietary or Commodity? Interconnect Performance in Large-scale Supercomputers - PowerPoint PPT Presentation

About This Presentation

Title:

Proprietary or Commodity? Interconnect Performance in Large-scale Supercomputers

Description:

Proprietary or Commodity? Interconnect Performance in Large-scale Supercomputers Author: Olli-Pekka Lehto Supervisor: Prof. Jorma Virtamo Instructor: D.Sc. (Tech ... – PowerPoint PPT presentation

Number of Views:157

Avg rating:3.0/5.0

Slides: 22

Provided by: Olli69

Category:

more less

Transcript and Presenter's Notes

Title: Proprietary or Commodity? Interconnect Performance in Large-scale Supercomputers

1
Proprietary or Commodity?Interconnect
Performance in Large-scale Supercomputers

Author Olli-Pekka Lehto
Supervisor Prof. Jorma Virtamo
Instructor D.Sc. (Tech.) Jussi Heikonen

2
Contents

Introduction
Background
Objectives
Test platforms
Testing methodology
Results
Conclusions

3
Introduction

A Modern High Performance Computing (HPC) system
consists of
Login nodes, service nodes, IO nodes, Compute
nodes
Usually multicore/SMP
Interconnect network(s) which links all nodes
together
Applications are run in parallel
The individual tasks of a parallel application
exchange data via the interconnect
The interconnect plays a critical role in the
overall performance of the parallel application
Commodity system
Use of off-the-shelf components
Leverages economies of scale involved in the PC
industry
Industry standard architectures which allow for
the system to be extended in a vendor-independent
fashion
Proprietary system
A highly integrated system designed specifically
for HPC
May contain some off-the-shelf components but
cannot be extended in a vendor
independent fashion.

4
BackgroundThe Cluster Revolution
In the last decade clusters have rapidly become
the system architecture of choice in HPC. The
high end of the market is still dominated by
proprietary MPP systems.
Source http//www.top500.org
5
Background The Architecture Evolution
The architectures of proprietary and commodity
HPC systems have been converging. Nowadays it's
difficult to differentiate between the two.
Increasing RD costs drive move towards
commodity components
Competition between AMD and Intel Competition in
specialized cluster interconnects
The interconnect network is a key differentiating
factor between commodity and proprietary
architectures.
Disclaimer IBM Blue Gene is an notable execption
6
Problem Statement
Does a supercomputer architecture with a
proprietary interconnection network offer a
significant performance advantage compared to a
similiar sized cluster with a commodity network?

7
Test Platforms

In 2007 CSC - Scientific Computing Ltd. conducted
a 10M procurement to aquire new HPC systems
The procurement was split into two parts
Lot 1 Capability computing
Massively parallel Grand Challenge computation
Lot 2 Capacity computing
Sequential and small to medium size parallel
problems
Problems requiring large amounts of memory

8
Test Platform 1 Louhi

Winner of Lot 1 capability computing
Cray XT4
Phase 1 (2007) 2024 2.6GHz AMD Opteron cores
(dual core) 10.5TFlop/s
Phase 2 (2008) 6736 2.6GHz AMD Opteron cores
(quad core ) 70.1TFlop/s
Unicos/lc operating system
Linux on the login and service nodes
Catamount microkernel on compute nodes
Proprietary SeaStar2 interconnection network
3-dimensional torus topology
Each node connected to 6 neighbors with
7,6GByte/s links
Each node has a router integrated into the NIC
NIC connected directly to AMD HyperTransport bus
NIC has an onboard CPU for protocol processing
(protocol offloading)
Remote Direct Memory Access (RDMA)

4 Flops/cycle
9
Test Platform 1 Louhi
Source Cray Inc.
10
Test Platform 2 Murska

Winner of Lot 2 capacity computing
HP CP4000BL XC blade cluster
2048 2.6GHz AMD Opteron cores (dual-core)10.6
TFlop/s
Dual-socket HP BL465c server blades
HPC Linux operating system
HP's turnkey cluster OS based on RHEL
InfiniBand interconnect network
A multipurpose high-speed network
Fat tree topology (blocking)
24-port blade enclosure switches
16 16Gbit/s DDR ) downlinks
8 16Gbit/s DDR uplinks, running at 8Gbit/s
288-port master switch with SDR ) ports
(8Gbit/s)
Recently upgraded to DDR
Host Channel Adapters (HCA) connected to 8x PCIe
buses
Remote Direct Memory Access (RDMA)

) Double Data Rate ) Single Data Rate
11
Test Methodology

Testing of individual parameters with
microbenchmarks
End-to-end communication latency and bandwidth
Communication processing overhead
Consistency of performance across the system
Testing of real-world behavior with a scientific
application
Gromacs A popular open source molecular
dynamics application
All measurements use Message Passing Interface
(MPI)
MPI is by far the most popular parallel
programming application programming interface
(API) for HPC
Murska uses HP's HP-MPI implementation
Louhi uses a Cray-modified version of the MPICH2
implementation

12
End-to-end Latency

Intel MPI Benchmarks (IMB) PingPong test was used
Measures the latency to send and recieve a single
point-to-point message as a function of the
message size
Arguably the most popular metric of interconnect
performance (esp. short messages)
Murska's HP-MPI has 2 modes of operation (RDMA
and SRQ)
RDMA requires 256kbytes of memory per MPI task
while SRQ (Shared Recieve Queue) has a constant
memory requirement

SRQ causes a notable latency overhead
As the message size grows, Louhi outperforms
Murska
Murska using RDMA has a slightly lower
latency with short messages
13
End-to-end Bandwidth

IMB PingPong test with large message sizes
Inverse of the latency
Test was done between nearest-neighbor nodes

Louhi has still not reached peak bandwidth with
the largest message size (4 megabytes)
The gap between RDMA and SRQ narrows as the link
becomes saturated SRQ doesn't affect
performance with large message sizes
14
Communication Processing Overhead

Measure of how much communication stresses the
CPU
The C in HPC stands for Computing, not
Communication )
MPI has asynchronous communication routines which
overlap communication and computation
This requires autonomous communication mechanisms
Murska has RDMA, Louhi has protocol offloading
and RDMA
Sandia Nat'l Labs SMB benchmark was used
See how much work one process gets done while
another process communicates with it constantly.
Result as application availability percentage
100 communication is performed completely in
the background
0 communication is performed completely in the
foreground, no work done
Separate results for getting work done at the
sender and at the reciever side

15
Reciever Side Availability
100 Availability Communication does not
interfere with processing at all
Louhi's availability improves with 8k-128k
messages
Murska's availability drops significantly between
16k and 32k
0 Availability Processing communication hogs
the CPU completely
16
Sender Side Availability
Louhi's availability improves dramatically with
large messages The offload engine can process
packets autonomously.
17
Gromacs

A popular and mature molecular dynamics
simulation package
Open Source, downloadable from http//www.gromacs.
org
Programmed with MPI
Designed to exploit overlapping computation and
communication, if available
Parallel speedup was measured by using a fixed
size molecular system
Run times for task counts from 16 to 128 were
measured
MPI calls were profiled
How much time spent in communication subroutines?
Which subroutines were the most time-consuming?

18
Gromacs Run Time
Murskas scaling stops at 32 tasks
Louhis scaling stops at 64 tasks
Time spent in MPI communication routines starts
increasing at 64 tasks
19
MPI Call Profile
Fraction of the total MPI time spent in a
specific MPI call
With small message sizes the MPI_Alltoall (all
processes send a message to each other) dominates
the time spent in MPI.
MPI_Wait (wait for an asynchronous message
transfer to complete) starts dominating time
usage on Murska as the task count grows.
MPI_Alltoall starts dominating MPI time usage on
Louhi again as the task count grows.
20
Conclusions

On Murska, a tradeoff has to be made with large
parallel problems
SRQ Sacrifice latency in favor of memory
capacity
RDMA Sacrifice memory capacity in favor of
latency
Murska is able to outperform Louhi in some
benchmarks
Especially in short message performance in RDMA
mode
Louhi was more consistent in providing low
processing overhead
Being able to overlap long messages tends to be
more important than short messages as they take
more time to complete
Gromacs scaled significantly better on Louhi
Most likely largely due to lower communication
processing overhead
A proprietary system still has it's place
The interconnect is designed from ground up to
handle MPI communication and HPC workloads
The streamlined microkernel also helps out
Focusing only on hero numbers (e.g. short
message latency) can be misleading