Proprietary or Commodity? Interconnect Performance in Large-scale Supercomputers - PowerPoint PPT Presentation

About This Presentation
Title:

Proprietary or Commodity? Interconnect Performance in Large-scale Supercomputers

Description:

Proprietary or Commodity? Interconnect Performance in Large-scale Supercomputers Author: Olli-Pekka Lehto Supervisor: Prof. Jorma Virtamo Instructor: D.Sc. (Tech ... – PowerPoint PPT presentation

Number of Views:157
Avg rating:3.0/5.0
Slides: 22
Provided by: Olli69
Category:

less

Transcript and Presenter's Notes

Title: Proprietary or Commodity? Interconnect Performance in Large-scale Supercomputers


1
Proprietary or Commodity?Interconnect
Performance in Large-scale Supercomputers
  • Author Olli-Pekka Lehto
  • Supervisor Prof. Jorma Virtamo
  • Instructor D.Sc. (Tech.) Jussi Heikonen

2
Contents
  • Introduction
  • Background
  • Objectives
  • Test platforms
  • Testing methodology
  • Results
  • Conclusions

3
Introduction
  • A Modern High Performance Computing (HPC) system
    consists of
  • Login nodes, service nodes, IO nodes, Compute
    nodes
  • Usually multicore/SMP
  • Interconnect network(s) which links all nodes
    together
  • Applications are run in parallel
  • The individual tasks of a parallel application
    exchange data via the interconnect
  • The interconnect plays a critical role in the
    overall performance of the parallel application
  • Commodity system
  • Use of off-the-shelf components
  • Leverages economies of scale involved in the PC
    industry
  • Industry standard architectures which allow for
    the system to be extended in a vendor-independent
    fashion
  • Proprietary system
  • A highly integrated system designed specifically
    for HPC
  • May contain some off-the-shelf components but
    cannot be extended in a vendor
    independent fashion.

4
BackgroundThe Cluster Revolution
In the last decade clusters have rapidly become
the system architecture of choice in HPC. The
high end of the market is still dominated by
proprietary MPP systems.
Source http//www.top500.org
5
Background The Architecture Evolution
The architectures of proprietary and commodity
HPC systems have been converging. Nowadays it's
difficult to differentiate between the two.
Increasing RD costs drive move towards
commodity components
Competition between AMD and Intel Competition in
specialized cluster interconnects
The interconnect network is a key differentiating
factor between commodity and proprietary
architectures.
Disclaimer IBM Blue Gene is an notable execption
6
Problem Statement
Does a supercomputer architecture with a
proprietary interconnection network offer a
significant performance advantage compared to a
similiar sized cluster with a commodity network?

7
Test Platforms
  • In 2007 CSC - Scientific Computing Ltd. conducted
    a 10M procurement to aquire new HPC systems
  • The procurement was split into two parts
  • Lot 1 Capability computing
  • Massively parallel Grand Challenge computation
  • Lot 2 Capacity computing
  • Sequential and small to medium size parallel
    problems
  • Problems requiring large amounts of memory

8
Test Platform 1 Louhi
  • Winner of Lot 1 capability computing
  • Cray XT4
  • Phase 1 (2007) 2024 2.6GHz AMD Opteron cores
    (dual core) 10.5TFlop/s
  • Phase 2 (2008) 6736 2.6GHz AMD Opteron cores
    (quad core ) 70.1TFlop/s
  • Unicos/lc operating system
  • Linux on the login and service nodes
  • Catamount microkernel on compute nodes
  • Proprietary SeaStar2 interconnection network
  • 3-dimensional torus topology
  • Each node connected to 6 neighbors with
    7,6GByte/s links
  • Each node has a router integrated into the NIC
  • NIC connected directly to AMD HyperTransport bus
  • NIC has an onboard CPU for protocol processing
    (protocol offloading)
  • Remote Direct Memory Access (RDMA)

4 Flops/cycle
9
Test Platform 1 Louhi
Source Cray Inc.
10
Test Platform 2 Murska
  • Winner of Lot 2 capacity computing
  • HP CP4000BL XC blade cluster
  • 2048 2.6GHz AMD Opteron cores (dual-core)10.6
    TFlop/s
  • Dual-socket HP BL465c server blades
  • HPC Linux operating system
  • HP's turnkey cluster OS based on RHEL
  • InfiniBand interconnect network
  • A multipurpose high-speed network
  • Fat tree topology (blocking)
  • 24-port blade enclosure switches
  • 16 16Gbit/s DDR ) downlinks
  • 8 16Gbit/s DDR uplinks, running at 8Gbit/s
  • 288-port master switch with SDR ) ports
    (8Gbit/s)
  • Recently upgraded to DDR
  • Host Channel Adapters (HCA) connected to 8x PCIe
    buses
  • Remote Direct Memory Access (RDMA)

) Double Data Rate ) Single Data Rate
11
Test Methodology
  • Testing of individual parameters with
    microbenchmarks
  • End-to-end communication latency and bandwidth
  • Communication processing overhead
  • Consistency of performance across the system
  • Testing of real-world behavior with a scientific
    application
  • Gromacs A popular open source molecular
    dynamics application
  • All measurements use Message Passing Interface
    (MPI)
  • MPI is by far the most popular parallel
    programming application programming interface
    (API) for HPC
  • Murska uses HP's HP-MPI implementation
  • Louhi uses a Cray-modified version of the MPICH2
    implementation

12
End-to-end Latency
  • Intel MPI Benchmarks (IMB) PingPong test was used
  • Measures the latency to send and recieve a single
    point-to-point message as a function of the
    message size
  • Arguably the most popular metric of interconnect
    performance (esp. short messages)
  • Murska's HP-MPI has 2 modes of operation (RDMA
    and SRQ)
  • RDMA requires 256kbytes of memory per MPI task
    while SRQ (Shared Recieve Queue) has a constant
    memory requirement

SRQ causes a notable latency overhead
As the message size grows, Louhi outperforms
Murska
Murska using RDMA has a slightly lower
latency with short messages
13
End-to-end Bandwidth
  • IMB PingPong test with large message sizes
  • Inverse of the latency
  • Test was done between nearest-neighbor nodes

Louhi has still not reached peak bandwidth with
the largest message size (4 megabytes)
The gap between RDMA and SRQ narrows as the link
becomes saturated SRQ doesn't affect
performance with large message sizes
14
Communication Processing Overhead
  • Measure of how much communication stresses the
    CPU
  • The C in HPC stands for Computing, not
    Communication )
  • MPI has asynchronous communication routines which
    overlap communication and computation
  • This requires autonomous communication mechanisms
  • Murska has RDMA, Louhi has protocol offloading
    and RDMA
  • Sandia Nat'l Labs SMB benchmark was used
  • See how much work one process gets done while
    another process communicates with it constantly.
  • Result as application availability percentage
  • 100 communication is performed completely in
    the background
  • 0 communication is performed completely in the
    foreground, no work done
  • Separate results for getting work done at the
    sender and at the reciever side

15
Reciever Side Availability
100 Availability Communication does not
interfere with processing at all
Louhi's availability improves with 8k-128k
messages
Murska's availability drops significantly between
16k and 32k
0 Availability Processing communication hogs
the CPU completely
16
Sender Side Availability
Louhi's availability improves dramatically with
large messages The offload engine can process
packets autonomously.
17
Gromacs
  • A popular and mature molecular dynamics
    simulation package
  • Open Source, downloadable from http//www.gromacs.
    org
  • Programmed with MPI
  • Designed to exploit overlapping computation and
    communication, if available
  • Parallel speedup was measured by using a fixed
    size molecular system
  • Run times for task counts from 16 to 128 were
    measured
  • MPI calls were profiled
  • How much time spent in communication subroutines?
  • Which subroutines were the most time-consuming?

18
Gromacs Run Time
Murskas scaling stops at 32 tasks
Louhis scaling stops at 64 tasks
Time spent in MPI communication routines starts
increasing at 64 tasks
19
MPI Call Profile
Fraction of the total MPI time spent in a
specific MPI call
With small message sizes the MPI_Alltoall (all
processes send a message to each other) dominates
the time spent in MPI.
MPI_Wait (wait for an asynchronous message
transfer to complete) starts dominating time
usage on Murska as the task count grows.
MPI_Alltoall starts dominating MPI time usage on
Louhi again as the task count grows.
20
Conclusions
  • On Murska, a tradeoff has to be made with large
    parallel problems
  • SRQ Sacrifice latency in favor of memory
    capacity
  • RDMA Sacrifice memory capacity in favor of
    latency
  • Murska is able to outperform Louhi in some
    benchmarks
  • Especially in short message performance in RDMA
    mode
  • Louhi was more consistent in providing low
    processing overhead
  • Being able to overlap long messages tends to be
    more important than short messages as they take
    more time to complete
  • Gromacs scaled significantly better on Louhi
  • Most likely largely due to lower communication
    processing overhead
  • A proprietary system still has it's place
  • The interconnect is designed from ground up to
    handle MPI communication and HPC workloads
  • The streamlined microkernel also helps out
  • Focusing only on hero numbers (e.g. short
    message latency) can be misleading

21
Questions?
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com