HighPerformance Networking HPN Group - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

HighPerformance Networking HPN Group

Description:

Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters ... Radix sort. Measures set up time, sort time, and time to verify the sort ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 19

Provided by: dral60

Category:

more less

Transcript and Presenter's Notes

Title: HighPerformance Networking HPN Group

1
High-Performance Networking (HPN) Group
The HPN Group was formerly known as the SAN
group.

Distributed Shared-Memory Parallel Computing with
UPC on SAN-based Clusters
Appendix for Q3 Status Report
DOD Project MDA904-03-R-0507
February 5, 2004

2
Outline

Objectives and Motivations
Background
Related Research
Approach
Results
Conclusions and Future Plans

3
Objectives and Motivations

Objectives
Support advancements for HPC with Unified
Parallel C (UPC) on cluster systems exploiting
high-throughput, low-latency system-area networks
(SANs) and LANs
Design and analysis of tools to support UPC on
SAN-based systems
Benchmarking and case studies with key UPC
applications
Analysis of tradeoffs in application, network,
service and system design
Motivations
Increasing demand in sponsor and scientific
computing community for shared-memory parallel
computing with UPC
New and emerging technologies in system-area
networking and cluster computing
Scalable Coherent Interface (SCI)
Myrinet (GM)
InfiniBand
QsNet (Quadrics Elan)
Gigabit Ethernet and 10 Gigabit Ethernet
PCI Express (3GIO)
Clusters offer excellent cost-performance
potential

4
Background

Key sponsor applications and developments toward
shared-memory parallel computing with UPC
More details from sponsor are requested
UPC extends the C language to exploit parallelism
Currently runs best on shared-memory
multiprocessors (notably HP/Compaqs UPC
compiler)
First-generation UPC runtime systems becoming
available for clusters (MuPC, Berkeley UPC)
Significant potential advantage in
cost-performance ratio with COTS-based cluster
configurations
Leverage economy of scale
Clusters exhibit low cost relative to
tightly-coupled SMP, CC-NUMA, and MPP systems
Scalable performance with commercial
off-the-shelf (COTS) technologies

5
Related Research

University of California at Berkeley
UPC runtime system
UPC to C translator
Global-Address Space Networking (GASNet) design
and development
Application benchmarks
George Washington University
UPC specification
UPC documentation
UPC testing strategies, testing suites
UPC benchmarking
UPC collective communications
Parallel I/O

Michigan Tech University
Michigan Tech UPC (MuPC) design and development
UPC collective communications
Memory model research
Programmability studies
Test suite development
Ohio State University
UPC benchmarking
HP/Compaq
UPC compiler
Intrepid
GCC UPC compiler

6
Approach

Collaboration
HP/Compaq UPC Compiler V2.1 running in lab on new
ES80 AlphaServer (Marvel)
Support of testing by OSU, MTU, UCB/LBNL, UF, et
al. with leading UPC tools and system for
function performance evaluation
Field test of newest compiler and system
Benchmarking
Use and design of applications in UPC to grasp
key concepts and understand performance issues
Exploiting SAN Strengths for UPC
Design and develop new SCI Conduit for GASNet in
collaboration UCB/LBNL
Evaluate DSM for SCI as option of executing UPC
Performance Analysis
Network communication experiments
UPC computing experiments
Emphasis on SAN Options and Tradeoffs
SCI, Myrinet, InfiniBand, Quadrics, GigE, 10GigE,
etc.

7
GASNet - Experimental Setup Analysis

Experimental Results
Throughput
Elan shows best performance with approx. 300MB/s
in both put and get operations
Myrinet and SCI very close with 200MB/s on put
operations
Myrinet obtains nearly the same performance with
get operations
SCI suffers from the reference extended API in
get operations (approx 7MB/s) due to greatly
increasing latency
get operations will benefit the greatest from
extended API implementation
Currently being addressed in UFs design of the
extended API for SCI
MPI suffers from high latency but still performs
well on GigE with almost 50 MB/s
Latency
Elan again performs best put/get(8 µs)
Myrinet put (20usec), get (33 µs)
SCI both put and get (25 µs) better than
Myrinet get for small messages
Larger messages suffer from the AM rpc protocol
MPI latency too high to show (250 µs)
Elan is the best performer in low-level API tests

Testbed
Elan, MPI and SCI conduits
Dual 2.4 GHz Intel Xeon, 1GB DDR PC2100 (DDR266)
RAM, Intel SE7501BR2 server motherboard with
E7501 chipset
Specs 667 MB/s (300MB/s sustained) Dolphin SCI
D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus
Specs 528 MB/s (340MB/s sustained) Elan3 ,using
PCI-X in two nodes with QM-S16 16 port switch
RedHat 9.0 with gcc compiler V 3.3.2
GM (Myrinet) conduit (c/o access to cluster at
MTU)
Dual 2.0 GHz Intel Xeon, 2GB DDR PC2100 (DDR266)
RAM
Specs - 250 MB/s Myrinet 2000, using PCI-X, on
8 nodes connected with 16-port M3F-SW16 switch
RedHat 7.3 with Intel C compiler V 7.1
Experimental Setup
Elan, GM conduits executed with extended API
implemented
SCI, MPI executed with the reference API (based
on AM in core API)
GASNet Conduit experiments
Berkeley GASNet Test suite
Average of 1000 iterations
Each uses Bulk transfers to take advantage of
implemented extended APIs
Latency results use testsmall

Testbed made available by Michigan Tech
8
GASNet Throughput on Conduits
For get operations, must wait for rpc to be
executed before data can be pushed back
9
GASNet Latency on Conduits
Despite not having yet constructed the extended
API, which allows better hardware exploitation,
SCI conduit still manages to keep pace with GM
conduit for throughput and most small-message
latencies. Q1 report shows target possibility of
10usec latencies.
SCI results based on generic GASNet version of
extended API, which limits performance.
10
UPC Benchmarks IS from NAS benchmarks

Class A executed with Berkeley UPC runtime system
V1.1 with gcc V3.3.2 for Elan, MPI Intel V7.1
for GM
IS (Integer Sort), lots of fine-grain
communication, low computation
Communication layer should have greatest effect
on performance
Single thread shows performance without use of
communication layer
Poor performance in the GASNet communication
system does NOT necessary indicating poor
performance in UPC application
MPI results poor for GASNet but decent for UPC
applications
Application may need to be larger to confirm this
assertion
GM conduit shows greatest gain in parallelization
(could be partly due to better compiler)

Only two nodes available with Elan, unable to
determine scalability at this point
TCP/IP overhead outweighs benefit of
parallelization
Code developed at GWU
11
Network Performance Tests

Detailed understanding of high-performance
cluster interconnects
Identifies suitable networks for UPC over
clusters
Aids in smooth integration of interconnects with
upper-layer UPC components
Enables optimization of network communication
unicast and collective
Various levels of network performance analysis
Low-level tests
InfiniBand based on Virtual Interface Provider
Library (VIPL)
SCI based on Dolphin SISCI and SCALI SCI
Myrinet based on Myricom GM
QsNet based on Quadrics Elan Communication
Library
Host architecture issues (e.g. CPU, I/O, etc.)
Mid-level tests
Sockets
Dolphin SCI Sockets on SCI
BSD Sockets on Gigabit and 10Gigabit Ethernet
GM Sockets on Myrinet
SOVIA on InfiniBand
MPI
InfiniBand and Myrinet based on MPI/PRO

12
Network Performance Tests

Tests run on two Elan3 cards connected by QM-S16
16-port switch
Quadrics dping used for raw tests
GASNet testsmall used for latency, testlarge for
throughput
Utilizes extended API
Results obtained from put operations
Elan conduit for GASNet more than doubles
hardware latency, but still maintains sub-10 µs
for small messages
Conduit throughput matches hardware
Elan conduit does not add appreciably to
performance overhead

13
Low Level vs. GASNet Conduit

Tests run on two Myrinet 2000 cards connected by
M3F-SW16 switch
Myricom gm_allsize used for raw tests
GASNet testsmall used for latency, testlarge for
throughput
Utilizes extended API
Results obtained from puts
GM conduit almost doubles the hardware latency,
with latencies of 19 µs for small messages
Conduit throughput follows trend of hardware but
differs by an average of 60MB/s for messages
1024bytes
Conduit peaks at 204MB/s compared to 238MB/s for
hardware
GM conduit adds a small amount to performance
overhead

14
Architectural Performance Tests

Opteron
Features
64-bit processor
Real-time support of 32-bit OS
On-chip memory controllers
Eliminates 4 GB memory barrier imposed by 32-bit
systems
19.2 GB/s I/O bandwidth per processor
Future plans
UPC and other parallel application benchmarks

Pentium 4 Xeon
Features
32-bit processor
Hyper-Threading technology
Increased CPU utilization
Intel NetBurst microarchitecture
RISC processor core
4.3 GB/s I/O bandwidth
Future Plans
UPC and other parallel application benchmarks

15
CPU Performance Results

NAS Benchmarks
Computationally intensive
EP, FT, and IS
Class A problem set size
Opteron and Xeon comparable with floating-point
operations (FT)
For integer operations, Opteron performs better
compared to Xeon (EP IS)

DOD Seminumeric Benchmark 2
Radix sort
Measures set up time, sort time, and time to
verify the sort
Sorting is the dominant component of execution
time
Results Analysis
Opteron architecture outperforms Xeons in all
tests performed for all iterations
Setup and Verify times around half as much as
Xeon architecture

16
Memory Performance Results

Lmbench-3.0-a3
Opteron latency/throughput worsen as expected at
size 64KB (L1 cache size) and 1MB (L2 cache size)
Xeon latency/throughput shows the same trend for
L1 (8KB) but starts earlier for L2 (256KB instead
of 512KB)
Cause under investigation
Between CPU / L1 / L2, Opteron outperforms Xeon,
but Xeon outperforms Opteron when loading data
from disk into main memory
Write throughput for Xeon stays relatively
constant for size lt L2 cache size suggesting
write-through policy use between L1 and L2
Xeon read gt Opteron write gt Opteron read gt Xeon
write

17
File I/O Results

Bonnie / Bonnie
10 iterations of writing and reading a 2GB file
using per character functions and efficient block
functions
stdio overhead great for per character functions
Efficient block reads and writes greatly reduce
the CPU utilization
Throughput results were directly proportional to
CPU utilization
Shows the same trend as observed in the memory
performance testing
Xeon read gt Opteron write gt Opteron read gt Xeon
write
Suggesting memory access and I/O access might
utilizes the same mechanism
AIM 9
10 iterations using 5MB files testing sequential
and random reads, writes, and copies
Opteron consistently outperforming Xeon by a wide
margin
Large increase in performance for disk reads as
compare to write
Xeon read speeds are very high for all results
with a much lower write performance
Opteron read speeds are also very high and
greatly outperform the Xeons in write performance
in all cases
Xeon sequential read is actually worse than
Opteron, but still comparable

18
Conclusions and Future Plans

Accomplishments to date
Baselining of UPC on shared-memory
multiprocessors
Evaluation of promising tools for UPC on clusters
Leverage and extend communication and UPC layers
Conceptual design of new tools
Preliminary network and system performance
analyses
Completed V1.0 of the GASNet Core API SCI conduit
for UPC
Key insights
Inefficient communication system does not
necessarily translate to poor UPC application
performance
Xeon cluster suitable for applications with high
Read/Write ratio
Opteron cluster suitable for generic application
due to comparable Read/Write capability
Future Plans
Comprehensive performance analysis of new SANs
and SAN-based clusters
Evaluation of UPC methods and tools on various
architectures and systems
UPC benchmarking on cluster architectures,
networks, and conduits
Continuing effort in stabilizing/optimizing
GASNet SCI Conduit
Cost/Performance analysis for all options