UPC Research at University of Florida presentation

About This Presentation

Transcript and Presenter's Notes

Title: UPC Research at University of Florida

1
UPC Research atUniversity of Florida

Alan D. George, Hung-Hsun Su, Bryan Golden, Adam
Leko
HCS Research Laboratory
University of Florida

2
UPC SHMEMPerformance Analysis Tool(PAT)
3
UPC/SHMEM PAT - Introduction

Motivations
When UPC/SHMEM programs do not yield the desired
or expected performance, why?
Due to complexity of parallel computing,
difficult to determine without tools for
performance analysis
Discouraging for users, new old few options
available for shared-memory computing in UPC and
SHMEM communities
Goals
Identify important performance factors in
UPC/SHMEM computing
Develop framework for a performance analysis tool
As new tool or as extension/redesign of existing
non-UPC/SHMEM tools
Design with both performance and user
productivity in mind
Attract new UPC/SHMEM users and support improved
performance
Develop model to predict optimal performance

4
UPC/SHMEM PAT Approach (1)

Define layers to divide the workload and
investigate important issues from different
perspectives
Application layer deals with general parallel
programming issues, as well as issues unique to
the problem at hand
Language layer involves issues specific to the
UPC or SHMEM programming model
Compiler layer includes the effects of
different compilers and their optimization
techniques on performance
Middleware layer includes all system software
that relates to system resources, such as the
communication protocol stack, OS, and runtime
system
Hardware layer comprises key issues within
system resources such as CPU architecture, memory
hierarchy, communication and synchronization
network

5
UPC/SHMEM PAT Approach (2)

Research strategies
Tool-driven approach (bottom-up, time-saving
approach)
Study existing tools for other programming models
Identify suitable factors applicable to UPC and
SHMEM
Conduct experiments to verify the relevancy of
these factors
Extend/create UPC/SHMEM-specific PAT
Layer-driven approach (top-down, comprehensive
approach)
Identify all possible factors in each of the five
layers defined
Conduct experiments to verify the relevancy of
these factors
Determine suitable tool(s) able to provide
measurements for these factors
Extend/create UPC/SHMEM-specific PAT
Hybrid approach (simult. pursuit of tool-driven
and layer-driven approaches)
Minimize development time
Maximize usefulness of PAT

Hybrid Approach
6
UPC/SHMEM PAT Areas of Study

Research areas currently under investigation that
are important to development of successful PAT
Algorithm Analysis
Analytical Models
Categorization of Platforms/Systems
Factor Classification and Determination
Performance Analysis Strategies
Profiling/Tracing Methods
Program/Compiler Optimization Techniques
Tool Design Production/Theoretical
Tool Evaluation Strategies
Tool Framework/Approach Theoretical and
Experimental
Usability (includes presentation methodology)

7
GASNet SCI Conduitfor UPC Computing
8
GASNet SCI Conduit Introduction and Design (1)

Scalable Coherent Interface (SCI)
Low-latency, high-bandwidth SAN
Shared-memory capabilities
Require memory exporting and importing
PIO (require importing) DMA (need 8 bytes
alignment)
Remote write 10x faster than remote read
SCI conduit
AM enabling (core API)
Dedicated AM message channels (Command)
Request/Response pairs to prevent deadlock
Flags to signal arrival of new AM (Control)
Put/Get enabling (extended API)
Global segment (Payload)

This work in collaboration with UPC Group at UC
Berkeley.
9
GASNet SCI Conduit Design (2)

Active Message Transfer
Obtain free slot
Tract locally using array of flags
Package AM header
Transfer data
Short AM
PIO write (Header)
Medium AM
PIO write (Header)
PIO write (Medium Payload)
Long AM
PIO write (Header)
PIO write (Long Payload)
Payload size ? 1024
Unaligned portion of payload
DMA write (multiple of 64 bytes)
Wait for transfer completion
Signal AM arrival

10
GASNet SCI Conduit Results

Objective - compare the performance of SCI
conduit to other existing conduits
Experimental Setup
GASNet configured with segment Large
GASNet conduit experiments
Berkeley GASNet test suite
Average of 1000 iterations
Executed with target memory falling inside and
then outside the GASNet segment
Latency results use testsmall
Throughput results use testlarge
Analysis
Elan (Quadrics) shows best performance for
latency of puts and gets
VAPI (InfiniBand) is by far the best bandwidth
latency very good
GM (Myrinet) latencies a little higher than all
the rest
Our SCI conduit shows better put latency than
MPI/SCI conduit on SCI for sizes gt 64 bytes very
close to MPI on SCI for smaller messages
Our SCI conduit has latency slightly higher than
MPI on SCI
GM and SCI provide about the same throughput
Our SCI conduit slightly higher bandwidth for
largest message sizes
Quick look at estimated total cost to support 8
nodes of these interconnect architectures
SCI 8,700

via testbed made available courtesy of Michigan
Tech
11
GASNet SCI Conduit Results (Latency)
12
GASNet SCI Conduit Results (Bandwidth)
13
GASNet SCI Conduit - Conclusions

Experimental version of our conduit is available
as part of Berkeley UPC 2.0 release
Despite being limited by existing SCI driver from
vendor, it is able to achieve performance fairly
comparable to other conduits
Enhancements to resolve driver limitations are
being investigated in close collaboration with
Dolphin
Support access of all virtual memory on remote
node
Minimize transfer setup overhead

14
UPC Benchmarking
15
UPC Benchmarking Overview

Goals
Produce interesting and useful benchmarks for UPC
Compare the performance of Berkeley UPC using
various clusters/conduits and HP UPC on HP/Compaq
AlphaServer
Testbed
Intel Xeon cluster
Nodes Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100
(DDR266) RAM, Intel SE7501BR2 server motherboard
with E7501 chipset
SCI 667 MB/s (300 MB/s sustained) Dolphin SCI
D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus
MPI MPICH 1.2.5
RedHat 9.0 with gcc compiler V 3.3.2, Berkeley
UPC runtime system 2.0
AMD Opteron cluster
Nodes Dual AMD Opteron 240, 1GB DDR PC2700
(DDR333) RAM, Tyan Thunder K8S server motherboard
InfiniBand 4x (10Gb/s, 800 MB/s sustained)
Voltaire HCS 400LP, using PCI-X 64/100, 24-port
Voltaire ISR 9024 switch
MPI MPI package provided by Voltaire
SuSE 9.0 with gcc compiler V 3.3.3, Berkeley UPC
runtime system 2.0
HP/Compaq ES80 AlphaServer (Marvel)
Four 1GHz EV7 Alpha processors, 8GB RD1600 RAM,
proprietary inter-processor connections
Tru64 5.1B Unix, HP UPC V2.3-test compiler

16
UPC Benchmarking Differential Cryptanalysis for
CAMEL Cipher

Description
Uses 1024-bit S-Boxes
Given a key, encrypts data, then tries to guess
key solely based on encrypted data using
differential attack
Has three main phases
Compute optimal difference pair based on S-Box
(not very CPU-intensive)
Performs main differential attack (extremely
CPU-intensive, brute force using optimal
difference pair)
Analyze data from differential attack (not very
CPU-intensive)
Computationally (independent processes) intensive
several synchronization points
Analysis
Marvel attained almost perfect speedup,
synchronization cost very low
Berkeley UPC
Speedup decreases with increasing number of
threads (cost of synchronization increases with
number of threads)
Run time varied greatly as number of threads
increased
Still decent performance for 32 threads (76.25
efficiency, VAPI)
Performance is more sensitive to data affinity

Parameters MAINKEYLOOP 256 NUMPAIRS
400,000 Initial Key 12345
17
UPC BenchmarkingDES Differential Attack
Simulator

Description
S-DES (8-bit key) cipher (integer-based)
Creates basic components used in differential
cryptanalysis
S-Boxes, Difference Pair Tables (DPT), and
Differential Distribution Tables (DDT)
Bandwidth-intensive application
Designed for high cache miss rate, so very costly
in terms of memory access
Analysis
With increasing number of nodes, bandwidth and
NIC response time become more important
Interconnects with higher bandwidth and faster
response times perform best
Marvel shows near-perfect linear speedup, but
processing time of integers an issue
MPI conduit clearly inadequate for high-bandwidth
programs

18
UPC Benchmarking Concurrent Wave Equation

Description
A vibrating string is decomposed into points
In the parallel version, each processor
responsible for updating amplitude of N/P points
Each iteration, each processor exchanges boundary
points with nearest neighbors
Coarse-grained communication
Algorithm complexity of O(N)
Analysis
Sequential C
Modified version was 30 faster than baseline for
Xeon, but only 17 faster for Opteron
Opteron and Xeon sequential unmodified code have
nearly identical execution times
UPC
Near linear speedup
Fairly straightforward to port from sequential
code
upc_forall loop ran faster with arrayj as
affinity expression than with (arrayj)

19
UPC Benchmarking Mod 2N Inverse

Description
Given list A of size listsize (64-bit integers),
size ranges from 0 to 2j 1
Compute
list B, where BiAi right justified
list C, such that (Bi Ci) 2j 1 (iterative
algorithm)
Check section (gather)
First node checks all values to verify (Bi Ci)
2j 1.
Computation is embarrassingly parallel and very
communication intensive
MPI, UPC, and SHMEM versions used same algorithm
Analysis
AlphaServer Relatively good, UPC, SHMEM, and
MPI comparable performance
Opteron
Suboptimal performance, communication time
dominates overall execution time
MPI gave best results (more mature compiler),
although code was much more laborious to write
GPSHMEM over MPI lacks good performance vs. plain
MPI, although much easier to write with one-sided
communication functions
However, GPSHMEM gives adequate performance for
applications that are not extremely sensitive to
bandwidth
MPI could use asynchronous calls to hide latency
on check, although communication time dominates
by a large factor

20
UPC Benchmarking Convolution

Description
Compute convolution of two sequences
Classic definition of convolution (not image
processing definition)
Embarrassingly parallel (gives an idea of
language overhead), O(N2) algorithm complexity
Parameters sequence sizes 100,000 elements,
data types 32-bit integer, double-precision
floating point
MPI, UPC, and SHMEM versions used same algorithm
Analysis
Overall language overhead
MPI version required most effort to code
SHMEM slightly easier than MPI because of
one-sided communication functions
UPC easiest to code (conversion of sequential
code very easy), but has potentially limited
performance unless optimizations (get, for, cast)
are used
Overall language performance overhead
On AlphaServer MPI had most runtime overhead in
most cases GPSHMEM performed surprisingly well
On Opteron Runtime overhead MPI (least) lt SHMEM
lt Berkeley UPC

21
Final Conclusions

Florida group active in three UPC research areas
UPC SHMEM performance analysis tools (PAT)
Network and system infrastructure for UPC
computing
UPC benchmarking and optimization
Status
PAT research for UPC is recently underway
SCI networking infrastructure for UPC/GASNet
cluster computing shows promising results
Broad range of UPC benchmarks under development
Developing plans for additional UPC projects
Integration of UPC and RC (reconfigurable)
computing
Simulation modeling of UPC systems and apps.

Write a Comment

User Comments (0)

About PowerShow.com

UPC Research at University of Florida PowerPoint PPT Presentation