Ernest Orlando Lawrence Berkeley National Laboratory presentation

About This Presentation

Transcript and Presenter's Notes

Title: Ernest Orlando Lawrence Berkeley National Laboratory

1
Implementing a Global Address Space Language on
the Cray X1 the Berkeley UPC Experience
Christian Bell and Wei Chen CS252 Class
Project December 10, 2003
2
Outline

An Overview of UPC and the Berkeley UPC Compiler
Overview of the Cray X1
Implementing the GASNet layer on the X1
Implementing the runtime layer on the X1
Serial performance
Evaluation of compiler optimizations

3
Unified Parallel C (UPC)

UPC is an explicitly parallel global address
space language with SPMD parallelism
An extension of ISO C
User level shared memory, partitioned by threads
One-sided (bulk and fine-grained) communication
through reads/writes of shared variables

Shared
X0
X1
XP
Global address space
Private
4
Shared Arrays and Pointers in UPC
A0 A2 A4 B0 B1 B4 B5 C0 C1
C2
A1 A3 A5 B2 B3 B6 B7
T1
T0

Cyclic shared int An
Block Cyclic shared 2 int Bn
Indefinite shared 0 int C
(shared 0 int ) upc_alloc(n)
Use pointer-to-shared to access shared data
Block size part of the pointer type
A generic pointer-to-shared contains
Address, Thread id, Phase
Cyclic and Indefinite pointers are phaseless

5
Accessing Shared Memory in UPC
start of array object
start of block
Shared Memory

block size
Phase

Thread 1
Thread N -1
Thread 0
0
2
addr
6
UPC Programming Model Features

Block cyclically distributed arrays
Shared and private pointers
Global synchronization -- barriers
Pair-wise synchronization locks
Parallel loops
Dynamic shared memory allocation
Bulk Shared Memory accesses
Strict vs. Relaxed memory consistency models

7
Overview of the Berkeley UPC Compiler
Two Goals Portability and High-Performance
Lower UPC code into ANSI-C code
Translator
UPC Code
Shared Memory Management and pointer operations
Platform- independent
Translator Generated C Code
Berkeley UPC Runtime System
Network- independent
Compiler- independent
GASNet Communication System
Language- independent
Network Hardware
Uniform get/put interface for underlying networks
8
A Layered Design

Portable
C is our intermediate language
Can run on top of MPI (with performance penalty)
GASNet has a layered design with a small core
High-Performance
Native C compiler optimizes serial code
Translator can perform high-level communication
optimizations
GASNet can access network hardware directly,
provides a rich set of communication /
synchronization primitives

9
Outline

An Overview of UPC and the Berkeley UPC Compiler
Overview of the Cray X1
Implementing the GASNet layer on the X1
Implementing the runtime layer on the X1
Serial performance
Evaluation of compiler optimizations

10
The Cray X1 Architecture

New line of Vector Architecture
Two modes of operation
SSP up to 16 CPUs/node
MSP multistreams long loops
Single-node UMA, multi-node NUMA (no caching
remote data)
Global pointers
Low latency, high bandwidth

All Gets/Puts must be loads/stores (directly or
shmem interface)
Only puts are non-blocking, gets are blocking
Vectorization is crucial
Vector pipeline 2x faster than scalar
Utilization of memory bandwidth
Strided accesses, scatter-gather, reduction, etc.

11
Outline

An Overview of UPC and the Berkeley UPC Compiler
Overview of the Cray X1
Implementing the GASNet layer on the X1
Implementing the runtime layer on the X1
Serial performance
Evaluation of compiler optimizations

12
GASNet Communication System- Architecture

2-Level architecture to ease implementation
Core API
Based heavily on Active Messages
Compatibility layer
Port to X1 in 2 days, new algorithm to manipulate
queues in Shared Memory
Extended API
Wider interface that includes more complicated
operations (puts, gets)
A reference implementation of the extended API in
terms of the core API is provided
Current revision is tuned especially for the X1
with shared memory as the primary focus (minimal
overhead)

Compiler-generated code
Compiler-specific runtime system
GASNet Extended API
GASNet Core API
Network Hardware
13
GASNet Extended API Remote memory operations

GASNet offers expressive Put/Get primitives
All Gets/Puts can be blocking and non-blocking
Non-blocking can be explicit (handle-based)
Non-blocking can be implicit (global or
region-based)
Synchronization can poll or block
Paves the way for complex split-phase
communication (compiler optimizations)
Cray X1 uses exclusively shared memory
All Gets/Puts must be loads/stores
Only puts are non-blocking, gets are blocking
Very limited synchronization mechanisms
Efficient communication only through vectors (one
order of magnitude between scalar and vector
communication)
Vectorization instead of split-phase?

14
GASNet and Cray X1 Remote memory operations
GASNet Cray X1 Instruction Comment
Bulk operations Vector bcopy() Fully vectorized, suitable for GASNet/UPC
Non-bulk blocking puts Store gsync No vectorization
Non-bulk blocking gets Load
Non-bulk Non-blocking explicit puts/gets Store/load gsync No vectorization if sync done in the loop
Non-bulk Non-blocking implicit puts/gets Store/load gsync No vectorization if sync done in the loop

Flexible communications provides no benefit
without vectorization (factor of 10 between
vector and scalar)
Difficult to expose vectorization through a
layered software stack Native C compiler now has
to optimize parallel code!
Cray X1 big hammer gsync() prevents
interesting communication optimizations

15
GASNet/X1 Performance

GASNet/X1 improves small message performance
Minimal overhead as portable network assembly
language
Core API (Active Messages) solves Cray problem
of upc_global_alloc (non-collective memory
allocation)
Synthetic benchmarks show no GASNet
interference, but not necessarily the case for
application benchmarks

16
Outline

An Overview of UPC and the Berkeley UPC Compiler
Overview of the Cray X1
Implementing the GASNet layer on the X1
Implementing the runtime layer on the X1
Serial performance
Evaluation of compiler optimizations

17
Shared Pointer Representations

Cray X1 memory centrifuge useful for UPC
Possible to manipulate UPC phaseless pointers
directly as X1 global pointers allocated by the
symmetric heap
Heavy function inlining and macros remove all
traces of UPC Runtime and GASNet calls

18
Cost of Shared Pointer Arithmetic and Accesses
19
Outline

An Overview of UPC and the Berkeley UPC Compiler
Overview of the Cray X1
Implementing the GASNet layer on the X1
Implementing the runtime layer on the X1
Serial performance
Evaluation of compiler optimizations

20
Serial Performance

Its all about vectorization
Cray C highly sensitive to changes in inner loop
Want translators output as vectorizable as
original C source.
Strategy Keep translated code syntactically
close to the source
Preserve high level loops
aexp becomes (aexp)
Multidimensional arrays linearized
Preserve restrict qualifier and ivdep pragmas

21
Livermore Loop Kernels
22
Evaluating Communication Optimizations on Cray X1

Message Aggregation
LogGP model fewer messages means less overhead
Techniques message vectorization, coalescing,
bulk prefetching
Still true for Cray X1?
Remote access latency comparable to local
accesses
Vectorization should hide most overhead of small
messages
Remote data not cache coherent may still help
to store them into local buffers
Essentially, a question of fine-grained vs.
coarse-grained programming model

23
NAS CG OpenMP style vs. MPI style

Fine-grained (OpenMP style) version still slower
shared memory programming style leads more
overhead (redundant boundary computation)
UPCs hybrid programming model can really help

24
More Optimizations

Overlapping Communication and Computation
Hides communication latencies with independent
computation
Examples communication scheduling, message
pipelining
Requires split-phase operations try to separate
sync() as far as possible from non-blocking
get/put
But Cray X1 lacks support for nonblocking gets
No user/compiler level overlapping
All communication optimizations rely on
vectorization (e.g., gups)
Vectorization is too restrictive in our opinion
gives up on pointer code and sync(), bulk
synchronous programs, etc

25
Conclusion

We have an efficient UPC implementation on Cray
X1
Evaluation of Cray X1 for GAS languages
Great latency/bandwidth for both local and remote
memory operations
Remote communication transparent with global
loads and stores
Lack of split-phase gets means losing
optimization opportunities
Poor user-level support for communication and
synchronization of remote operations (no
prefetching, no non-binding or per-operation
completion mechanisms)
Heavy reliance on vectorization for performance
great when it happens, not much so otherwise
(slow scalar processor)
First platform more sensitive to translated code
and less to communication/computation scheduling
First possible mismatch for GASNet between
semantics and platform were hoping the X2 can
address our concerns

Write a Comment

User Comments (0)

About PowerShow.com

Ernest Orlando Lawrence Berkeley National Laboratory PowerPoint PPT Presentation