Ernest Orlando Lawrence Berkeley National Laboratory - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Ernest Orlando Lawrence Berkeley National Laboratory

Description:

Unified Parallel C at LBNL/UCB. Implementing a Global Address Space Language on the Cray X1: ... Cray X1 'memory centrifuge' useful for UPC ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 26
Provided by: gab143
Category:

less

Transcript and Presenter's Notes

Title: Ernest Orlando Lawrence Berkeley National Laboratory


1
Implementing a Global Address Space Language on
the Cray X1 the Berkeley UPC Experience
Christian Bell and Wei Chen CS252 Class
Project December 10, 2003
2
Outline
  • An Overview of UPC and the Berkeley UPC Compiler
  • Overview of the Cray X1
  • Implementing the GASNet layer on the X1
  • Implementing the runtime layer on the X1
  • Serial performance
  • Evaluation of compiler optimizations

3
Unified Parallel C (UPC)
  • UPC is an explicitly parallel global address
    space language with SPMD parallelism
  • An extension of ISO C
  • User level shared memory, partitioned by threads
  • One-sided (bulk and fine-grained) communication
    through reads/writes of shared variables

Shared
X0
X1
XP
Global address space
Private
4
Shared Arrays and Pointers in UPC
A0 A2 A4 B0 B1 B4 B5 C0 C1
C2
A1 A3 A5 B2 B3 B6 B7
T1
T0
  • Cyclic shared int An
  • Block Cyclic shared 2 int Bn
  • Indefinite shared 0 int C
  • (shared 0 int ) upc_alloc(n)
  • Use pointer-to-shared to access shared data
  • Block size part of the pointer type
  • A generic pointer-to-shared contains
  • Address, Thread id, Phase
  • Cyclic and Indefinite pointers are phaseless

5
Accessing Shared Memory in UPC
start of array object
start of block
Shared Memory

block size
Phase

Thread 1
Thread N -1
Thread 0
0
2
addr
6
UPC Programming Model Features
  • Block cyclically distributed arrays
  • Shared and private pointers
  • Global synchronization -- barriers
  • Pair-wise synchronization locks
  • Parallel loops
  • Dynamic shared memory allocation
  • Bulk Shared Memory accesses
  • Strict vs. Relaxed memory consistency models

7
Overview of the Berkeley UPC Compiler
Two Goals Portability and High-Performance
Lower UPC code into ANSI-C code
Translator
UPC Code
Shared Memory Management and pointer operations
Platform- independent
Translator Generated C Code
Berkeley UPC Runtime System
Network- independent
Compiler- independent
GASNet Communication System
Language- independent
Network Hardware
Uniform get/put interface for underlying networks
8
A Layered Design
  • Portable
  • C is our intermediate language
  • Can run on top of MPI (with performance penalty)
  • GASNet has a layered design with a small core
  • High-Performance
  • Native C compiler optimizes serial code
  • Translator can perform high-level communication
    optimizations
  • GASNet can access network hardware directly,
    provides a rich set of communication /
    synchronization primitives

9
Outline
  • An Overview of UPC and the Berkeley UPC Compiler
  • Overview of the Cray X1
  • Implementing the GASNet layer on the X1
  • Implementing the runtime layer on the X1
  • Serial performance
  • Evaluation of compiler optimizations

10
The Cray X1 Architecture
  • New line of Vector Architecture
  • Two modes of operation
  • SSP up to 16 CPUs/node
  • MSP multistreams long loops
  • Single-node UMA, multi-node NUMA (no caching
    remote data)
  • Global pointers
  • Low latency, high bandwidth
  • All Gets/Puts must be loads/stores (directly or
    shmem interface)
  • Only puts are non-blocking, gets are blocking
  • Vectorization is crucial
  • Vector pipeline 2x faster than scalar
  • Utilization of memory bandwidth
  • Strided accesses, scatter-gather, reduction, etc.

11
Outline
  • An Overview of UPC and the Berkeley UPC Compiler
  • Overview of the Cray X1
  • Implementing the GASNet layer on the X1
  • Implementing the runtime layer on the X1
  • Serial performance
  • Evaluation of compiler optimizations

12
GASNet Communication System- Architecture
  • 2-Level architecture to ease implementation
  • Core API
  • Based heavily on Active Messages
  • Compatibility layer
  • Port to X1 in 2 days, new algorithm to manipulate
    queues in Shared Memory
  • Extended API
  • Wider interface that includes more complicated
    operations (puts, gets)
  • A reference implementation of the extended API in
    terms of the core API is provided
  • Current revision is tuned especially for the X1
    with shared memory as the primary focus (minimal
    overhead)

Compiler-generated code
Compiler-specific runtime system
GASNet Extended API
GASNet Core API
Network Hardware
13
GASNet Extended API Remote memory operations
  • GASNet offers expressive Put/Get primitives
  • All Gets/Puts can be blocking and non-blocking
  • Non-blocking can be explicit (handle-based)
  • Non-blocking can be implicit (global or
    region-based)
  • Synchronization can poll or block
  • Paves the way for complex split-phase
    communication (compiler optimizations)
  • Cray X1 uses exclusively shared memory
  • All Gets/Puts must be loads/stores
  • Only puts are non-blocking, gets are blocking
  • Very limited synchronization mechanisms
  • Efficient communication only through vectors (one
    order of magnitude between scalar and vector
    communication)
  • Vectorization instead of split-phase?

14
GASNet and Cray X1 Remote memory operations
GASNet Cray X1 Instruction Comment
Bulk operations Vector bcopy() Fully vectorized, suitable for GASNet/UPC
Non-bulk blocking puts Store gsync No vectorization
Non-bulk blocking gets Load
Non-bulk Non-blocking explicit puts/gets Store/load gsync No vectorization if sync done in the loop
Non-bulk Non-blocking implicit puts/gets Store/load gsync No vectorization if sync done in the loop
  • Flexible communications provides no benefit
    without vectorization (factor of 10 between
    vector and scalar)
  • Difficult to expose vectorization through a
    layered software stack Native C compiler now has
    to optimize parallel code!
  • Cray X1 big hammer gsync() prevents
    interesting communication optimizations

15
GASNet/X1 Performance
  • GASNet/X1 improves small message performance
  • Minimal overhead as portable network assembly
    language
  • Core API (Active Messages) solves Cray problem
    of upc_global_alloc (non-collective memory
    allocation)
  • Synthetic benchmarks show no GASNet
    interference, but not necessarily the case for
    application benchmarks

16
Outline
  • An Overview of UPC and the Berkeley UPC Compiler
  • Overview of the Cray X1
  • Implementing the GASNet layer on the X1
  • Implementing the runtime layer on the X1
  • Serial performance
  • Evaluation of compiler optimizations

17
Shared Pointer Representations
  • Cray X1 memory centrifuge useful for UPC
  • Possible to manipulate UPC phaseless pointers
    directly as X1 global pointers allocated by the
    symmetric heap
  • Heavy function inlining and macros remove all
    traces of UPC Runtime and GASNet calls

18
Cost of Shared Pointer Arithmetic and Accesses
19
Outline
  • An Overview of UPC and the Berkeley UPC Compiler
  • Overview of the Cray X1
  • Implementing the GASNet layer on the X1
  • Implementing the runtime layer on the X1
  • Serial performance
  • Evaluation of compiler optimizations

20
Serial Performance
  • Its all about vectorization
  • Cray C highly sensitive to changes in inner loop
  • Want translators output as vectorizable as
    original C source.
  • Strategy Keep translated code syntactically
    close to the source
  • Preserve high level loops
  • aexp becomes (aexp)
  • Multidimensional arrays linearized
  • Preserve restrict qualifier and ivdep pragmas

21
Livermore Loop Kernels
22
Evaluating Communication Optimizations on Cray X1
  • Message Aggregation
  • LogGP model fewer messages means less overhead
  • Techniques message vectorization, coalescing,
    bulk prefetching
  • Still true for Cray X1?
  • Remote access latency comparable to local
    accesses
  • Vectorization should hide most overhead of small
    messages
  • Remote data not cache coherent may still help
    to store them into local buffers
  • Essentially, a question of fine-grained vs.
    coarse-grained programming model

23
NAS CG OpenMP style vs. MPI style
  • Fine-grained (OpenMP style) version still slower
  • shared memory programming style leads more
    overhead (redundant boundary computation)
  • UPCs hybrid programming model can really help

24
More Optimizations
  • Overlapping Communication and Computation
  • Hides communication latencies with independent
    computation
  • Examples communication scheduling, message
    pipelining
  • Requires split-phase operations try to separate
    sync() as far as possible from non-blocking
    get/put
  • But Cray X1 lacks support for nonblocking gets
  • No user/compiler level overlapping
  • All communication optimizations rely on
    vectorization (e.g., gups)
  • Vectorization is too restrictive in our opinion
    gives up on pointer code and sync(), bulk
    synchronous programs, etc

25
Conclusion
  • We have an efficient UPC implementation on Cray
    X1
  • Evaluation of Cray X1 for GAS languages
  • Great latency/bandwidth for both local and remote
    memory operations
  • Remote communication transparent with global
    loads and stores
  • Lack of split-phase gets means losing
    optimization opportunities
  • Poor user-level support for communication and
    synchronization of remote operations (no
    prefetching, no non-binding or per-operation
    completion mechanisms)
  • Heavy reliance on vectorization for performance
    great when it happens, not much so otherwise
    (slow scalar processor)
  • First platform more sensitive to translated code
    and less to communication/computation scheduling
  • First possible mismatch for GASNet between
    semantics and platform were hoping the X2 can
    address our concerns
Write a Comment
User Comments (0)
About PowerShow.com