Ernest Orlando Lawrence Berkeley National Laboratory - PowerPoint PPT Presentation

About This Presentation
Title:

Ernest Orlando Lawrence Berkeley National Laboratory

Description:

Strict vs. Relaxed memory consistency models. Unified Parallel C at LBNL/UCB ... Based on the weak ordering consistency model. Unified Parallel C at LBNL/UCB ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 32
Provided by: gabort
Learn more at: https://upc.lbl.gov
Category:

less

Transcript and Presenter's Notes

Title: Ernest Orlando Lawrence Berkeley National Laboratory


1
The Berkeley UPC Compiler Implementation and
Performance
Wei Chen, Dan Bonachea, Jason Duell, Parry
Husbands, Costin Iancu, Kathy Yelick the
LBNL/Berkeley UPC Group http//upc.lbl.gov
2
Outline
  • An Overview of UPC
  • Design and Implementation of the Berkeley UPC
    Compiler
  • Preliminary Performance Results
  • Communication Optimizations

3
Unified Parallel C (UPC)
  • UPC is an explicitly parallel global address
    space language with SPMD parallelism
  • An extension of C
  • Shared memory is partitioned by threads
  • One-sided (bulk and fine-grained) communication
    through reads/writes of shared variables
  • Collective Efforts by industry, academia, and
    government
  • http//upc.gwu.edu

Shared
X0
X1
XP
Global address space
Private
4
UPC Programming Model Features
  • Block cyclically distributed arrays
  • Shared and private pointers
  • Global synchronization -- barriers
  • Pair-wise synchronization locks
  • Parallel loops
  • dynamic shared memory allocation
  • Bulk Shared Memory accesses
  • Strict vs. Relaxed memory consistency models

5
Design and Implementation of the Berkeley UPC
Compiler
6
Overview of the Berkeley UPC Compiler
Two Goals Portability and High-Performance
Lower UPC code into ANSI-C code
Translator
UPC Code
Shared Memory Management and pointer operations
Platform- independent
Translator Generated C Code
Berkeley UPC Runtime System
Network- independent
Compiler- independent
GASNet Communication System
Language- independent
Network Hardware
Uniform get/put interface for underlying networks
7
A Layered Design
  • Portable
  • C is our intermediate language
  • Can run on top of MPI (with performance penalty)
  • GASNet has a layered design with a small core
  • High-Performance
  • Native C compiler optimizes serial code
  • Translator can perform high-level communication
    optimizations
  • GASNet can access network hardware directly

8
Implementing the UPC to C Translator
Preprocessed File
  • Based on the Open64 compiler
  • Source to source transformation
  • Convert shared memory
  • operations into runtime library
  • calls
  • Designed to incorporate existing
  • optimization framework in open64
  • Communicate with runtime via a
  • standard API

C front end
Whirl w/ shared types
Backend lowering
Whirl w/ runtime calls
Whirl2c
ANSI-compliant C Code
9
Shared Arrays and Pointers in UPC
A0 A2 A4 B0 B1 B4 B5 C0 C1
C2
A1 A3 A5 B2 B3 B6 B7
T1
T0
  • Cyclic shared int An
  • Block Cyclic shared 2 int Bn
  • Indefinite shared 0 int C
  • (shared 0 int ) upc_alloc(n)
  • Use global pointers (pointer-to-shared) to access
    shared (possibly remote) data
  • Block size part of the pointer type
  • A generic pointer-to-shared contains
  • Address, Thread id, Phase

10
Accessing Shared Memory in UPC
start of array object
start of block
Shared Memory

block size
Phase

Thread 1
Thread N -1
Thread 0
0
2
addr
11
Phaseless Pointer-to-Shared
  • A pointer needs a phase to keep track of where
    it is in a block
  • Source of overhead for pointer arithmetic
  • Special case for phaseless pointers Cyclic
    Indefinite
  • Cyclic pointers always have phase 0
  • Indefinite pointers only have one block
  • Dont need to keep phase in pointer operations
    for cyclic and indefinite
  • Dont need to update thread id for indefinite
    pointer arithmetic

12
Pointer-to-Shared Representation
  • Pointer Size
  • Want to allow pointers to reside in a register
  • But very large machines may require a longer
    representation
  • Datatype
  • Use of scalar types (long) rather than a struct
    may improve backend code quality
  • Faster pointer manipulation, e.g., ptrint as
    well as dereferencing
  • Portability and performance balance in UPC
    compiler
  • 8-byte scalar vs. struct format (configuration
    time option)
  • The pointer representation is hidden in the
    runtime layer
  • Modular design means easy to add new
    representations

13
Performance Results
14
Performance Evaluation
  • Testbed
  • HP AlphaServer (1GHz processor), with Quadrics
    interconnect
  • Compaq C compiler for the translated C code
  • Compare with HP UPC 2.1
  • Cost of Language Features
  • Shared pointer arithmetic, shared memory
    accesses, parallel loops, etc
  • Application Benchmarks
  • EP no communication
  • IS large bulk memory operations
  • - MG bulk memget
  • CG fine-grained vs. bulk memput
  • Potentials of Optimizations
  • Measure the effectiveness of various
    communication optimizations

15
Performance of Shared Pointer Arithmetic
1 cycle 1ns Struct 16 bytes
  • Phaseless pointer an important optimization
  • Packed representation also helps.

16
Cost of Shared Memory Access
  • Local accesses somewhat slower than private
    accesses
  • Layered design does not add additional overhead
  • Remote accesses a few magnitude worse than local

17
Parallel Loops in UPC
  • UPC has a forall construct for distributing
    computation
  • shared int v1N, v2N, v3N
  • upc_forall(i0 i lt N i v3i)
  • v3i v2i v1i
  • Affinity tests performed on every iteration to
    decide if it should execute
  • Two kinds of affinity expressions
  • Integer (compare with thread id)
  • Shared address (check the affinity of address)

18
Application Performance
19
NAS Benchmarks (EP and IS)
  • EP shows the backend C compiler can still
    successfully optimize translated C code
  • IS shows Berkeley UPC compiler is effective for
    communication operations

20
NAS Benchmarks (CG and MG)
  • Berkeley UPC compiler scales well

21
Performance of Fine-grained Applications
  • Doesnt scale well due to nature of the benchmark
    (lots of small reads)
  • HP UPCs software caching helps performance

22
Observations on the Results
  • Acceptable worst-case overhead for shared memory
    access latencies
  • lt 10 cycle overhead for shared local accesses
  • 1.5 usec overhead compared to end-to-end
    network latency
  • Optimizations on pointer-to-shared representation
    are effective
  • Both phaseless pointers and packed 8-byte format
  • Good performance compared to HP UPC 2.1

23
Communication Optimizations for UPC
24
Communication Optimizations
  • Hiding Communication Latencies
  • Use of non-blocking operations
  • Possible placement analysis, by separating
    get(), put() as far as possible from sync()
  • Message pipelining to overlap communication with
    more communication
  • Optimizing Shared Memory Accesses
  • Eliminating locality test for local shared
    pointers flow- and context-sensitive analysis
  • Transforming forall loops into equivalent for
    loops
  • Eliminate redundant pointer arithmetic for
    pointers with same thread and phase

25
More Optimizations
  • Message Vectorization and Aggregation
  • Scatter/gather techniques
  • Packing generally pays off for small (lt 500 byte)
    messages
  • Software Caching and Prefetching
  • A prototype implementation
  • Local knowledge only no coherence messages
  • Cache remote reads and buffer outgoing writes
  • Based on the weak ordering consistency model

26
Example Optimizing Loops
27
Experimental Results
  • Computation/communication overlap better than
    communication/communication overlap, for Quadrics
  • Results likely different for other networks

28
Example Optimizing Local Shared Accesses
29
Experimental Results
  • Neither compiler performs well for naïve version
  • Culprit pointer-to-shared operations affinity
    tests
  • Privatizing local shared accesses improves
    performance by an order of magnitude

30
Compiler Status
  • A fully UPC 1.1 compliant public release in April
  • Supported Platforms
  • HP AlphaServer, IBM SP, Linux x86/Itanium, SGI
    Origin2000, Solaris Sparc/x86, Mac OSX PowerPC
  • Supported Networks
  • Quadrics/Elan, Myrinet/GM, IBM/LAPI, and MPI
  • A release this summer will include
  • Pthreads/System V shared memory support
  • GASNet Infiniband support

31
Conclusion
  • The Berkeley UPC Compiler achieves both
    portability and good performance
  • Layered, modular design
  • Effective pointer-to-shared optimizations
  • Good performance compared to commercially
    available UPC compiler
  • Still lots of opportunities for communication
    optimizations
  • Available for download at
  • http//upc.lbl.gov
Write a Comment
User Comments (0)
About PowerShow.com