Unified Parallel C - PowerPoint PPT Presentation

About This Presentation
Title:

Unified Parallel C

Description:

EEL End to end latency or time spent sending a short message between two processes. ... Results: EEL and Overhead. Results: Gap and Overhead. Send Overhead ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 27
Provided by: yel3
Category:
Tags: eel | parallel | unified

less

Transcript and Presenter's Notes

Title: Unified Parallel C


1
Unified Parallel C
  • Kathy Yelick
  • EECS, U.C. Berkeley and NERSC/LBNL
  • NERSC Team Dan Bonachea, Jason Duell, Paul
    Hargrove, Parry Husbands, Costin Iancu, Mike
    Welcome, Christian Bell

2
Outline
  • Global Address Space Languages in General
  • Programming models
  • Overview of Unified Parallel C (UPC)
  • Programmability advantage
  • Performance opportunity
  • Status
  • Next step
  • Related projects

3
Programming Model 1 Shared Memory
  • Program is a collection of threads of control.
  • Many languages allow threads to be created
    dynamically,
  • Each thread has a set of private variables, e.g.
    local variables on the stack.
  • Collectively with a set of shared variables,
    e.g., static variables, shared common blocks,
    global heap.
  • Threads communicate implicitly by writing/reading
    shared variables.
  • Threads coordinate using synchronization
    operations on shared variables

x ...
Shared
y ..x ...
Private
. . .
Pn
P0
4
Programming Model 2 Message Passing
  • Program consists of a collection of named
    processes.
  • Usually fixed at program startup time
  • Thread of control plus local address space -- NO
    shared data.
  • Logically shared data is partitioned over local
    processes.
  • Processes communicate by explicit send/receive
    pairs
  • Coordination is implicit in every communication
    event.
  • MPI is the most common example

send P0,X
recv Pn,Y
Y
X
. . .
Pn
P0
5
Tradeoffs Between the Models
  • Shared memory
  • Programming is easier
  • Can build large shared data structures
  • Machines dont scale
  • SMPs typically lt 16 processors (Sun, DEC, Intel,
    IBM)
  • Distributed shared memory lt 128 (SGI)
  • Performance is hard to predict and control
  • Message passing
  • Machines easier to build from commodity parts
  • Can scale (given sufficient network)
  • Programming is harder
  • Distributed data structures only in the
    programmers mind
  • Tedious packing/unpacking of irregular data
    structures

6
Global Address Space Programming
  • Intermediate point between message passing and
    shared memory
  • Program consists of a collection of processes.
  • Fixed at program startup time, like MPI
  • Local and shared data, as in shared memory model
  • But, shared data is partitioned over local
    processes
  • Remote data stays remote on distributed memory
    machines
  • Processes communicate by reads/writes to shared
    variables
  • Examples are UPC, Titanium, CAF, Split-C
  • Note These are not data-parallel languages
  • heroic compilers not required

7
GAS Languages on Clusters of SMPs
  • Cluster of SMPs (CLUMPs)hb
  • IBM SP 16-way SMP nodes
  • Berkeley Millennium 2-way and 4-way nodes
  • What is an appropriate programming model?
  • Use message passing throughout
  • Most common model
  • Unnecessary packing/unpacking overhead
  • Hybrid models
  • Write 2 parallel programs (MPI OpenMP or
    Threads)
  • Global address space
  • Only adds test (on/off node) before local
    read/write

8
Support for GAS Languages
  • Unified Parallel C (UPC)
  • Funded by the NSA
  • Compaq compiler for Alpha/Quadrics
  • HP, Sun and Cray compilers under development
  • Gcc-based compiler for SGI (Intrepid)
  • Gcc-based compiler (SRC) for Cray T3E
  • MTU and Compaq effort for MPI-based compiler
  • LBNL compiler based on Open64
  • Co-Array Fortran (CAF)
  • Cray compiler
  • Rice and UMN effort based on Open64
  • SPMD Java (Titanium)
  • UCB compiler available for most machines

9
Parallelism Model in UPC
  • UPC uses an SPMD model of parallelism
  • A set if THREADS threads working independently
  • Two compilation models
  • THREADS may be fixed at compile time or
  • Dynamically set at program startup time
  • MYTHREAD specifies thread index (0..THREADS-1)
  • Basic synchronization mechanisms
  • Barriers (normal and split-phase), locks
  • What UPC does not do automatically
  • Determine data layout
  • Load balance move computations
  • Caching move data
  • These are intentionally left to the programmer

10
UPC Pointers
  • Pointers may point to shared or private variables
  • Same syntax for use, just add qualifier
  • shared int sp
  • int lp
  • sp is a pointer to an integer residing in the
    shared memory space.
  • sp is called a shared pointer (somewhat sloppy).

x 3
Shared
sp
sp
sp
Global address space
Private
11
Shared Arrays in UPC
  • Shared array elements are spread across the
    threads
  • shared int xTHREADS /One element per
    thread /
  • shared int y3THREADS / 3 elements per
    thread /
  • shared int z3THREADS / 3 elements per
    thread, cyclic /
  • In the pictures below
  • Assume THREADS 4
  • Elements with affinity to processor 0 are red

Of course, this is really a 2D array
x
y
blocked
z
cyclic
12
Overlapping Communication in UPC
  • Programs with fine-grained communication require
    overlap for performance
  • UPC compiler does this automatically for
    relaxed accesses.
  • Accesses may be designated as strict, relaxed, or
    unqualified (the default).
  • There are several ways of designating the
    ordering type.
  • A type qualifier, strict or relaxed can be used
    to affect all variables of that type.
  • Labels strict or relaxed can be used to control
    the accesses within a statement.
  • strict x y z y1
  • A strict or relaxed cast can be used to override
    the current label or type qualifier.

13
Performance of UPC
  • Reason why UPC may be slower than MPI
  • Shared array indexing is expensive
  • Small messages encouraged by model
  • Reasons why UPC may be faster than MPI
  • MPI encourages synchrony
  • Buffering required for many MPI calls
  • Remote read/write of a single word may require
    very little overhead
  • Cray t3e, Quadrics interconnect (next version)
  • Assuming overlapped communication, the real
    issues is overhead how much time does it take to
    issue a remote read/write?

14
UPC vs. MPI Sparse MatVec Multiply
  • Short term goal
  • Evaluate language and compilers using small
    applications
  • Longer term, identify large application
  • Show advantage of t3e network model and UPC
  • Performance on Compaq machine worse
  • Serial code
  • Communication performance
  • New compiler just released

15
UPC versus MPI for Edge detection
b. Scalability
a. Execution time
  • Performance from Cray T3E
  • Benchmark developed by El Ghazawis group at GWU

16
UPC versus MPI for Matrix Multiplication
a. Execution time
b. Scalability
  • Performance from Cray T3E
  • Benchmark developed by El Ghazawis group at GWU

17
Implementing UPC
  • UPC extensions to C are small
  • lt 1 person-year to implement in existing compiler
  • Simplest approach
  • Reads and writes of shared pointers become small
    message puts/gets
  • UPC has relaxed keyword for nonblocking
    communication
  • Small message performance is key
  • Advanced optimizations include conversions to
    bulk communication by either
  • Application programmer
  • Compiler

18
Overview of NERSC Compiler
  • Compiler
  • Portable compiler infrastructure (UPC-gtC)
  • Explore optimizations communication, shared
    pointers
  • Based on Open64 plan to release sources
  • Runtime systems for multiple compilers
  • Allow use by other languages (Titanium and CAF)
  • And in other UPC compilers, e.g., Intrepid
  • Performance of small message put/get are key
  • Designed to be easily ported, then tuned
  • Also designed for low overhead (macros, inline
    functions)

19
Compiler and Runtime Status
  • Basic parsing and type-checking complete
  • Generates code for small serial kernels
  • Still testing and debugging
  • Needs runtime for complete testing
  • UPC runtime layer
  • Initial implementation should be done this month
  • Based on processes (not threads) on GASNet
  • GASNet
  • Initial specification complete
  • Reference implementation done on MPI
  • Working on Quadrics and IBM (LAPI)

20
Benchmarks for GAS Languages
  • EEL End to end latency or time spent sending a
    short message between two processes.
  • BW Large message network bandwidth
  • Parameters of the LogP Model
  • L Latencyor time spent on the network
  • During this time, processor can be doing other
    work
  • O Overhead or processor busy time on the
    sending or receiving side.
  • During this time, processor cannot be doing other
    work
  • We distinguish between send and recv overhead
  • G gap the rate at which messages can be
    pushed onto the network.
  • P the number of processors

21
LogP Parameters Overhead Latency
  • Non-overlapping overhead
  • Send and recv overhead can overlap

P0
osend
orecv
P1
EEL osend L orecv
EEL f(osend, L, orecv)
22
Benchmarks
  • Designed to measure the network parameters
  • Also provide gap as function of queue depth
  • Measured for best case in general
  • Implemented once in MPI
  • For portability and comparison to target specific
    layer
  • Implemented again in target specific
    communication layer
  • LAPI
  • ELAN
  • GM
  • SHMEM
  • VIPL

23
Results EEL and Overhead
24
Results Gap and Overhead
25
Send Overhead Over Time
  • Overhead has not improved significantly T3D was
    best
  • Lack of integration lack of attention in software

26
Summary
  • Global address space languages offer alternative
    to MPI for large machines
  • Easier to use shared data structures
  • Recover users left behind on shared memory?
  • Performance tuning still possible
  • Implementation
  • Small compiler effort given lightweight
    communication
  • Portable communication layer GASNet
  • Difficulty with small message performance on IBM
    SP platform
Write a Comment
User Comments (0)
About PowerShow.com