Unified Parallel C (UPC) - PowerPoint PPT Presentation

About This Presentation
Title:

Unified Parallel C (UPC)

Description:

Data movement: broadcast, scatter, gather, ... Computational: reduce, prefix, ... Should non-blocking communication be a first class language citizen? Synchronization ... – PowerPoint PPT presentation

Number of Views:191
Avg rating:3.0/5.0
Slides: 65
Provided by: cost83
Learn more at: https://cseweb.ucsd.edu
Category:
Tags: upc | parallel | unified

less

Transcript and Presenter's Notes

Title: Unified Parallel C (UPC)


1
Unified Parallel C (UPC)
  • Costin Iancu
  • The Berkeley UPC Group C. Bell, D. Bonachea, W.
    Chen, J. Duell,
  • P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,
    M. Welcome, K. Yelick
  • http//upc.lbl.gov

Slides edited by K. Yelick, T. El-Ghazawi, P.
Husbands, C.Iancu
2
Context
  • Most parallel programs are written using either
  • Message passing with a SPMD model (MPI)
  • Usually for scientific applications with
    C/Fortran
  • Scales easily
  • Shared memory with threads in OpenMP,
    ThreadsC/C/F or Java
  • Usually for non-scientific applications
  • Easier to program, but less scalable performance
  • Partitioned Global Address Space (PGAS) Languages
    take the best of both
  • SPMD parallelism like MPI (performance)
  • Local/global distinction, i.e., layout matters
    (performance)
  • Global address space like threads
    (programmability)
  • 3 Current languages UPC (C), CAF (Fortran), and
    Titanium (Java)
  • 3 New languages Chapel, Fortress, X10

3
Partitioned Global Address Space
  • Shared memory is logically partitioned by
    processors
  • Remote memory may stay remote no automatic
    caching implied
  • One-sided communication reads/writes of shared
    variables
  • Both individual and bulk memory copies
  • Some models have a separate private memory area
  • Distributed array generality and how they are
    constructed

4
Partitioned Global Address Space Languages
  • Explicitly-parallel programming model with SPMD
    parallelism
  • Fixed at program start-up, typically 1 thread per
    processor
  • Global address space model of memory
  • Allows programmer to directly represent
    distributed data structures
  • Address space is logically partitioned
  • Local vs. remote memory (two-level hierarchy)
  • Programmer control over performance critical
    decisions
  • Data layout and communication
  • Performance transparency and tunability are goals
  • Initial implementation can use fine-grained
    shared memory

5
Current Implementations
  • A successful language/library must run everywhere
  • UPC
  • Commercial compilers Cray, SGI, HP, IBM
  • Open source compilers LBNL/UCB
    (source-to-source), Intrepid (gcc)
  • CAF
  • Commercial compilers Cray
  • Open source compilers Rice (source-to-source)
  • Titanium
  • Open source compilers UCB (source-to-source)
  • Common tools
  • Open64 open source research compiler
    infrastructure
  • ARMCI, GASNet for distributed memory
    implementations
  • Pthreads, System V shared memory

6
Talk Overview
  • UPC Language Design
  • Data Distribution (layout, memory management)
  • Work Distribution (data parallelism)
  • Communication (implicit,explicit, collective
    operations)
  • Synchronization (memory model, locks)
  • Programming in UPC
  • Performance (one-sided communication)
  • Application examples FFT, PC
  • Productivity (compiler support)
  • Performance tuning and modeling

7
UPC Overview and Design
  • Unified Parallel C (UPC) is
  • An explicit parallel extension of ANSI C with
    common and familiar syntax and semantics for
    parallel C and simple extensions to ANSI C
  • A partitioned global address space language
    (PGAS)
  • Based on ideas in Split-C, AC, and PCP
  • Similar to the C language philosophy
  • Programmers are clever and careful, and may need
    to get close to hardware
  • to get performance, but
  • can get in trouble
  • SPMD execution model (THREADS, MYTHREAD),
    static vs. dynamic threads

8
Data Distribution
9
Data Distribution
  • Distinction between memory spaces through
    extensions of the type system (shared qualifier)
  • shared int ours
  • shared int XTHREADS
  • shared int ptr
  • int mine
  • Data in shared address space
  • Static scalars (T0), distributed
    arrays
  • Dynamic dynamic memory management
  • (upc_alloc, upc_global_alloc, upc_all_alloc)

10
Data Layout
  • Data layout controlled through extensions of the
    type system (layout specifiers)
  • 0 or (indefinite layout, all on 1 thread)
  • shared int p
  • Empty (cyclic layout)
  • shared int arrayTHREADSM
  • (blocked layout)
  • shared int arrayTHREADSM
  • b or b1b2bn b1b2bn (block cyclic)
  • shared B int arrayTHREADSM
  • Element arrayi has affinity with thread
  • (i / B) THREADS
  • Layout determines pointer arithmetic rules
  • Introspection (upc_threadof, upc_phaseof,
    upc_blocksize)

11
UPC Pointers Implementation
  • In UPC pointers to shared objects have three
    fields
  • thread number
  • local address of block
  • phase (specifies position in the block)
  • Example Cray T3E implementation
  • Pointer arithmetic can be expensive in UPC

Virtual Address Thread Phase
Phase Thread Virtual Address
0
12
UPC Pointers
Where does the pointer point?
Local Shared
Private PP (p1) PS (p3)
Shared SP (p2) SS (p4)
Where does the pointer reside?
int p1 / private pointer to local
memory / shared int p2 / private pointer to
shared space / int shared p3 / shared pointer
to local memory / shared int shared p4 /
shared pointer to
shared space /
13
UPC Pointers
int p1 / private pointer to local
memory / shared int p2 / private pointer to
shared space / int shared p3 / shared pointer
to local memory / shared int shared p4 /
shared pointer to
shared space /
Pointers to shared often require more storage and
are more costly to dereference they may refer to
local or remote memory.
14
Common Uses for UPC Pointer Types
  • int p1
  • These pointers are fast (just like C pointers)
  • Use to access local data in part of code
    performing local work
  • Often cast a pointer-to-shared to one of these to
    get faster access to shared data that is local
  • shared int p2
  • Use to refer to remote data
  • Larger and slower due to test-for-local
    possible communication
  • int shared p3
  • Not recommended
  • shared int shared p4
  • Use to build shared linked structures, e.g., a
    linked list
  • typedef is the UPC programmers best friend

15
UPC Pointers Usage Rules
  • Pointer arithmetic supports blocked and
    non-blocked array distributions
  • Casting of shared to private pointers is allowed
    but not vice versa !
  • When casting a pointer-to-shared to a
    pointer-to-local, the thread number of the
    pointer to shared may be lost
  • Casting of shared to local is well defined only
    if the object pointed to by the pointer to shared
    has affinity with the thread performing the cast

16
Work Distribution
17
Work Distribution upc_forall()
  • Owner computes rule loop over all, work on those
    owned by you
  • UPC adds a special type of loop
  • upc_forall(init test step affinity)
  • statement
  • Programmer indicates the iterations are
    independent
  • Undefined if there are dependencies across
    threads
  • Affinity expression indicates which iterations to
    run on each thread. It may have one of two
    types
  • Integer affinityTHREADS MYTHREAD
  • Pointer upc_threadof(affinity) MYTHREAD
  • Syntactic sugar for
  • for(iMYTHREAD iltN iTHREADS)
  • for(i0 iltN i)
  • if (MYTHREAD iTHREADS)

18
Inter-Processor Communication
19
Data Communication
  • Implicit (assignments)
  • shared int p
  • p 7
  • Explicit (bulk synchronous) point-to-point
  • (upc_memget, upc_memput, upc_memcpy,
    upc_memset)
  • Collective operations http//www.gwu.edu/upc/docs
    /
  • Data movement broadcast, scatter, gather,
  • Computational reduce, prefix,
  • Interface has synchronization modes (??)
  • Avoid over-synchronizing (barrier before/after is
    simplest semantics, but may be unnecessary)
  • Data being collected may be read/written by any
    thread simultaneously

20
Data Communication
  • The UPC Language Specification V 1.2 does not
    contain non-blocking communication primitives
  • Extensions for non-blocking communication
    available in the BUPC implementation
  • UPC V1.2 does not have higher level communication
    primitives for point-to-point communication.
  • See BUPC extensions for
  • scatter, gather
  • VIS
  • Should non-blocking communication be a first
    class language citizen?

21
Synchronization
22
Synchronization
  • Point-to-point synchronization locks
  • opaque type upc_lock_t
  • dynamically managed upc_all_lock_alloc,
    upc_global_lock_alloc
  • Global synchronization
  • Barriers (unaligned) upc_barrier
  • Split-phase barriers
  • upc_notify this thread is ready for barrier
  • do computation unrelated to barrier
  • upc_wait wait for others to be ready

23
Memory Consistency in UPC
  • The consistency model defines the order in which
    one thread may see another threads accesses to
    memory
  • If you write a program with un-synchronized
    accesses, what happens?
  • Does this work?
  • data while (!flag)
  • flag 1 data // use the data
  • UPC has two types of accesses
  • Strict will always appear in order
  • Relaxed May appear out of order to other threads
  • Consistency is associated either with a program
    scope (file, statement)
  • pragma upc strict flag 1
  • or with a type
  • shared strict int flag

24
Sample UPC Code
25
Matrix Multiplication in UPC
  • Given two integer matrices A(NxP) and B(PxM), we
    want to compute C A x B.
  • Entries cij in C are computed by the formula

26
Serial C code
  • define N 4
  • define P 4
  • define M 4
  • int aNP 1,2,3,4,5,6,7,8,9,10,11,12,14,14,1
    5,16, cNM
  • int bPM 0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1
  • void main (void)
  • int i, j , l
  • for (i 0 iltN i)
  • for (j0 jltM j)
  • cij 0
  • for (l 0 lltP l)
  • cij ailblj

27
Domain Decomposition
  • Exploits locality in matrix multiplication
  • A (N ? P) is decomposed row-wise into blocks of
    size (N ? P) / THREADS as shown below
  • B(P ? M) is decomposed column wise into M/
    THREADS blocks as shown below

Thread THREADS-1
Thread 0
P
M
Thread 0
0 .. (NP / THREADS) -1
Thread 1
(NP / THREADS)..(2NP / THREADS)-1
N
P
((THREADS-1)?NP) / THREADS .. (THREADSNP /
THREADS)-1
Thread THREADS-1
  • Note N and M are assumed to be multiples of
    THREADS

Columns 0 (M/THREADS)-1
Columns ((THREAD-1) ? M)/THREADS(M-1)
28
UPC Matrix Multiplication Code
include ltupc_relaxed.hgt define N 4 define P
4 define M 4 shared NP/THREADS int aNP
1,..,16, cNM // data distribution a and c
are blocked shared matrices shared M/THREADS
int bPM 0,1,0,1, ,0,1 void main (void)
int i, j , l // private variables upc_forall(
i 0 iltN i ci0) //work
distribution for (j0 jltM j)
cij 0 for (l 0 lltP l)
//implicit communication
cij ailblj
29
UPC Matrix Multiplication With Block Copy
include ltupc_relaxed.hgt shared NP /THREADS
int aNP, cNM // a and c are blocked
shared matrices sharedM/THREADS int
bPM int b_localPM void main (void)
int i, j , l // private variables
//explicit bulk communication upc_memget(b_local,
b, PMsizeof(int)) //work distribution
(c aligned with a??) upc_forall(i 0 iltN
i ci0) for (j0 jltM j)
cij 0 for (l 0 lltP l)
cij ailb_locallj
30
Programming in UPC
  • Dont ask yourself what can my compiler do for
    me, ask yourself what can I do for my compiler!

31
Principles of Performance Software
  • To minimize the cost of communication
  • Use the best available communication mechanism on
    a given machine
  • Hide communication by overlapping
  • (programmer or compiler
    or runtime)
  • Avoid synchronization using data-driven execution
    (programmer or runtime)
  • Tune communication using performance models when
    they work (??) search when they dont
  • (programmer or
    compiler/runtime)

32
Best Available Communication Mechanism
  • Performance is determined by overhead, latency
    and bandwidth
  • Data transfer (one-sided communication) is often
    faster than (two sided) message passing
  • Semantics limit performance
  • In-order message delivery
  • Message and tag matching
  • Need to acquire information from remote host
    processor
  • Synchronization (message receipt) tied to data
    transfer

33
One-Sided vs Two-Sided Theory
host CPU
two-sided message
message id
data payload
network interface
one-sided put message
memory
address
data payload
  • A two-sided messages needs to be matched with a
    receive to identify memory address to put data
  • Offloaded to Network Interface in networks like
    Quadrics
  • Need to download match tables to interface (from
    host)
  • A one-sided put/get message can be handled
    directly by a network interface with RDMA support
  • Avoid interrupting the CPU or storing data from
    CPU (preposts)

34
GASNet Portability and High-Performance
GASNet better for overhead and latency across
machines
UPC Group GASNet design by Dan Bonachea
35
GASNet Portability and High-Performance
GASNet excels at mid-range sizes important for
overlap
GASNet at least as high (comparable) for large
messages
Joint work with UPC Group GASNet design by Dan
Bonachea
36
One-Sided vs. Two-Sided Practice
NERSC Jacquard machine with Opteron processors
  • InfiniBand GASNet vapi-conduit and OSU MVAPICH
    0.9.5
  • Half power point (N ½ ) differs by one order of
    magnitude
  • This is not a criticism of the implementation!

Yelick,Hargrove, Bonachea
37
Overlap
38
Hide Communication by Overlapping
  • A programming model that decouples data transfer
    and synchronization (init, sync)
  • BUPC has several extensions (programmer)
  • explicit handle based
  • region based
  • implicit handle based
  • Examples
  • 3D FFT (programmer)
  • split-phase optimizations (compiler)
  • automatic overlap (runtime)

39
Performing a 3D FFT
  • NX x NY x NZ elements spread across P processors
  • Will Use 1-Dimensional Layout in Z dimension
  • Each processor gets NZ / P planes of NX x NY
    elements per plane

Example P 4
NZ
NZ/P
1D Partition
NX
p3
p2
p1
NY
p0
Bell, Nishtala, Bonachea, Yelick
40
Performing a 3D FFT (part 2)
  • Perform an FFT in all three dimensions
  • With 1D layout, 2 out of the 3 dimensions are
    local while the last Z dimension is distributed

Step 1 FFTs on the columns (all elements local)
Step 2 FFTs on the rows (all elements local)
Step 3 FFTs in the Z-dimension (requires
communication)
Bell, Nishtala, Bonachea, Yelick
41
Performing the 3D FFT (part 3)
  • Can perform Steps 1 and 2 since all the data is
    available without communication
  • Perform a Global Transpose of the cube
  • Allows step 3 to continue

Transpose
Bell, Nishtala, Bonachea, Yelick
42
Communication Strategies for 3D FFT
chunk all rows with same destination
  • Three approaches
  • Chunk
  • Wait for 2nd dim FFTs to finish
  • Minimize messages
  • Slab
  • Wait for chunk of rows destined for 1 proc to
    finish
  • Overlap with computation
  • Pencil
  • Send each row as it completes
  • Maximize overlap and
  • Match natural layout

pencil 1 row
slab all rows in a single plane with same
destination
Bell, Nishtala, Bonachea, Yelick
43
NAS FT Variants Performance Summary
.5 Tflops
Chunk (NAS FT with FFTW) Best MPI (always
slabs) Best UPC (always pencils)
MFlops per Thread
  • Slab is always best for MPI small message cost
    too high
  • Pencil is always best for UPC more overlap

Myrinet Infiniband Elan3
Elan3 Elan4 Elan4 procs
64 256 256
512 256 512
44
Bisection Bandwidth Limits
  • Full bisection bandwidth is (too) expensive
  • During an all-to-all communication phase
  • Effective (per-thread) bandwidth is fractional
    share
  • Significantly lower than link bandwidth
  • Use smaller messages mixed with computation to
    avoid swamping the network

Bell, Nishtala, Bonachea, Yelick
45
Compiler Optimizations
  • Naïve scheme (blocking call for each load/store)
    not good enough
  • PRE on shared expressions
  • Reduce the amount of unnecessary communication
  • Apply also to UPC shared pointer arithmetic
  • Split-phase communication
  • Hide communication latency through overlapping
  • Message coalescing
  • Reduce number of messages to save startup
    overhead and achieve better bandwidth

Chen, Iancu, Yelick
46
Benchmarks
  • Gups
  • Random access (read/modify/write) to distributed
    array
  • Mcop
  • Parallel dynamic programming algorithm
  • Sobel
  • Image filter
  • Psearch
  • Dynamic load balancing/work stealing
  • Barnes Hut
  • Shared memory style code from SPLASH2
  • NAS FT/IS
  • Bulk communication

47
Performance Improvements
improvement over unoptimized
Chen, Iancu, Yelick
48
Data Driven Execution
49
Data-Driven Execution
  • Many algorithms require synchronization with
    remote processor
  • Mechanisms (BUPC extensions)
  • Signaling store Raise a semaphore upon transfer
  • Remote enqueue Put a task in a remote queue
  • Remote execution Floating functions (X10
    activities)
  • Many algorithms have irregular data dependencies
    (LU)
  • Mechanisms (BUPC extensions)
  • Cooperative multithreading

50
Matrix Factorization
Completed part of U
A(i,j)
A(i,k)
Panel factorizations involve communication for
pivoting
Completed part of L
A(j,i)
A(j,k)
Trailing matrix to be updated
Panel being factored
Husbands,Yelick
51
Three Strategies for LU Factorization
  • Organize in bulk-synchronous phases (ScaLAPACK)
  • Factor a block column, then perform updates
  • Relatively easy to understand/debug, but extra
    synchronization
  • Overlapping phases (HPL)
  • Work associated with on block column
    factorization can be overlapped
  • Parameter to determine how many (need temp space
    accordingly)
  • Event-driven multithreaded (UPC Linpack)
  • Each thread runs an event handler loop
  • Tasks factorization (w/ pivoting), update
    trailing, update upper
  • Tasks my suspend (voluntarily) to wait for data,
    synchronization, etc.
  • Data moved with remote gets (synchronization
    built-in)
  • Must gang together for factorizations
  • Scheduling priorities are key to performance and
    deadlock avoidance

Husbands,Yelick
52
UPC-HP Linpack Performance
  • Comparable to HPL (numbers from HPCC database)
  • Faster than ScaLAPACK due to less synchronization
  • Large scaling of UPC code on Itanium/Quadrics
    (Thunder)
  • 2.2 TFlops on 512p and 4.4 TFlops on 1024p

Husbands, Yelick
53
Performance Tuning
Iancu, Strohmaier
54
Efficient Use of One-Sided
  • Implementations need to be efficient and have
    scalable performance
  • Application level use of NB benefits from new
    design techniques finer grained decompositions
    and overlap
  • Overlap exercises the system in un-expected
    ways
  • Prototyping of implementations for large scale
    systems is a hard problem non-linear behavior of
    networks, communication scheduling is NP-hard
  • Need methodology for fast prototyping
  • understand interaction network/CPU at large scale

55
Performance Tuning
  • Performance is determined by overhead, latency
    and bandwidth, computational characteristics and
    communication topology
  • Its all relative Performance characteristics
    are determined by system load
  • Basic principles
  • Minimize communication overhead
  • Avoid congestion
  • control injection rate (end-point)
  • avoid hotspots (end-point, network routes)
  • Have to use models.
  • What kind of answers can a model answer?

56
Example Vector-Add
  • shared double rdata
  • double ldata, buf
  • upc_memget(buf, rdata, N)
  • for(i0 iltN i)
  • ldatai bufi
  • for(i0 iltN/B i)
  • hiupc_memget_nb(bufiB,rdataiB,B)
  • for(i0 iltN/B i)
  • sync(hi)
  • for(j0jltB j)
  • ldataiBjbufiBj

GET_nb(B0) GET_nb(Bb) GET_nb(Bb1) GET_nb(B2b)
sync(B0) compute(B0) sync(Bb) compute(Bb) GET_n
b(B2b1) GET_nb(B3b) sync(BN) compute(BN)
b
b
b
Which implementation is faster? What is B,b?
57
Prototyping
  • Usual approach use time accurate performance
    model (applications, automatically tuned
    collectives)
  • Models (LogP..) dont capture important behavior
    (parallelism, congestion, resource constraints,
    non-linear behavior)
  • Exhaustive search of the optimization space
  • Validated only at low concurrency (tens of
    procs), might break at high concurrency, might
    break for torus networks
  • Our approach
  • Use performance model for ideal implementation
  • Understand hardware resource constraints and the
    variation of performance parameters (understand
    trends not absolute values)
  • Derive implementation constraints to satisfy both
    optimal implementation and hardware constraints
  • Force implementation parameters to converge
    towards optimal

58
Performance
  • Network bandwidth and overhead
  • Application communication pattern and schedule
    (congestion), computation

Overhead is determined by message size,
communication schedule, hardware flow of control
Bandwidth is determined by message size,
communication schedule, fairness of allocation
Iancu, Strohmaier
59
Validation
  • Understand network behavior in the presence of
    non-blocking communication (Infiniband, Elan)
  • Develop performance model for scenarios widely
    encountered in applications (p2p, scatter,
    gather, all-to-all) and a variety of aggressive
    optimization techniques (strip mining,
    pipelining, communication schedule skewing)
  • Use both micro-benchmarks and application kernels
    to validate approach

Iancu, Strohmaier
60
Findings
  • Can choose optimal values for implementation
    parameters
  • Time accurate model for an implementation hard
    to develop, inaccurate at high concurrency
  • Methodology does not require exhaustive search of
    the optimization space (only p2p and qualitative
    behavior of gather)
  • In practice one can produce templatized
    implementations for an algorithm and use our
    approach to determine optimal values code
    generation (UPC), automatic tuning of collective
    operations, application development
  • Need to further understand the mathematical and
    statistical properties

61
End
62
Avoid synchronization Data-driven Execution
  • Many algorithms require synchronization with
    remote processor
  • What is the right mechanism in a PGAS model for
    doing this?
  • Is it still one-sided?
  • Part 3 Event-Driven Execution Models

63
Mechanisms for Event-Driven Execution
  • Put operation does a send side notification
  • Needed for memory consistency model ordering
  • Need to signal remote side on completion
  • Strategies
  • Have remote side do a get (works in some
    algorithms)
  • Put strict flag write do a put, wait for
    completion, then do another (strict) put
  • Pipelined put put-flag works only on ordered
    networks
  • Signaling put add new store operation that
    embeds signal (2nd remote address) into single
    message

64
Mechanisms for Event-Driven Execution
Preliminary results on Opteron Infiniband
Write a Comment
User Comments (0)
About PowerShow.com