CS 267 Applications of Parallel Computers Lecture 9: Split-C - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

CS 267 Applications of Parallel Computers Lecture 9: Split-C

Description:

CS 267 Applications of Parallel Computers. Lecture 9: Split-C. James Demmel ... dimensions to the right define the objects on individual processors ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 24
Provided by: DavidE1
Category:

less

Transcript and Presenter's Notes

Title: CS 267 Applications of Parallel Computers Lecture 9: Split-C


1
CS 267 Applications of Parallel
ComputersLecture 9 Split-C
  • James Demmel
  • http//www.cs.berkeley.edu/demmel/cs267_Spr99

2
Comparison of Programming Models
  • Data Parallel (HPF)
  • Good for regular applications compiler controls
    performance
  • Message Passing SPMD (MPI)
  • Standard and portable
  • Needs low level programmer control no global
    data structures
  • Shared Memory with Dynamic Threads
  • Shared data is easy, but locality cannot be
    ignored
  • Virtual processor model adds overhead
  • Shared Address Space SPMD
  • Single thread per processor
  • Address space is partitioned, but shared
  • Encourages shared data structures matched to the
    architecture
  • Titanium - targets (adaptive) grid computations
  • Split-C - simple parallel extension to C
  • F77 Heroic Compiler
  • Depends on compiler to discover parallelism
  • Hard to do except for fine grain parallelism,
    usually in loops

3
Overview
  • Split-C
  • Systems programming language based on C
  • Creating Parallelism SPMD
  • Communication Global pointers and spread arrays
  • Memory consistency model
  • Synchronization
  • Optimization opportunities

4
Split-C Systems Programming
  • Widely used parallel extension to C
  • Supported on most large-scale parallel machines
  • Tunable performance
  • Consistent with C

5
Split-C Overview
Globally- Addressable Local Memory
Globally- Addressable Remote Memory
Global Address Space
int x
int x
local address space
Memory
g_P
P0
P1
P2
P3
  • Adds two new levels to the memory hierarchy
  • - Local in the global address space
  • - Remote in the global address space
  • Model is a collection of processors global
    address space
  • SPMD Model
  • Same Program on each node

6
SPMD Control Model
  • PROCS threads of control
  • independent
  • explicit synchronization
  • Synchronization
  • global barrier
  • locks

PE
PE
PE
PE
barrier()
7
C Pointers
  • (x) read pointer to x
  • Types read right to left
  • int read as pointer to int
  • P read as value at P
  • / assign the value of 6 to x /
  • int x
  • int P x
  • P 6

8
Global Pointers
A global pointer may refer to an object anywhere
in the machine. Each object (C structure) lives
on one processor Global pointers can be
dereferenced, incremented, and indexed just like
local pointers.
  • int global gp1 / global ptr to an int /
  • typedef int global g_ptr
  • gptr gp2 / same /
  • typedef double foo
  • foo global global gp3 / global ptr to a
    global ptr to a foo /
  • int global gp4 / local ptr to a global ptr to
    an int /

9
Memory Model
Processor 0
Processor 2
  • on_one
  • double global g_P toglobal(2,x)
  • g_P 6

10
C Arrays
  • Set 4 values to 0,2,4,6
  • Origin is 0
  • for (I 0 Ilt 4 I)
  • AI I2
  • Pointers Arrays
  • AI (AI)

11
Spread Arrays
  • Spread Arrays are spread over the entire machine
  • spreader determines which dimensions are
    spread
  • dimensions to the right define the objects on
    individual processors
  • dimensions to the left are linearized and
    spread in cyclic map
  • Example
  • double Anrbb,

Per processor blocks
Spread high dimensions
Aij gt A ir j in units of
sizeof(double)bb
The traditional C duality between arrays and
pointers is preserved through spread pointers.
12
Spread Pointers
Global pointers, but with index arithmetic across
processors(cyclic) 1 dimensional address
space, i.e. wrap and increment Processor
component varies fastest
No communication
  • double APROCS
  • for_my_1d (i,PROCS) Ai i2

13
Blocked Matrix Multiply
  • void all_mat_mult_blk(int n, int r, int m, int b,
  • double Cnmbb,
  • double Anrbb,
  • double Brmbb)
  • int i,j,k,l
  • double labb, lbbb
  • for_my_2D(i,j,l,n,m)
  • double (lc)b tolocal(Cij)
  • for (k0kltrk)
  • bulk_read (la, Aik, bbsizeof(double))
  • bulk_read (lb, Bkj, bbsizeof(double))
  • matrix_mult(b,b,b,lc,la,lb)
  • barrier()

Configuration independent use of spread arrays
Local copies of subblocks
Highly optimized local routine
Blocking improves performance because the number
of remote accesses is reduced.
14
An Irregular Problem EM3D
Maxwells Equations on an Unstructured 3D Mesh
Irregular Bipartite Graph of varying
degree (about 20) with weighted edges
v1
v2
w1
w2
H
E
B
Basic operation is to subtract weighted sum
of neighboring values for all E nodes for
all H nodes
D
15
EM3D Uniprocessor Version
  • typedef struct node_t
  • double value
  • int edge_count
  • double coeffs
  • double (values)
  • struct node_t next
  • node_t
  • void all_compute_E()
  • node_t n
  • int i
  • for (n e_nodes n n n-gtnext)
  • for (i 0 i lt n-gtedge_count i)
  • n-gtvalue n-gtvalue -
  • (n-gtvaluesi) (n-gtcoeffsi)

H
E
coeffs
value
values
value
How would you optimize this for a
uniprocessor? minimize cache misses by
organizing list such that neighboring nodes
are visited in order
16
EM3D Simple Parallel Version
Each processor has list of local nodes
  • typedef struct node_t
  • double value
  • int edge_count
  • double coeffs
  • double global (values)
  • struct node_t next
  • node_t
  • void all_compute_e()
  • node_t n
  • int i
  • for (n e_nodes n n n-gtnext)
  • for (i 0 i lt n-gtedge_count i)
  • n-gtvalue n-gtvalue -
  • (n-gtvaluesi) (n-gtcoeffsi)
  • barrier()

v1
How do you optimize this? Minimize remote
edges Balance load across processors C(p)
aNodes bEdges cRemotes
17
EM3D Eliminate Redundant Remote Accesses
  • void all_compute_e()
  • ghost_node_t g
  • node_t n
  • int i
  • for (g h_ghost_nodes g g g-gtnext)
    g-gtvalue (g-gtrval)
  • for (n e_nodes n n n-gtnext)
  • for (i 0 i lt n-gtedge_count i)
  • n-gtvalue n-gtvalue - (n-gtvaluesi)
    (n-gtcoeffsi)
  • barrier()

18
EM3D Overlap Global Reads GET
  • void all_compute_e()
  • ghost_node_t g
  • node_t n
  • int i
  • for (g h_ghost_nodes g g g-gtnext) g-gtvalue
    (g-gtrval)
  • sync()
  • for (n e_nodes n n n-gtnext)
  • for (i 0 i lt n-gtedge_count i)
  • n-gtvalue n-gtvalue - (n-gtvaluesi)
    (n-gtcoeffsi)
  • barrier()

19
Split-C Systems Programming
  • Tuning affects application performance

usec per edge
20
Global Operations and Shared Memory
  • int all_bcast(int val)
  • int left 2MYPROC1
  • int right 2MYPROC2
  • if (MYPROC gt 0)
  • while (spread_lockMYPROC 0)
  • spread_lockMYPROC 0
  • val spread_bufMYPROC
  • if ( left lt PROCS)
  • spread_bufleft val
  • spread_lockleft 1
  • if ( right lt PROCS)
  • spread_bufright val
  • spread_lockright val
  • return val

Requires sequential consistency
21
Global Operations and Signaling Store
  • int all_bcast(int val)
  • int left 2MYPROC1
  • int right 2MYPROC2
  • if (MYPROC gt 0)
  • store_sync(4)
  • val spread_bufMYPROC
  • if ( left lt PROCS)
  • spread_bufleft - val
  • if ( right lt PROCS)
  • spread_bufright - val
  • return val

22
Signaling Store and Global Communication
  • void all_block_to_cyclic ( int m ,
  • double BPROCSm,
  • double APROCSm)
  • double a AMYPROC
  • for (i 0 i lt m i)
  • BmMYPROCi - ai
  • all_store_sync()

PE
PE
PE
PE
23
Split-C Summary
  • Performance tuning capabilities of message
    passing
  • Support for shared data structures
  • Installed on NOW and available on most platforms
  • http//www.cs.berkeley.edu/projects/split-c
  • Consistent with C design
  • arrays are simply blocks of memory
  • no linguistic support for data abstraction
  • interfaces difficult for complex data structures
  • explicit memory management
Write a Comment
User Comments (0)
About PowerShow.com