Eliminating%20affinity%20tests%20and%20simplifying%20shared%20accesses%20in%20UPC presentation

About This Presentation

Transcript and Presenter's Notes

Title: Eliminating%20affinity%20tests%20and%20simplifying%20shared%20accesses%20in%20UPC

1
Eliminating affinity tests and simplifying shared
accesses in UPC

Rahul Garg, Kit Barton, Calin Cascaval
Gheorghe Almasi, Jose Nelson Amaral
University of Alberta
IBM Research

2
(No Transcript)
3
Shared arrays

Arrays can be shared b/w all threads
Eg shared 2 double A9
Assuming THREADS3
1-d block cyclic distribution similar to HPF
cyclic(k)

0
1
2
3
4
5
6
7
8
4
Vector addition example

include ltupc.hgt
include ltstdio.hgt
shared 2 double A10
shared 3 double B10,C10
int main()
int i
upc_forall(i0ilt10iCi)
Ci Ai Bi

5
Outline of talk

upc_forall loops syntax and uses
Compiling upc_forall loops
Data distributions in UPC
Multiblocking distributions
Privatization of access
Results

6
upc_forall and affinity tests

upc_forall is a work distribution construct
Form
shared BF double AM
upc_forall(i0 iltN i Ai)
//loop body
Affinity test expression determines which
thread executes which iteration.

Affinity test expression
7
Affinity test elimination naive
shared BF double AM upc_forall(i0iltMi
Ai) //loop body
shared BF double AM for(i0 iltM
i) if(upc_threadof(Ai)MYTHREAD) //loop
body
8
Affinity test elimination optimized
shared BF double AM upc_forall(i0iltMi
Ai) //loop body
shared BF double AM for(iMYTHREADBF iltM
i(BFTHREADS)) for(ji jltiBF
j) //loop body
9
Integer Affinity Tests
upc_forall(i0iltMi i) //loop body
for(iMYTHREAD iltM iTHREADS) //loop body
10
Data distributions for shared arrays

UPC official spec only supports 1d block cyclic
IBM xlupc compiler supports more general data
distribution 'multi-dimensional blocking'
Eg shared 23 double A55
Divide the array into multidimensional tiles
Distribute the tiles among processors in cyclic
fashion
More general than UPC spec, but not as general as
ScaLAPACK or HPF

11
(No Transcript)
12
Locality analysis and privatization

Consider
shared 23 A56,B56
for(i0 ilt4 i)
upc_forall(j0 jlt4 j Aij)
Aij Bi1j
What code should we generate for references
Aij and Bi1j?

13
Shared access code generation
for(i0ilt4i) upc_forall(j0jlt4jAij
) Aij Bi1j
for(i0ilt4i) upc_forall(j0jlt4jAij
) val shared_deref(B,i1,j) shared_assign(
A,i,j,val)
14
Shared access code generation
for(i0ilt4i) upc_forall(j0jlt4jAij
) Aij Bi1j

Do we really need the function calls?
Aij should only be a memory load/store??
What about Bi1j on SMP? This should be just
a load? On hybrids?

15
(No Transcript)
16
Locality Analysis Intuition
for(i0ilt4i) upc_forall(j0jlt4jAij
) Aij Bi1j

The locality can only change if index (i1)
crosses block boundaries in a direction
Block boundaries 0, BF , 2BF ...
(i1)BF0 gives block boundary
So we only need to see if (i1)BF0 to find
places where locality can change!

17
Locality Analysis
for(i0ilt4i) upc_forall(j0jlt4jAij
) Aij Bi1j

Define offset vector k1 k2 k11, k20
k1 and k2 are integer constants
Cross block boundary at (ik1)BF 0
Cases iBFlt(BF-k1BF) and iBFgt (BF-k1BF)
iBFlt(BF-k1) we refer it to as 'cut'

18
Shared access code generation
for(i0ilt4i) if((i2lt1) upc_forall(j0jlt
4jAij) val memory_load(B,i1,j)
memory_store(A,i,j,val) else upc_forall
(j0jlt4j Aij) val
shared_deref(B,i1,j) memory_store(A,i,j,val)

19
Locality analysis algorithm

For each shared reference in loop
Check if blocking factor matches
Check if distance vector is constant
If reference is eligible
Generate cut expressions
Put cut in a sorted cut list
Replicate loop body as necessary
Insert memory load/store if local reference
otherwise insert RTS call

20
Improvements of locality analysis in isolation
21
Improvements of affinity test elimination in
isolation
22
Results Vector addition
23
Matrix-vector multiplication
24
Matrix-vector scalability
25
Conclusions

UPC requires extensive compiler support
upc_forall is a challenging construct to compile
efficiently
Shared access implementation requires compiler
support
Optimizations working together produce good
results
Compiler optimizations can produce gt80x speedup
over unoptimized code
If one optimization fails, then results can still
be bad

Write a Comment

User Comments (0)

About PowerShow.com

Eliminating%20affinity%20tests%20and%20simplifying%20shared%20accesses%20in%20UPC PowerPoint PPT Presentation