Eliminating%20affinity%20tests%20and%20simplifying%20shared%20accesses%20in%20UPC - PowerPoint PPT Presentation

About This Presentation
Title:

Eliminating%20affinity%20tests%20and%20simplifying%20shared%20accesses%20in%20UPC

Description:

Eg : shared [2] double A[9]; Assuming THREADS=3. 1-d block cyclic distribution : similar to HPF cyclic(k) 0. 1. 2. 3. 4. 5. 6. 7. 8 ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 26
Provided by: rahul48
Category:

less

Transcript and Presenter's Notes

Title: Eliminating%20affinity%20tests%20and%20simplifying%20shared%20accesses%20in%20UPC


1
Eliminating affinity tests and simplifying shared
accesses in UPC
  • Rahul Garg, Kit Barton, Calin Cascaval
  • Gheorghe Almasi, Jose Nelson Amaral
  • University of Alberta
  • IBM Research

2
(No Transcript)
3
Shared arrays
  • Arrays can be shared b/w all threads
  • Eg shared 2 double A9
  • Assuming THREADS3
  • 1-d block cyclic distribution similar to HPF
    cyclic(k)

0
1
2
3
4
5
6
7
8
4
Vector addition example
  • include ltupc.hgt
  • include ltstdio.hgt
  • shared 2 double A10
  • shared 3 double B10,C10
  • int main()
  • int i
  • upc_forall(i0ilt10iCi)
  • Ci Ai Bi

5
Outline of talk
  • upc_forall loops syntax and uses
  • Compiling upc_forall loops
  • Data distributions in UPC
  • Multiblocking distributions
  • Privatization of access
  • Results

6
upc_forall and affinity tests
  • upc_forall is a work distribution construct
  • Form
  • shared BF double AM
  • upc_forall(i0 iltN i Ai)
  • //loop body
  • Affinity test expression determines which
    thread executes which iteration.

Affinity test expression
7
Affinity test elimination naive
shared BF double AM upc_forall(i0iltMi
Ai) //loop body
shared BF double AM for(i0 iltM
i) if(upc_threadof(Ai)MYTHREAD) //loop
body
8
Affinity test elimination optimized
shared BF double AM upc_forall(i0iltMi
Ai) //loop body
shared BF double AM for(iMYTHREADBF iltM
i(BFTHREADS)) for(ji jltiBF
j) //loop body
9
Integer Affinity Tests
upc_forall(i0iltMi i) //loop body
for(iMYTHREAD iltM iTHREADS) //loop body
10
Data distributions for shared arrays
  • UPC official spec only supports 1d block cyclic
  • IBM xlupc compiler supports more general data
    distribution 'multi-dimensional blocking'
  • Eg shared 23 double A55
  • Divide the array into multidimensional tiles
  • Distribute the tiles among processors in cyclic
    fashion
  • More general than UPC spec, but not as general as
    ScaLAPACK or HPF

11
(No Transcript)
12
Locality analysis and privatization
  • Consider
  • shared 23 A56,B56
  • for(i0 ilt4 i)
  • upc_forall(j0 jlt4 j Aij)
  • Aij Bi1j
  • What code should we generate for references
    Aij and Bi1j?

13
Shared access code generation
for(i0ilt4i) upc_forall(j0jlt4jAij
) Aij Bi1j
for(i0ilt4i) upc_forall(j0jlt4jAij
) val shared_deref(B,i1,j) shared_assign(
A,i,j,val)
14
Shared access code generation
for(i0ilt4i) upc_forall(j0jlt4jAij
) Aij Bi1j
  • Do we really need the function calls?
  • Aij should only be a memory load/store??
  • What about Bi1j on SMP? This should be just
    a load? On hybrids?

15
(No Transcript)
16
Locality Analysis Intuition
for(i0ilt4i) upc_forall(j0jlt4jAij
) Aij Bi1j
  • The locality can only change if index (i1)
    crosses block boundaries in a direction
  • Block boundaries 0, BF , 2BF ...
  • (i1)BF0 gives block boundary
  • So we only need to see if (i1)BF0 to find
    places where locality can change!

17
Locality Analysis
for(i0ilt4i) upc_forall(j0jlt4jAij
) Aij Bi1j
  • Define offset vector k1 k2 k11, k20
  • k1 and k2 are integer constants
  • Cross block boundary at (ik1)BF 0
  • Cases iBFlt(BF-k1BF) and iBFgt (BF-k1BF)
  • iBFlt(BF-k1) we refer it to as 'cut'

18
Shared access code generation
for(i0ilt4i) if((i2lt1) upc_forall(j0jlt
4jAij) val memory_load(B,i1,j)
memory_store(A,i,j,val) else upc_forall
(j0jlt4j Aij) val
shared_deref(B,i1,j) memory_store(A,i,j,val)

19
Locality analysis algorithm
  • For each shared reference in loop
  • Check if blocking factor matches
  • Check if distance vector is constant
  • If reference is eligible
  • Generate cut expressions
  • Put cut in a sorted cut list
  • Replicate loop body as necessary
  • Insert memory load/store if local reference
    otherwise insert RTS call

20
Improvements of locality analysis in isolation
21
Improvements of affinity test elimination in
isolation
22
Results Vector addition
23
Matrix-vector multiplication
24
Matrix-vector scalability
25
Conclusions
  • UPC requires extensive compiler support
  • upc_forall is a challenging construct to compile
    efficiently
  • Shared access implementation requires compiler
    support
  • Optimizations working together produce good
    results
  • Compiler optimizations can produce gt80x speedup
    over unoptimized code
  • If one optimization fails, then results can still
    be bad
Write a Comment
User Comments (0)
About PowerShow.com