Affine Partitioning for Parallelism - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Affine Partitioning for Parallelism

Description:

use affine form of Farkas Lemma to rewrite constraints as ... Finds largest fully permutable loop nest via affine partitioning ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 17
Provided by: Monic79
Learn more at: http://suif.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Affine Partitioning for Parallelism


1
Affine Partitioning for Parallelism Locality
  • Amy Lim
  • Stanford University
  • http//suif.stanford.edu/

2
Useful Transforms for ParallelismLocality
  • INTERCHANGE FOR i FOR
    j FOR j FOR i
  • Ai,j Ai,j
  • REVERSAL FOR i 1 to n FOR i n downto
    1 Ai Ai
  • SKEWING FOR i1 TO n FOR i1 TO
    n FOR j1 TO n FOR ki1 to in
  • Ai,j Ai,k-i
  • FUSION/FISSION FOR i 1 TO n FOR i 1 TO
    n Ai Ai
  • FOR i 1 TO n Bi
  • Bi
  • REINDEXING FOR i 1 to n A1
    B0 Ai Bi-1 FOR i 1 to
    n-1 Ci Ai1 Ai1
    Bi Ci Ai1
  • Cn An1
  • Traditional approach is it legal desirable to
    apply one transform?

3
Question How to combine the transformations?
  • Affine mappings Lim Lam, POPL 97, ICS 99
  • Domain arbitrary loop nesting, affine loop
    indices instruction optimized separately
  • Unifies
  • Permutation
  • Skewing
  • Reversal
  • Fusion
  • Fission
  • Statement reordering
  • Supports blocking across all (non-perfectly
    nested) loops
  • Optimal Max. deg. of parallelism min. deg. of
    synchronization
  • Minimize communication by aligning the
    computation and pipelining

4
Loop Transforms Cholesky factorization example
  • DO 1 J 0, N
  • I0 MAX ( -M, -J )
  • DO 2 I I0, -1
  • DO 3 JJ I0 - I, -1
  • DO 3 L 0, NMAT
  • 3 A(L,I,J) A(L,I,J) - A(L,JJ,IJ)
    A(L,IJJ,J)
  • DO 2 L 0, NMAT
  • 2 A(L,I,J) A(L,I,J) A(L,0,IJ)
  • DO 4 L 0, NMAT
  • 4 EPSS(L) EPS A(L,0,J)
  • DO 5 JJ I0, -1
  • DO 5 L 0, NMAT
  • 5 A(L,0,J) A(L,0,J) - A(L,JJ,J) 2
  • DO 1 L 0, NMAT
  • 1 A(L,0,J) 1. / SQRT ( ABS (EPSS(L)
    A(L,0,J)) )
  • DO 6 I 0, NRHS
  • DO 7 K 0, N
  • DO 8 L 0, NMAT

5
Results for Optimizing Perfect Nests
Speedup on a Digital Turbolaser with 8 300Mhz
21164 processors
6
Optimizing Arbitrary Loop Nesting Using Affine
Partitions
  • DO 1 J 0, N
  • I0 MAX ( -M, -J )
  • DO 2 I I0, -1
  • DO 3 JJ I0 - I, -1
  • DO 3 L 0, NMAT
  • 3 A(L,I,J) A(L,I,J) - A(L,JJ,IJ)
    A(L,IJJ,J)
  • DO 2 L 0, NMAT
  • 2 A(L,I,J) A(L,I,J) A(L,0,IJ)
  • DO 4 L 0, NMAT
  • 4 EPSS(L) EPS A(L,0,J)
  • DO 5 JJ I0, -1
  • DO 5 L 0, NMAT
  • 5 A(L,0,J) A(L,0,J) - A(L,JJ,J) 2
  • DO 1 L 0, NMAT
  • 1 A(L,0,J) 1. / SQRT ( ABS (EPSS(L)
    A(L,0,J)) )
  • DO 6 I 0, NRHS
  • DO 7 K 0, N
  • DO 8 L 0, NMAT


A
L

B
L
EPSS
L
7
Results with Affine Partitioning Blocking
8
A Simple Example
  • FOR i 1 TO n DO
  • FOR j 1 TO n DO
  • Ai,j Ai,jBi-1,j (S1)
  • Bi,j Ai,j-1Bi,j (S2)

S1
i
S2
j
9
Best Parallelization Scheme
  • SPMD code Let p be the processors ID number
  • if (1-n lt p lt n) then
  • if (1 lt p) then
  • Bp,1 Ap,0 Bp,1 (S2)
  • for i1 max(1,1p) to min(n,n-1p) do
  • Ai1,i1-p Ai1,i1-p Bi1-1,i1-p (S1)
  • Bi1,i1-p1 Ai1,i1-p Bi1,i1-p1 (S2)
  • if (p lt 0) then
  • Anp,n Anp,N Bnp-1,n (S1)
  • Solution can be expressed as affine partitions
  • S1 Execute iteration (i, j) on processor i-j.
  • S2 Execute iteration (i, j) on processor
    i-j1.

10
Maximum Parallelism No Communication
  • Let Fxj be an access to array x in statement j,
  • ij be an iteration index for statement j,
  • Bj ij ? 0 represent loop bound constraints
    for statement j,
  • Find Cj which maps an instance of statement j
    to a processor
  • ? ij, ik Bj ij ? 0, Bk ik ? 0
  • Fxj (ij) Fxk (ik) ? Cj (ij) Ck (ik)
  • with the objective of maximizing the rank of Cj

11
Algorithm
  • ? ij, ik Bj ij ? 0, Bk ik ? 0
  • Fxj (ij) Fxk (ik) ? Cj (ij) Ck (ik)
  • Rewrite partition constraints as systems of
    linear equations
  • use affine form of Farkas Lemma to rewrite
    constraints assystems of linear inequalities in
    C and l
  • use Fourier-Motzkin algorithm to eliminate Farkas
    multipliers l and get systems of linear equations
    AC 0
  • Find solutions using linear algebra techniques
  • the null space for matrix A is a solution of C
    with maximum rank.

12
PipeliningAlternating Direction Integration
Example
  • Requires transposing data
  • DO J 1 to N (parallel)
  • DO I 1 to N
  • A(I,J) f(A(I,J),A(I-1,J)
  • DO J 1 to N
  • DO I 1 to N (parallel)
  • A(I,J) g(A(I,J),A(I,J-1))
  • Moves only boundary data
  • DO J 1 to N (parallel)
  • DO I 1 to N
  • A(I,J) f(A(I,J),A(I-1,J)
  • DO J 1 to N (pipelined)
  • DO I 1 to N
  • A(I,J) g(A(I,J),A(I,J-1))

13
Finding the Maximum Degree of Pipelining
F1 (i1)
Array
Loops
F2 (i2)
T1 (i1)
T2 (i2)
Time Stage
  • Let Fxj be an access to array x in statement j,
  • ij be an iteration index for statement j,
  • Bj ij ? 0 represent loop bound constraints
    for statement j,
  • Find Tj which maps an instance of statement j
    to a time stage
  • ? ij, ik Bj ij ? 0, Bk ik ? 0
  • ( ij ? ik ) ? (Fxj ( ij) Fxk ( ik)) ? Tj
    (ij) ? Tk (ik)
  • lexicographically
  • with the objective of maximizing the rank of Tj

14
Key Insight
  • Choice in time mapping gt (pipelined) parallelism
  • Degrees of parallelism rank(T) - 1

15
Putting it All Together
  • Find maximum outer-loop parallelism with minimum
    synchronization
  • Divide into strongly connected components
  • Apply processor mapping algorithm (no
    communication) to program
  • If no parallelism found,
  • Apply time mapping algorithm to find pipelining
  • If no pipelining found (found outer sequential
    loop)
  • Repeat process on inner loops
  • Minimize communication
  • Use a greedy method to order communicating pairs
  • Try to find communication-free, or neighborhood
    only communication by solving similar equations
  • Aggregate computations of consecutive data to
    improve spatial locality

16
Use of Affine Partitioning in Locality Opt.
  • Promotes array contraction
  • Finds independent threads and shortens the live
    ranges of variables
  • Supports blocking of imperfectly nested loops
  • Finds largest fully permutable loop nest via
    affine partitioning
  • Fully permutable loop nest -gt blockable
Write a Comment
User Comments (0)
About PowerShow.com