Compiling High Performance Fortran - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Compiling High Performance Fortran

Description:

IF (PID==CEIL(N 1/100)-1) hi = MOD(N,100) 1. DO i = lo, hi. A(i) = B(i-1) ... lastP = CEIL((N 1)/100) - 1. IF (PID==lastP) hi = MOD(N,100) 1. DO i = lo, hi 1 ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 46

Provided by: engi100

Learn more at: https://www.cs.rice.edu

Category:

more less

Transcript and Presenter's Notes

Title: Compiling High Performance Fortran

1
Compiling High Performance Fortran

Allen and Kennedy, Chapter 14

2
Overview

Motivation for HPF
Overview of compiling HPF programs
Basic Loop Compilation for HPF
Optimizations for compiling HPF
Results and Summary

3
Motivation for HPF

Require Message Passing to communicate data
between processors
Approach 1 Use MPI calls in Fortran/C code

Scalable Distributed Memory Multiprocessor
4
Motivation for HPF

Consider the following sum reduction
PROGRAM SUM
REAL A(10000)
READ (9) A
SUM 0.0
DO I 1, 10000
SUM SUM A(I)
ENDDO
PRINT SUM
END

PROGRAM SUM
REAL A(100), BUFF(100)
IF (PID 0) THEN
DO IP 0, 99
READ (9) BUFF(1100)
IF (IP 0)
A(1100) BUFF(1100)
ELSE SEND(IP,BUFF,100)
ENDDO
ELSE RECV(0,A,100)
ENDIF
/Actual sum reduction code here /
IF (PID 0) SEND(1,SUM,1)
IF (PID gt 0) RECV(PID-1,T,1)
SUM SUM T
IF (PID lt 99) SEND(PID1,SUM,1)
ELSE SEND(0,SUM,1)
ENDIF
IF (PID 0) PRINT SUM

MPI implementation
5
Motivation for HPF

Disadvantages of MPI approach
User has to rewrite the program in SPMD form
Single Program Multiple Data
User has to manage data movement send
receive, data placement and synchronization
Too messy and not easy to master

6
Motivation for HPF

Approach 2 Use HPF
HPF is an extended version of Fortran 90
HPF has Fortran 90 features and a few directives
Directives
Tell how data is laid out in processor memories
in parallel machine configuration. For example,
!HPF DISTRIBUTE A(BLOCK)
Assist in identifying parallelism. For example,
!HPF INDEPENDENT

7
Motivation for HPF

The same sum reduction code
PROGRAM SUM
REAL A(10000)
READ (9) A
SUM 0.0
DO I 1, 10000
SUM SUM A(I)
ENDDO
PRINT SUM
END
When written in HPF...

PROGRAM SUM
REAL A(10000)
!HPF DISTRIBUTE A(BLOCK)
READ (9) A
SUM 0.0
DO I 1, 10000
SUM SUM A(I)
ENDDO
PRINT SUM
END
Minimum modification
Easy to write
Now compiler has to do more work

8
Motivation for HPF

Advantages of HPF
User needs only to write some easy directives
need not write the whole program in SPMD form
User does not need to manage data movement send
receive and synchronization
Simple and easy to master

9
Overview

Motivation for HPF
Overview of compiling HPF programs
Basic Loop Compilation for HPF
Optimizations for compiling HPF
Results and Summary

10
HPF Compilation Overview

Dependence Analysis
Used for communication analysis
Fact used No dependence carried by I loop

Running example
REAL A(10000), B(10000)
!HPF DISTRIBUTE A(BLOCK), B(BLOCK)
DO J 1, 10000
DO I 2, 10000
S1 A(I) B(I-1) C
ENDDO
DO I 1, 10000
S2 B(I) A(I)
ENDDO
ENDDO

11
HPF Compilation Overview

Running example
REAL A(10000), B(10000)
!HPF DISTRIBUTE A(BLOCK), B(BLOCK)
DO J 1, 10000
DO I 2, 10000
S1 A(I) B(I-1) C
ENDDO
DO I 1, 10000
S2 B(I) A(I)
ENDDO
ENDDO

Dependence Analysis
Distribution Analysis

12
HPF Compilation Overview

Running example
REAL A(10000), B(10000)
!HPF DISTRIBUTE A(BLOCK), B(BLOCK)
DO J 1, 10000
DO I 2, 10000
S1 A(I) B(I-1) C
ENDDO
DO I 1, 10000
S2 B(I) A(I)
ENDDO
ENDDO

Dependence Analysis
Distribution Analysis
Computation Partitioning
Partition so as to distribute work of the I loops

13
HPF Compilation Overview

REAL A(1,100), B(0100)
!HPF DISTRIBUTE A(BLOCK), B(BLOCK)
DO J 1, 10000
I1 IF (PID / 100) SEND(PID1,B(100),1)
I2 IF (PID / 0) THEN
RECV(PID-1,B(0),1)
A(1) B(0) C
ENDIF
DO I 2, 100
S1 A(I) B(I-1)C
ENDDO
DO I 1, 100
S2 B(I) A(I)
ENDDO
ENDDO

Dependence Analysis
Distribution Analysis
Computation Partitioning
Communication Analysis and placement
Communication reqd for B(0)for each iteration
Shadow region B(0)

14
HPF Compilation Overview

REAL A(1,100), B(0100)
!HPF DISTRIBUTE A(BLOCK), B(BLOCK)
DO J 1, 10000
I1 IF (PID / 100) SEND(PID1,B(100),1)
DO I 2, 100
S1 A(I) B(I-1)C
ENDDO
I2 IF (PID / 0) THEN
RECV(PID-1,B(0),1)
A(1) B(0) C
ENDIF
DO I 1, 100
S2 B(I) A(I)
ENDDO
ENDDO

Dependence Analysis
Distribution Analysis
Computation Partitioning
Communication Analysis and placement
Optimization
Aggregation
Overlap communication and computation
Recognition of reduction

15
Overview

Motivation for HPF
Overview of compiling HPF programs
Basic Loop Compilation for HPF
Optimizations for compiling HPF
Results and Summary

16
Basic Loop Compilation

Distribution Propagation and analysis
Analyze what distribution holds for a given array
at a given point in the program
Difficult due to
REALIGN and REDISTRIBUTE directives
Distribution of formal parameters inherited from
calling procedure
Use Reaching Decompositions data flow analysis
and its interprocedural version

17
Basic Loop Compilation

For simplicity assume single distribution for an
array at all points in a subprogram
Define
For example suppose array A of size N is block
distributed over p processors
Block size,

18
Basic Loop Compilation

Iteration Partitioning
Dividing work among processors
Computation partitioning
Determine which iterations of a loop will be
executed on which processor
Owner-computes rule

REAL A(10000)
!HPF DISTRIBUTE A(BLOCK)
DO I 1, 10000
A(I) A(I) C
ENDDO
Iteration I is executed on owner of A(I)
100 processors 1st 100 iterations on processor
0, the next 100 on processor 1 and so on

19
Iteration Partitioning

Multiple statements in a loop in a recurrence
choose a partitioning reference
Processor responsible for performing computation
for iteration I is
Set of indices executed on p

20
Iteration Partitioning

Have to map global loop index to local loop index
Smallest value in maps to 1
REAL A(10000)
!HPF DISTRIBUTE A(BLOCK)
DO I 1, N
A(I1) B(I) C
ENDDO

21
Iteration Partitioning

REAL A(10000),B(10000)
!HPF DISTRIBUTE A(BLOCK),B(BLOCK)
DO I 1, N
A(I1) B(I) C
ENDDO
Map global iteration space, I to local iteration
space,i as follows

22
Iteration Partitioning

Adjust array subscripts for local iterations

23
Iteration Partitioning

For interior processors the code becomes..
DO i 1, 100
A(i) B(i-1) C
ENDDO
Adjust lower bound for 1st processor and upper
bound of last processor to take care of boundary
conditions..
lo 1
IF (PID0) lo 2
hi 100
IF (PIDCEIL(N1/100)-1) hi MOD(N,100) 1
DO i lo, hi
A(i) B(i-1) C
ENDDO

24
Communication Generation

For our example no communication is required for
iterations in
Iterations which require receiving data are
Iterations which require sending data are

25
Communication Generation

REAL A(10000), B(10000)
!HPF DISTRIBUTE A(BLOCK), B(BLOCK)
...
DO I 1, N
A(I1) B(I) C
ENDDO
Receive required for iterations in 100p100p
Send required for iterations in
100p100100p100
No communication required for iterations in
100p1100p99

26
Communication Generation

After inserting receive
lo 1
IF (PID0) lo 2
hi 100
IF (PIDCEIL((N1)/100)-1)
hi MOD(N,100) 1
DO i lo, hi
IF (i1 PID / 0)
RECV (PID-1, B(0), 1)
A(i) B(i-1) C
ENDDO

Send must happen in the 101st iteration
lo 1
IF (PID0) lo 2
hi 100
lastP CEIL((N1)/100) - 1
IF (PIDlastP)
hi MOD(N,100) 1
DO i lo, hi1
IF (i1 PID / 0)
RECV (PID-1, B(0), 1)
IF (i lt hi) THEN
A(i) B(i-1) C
ENDIF
IF (i hi1 PID / lastP)
SEND(PID1, B(100), 1)
ENDDO

27
Communication Generation

lo 1
IF (PID0) lo 2
hi 100
lastP CEIL((N1)/100) - 1
IF (PIDlastP) hi MOD(N,100) 1
DO i lo, hi1
IF (i1 PID / 0)
RECV (PID-1, B(0), 1)
IF (i lt hi) THEN
A(i) B(i-1) C
ENDIF
IF (i hi1 PID / lastP)
SEND(PID1, B(100), 1)
ENDDO
Move SEND outside the loop

lo 1
IF (PID0) lo 2
hi 100
lastP CEIL((N1)/100) - 1
IF (PIDlastP) hi MOD(N,100) 1
IF (PID lt lastP) THEN
DO i lo, hi
IF (i1 PID / 0)
RECV (PID-1, B(0), 1)
A(i) B(i-1) C
ENDDO
IF (PID / lastP)
SEND(PID1, B(100), 1)
ENDIF

28
Communication Generation

lo 1
IF (PID0) lo 2
hi 100
lastP CEIL((N1)/100) - 1
IF (PIDlastP) hi MOD(N,100) 1
IF (PID lt lastP) THEN
DO i lo, hi
IF (i1 PID / 0)
RECV (PID-1, B(0), 1)
A(i) B(i-1) C
ENDDO
IF (PID / lastP)
SEND(PID1, B(100), 1)
ENDIF
Move receive outside loop and loop peel

lo 1
IF (PID0) lo 2
hi 100
lastP CEIL((N1)/100) - 1
IF (PIDlastP) hi MOD(N,100) 1
IF (PID lt lastP) THEN
IF (lo 1 PID / 0) THEN
RECV (PID-1, B(0), 1)
A(1) B(0) C
ENDIF
! lo MAX(lo,11) 2
DO i 2, hi
A(i) B(i-1) C
ENDDO
IF (PID / lastP)
SEND(PID1, B(100), 1)
ENDIF

29
Communication Generation

lo 1
IF (PID0) lo 2
hi 100
lastP CEIL((N1)/100) - 1
IF (PIDlastP) hi MOD(N,100) 1
IF (PID lt lastP) THEN
IF (lo 1 PID / 0) THEN
RECV (PID-1, B(0), 1)
A(1) B(0) C
ENDIF
! lo MAX(lo,11) 2
DO i 2, hi
A(i) B(i-1) C
ENDDO
IF (PID / lastP)
SEND(PID1, B(100), 1)
ENDIF

lo 1
IF (PID0) lo 2
hi 100
lastP CEIL((N1)/100) - 1
IF (PIDlastP) hi MOD(N,100) 1
IF (PID lt lastP) THEN
IF (PID / lastP)
SEND(PID1, B(100), 1)
IF (lo 1 PID / 0) THEN
RECV (PID-1, B(0), 1)
A(1) B(0) C
ENDIF
DO i 2, hi
A(i) B(i-1) C
ENDDO
ENDIF

30
Communication Generation

When is such rearrangement legal?
Receive copy from global to local location
Send copy local to global location
IF (PID lt lastP) THEN
S1 IF (lo 1 PID / 0) THEN
B(0) Bg(0) ! RECV
A(1) B(0) C
ENDIF
DO i 2, hi
A(i) B(i-1) C
ENDDO
S2 IF (PID / lastP) Bg(100) B(100) ! SEND
ENDIF

No chain of dependences from S1 to S2
31
Communication Generation

REAL A(10000), B(10000)
!HPF DISTRIBUTE A(BLOCK)
...
DO I 1, N
A(I1) A(I) C
ENDDO
Would be rewritten as ..

IF (PID lt lastP) THEN
S1 IF (lo 1 PID / 0) THEN
A(0) Ag(0) ! RECV
A(1) A(0) C
ENDIF
DO i 2, hi
A(i) A(i-1) C
ENDDO
S2 IF (PID / lastP)
Ag(100) A(100) ! SEND
ENDIF
Rearrangement wont be correct

32
Overview

Motivation for HPF
Overview of compiling HPF programs
Basic Loop Compilation for HPF
Optimizations for compiling HPF
Results and Summary

33
Communication Vectorization

REAL A(10000,100)
!HPF DISTRIBUTE A(BLOCK,),
B(BLOCK,)
DO J 1, M
DO I 1, N
A(I1,J) B(I,J) C
ENDDO
ENDDO
Using Basic Loop compilation gives..

DO J 1, M
lo 1
IF (PID0) lo 2
hi 100
lastP CEIL((N1)/100) - 1
IF (PIDlastP)
hi MOD(N,100) 1
IF (PID lt lastP) THEN
IF (PID / lastP)
SEND(PID1, B(100,J), 1)
IF (lo 1) THEN
RECV (PID-1, B(0,J), 1)
A(1,J) B(0,J) C
ENDIF
DO i lo1, hi
A(i,J) B(i-1,J) C
ENDDO
ENDIF
ENDDO

34
Communication Vectorization

DO J 1, M
lo 1
IF (PID0) lo 2
hi 100
lastP CEIL((N1)/100) - 1
IF (PIDlastP)
hi MOD(N,100) 1
IF (PID lt lastP) THEN
IF (PID / lastP)
SEND(PID1, B(100,J), 1)
IF (lo 1) THEN
RECV (PID-1, B(0,J), 1)
A(1,J) B(0,J) C
ENDIF
DO i lo1, hi
A(i,J) B(i-1,J) C
ENDDO
ENDIF
ENDDO

lo 1
IF (PID0) lo 2
hi 100
lastP CEIL((N1)/100) - 1
IF (PIDlastP) hi MOD(N,100) 1
IF (PID lt lastP) THEN
DO J 1, M
IF (PID / lastP)
SEND(PID1, B(100,J), 1)
ENDDO
DO J 1, M
IF (lo 1) THEN
RECV (PID-1, B(0,J), 1)
A(i,J) B(i-1,J) C
ENDIF
ENDDO
DO J 1, M
DO i lo1, hi
A(i,J) B(i-1,J) C

Distribute J Loop
35
Communication Vectorization

lo 1
IF (PID0) lo 2
hi 100
lastP CEIL((N1)/100) - 1
IF (PIDlastP) hi MOD(N,100) 1
IF (PID lt lastP) THEN
DO J 1, M
IF (PID / lastP)
SEND(PID1, B(100,J), 1)
ENDDO
DO J 1, M
IF (lo 1) THEN
RECV (PID-1, B(0,J), 1)
A(i,J) B(i-1,J) C
ENDIF
ENDDO
DO J 1, M
DO i lo1, hi
A(i,J) B(i-1,J) C

lo 1
IF (PID0) lo 2
hi 100
lastP CEIL((N1)/100) - 1
IF (PIDlastP) hi MOD(N,100) 1
IF (PID lt lastP) THEN
IF (lo 1) THEN
RECV (PID-1, B(0,1M), M)
DO J 1, M
A(1,J) B(0,J) C
ENDDO
ENDIF
DO J 1, M
DO i lo1, hi
A(i,J) B(i-1,J) C
ENDDO
ENDDO
IF (PID / lastP)
SEND(PID1, B(100,1M), M)

36
Communication Vectorization

DO J 1, M
lo 1
IF (PID0) lo 2
hi 100
lastP CEIL((N1)/100) - 1
IF (PIDlastP)
hi MOD(N,100) 1
IF (PID lt lastP) THEN
S1 IF (PID / lastP)
Bg(100,J)B(100,J)
IF (lo 1) THEN
S2 B(0,J)Bg(0,J)
S3 A(1,J) B(0,J) C
ENDIF
DO i lo1, hi
S4 A(i,J) B(i-1,J) C
ENDDO
ENDIF
ENDDO

Communication stmts resulting from an inner loop
can be vectorized wrt an outer loop if the
communication statements are not involved in a
recurrence carried by outer loop

37
Communication Vectorization

REAL A(10000,100)
!HPF DISTRIBUTE A(BLOCK,),
B(BLOCK,)
DO J 1, M
DO I 1, N
A(I1,J) A(I,J) B(I,J)
ENDDO
ENDDO
Can sends be done before the receives?
Can communication be vectorized?

REAL A(10000,100)
!HPF DISTRIBUTE A(BLOCK,)
DO J 1, M
DO I 1, N
A(I1,J1) A(I,J) C
ENDDO
ENDDO
Can sends be done before the receives?
Can communication be fully vectorized?

38
Overlapping Communication and Computation

lo 1
IF (PID0) lo 2
hi 100
lastP CEIL((N1)/100) - 1
IF (PIDlastP) hi MOD(N,100) 1
IF (PID lt lastP) THEN
S0 IF (PID / lastP)
SEND(PID1, B(100), 1)
S1 IF (lo 1 PID / 0) THEN
RECV (PID-1, B(0), 1)
A(1) B(0) C
ENDIF
L1 DO i 2, hi
A(i) B(i-1) C
ENDDO
ENDIF

lo 1
IF (PID0) lo 2
hi 100
lastP CEIL((N1)/100) - 1
IF (PIDlastP) hi MOD(N,100) 1
IF (PID lt lastP) THEN
S0 IF (PID / lastP)
SEND(PID1, B(100), 1)
L1 DO i 2, hi
A(i) B(i-1) C
ENDDO
S1 IF (lo 1 PID / 0) THEN
RECV (PID-1, B(0), 1)
A(1) B(0) C
ENDIF
ENDIF

39
Pipelining

REAL A(10000,100)
!HPF DISTRIBUTE A(BLOCK,)
DO J 1, M
DO I 1, N
A(I1,J) A(I,J) C
ENDDO
ENDDO
Initial code generation for the I loop gives..

lo 1
IF (PID0) lo 2
hi 100
lastP CEIL((N1)/100) - 1
IF (PIDlastP) hi MOD(N,100) 1
IF (PID lt lastP) THEN
DO J 1, M
IF (lo 1) THEN
RECV (PID-1, A(0,J), 1)
A(1,J) A(0,J) C
ENDIF
DO i lo1, hi
A(i,J) A(i-1,J) C
ENDDO
IF (PID / lastP)
SEND(PID1, A(100,J), 1)
ENDDO
ENDIF

Can be vectorized But gives up parallelism
40
Pipelining

Pipelined parallelism with communication

41
Pipelining

Pipelined parallelism with communication overhead

42
Pipelining Blocking

lo 1
IF (PID0) lo 2
hi 100
lastP CEIL((N1)/100) - 1
IF (PIDlastP) hi MOD(N,100) 1
IF (PID lt lastP) THEN
DO J 1, M
IF (lo 1) THEN
RECV (PID-1, A(0,J), 1)
A(1,J) A(0,J) C
ENDIF
DO i lo1, hi
A(i,J) A(i-1,J) C
ENDDO
IF (PID / lastP)
SEND(PID1, A(100,J), 1)
ENDDO
ENDIF

...
IF (PID lt lastP) THEN
DO J 1, M, K
IF (lo 1) THEN
RECV (PID-1, A(0,JJK-1), K)
DO j J, JK-1
A(1,J) A(0,J) C
ENDDO
ENDIF
DO j J, JK-1
DO i lo1, hi
A(i,J) A(i-1,J) C
ENDDO
ENDDO
IF (PID / lastP)
SEND(PID1, A(100,JJK-1),K)
ENDDO
ENDIF

43
Other Optimizations

Alignment and Replication
Identification of Common recurrences
Storage Mangement
Minimize temporary storage used for communication
Space taken for temporary storage should be at
most equal to the space taken by the arrays
Interprocedural Optimizations

44
Results
45
Summary