Compiling High Performance Fortran - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Compiling High Performance Fortran

Description:

IF (PID==CEIL(N 1/100)-1) hi = MOD(N,100) 1. DO i = lo, hi. A(i) = B(i-1) ... lastP = CEIL((N 1)/100) - 1. IF (PID==lastP) hi = MOD(N,100) 1. DO i = lo, hi 1 ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 46
Provided by: engi100
Learn more at: https://www.cs.rice.edu
Category:

less

Transcript and Presenter's Notes

Title: Compiling High Performance Fortran


1
Compiling High Performance Fortran
  • Allen and Kennedy, Chapter 14

2
Overview
  • Motivation for HPF
  • Overview of compiling HPF programs
  • Basic Loop Compilation for HPF
  • Optimizations for compiling HPF
  • Results and Summary

3
Motivation for HPF
  • Require Message Passing to communicate data
    between processors
  • Approach 1 Use MPI calls in Fortran/C code

Scalable Distributed Memory Multiprocessor
4
Motivation for HPF
  • Consider the following sum reduction
  • PROGRAM SUM
  • REAL A(10000)
  • READ (9) A
  • SUM 0.0
  • DO I 1, 10000
  • SUM SUM A(I)
  • ENDDO
  • PRINT SUM
  • END
  • PROGRAM SUM
  • REAL A(100), BUFF(100)
  • IF (PID 0) THEN
  • DO IP 0, 99
  • READ (9) BUFF(1100)
  • IF (IP 0)
  • A(1100) BUFF(1100)
  • ELSE SEND(IP,BUFF,100)
  • ENDDO
  • ELSE RECV(0,A,100)
  • ENDIF
  • /Actual sum reduction code here /
  • IF (PID 0) SEND(1,SUM,1)
  • IF (PID gt 0) RECV(PID-1,T,1)
  • SUM SUM T
  • IF (PID lt 99) SEND(PID1,SUM,1)
  • ELSE SEND(0,SUM,1)
  • ENDIF
  • IF (PID 0) PRINT SUM

MPI implementation
5
Motivation for HPF
  • Disadvantages of MPI approach
  • User has to rewrite the program in SPMD form
    Single Program Multiple Data
  • User has to manage data movement send
    receive, data placement and synchronization
  • Too messy and not easy to master

6
Motivation for HPF
  • Approach 2 Use HPF
  • HPF is an extended version of Fortran 90
  • HPF has Fortran 90 features and a few directives
  • Directives
  • Tell how data is laid out in processor memories
    in parallel machine configuration. For example,
  • !HPF DISTRIBUTE A(BLOCK)
  • Assist in identifying parallelism. For example,
  • !HPF INDEPENDENT

7
Motivation for HPF
  • The same sum reduction code
  • PROGRAM SUM
  • REAL A(10000)
  • READ (9) A
  • SUM 0.0
  • DO I 1, 10000
  • SUM SUM A(I)
  • ENDDO
  • PRINT SUM
  • END
  • When written in HPF...
  • PROGRAM SUM
  • REAL A(10000)
  • !HPF DISTRIBUTE A(BLOCK)
  • READ (9) A
  • SUM 0.0
  • DO I 1, 10000
  • SUM SUM A(I)
  • ENDDO
  • PRINT SUM
  • END
  • Minimum modification
  • Easy to write
  • Now compiler has to do more work

8
Motivation for HPF
  • Advantages of HPF
  • User needs only to write some easy directives
    need not write the whole program in SPMD form
  • User does not need to manage data movement send
    receive and synchronization
  • Simple and easy to master

9
Overview
  • Motivation for HPF
  • Overview of compiling HPF programs
  • Basic Loop Compilation for HPF
  • Optimizations for compiling HPF
  • Results and Summary

10
HPF Compilation Overview
  • Dependence Analysis
  • Used for communication analysis
  • Fact used No dependence carried by I loop
  • Running example
  • REAL A(10000), B(10000)
  • !HPF DISTRIBUTE A(BLOCK), B(BLOCK)
  • DO J 1, 10000
  • DO I 2, 10000
  • S1 A(I) B(I-1) C
  • ENDDO
  • DO I 1, 10000
  • S2 B(I) A(I)
  • ENDDO
  • ENDDO

11
HPF Compilation Overview
  • Running example
  • REAL A(10000), B(10000)
  • !HPF DISTRIBUTE A(BLOCK), B(BLOCK)
  • DO J 1, 10000
  • DO I 2, 10000
  • S1 A(I) B(I-1) C
  • ENDDO
  • DO I 1, 10000
  • S2 B(I) A(I)
  • ENDDO
  • ENDDO
  • Dependence Analysis
  • Distribution Analysis

12
HPF Compilation Overview
  • Running example
  • REAL A(10000), B(10000)
  • !HPF DISTRIBUTE A(BLOCK), B(BLOCK)
  • DO J 1, 10000
  • DO I 2, 10000
  • S1 A(I) B(I-1) C
  • ENDDO
  • DO I 1, 10000
  • S2 B(I) A(I)
  • ENDDO
  • ENDDO
  • Dependence Analysis
  • Distribution Analysis
  • Computation Partitioning
  • Partition so as to distribute work of the I loops

13
HPF Compilation Overview
  • REAL A(1,100), B(0100)
  • !HPF DISTRIBUTE A(BLOCK), B(BLOCK)
  • DO J 1, 10000
  • I1 IF (PID / 100) SEND(PID1,B(100),1)
  • I2 IF (PID / 0) THEN
  • RECV(PID-1,B(0),1)
  • A(1) B(0) C
  • ENDIF
  • DO I 2, 100
  • S1 A(I) B(I-1)C
  • ENDDO
  • DO I 1, 100
  • S2 B(I) A(I)
  • ENDDO
  • ENDDO
  • Dependence Analysis
  • Distribution Analysis
  • Computation Partitioning
  • Communication Analysis and placement
  • Communication reqd for B(0)for each iteration
  • Shadow region B(0)

14
HPF Compilation Overview
  • REAL A(1,100), B(0100)
  • !HPF DISTRIBUTE A(BLOCK), B(BLOCK)
  • DO J 1, 10000
  • I1 IF (PID / 100) SEND(PID1,B(100),1)
  • DO I 2, 100
  • S1 A(I) B(I-1)C
  • ENDDO
  • I2 IF (PID / 0) THEN
  • RECV(PID-1,B(0),1)
  • A(1) B(0) C
  • ENDIF
  • DO I 1, 100
  • S2 B(I) A(I)
  • ENDDO
  • ENDDO
  • Dependence Analysis
  • Distribution Analysis
  • Computation Partitioning
  • Communication Analysis and placement
  • Optimization
  • Aggregation
  • Overlap communication and computation
  • Recognition of reduction

15
Overview
  • Motivation for HPF
  • Overview of compiling HPF programs
  • Basic Loop Compilation for HPF
  • Optimizations for compiling HPF
  • Results and Summary

16
Basic Loop Compilation
  • Distribution Propagation and analysis
  • Analyze what distribution holds for a given array
    at a given point in the program
  • Difficult due to
  • REALIGN and REDISTRIBUTE directives
  • Distribution of formal parameters inherited from
    calling procedure
  • Use Reaching Decompositions data flow analysis
    and its interprocedural version

17
Basic Loop Compilation
  • For simplicity assume single distribution for an
    array at all points in a subprogram
  • Define
  • For example suppose array A of size N is block
    distributed over p processors
  • Block size,

18
Basic Loop Compilation
  • Iteration Partitioning
  • Dividing work among processors
  • Computation partitioning
  • Determine which iterations of a loop will be
    executed on which processor
  • Owner-computes rule
  • REAL A(10000)
  • !HPF DISTRIBUTE A(BLOCK)
  • DO I 1, 10000
  • A(I) A(I) C
  • ENDDO
  • Iteration I is executed on owner of A(I)
  • 100 processors 1st 100 iterations on processor
    0, the next 100 on processor 1 and so on

19
Iteration Partitioning
  • Multiple statements in a loop in a recurrence
    choose a partitioning reference
  • Processor responsible for performing computation
    for iteration I is
  • Set of indices executed on p

20
Iteration Partitioning
  • Have to map global loop index to local loop index
  • Smallest value in maps to 1
  • REAL A(10000)
  • !HPF DISTRIBUTE A(BLOCK)
  • DO I 1, N
  • A(I1) B(I) C
  • ENDDO

21
Iteration Partitioning
  • REAL A(10000),B(10000)
  • !HPF DISTRIBUTE A(BLOCK),B(BLOCK)
  • DO I 1, N
  • A(I1) B(I) C
  • ENDDO
  • Map global iteration space, I to local iteration
    space,i as follows

22
Iteration Partitioning
  • Adjust array subscripts for local iterations

23
Iteration Partitioning
  • For interior processors the code becomes..
  • DO i 1, 100
  • A(i) B(i-1) C
  • ENDDO
  • Adjust lower bound for 1st processor and upper
    bound of last processor to take care of boundary
    conditions..
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • IF (PIDCEIL(N1/100)-1) hi MOD(N,100) 1
  • DO i lo, hi
  • A(i) B(i-1) C
  • ENDDO

24
Communication Generation
  • For our example no communication is required for
    iterations in
  • Iterations which require receiving data are
  • Iterations which require sending data are

25
Communication Generation
  • REAL A(10000), B(10000)
  • !HPF DISTRIBUTE A(BLOCK), B(BLOCK)
  • ...
  • DO I 1, N
  • A(I1) B(I) C
  • ENDDO
  • Receive required for iterations in 100p100p
  • Send required for iterations in
    100p100100p100
  • No communication required for iterations in
    100p1100p99

26
Communication Generation
  • After inserting receive
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • IF (PIDCEIL((N1)/100)-1)
  • hi MOD(N,100) 1
  • DO i lo, hi
  • IF (i1 PID / 0)
  • RECV (PID-1, B(0), 1)
  • A(i) B(i-1) C
  • ENDDO
  • Send must happen in the 101st iteration
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • lastP CEIL((N1)/100) - 1
  • IF (PIDlastP)
  • hi MOD(N,100) 1
  • DO i lo, hi1
  • IF (i1 PID / 0)
  • RECV (PID-1, B(0), 1)
  • IF (i lt hi) THEN
  • A(i) B(i-1) C
  • ENDIF
  • IF (i hi1 PID / lastP)
  • SEND(PID1, B(100), 1)
  • ENDDO

27
Communication Generation
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • lastP CEIL((N1)/100) - 1
  • IF (PIDlastP) hi MOD(N,100) 1
  • DO i lo, hi1
  • IF (i1 PID / 0)
  • RECV (PID-1, B(0), 1)
  • IF (i lt hi) THEN
  • A(i) B(i-1) C
  • ENDIF
  • IF (i hi1 PID / lastP)
  • SEND(PID1, B(100), 1)
  • ENDDO
  • Move SEND outside the loop
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • lastP CEIL((N1)/100) - 1
  • IF (PIDlastP) hi MOD(N,100) 1
  • IF (PID lt lastP) THEN
  • DO i lo, hi
  • IF (i1 PID / 0)
  • RECV (PID-1, B(0), 1)
  • A(i) B(i-1) C
  • ENDDO
  • IF (PID / lastP)
  • SEND(PID1, B(100), 1)
  • ENDIF

28
Communication Generation
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • lastP CEIL((N1)/100) - 1
  • IF (PIDlastP) hi MOD(N,100) 1
  • IF (PID lt lastP) THEN
  • DO i lo, hi
  • IF (i1 PID / 0)
  • RECV (PID-1, B(0), 1)
  • A(i) B(i-1) C
  • ENDDO
  • IF (PID / lastP)
  • SEND(PID1, B(100), 1)
  • ENDIF
  • Move receive outside loop and loop peel
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • lastP CEIL((N1)/100) - 1
  • IF (PIDlastP) hi MOD(N,100) 1
  • IF (PID lt lastP) THEN
  • IF (lo 1 PID / 0) THEN
  • RECV (PID-1, B(0), 1)
  • A(1) B(0) C
  • ENDIF
  • ! lo MAX(lo,11) 2
  • DO i 2, hi
  • A(i) B(i-1) C
  • ENDDO
  • IF (PID / lastP)
  • SEND(PID1, B(100), 1)
  • ENDIF

29
Communication Generation
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • lastP CEIL((N1)/100) - 1
  • IF (PIDlastP) hi MOD(N,100) 1
  • IF (PID lt lastP) THEN
  • IF (lo 1 PID / 0) THEN
  • RECV (PID-1, B(0), 1)
  • A(1) B(0) C
  • ENDIF
  • ! lo MAX(lo,11) 2
  • DO i 2, hi
  • A(i) B(i-1) C
  • ENDDO
  • IF (PID / lastP)
  • SEND(PID1, B(100), 1)
  • ENDIF
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • lastP CEIL((N1)/100) - 1
  • IF (PIDlastP) hi MOD(N,100) 1
  • IF (PID lt lastP) THEN
  • IF (PID / lastP)
  • SEND(PID1, B(100), 1)
  • IF (lo 1 PID / 0) THEN
  • RECV (PID-1, B(0), 1)
  • A(1) B(0) C
  • ENDIF
  • DO i 2, hi
  • A(i) B(i-1) C
  • ENDDO
  • ENDIF

30
Communication Generation
  • When is such rearrangement legal?
  • Receive copy from global to local location
  • Send copy local to global location
  • IF (PID lt lastP) THEN
  • S1 IF (lo 1 PID / 0) THEN
  • B(0) Bg(0) ! RECV
  • A(1) B(0) C
  • ENDIF
  • DO i 2, hi
  • A(i) B(i-1) C
  • ENDDO
  • S2 IF (PID / lastP) Bg(100) B(100) ! SEND
  • ENDIF

No chain of dependences from S1 to S2
31
Communication Generation
  • REAL A(10000), B(10000)
  • !HPF DISTRIBUTE A(BLOCK)
  • ...
  • DO I 1, N
  • A(I1) A(I) C
  • ENDDO
  • Would be rewritten as ..
  • IF (PID lt lastP) THEN
  • S1 IF (lo 1 PID / 0) THEN
  • A(0) Ag(0) ! RECV
  • A(1) A(0) C
  • ENDIF
  • DO i 2, hi
  • A(i) A(i-1) C
  • ENDDO
  • S2 IF (PID / lastP)
  • Ag(100) A(100) ! SEND
  • ENDIF
  • Rearrangement wont be correct

32
Overview
  • Motivation for HPF
  • Overview of compiling HPF programs
  • Basic Loop Compilation for HPF
  • Optimizations for compiling HPF
  • Results and Summary

33
Communication Vectorization
  • REAL A(10000,100)
  • !HPF DISTRIBUTE A(BLOCK,),
  • B(BLOCK,)
  • DO J 1, M
  • DO I 1, N
  • A(I1,J) B(I,J) C
  • ENDDO
  • ENDDO
  • Using Basic Loop compilation gives..
  • DO J 1, M
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • lastP CEIL((N1)/100) - 1
  • IF (PIDlastP)
  • hi MOD(N,100) 1
  • IF (PID lt lastP) THEN
  • IF (PID / lastP)
  • SEND(PID1, B(100,J), 1)
  • IF (lo 1) THEN
  • RECV (PID-1, B(0,J), 1)
  • A(1,J) B(0,J) C
  • ENDIF
  • DO i lo1, hi
  • A(i,J) B(i-1,J) C
  • ENDDO
  • ENDIF
  • ENDDO

34
Communication Vectorization
  • DO J 1, M
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • lastP CEIL((N1)/100) - 1
  • IF (PIDlastP)
  • hi MOD(N,100) 1
  • IF (PID lt lastP) THEN
  • IF (PID / lastP)
  • SEND(PID1, B(100,J), 1)
  • IF (lo 1) THEN
  • RECV (PID-1, B(0,J), 1)
  • A(1,J) B(0,J) C
  • ENDIF
  • DO i lo1, hi
  • A(i,J) B(i-1,J) C
  • ENDDO
  • ENDIF
  • ENDDO
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • lastP CEIL((N1)/100) - 1
  • IF (PIDlastP) hi MOD(N,100) 1
  • IF (PID lt lastP) THEN
  • DO J 1, M
  • IF (PID / lastP)
  • SEND(PID1, B(100,J), 1)
  • ENDDO
  • DO J 1, M
  • IF (lo 1) THEN
  • RECV (PID-1, B(0,J), 1)
  • A(i,J) B(i-1,J) C
  • ENDIF
  • ENDDO
  • DO J 1, M
  • DO i lo1, hi
  • A(i,J) B(i-1,J) C

Distribute J Loop
35
Communication Vectorization
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • lastP CEIL((N1)/100) - 1
  • IF (PIDlastP) hi MOD(N,100) 1
  • IF (PID lt lastP) THEN
  • DO J 1, M
  • IF (PID / lastP)
  • SEND(PID1, B(100,J), 1)
  • ENDDO
  • DO J 1, M
  • IF (lo 1) THEN
  • RECV (PID-1, B(0,J), 1)
  • A(i,J) B(i-1,J) C
  • ENDIF
  • ENDDO
  • DO J 1, M
  • DO i lo1, hi
  • A(i,J) B(i-1,J) C
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • lastP CEIL((N1)/100) - 1
  • IF (PIDlastP) hi MOD(N,100) 1
  • IF (PID lt lastP) THEN
  • IF (lo 1) THEN
  • RECV (PID-1, B(0,1M), M)
  • DO J 1, M
  • A(1,J) B(0,J) C
  • ENDDO
  • ENDIF
  • DO J 1, M
  • DO i lo1, hi
  • A(i,J) B(i-1,J) C
  • ENDDO
  • ENDDO
  • IF (PID / lastP)
  • SEND(PID1, B(100,1M), M)

36
Communication Vectorization
  • DO J 1, M
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • lastP CEIL((N1)/100) - 1
  • IF (PIDlastP)
  • hi MOD(N,100) 1
  • IF (PID lt lastP) THEN
  • S1 IF (PID / lastP)
  • Bg(100,J)B(100,J)
  • IF (lo 1) THEN
  • S2 B(0,J)Bg(0,J)
  • S3 A(1,J) B(0,J) C
  • ENDIF
  • DO i lo1, hi
  • S4 A(i,J) B(i-1,J) C
  • ENDDO
  • ENDIF
  • ENDDO
  • Communication stmts resulting from an inner loop
    can be vectorized wrt an outer loop if the
    communication statements are not involved in a
    recurrence carried by outer loop

37
Communication Vectorization
  • REAL A(10000,100)
  • !HPF DISTRIBUTE A(BLOCK,),
  • B(BLOCK,)
  • DO J 1, M
  • DO I 1, N
  • A(I1,J) A(I,J) B(I,J)
  • ENDDO
  • ENDDO
  • Can sends be done before the receives?
  • Can communication be vectorized?
  • REAL A(10000,100)
  • !HPF DISTRIBUTE A(BLOCK,)
  • DO J 1, M
  • DO I 1, N
  • A(I1,J1) A(I,J) C
  • ENDDO
  • ENDDO
  • Can sends be done before the receives?
  • Can communication be fully vectorized?

38
Overlapping Communication and Computation
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • lastP CEIL((N1)/100) - 1
  • IF (PIDlastP) hi MOD(N,100) 1
  • IF (PID lt lastP) THEN
  • S0 IF (PID / lastP)
  • SEND(PID1, B(100), 1)
  • S1 IF (lo 1 PID / 0) THEN
  • RECV (PID-1, B(0), 1)
  • A(1) B(0) C
  • ENDIF
  • L1 DO i 2, hi
  • A(i) B(i-1) C
  • ENDDO
  • ENDIF
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • lastP CEIL((N1)/100) - 1
  • IF (PIDlastP) hi MOD(N,100) 1
  • IF (PID lt lastP) THEN
  • S0 IF (PID / lastP)
  • SEND(PID1, B(100), 1)
  • L1 DO i 2, hi
  • A(i) B(i-1) C
  • ENDDO
  • S1 IF (lo 1 PID / 0) THEN
  • RECV (PID-1, B(0), 1)
  • A(1) B(0) C
  • ENDIF
  • ENDIF

39
Pipelining
  • REAL A(10000,100)
  • !HPF DISTRIBUTE A(BLOCK,)
  • DO J 1, M
  • DO I 1, N
  • A(I1,J) A(I,J) C
  • ENDDO
  • ENDDO
  • Initial code generation for the I loop gives..
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • lastP CEIL((N1)/100) - 1
  • IF (PIDlastP) hi MOD(N,100) 1
  • IF (PID lt lastP) THEN
  • DO J 1, M
  • IF (lo 1) THEN
  • RECV (PID-1, A(0,J), 1)
  • A(1,J) A(0,J) C
  • ENDIF
  • DO i lo1, hi
  • A(i,J) A(i-1,J) C
  • ENDDO
  • IF (PID / lastP)
  • SEND(PID1, A(100,J), 1)
  • ENDDO
  • ENDIF

Can be vectorized But gives up parallelism
40
Pipelining
  • Pipelined parallelism with communication

41
Pipelining
  • Pipelined parallelism with communication overhead

42
Pipelining Blocking
  • lo 1
  • IF (PID0) lo 2
  • hi 100
  • lastP CEIL((N1)/100) - 1
  • IF (PIDlastP) hi MOD(N,100) 1
  • IF (PID lt lastP) THEN
  • DO J 1, M
  • IF (lo 1) THEN
  • RECV (PID-1, A(0,J), 1)
  • A(1,J) A(0,J) C
  • ENDIF
  • DO i lo1, hi
  • A(i,J) A(i-1,J) C
  • ENDDO
  • IF (PID / lastP)
  • SEND(PID1, A(100,J), 1)
  • ENDDO
  • ENDIF
  • ...
  • IF (PID lt lastP) THEN
  • DO J 1, M, K
  • IF (lo 1) THEN
  • RECV (PID-1, A(0,JJK-1), K)
  • DO j J, JK-1
  • A(1,J) A(0,J) C
  • ENDDO
  • ENDIF
  • DO j J, JK-1
  • DO i lo1, hi
  • A(i,J) A(i-1,J) C
  • ENDDO
  • ENDDO
  • IF (PID / lastP)
  • SEND(PID1, A(100,JJK-1),K)
  • ENDDO
  • ENDIF

43
Other Optimizations
  • Alignment and Replication
  • Identification of Common recurrences
  • Storage Mangement
  • Minimize temporary storage used for communication
  • Space taken for temporary storage should be at
    most equal to the space taken by the arrays
  • Interprocedural Optimizations

44
Results
45
Summary
  • HPF is easy to code
  • But hard to compile
  • Steps required to compile HPF programs
  • Basic loop compilation
  • Communication generation
  • Optimizations
  • Communication vectorization
  • Overlapping communication with computation
  • Pipelining
Write a Comment
User Comments (0)
About PowerShow.com