Tuesday, September 19, 2006 - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Tuesday, September 19, 2006

Description:

Tuesday, September 19, 2006 – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 34
Provided by: Erud
Category:

less

Transcript and Presenter's Notes

Title: Tuesday, September 19, 2006


1
Tuesday, September 19, 2006
  • The practical scientist is trying to solve
    tomorrow's problem on yesterday's computer.
    Computer scientists often have it the other way
    around.
  • - Numerical Recipes, C Edition

2
Reference Material
  • Lectures 1 2
  • Parallel Computer Architecture by David Culler
    et. al., Chapter 1.
  • Sourcebook of Parallel Computing by Jack
    Dongarra et. al.,
  • Chapters 1 and 2.
  • Introduction to Parallel Computing by Grama et.
    al., Chapter 1 and Chapter 2 2.4.
  • www.top500.org
  • Lecture 3
  • Introduction to Parallel Computing by Grama et.
    al., Chapter 2 2.3
  • Introduction to Parallel Computing, Lawrence
    Livermore National Laboratory, http//www.llnl.gov
    /computing/tutorials/parallel_comp/
  • Lecture 4 5
  • Techniques for Optimizing Applications by Garg
    et. al., Chapter 9
  • Software Optimizations for High Performance
    Computing by Wadleigh et. al., Chapter 5
  • Introduction to Parallel Computing by Grama et.
    al., Chapter 2 2.1-2.2

3
Software Optimizations
  • Optimize serial code before parallelizing it.

4
Loop Unrolling
Assumption n is divisible by 4
  • do i1,n,4
  • A(i)B(i)
  • A(i1)B(i1)
  • A(i2)B(i2)
  • A(i3)B(i3)
  • enddo
  • do i1,n
  • A(i)B(i)
  • enddo
  • Unrolled by 4.
  • Some compilers allow users to specify unrolling
    depth.
  • Avoid excessive unrolling Register pressure /
    spills can hurt performance
  • Pipelining to hide instruction latencies
  • Reduces overhead of index increment and
    conditional check

5
Loop Unrolling
  • do j1 to N
  • do i 1 to N
  • Zi,jZi,jXiYj
  • enddo
  • enddo

Unroll outer loop by 2
6
Loop Unrolling
  • do j1 to N
  • do i 1 to N
  • Zi,jZi,jXiYj
  • enddo
  • enddo
  • do j1 to N step 2
  • do i 1 to N
  • Zi,jZi,jXiYj
  • Zi,j1Zi,j1XiYj1
  • enddo
  • enddo

7
Loop Unrolling
  • do j1 to N
  • do i 1 to N
  • Zi,jZi,jXiYj
  • enddo
  • enddo
  • do j1 to N step 2
  • do i 1 to N
  • Zi,jZi,jXiYj
  • Zi,j1Zi,j1XiYj1
  • enddo
  • enddo

Number of load operations can be reduced e.g.
Half as many loads of X
8
Loop Fusion
  • Beneficial in loop-intensive programs.
  • Decreases index calculation overhead.
  • Can also help in instruction level parallelism.
  • Beneficial if same data structures are used in
    different loops.

9
Loop Fusion
  • for (i0 iltn i)
  • tempi xiyi
  • for (i0 iltn i)
  • zi witempi

10
Loop Fusion
  • for (i0 iltn i)
  • tempi xiyi
  • for (i0 iltn i)
  • zi witempi
  • for (i0 iltn i)
  • zi xiyiwi

Check for register pressure before fusing
11
Loop Fission
  • Condition statements can hurt pipelining
  • Split into two, one with condition statements and
    the other without.
  • Compiler can do optimizations in condition-free
    loop like unrolling.
  • Beneficial for fat loops that may lead to
    register spills

12
Loop Fission
  • for (i0iltnodesi)
  • ai aismall
  • dtime ai bi
  • dtime fabs(dtimeratinpmt)
  • temp1i dtimerelaxn
  • if(temp1i gt hgreat)
  • temp1i1

13
Loop Fission
  • for (i0iltnodesi)
  • ai aismall
  • dtime ai bi
  • dtime fabs(dtimeratinpmt)
  • temp1i dtimerelaxn
  • for (i0iltnodesi)
  • if(temp1i gt hgreat)
  • temp1i1
  • for (i0iltnodesi)
  • ai aismall
  • dtime ai bi
  • dtime fabs(dtimeratinpmt)
  • temp1i dtimerelaxn
  • if(temp1i gt hgreat)
  • temp1i1

14
Reductions
  • for (i0 iltn i)
  • sum xi

Normally a single register would be used for
reduction variable. Hide floating point
instruction latency?
15
Reductions
  • sum1sum2sum3sum40.0
  • nend (ngtgt2)ltlt2
  • for (i0 iltnend i4)
  • sum1 xi
  • sum2 xi1
  • sum3 xi2
  • sum4 xi3
  • sumx sum1 sum2 sum3 sum4
  • for (inend iltn i)
  • sumx xi
  • for (i0 iltn i)
  • sum xi

16
  • a0.5 vs sqrt(a)

17
  • a0.5 vs sqrt(a)
  • Appropriate include files can help in generating
    faster code. e.g. math.h

18
  • The time to access memory has not kept pace with
    CPU clock speeds.
  • Performance of a program can be suboptimal
    because data to perform the operations are not
    delivered from memory to registers by the time
    processor is ready to use them.
  • Wastage of CPU cycles CPU starvation

19
(No Transcript)
20
  • Ability of memory system to feed data to the
    processor
  • Memory latency
  • Memory Bandwidth

21
Effect of Memory Latency
  • 1 GHz processor (1ns clock)
  • Capable of executing 4 instructions in each cycle
    of 1ns
  • DRAM with latency 100ns
  • Cache block size 1 word
  • Peak processor rating?

22
Effect of Memory Latency
  • 1 GHz processor (1ns clock)
  • Capable of executing 4 instructions in each cycle
    of 1ns
  • DRAM with latency 100ns (no caches)
  • Memory block 1 word
  • Peak processor rating 4 GFlops

23
Effect of Memory Latency
  • 1 GHz processor (1ns clock)
  • Capable of executing 4 instructions in each cycle
    of 1ns
  • DRAM with latency 100ns (no caches)
  • Memory block 1 word
  • Peak processor rating 4 GFlops
  • Dot product of two vectors
  • Peak speed of computation?

24
Effect of Memory Latency
  • 1 GHz processor (1ns clock)
  • Capable of executing 4 instructions in each cycle
    of 1ns
  • DRAM with latency 100ns (no caches)
  • Memory block 1 word
  • Peak processor rating 4 GFlops
  • Dot product of two vectors
  • Peak speed of computation? one floating point
    operation every 100ns i.e. speed of 10 MFLOPS

25
Effect of Memory Latency Introduce Cache
  • 1 GHz processor (1ns clock)
  • Capable of executing 4 instructions in each cycle
    of 1ns
  • DRAM with latency 100ns
  • Memory block 1 word
  • Cache 32KB with 1ns latency
  • Multiply two matrices A and B of 32x32 words with
    result in C. (Note Previous example had no data
    reuse).
  • Assume ideal cache placement and enough capacity
    to hold A,B and C

26
Effect of Memory Latency Introduce Cache
  • Multiply two matrices A and B of 32x32 words with
    result in C
  • 32x32 1K words
  • Total operations and total time taken?

27
Effect of Memory Latency Introduce Cache
  • Multiply two matrices A and B of 32x32 words with
    result in C
  • 32x32 1K words
  • Total operations and total time taken?
  • Two matrices 2K require words
  • Multiplying two matrices requires 2n3 operations

28
Effect of Memory Latency Introduce Cache
  • Multiply two matrices A and B of 32x32 words with
    result in C
  • 32x32 1K
  • Two matrices 2K require 2K 100ns 200µs.
  • Multiplying two matrices requires 2n3 operations
    2323 64K operations
  • 4 operations per cycle we need 64K/4 cycles
    16µs
  • Total time 20016µs
  • Computation rate 64K operations/(20016µs) 303
    MFLOPS

29
Effect of Memory Bandwidth
  • 1 GHz processor (1ns clock)
  • Capable of executing 4 instructions in each cycle
    of 1ns
  • DRAM with latency 100ns
  • Memory block 4 words
  • Cache 32KB with 1ns latency
  • Dot product example again
  • Bandwidth increased 4 fold

30
  • Reduce cache misses.
  • Spatial locality
  • Temporal locality

31
Impact of strided access
  • for (i0 ilt1000 i)
  • column_sumi 0.0
  • for(j0 jlt1000 j)
  • column_sumi bji

32
Eliminating strided access
  • for (i0 ilt1000 i)
  • column_sumi 0.0
  • for(j0 jlt1000 j)
  • for (i0 ilt1000 i)
  • column_sumi bji

Assumption Vector column_sum is retained in the
cache
33
  • do i 1, N
  • do j 1, N
  • Ai Ai Bj
  • enddo
  • enddo

N is large so Bj cannot remain in cache until
it is used again in another iteration of outer
loop. Little reuse between touches How many cache
misses for A and B?
Write a Comment
User Comments (0)
About PowerShow.com