Title: Tuesday, September 19, 2006
1Tuesday, September 19, 2006
- The practical scientist is trying to solve
tomorrow's problem on yesterday's computer.
Computer scientists often have it the other way
around. - - Numerical Recipes, C Edition
2Reference Material
- Lectures 1 2
- Parallel Computer Architecture by David Culler
et. al., Chapter 1. - Sourcebook of Parallel Computing by Jack
Dongarra et. al., - Chapters 1 and 2.
- Introduction to Parallel Computing by Grama et.
al., Chapter 1 and Chapter 2 2.4. - www.top500.org
- Lecture 3
- Introduction to Parallel Computing by Grama et.
al., Chapter 2 2.3 - Introduction to Parallel Computing, Lawrence
Livermore National Laboratory, http//www.llnl.gov
/computing/tutorials/parallel_comp/ - Lecture 4 5
- Techniques for Optimizing Applications by Garg
et. al., Chapter 9 - Software Optimizations for High Performance
Computing by Wadleigh et. al., Chapter 5 - Introduction to Parallel Computing by Grama et.
al., Chapter 2 2.1-2.2
3Software Optimizations
- Optimize serial code before parallelizing it.
4Loop Unrolling
Assumption n is divisible by 4
- do i1,n,4
- A(i)B(i)
- A(i1)B(i1)
- A(i2)B(i2)
- A(i3)B(i3)
- enddo
- Unrolled by 4.
- Some compilers allow users to specify unrolling
depth. - Avoid excessive unrolling Register pressure /
spills can hurt performance - Pipelining to hide instruction latencies
- Reduces overhead of index increment and
conditional check
5Loop Unrolling
- do j1 to N
- do i 1 to N
- Zi,jZi,jXiYj
- enddo
- enddo
Unroll outer loop by 2
6Loop Unrolling
- do j1 to N
- do i 1 to N
- Zi,jZi,jXiYj
- enddo
- enddo
- do j1 to N step 2
- do i 1 to N
- Zi,jZi,jXiYj
- Zi,j1Zi,j1XiYj1
- enddo
- enddo
7Loop Unrolling
- do j1 to N
- do i 1 to N
- Zi,jZi,jXiYj
- enddo
- enddo
- do j1 to N step 2
- do i 1 to N
- Zi,jZi,jXiYj
- Zi,j1Zi,j1XiYj1
- enddo
- enddo
Number of load operations can be reduced e.g.
Half as many loads of X
8Loop Fusion
- Beneficial in loop-intensive programs.
- Decreases index calculation overhead.
- Can also help in instruction level parallelism.
- Beneficial if same data structures are used in
different loops.
9Loop Fusion
- for (i0 iltn i)
- tempi xiyi
- for (i0 iltn i)
- zi witempi
10Loop Fusion
- for (i0 iltn i)
- tempi xiyi
- for (i0 iltn i)
- zi witempi
- for (i0 iltn i)
- zi xiyiwi
Check for register pressure before fusing
11Loop Fission
- Condition statements can hurt pipelining
- Split into two, one with condition statements and
the other without. - Compiler can do optimizations in condition-free
loop like unrolling. - Beneficial for fat loops that may lead to
register spills
12Loop Fission
- for (i0iltnodesi)
- ai aismall
- dtime ai bi
- dtime fabs(dtimeratinpmt)
- temp1i dtimerelaxn
- if(temp1i gt hgreat)
- temp1i1
-
-
13Loop Fission
- for (i0iltnodesi)
- ai aismall
- dtime ai bi
- dtime fabs(dtimeratinpmt)
- temp1i dtimerelaxn
-
- for (i0iltnodesi)
- if(temp1i gt hgreat)
- temp1i1
-
- for (i0iltnodesi)
- ai aismall
- dtime ai bi
- dtime fabs(dtimeratinpmt)
- temp1i dtimerelaxn
- if(temp1i gt hgreat)
- temp1i1
-
-
14Reductions
Normally a single register would be used for
reduction variable. Hide floating point
instruction latency?
15Reductions
- sum1sum2sum3sum40.0
- nend (ngtgt2)ltlt2
- for (i0 iltnend i4)
- sum1 xi
- sum2 xi1
- sum3 xi2
- sum4 xi3
-
- sumx sum1 sum2 sum3 sum4
- for (inend iltn i)
- sumx xi
16 17- a0.5 vs sqrt(a)
- Appropriate include files can help in generating
faster code. e.g. math.h
18- The time to access memory has not kept pace with
CPU clock speeds. - Performance of a program can be suboptimal
because data to perform the operations are not
delivered from memory to registers by the time
processor is ready to use them. - Wastage of CPU cycles CPU starvation
19(No Transcript)
20- Ability of memory system to feed data to the
processor - Memory latency
- Memory Bandwidth
21Effect of Memory Latency
- 1 GHz processor (1ns clock)
- Capable of executing 4 instructions in each cycle
of 1ns - DRAM with latency 100ns
- Cache block size 1 word
- Peak processor rating?
22Effect of Memory Latency
- 1 GHz processor (1ns clock)
- Capable of executing 4 instructions in each cycle
of 1ns - DRAM with latency 100ns (no caches)
- Memory block 1 word
- Peak processor rating 4 GFlops
23Effect of Memory Latency
- 1 GHz processor (1ns clock)
- Capable of executing 4 instructions in each cycle
of 1ns - DRAM with latency 100ns (no caches)
- Memory block 1 word
- Peak processor rating 4 GFlops
- Dot product of two vectors
- Peak speed of computation?
24Effect of Memory Latency
- 1 GHz processor (1ns clock)
- Capable of executing 4 instructions in each cycle
of 1ns - DRAM with latency 100ns (no caches)
- Memory block 1 word
- Peak processor rating 4 GFlops
- Dot product of two vectors
- Peak speed of computation? one floating point
operation every 100ns i.e. speed of 10 MFLOPS
25Effect of Memory Latency Introduce Cache
- 1 GHz processor (1ns clock)
- Capable of executing 4 instructions in each cycle
of 1ns - DRAM with latency 100ns
- Memory block 1 word
- Cache 32KB with 1ns latency
- Multiply two matrices A and B of 32x32 words with
result in C. (Note Previous example had no data
reuse). - Assume ideal cache placement and enough capacity
to hold A,B and C
26Effect of Memory Latency Introduce Cache
- Multiply two matrices A and B of 32x32 words with
result in C - 32x32 1K words
- Total operations and total time taken?
27Effect of Memory Latency Introduce Cache
- Multiply two matrices A and B of 32x32 words with
result in C - 32x32 1K words
- Total operations and total time taken?
- Two matrices 2K require words
- Multiplying two matrices requires 2n3 operations
28Effect of Memory Latency Introduce Cache
- Multiply two matrices A and B of 32x32 words with
result in C - 32x32 1K
- Two matrices 2K require 2K 100ns 200µs.
- Multiplying two matrices requires 2n3 operations
2323 64K operations - 4 operations per cycle we need 64K/4 cycles
16µs - Total time 20016µs
- Computation rate 64K operations/(20016µs) 303
MFLOPS
29Effect of Memory Bandwidth
- 1 GHz processor (1ns clock)
- Capable of executing 4 instructions in each cycle
of 1ns - DRAM with latency 100ns
- Memory block 4 words
- Cache 32KB with 1ns latency
- Dot product example again
- Bandwidth increased 4 fold
30- Reduce cache misses.
- Spatial locality
- Temporal locality
31Impact of strided access
- for (i0 ilt1000 i)
- column_sumi 0.0
- for(j0 jlt1000 j)
- column_sumi bji
32Eliminating strided access
- for (i0 ilt1000 i)
- column_sumi 0.0
- for(j0 jlt1000 j)
- for (i0 ilt1000 i)
- column_sumi bji
Assumption Vector column_sum is retained in the
cache
33- do i 1, N
- do j 1, N
- Ai Ai Bj
- enddo
- enddo
N is large so Bj cannot remain in cache until
it is used again in another iteration of outer
loop. Little reuse between touches How many cache
misses for A and B?