Tuesday, September 19, 2006

About This Presentation

Title:

Tuesday, September 19, 2006

Description:

Tuesday, September 19, 2006 – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 34

Provided by: Erud

Category:

more less

Transcript and Presenter's Notes

Title: Tuesday, September 19, 2006

1
Tuesday, September 19, 2006

The practical scientist is trying to solve
tomorrow's problem on yesterday's computer.
Computer scientists often have it the other way
around.
- Numerical Recipes, C Edition

2
Reference Material

Lectures 1 2
Parallel Computer Architecture by David Culler
et. al., Chapter 1.
Sourcebook of Parallel Computing by Jack
Dongarra et. al.,
Chapters 1 and 2.
Introduction to Parallel Computing by Grama et.
al., Chapter 1 and Chapter 2 2.4.
www.top500.org
Lecture 3
Introduction to Parallel Computing by Grama et.
al., Chapter 2 2.3
Introduction to Parallel Computing, Lawrence
Livermore National Laboratory, http//www.llnl.gov
/computing/tutorials/parallel_comp/
Lecture 4 5
Techniques for Optimizing Applications by Garg
et. al., Chapter 9
Software Optimizations for High Performance
Computing by Wadleigh et. al., Chapter 5
Introduction to Parallel Computing by Grama et.
al., Chapter 2 2.1-2.2

3
Software Optimizations

Optimize serial code before parallelizing it.

4
Loop Unrolling
Assumption n is divisible by 4

do i1,n,4
A(i)B(i)
A(i1)B(i1)
A(i2)B(i2)
A(i3)B(i3)
enddo

do i1,n
A(i)B(i)
enddo

Unrolled by 4.
Some compilers allow users to specify unrolling
depth.
Avoid excessive unrolling Register pressure /
spills can hurt performance
Pipelining to hide instruction latencies
Reduces overhead of index increment and
conditional check

5
Loop Unrolling

do j1 to N
do i 1 to N
Zi,jZi,jXiYj
enddo
enddo

Unroll outer loop by 2
6
Loop Unrolling

do j1 to N
do i 1 to N
Zi,jZi,jXiYj
enddo
enddo

do j1 to N step 2
do i 1 to N
Zi,jZi,jXiYj
Zi,j1Zi,j1XiYj1
enddo
enddo

7
Loop Unrolling

do j1 to N
do i 1 to N
Zi,jZi,jXiYj
enddo
enddo

do j1 to N step 2
do i 1 to N
Zi,jZi,jXiYj
Zi,j1Zi,j1XiYj1
enddo
enddo

Number of load operations can be reduced e.g.
Half as many loads of X
8
Loop Fusion

Beneficial in loop-intensive programs.
Decreases index calculation overhead.
Can also help in instruction level parallelism.
Beneficial if same data structures are used in
different loops.

9
Loop Fusion

for (i0 iltn i)
tempi xiyi
for (i0 iltn i)
zi witempi

10
Loop Fusion

for (i0 iltn i)
tempi xiyi
for (i0 iltn i)
zi witempi

for (i0 iltn i)
zi xiyiwi

Check for register pressure before fusing
11
Loop Fission

Condition statements can hurt pipelining
Split into two, one with condition statements and
the other without.
Compiler can do optimizations in condition-free
loop like unrolling.
Beneficial for fat loops that may lead to
register spills

12
Loop Fission

for (i0iltnodesi)
ai aismall
dtime ai bi
dtime fabs(dtimeratinpmt)
temp1i dtimerelaxn
if(temp1i gt hgreat)
temp1i1

13
Loop Fission

for (i0iltnodesi)
ai aismall
dtime ai bi
dtime fabs(dtimeratinpmt)
temp1i dtimerelaxn
for (i0iltnodesi)
if(temp1i gt hgreat)
temp1i1

for (i0iltnodesi)
ai aismall
dtime ai bi
dtime fabs(dtimeratinpmt)
temp1i dtimerelaxn
if(temp1i gt hgreat)
temp1i1

14
Reductions

for (i0 iltn i)
sum xi

Normally a single register would be used for
reduction variable. Hide floating point
instruction latency?
15
Reductions

sum1sum2sum3sum40.0
nend (ngtgt2)ltlt2
for (i0 iltnend i4)
sum1 xi
sum2 xi1
sum3 xi2
sum4 xi3
sumx sum1 sum2 sum3 sum4
for (inend iltn i)
sumx xi

for (i0 iltn i)
sum xi

a0.5 vs sqrt(a)

a0.5 vs sqrt(a)
Appropriate include files can help in generating
faster code. e.g. math.h

The time to access memory has not kept pace with
CPU clock speeds.
Performance of a program can be suboptimal
because data to perform the operations are not
delivered from memory to registers by the time
processor is ready to use them.
Wastage of CPU cycles CPU starvation

19
(No Transcript)
20

Ability of memory system to feed data to the
processor
Memory latency
Memory Bandwidth

21
Effect of Memory Latency

1 GHz processor (1ns clock)
Capable of executing 4 instructions in each cycle
of 1ns
DRAM with latency 100ns
Cache block size 1 word
Peak processor rating?

22
Effect of Memory Latency

1 GHz processor (1ns clock)
Capable of executing 4 instructions in each cycle
of 1ns
DRAM with latency 100ns (no caches)
Memory block 1 word
Peak processor rating 4 GFlops

23
Effect of Memory Latency

1 GHz processor (1ns clock)
Capable of executing 4 instructions in each cycle
of 1ns
DRAM with latency 100ns (no caches)
Memory block 1 word
Peak processor rating 4 GFlops
Dot product of two vectors
Peak speed of computation?

24
Effect of Memory Latency

1 GHz processor (1ns clock)
Capable of executing 4 instructions in each cycle
of 1ns
DRAM with latency 100ns (no caches)
Memory block 1 word
Peak processor rating 4 GFlops
Dot product of two vectors
Peak speed of computation? one floating point
operation every 100ns i.e. speed of 10 MFLOPS

25
Effect of Memory Latency Introduce Cache

1 GHz processor (1ns clock)
Capable of executing 4 instructions in each cycle
of 1ns
DRAM with latency 100ns
Memory block 1 word
Cache 32KB with 1ns latency
Multiply two matrices A and B of 32x32 words with
result in C. (Note Previous example had no data
reuse).
Assume ideal cache placement and enough capacity
to hold A,B and C

26
Effect of Memory Latency Introduce Cache

Multiply two matrices A and B of 32x32 words with
result in C
32x32 1K words
Total operations and total time taken?

27
Effect of Memory Latency Introduce Cache

Multiply two matrices A and B of 32x32 words with
result in C
32x32 1K words
Total operations and total time taken?
Two matrices 2K require words
Multiplying two matrices requires 2n3 operations

28
Effect of Memory Latency Introduce Cache

Multiply two matrices A and B of 32x32 words with
result in C
32x32 1K
Two matrices 2K require 2K 100ns 200µs.
Multiplying two matrices requires 2n3 operations
2323 64K operations
4 operations per cycle we need 64K/4 cycles
16µs
Total time 20016µs
Computation rate 64K operations/(20016µs) 303
MFLOPS

29
Effect of Memory Bandwidth

1 GHz processor (1ns clock)
Capable of executing 4 instructions in each cycle
of 1ns
DRAM with latency 100ns
Memory block 4 words
Cache 32KB with 1ns latency
Dot product example again
Bandwidth increased 4 fold

Reduce cache misses.
Spatial locality
Temporal locality

31
Impact of strided access

for (i0 ilt1000 i)
column_sumi 0.0
for(j0 jlt1000 j)
column_sumi bji

32
Eliminating strided access

for (i0 ilt1000 i)
column_sumi 0.0
for(j0 jlt1000 j)
for (i0 ilt1000 i)
column_sumi bji

Assumption Vector column_sum is retained in the
cache
33

do i 1, N
do j 1, N
Ai Ai Bj
enddo
enddo

N is large so Bj cannot remain in cache until
it is used again in another iteration of outer
loop. Little reuse between touches How many cache
misses for A and B?

Write a Comment

User Comments (0)

About PowerShow.com

Tuesday, September 19, 2006 - PowerPoint PPT Presentation

Tuesday, September 19, 2006

Tuesday, September 19, 2006 – PowerPoint PPT presentation