CSE 260 - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

CSE 260

Description:

Highly recommended talk tomorrow (Weds) at 3:00 in SDSC's auditorium: ... 1-way (4-way on Nighthawk 2) associative, randomized ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 22

Provided by: car72

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSE 260

1
CSE 260 Introduction to Parallel Computation

Topic 7 A few words about performance
programming
October 23, 2001

2
Announcements

Office hours tomorrow 130 250
Highly recommended talk tomorrow (Weds) at 300
in SDSCs auditorium
Benchmarking and Performance Characterization in
High Performance Computing
John McCalpin
IBM
For more info, see www.sdsc.edu/CSSS/
Disclaimer Ive only read the abstract, but it
sounds relevant to this class and interesting.

3
Approach to Tuning Code

Engineers method
DO UNTIL (exhausted)
tweak something
IF (better) THEN accept_change
Scientific method
DO UNTIL (enlightened)
make hypothesis
experiment
revise hypothesis

4
IBM Power3s power and limits
Processor in Blue Horizon

Eight pipelined functional units
2 floating point
2 load/store
2 single-cycle integer
1 multi-cycle integer
1 branch
Powerful operations
Fused multiply-add (FMA)
Load (or Store) update
Branch on count

Launch ? 4 ops per cycle
Cant launch 2 stores/cyc
FMA pipe 3-4 cycles long
Memory hierarchy speed

5
Can its power be harnessed?

CL.6 FMA
fp31fp31,fp2,fp0,fcr LFL fp1()double(gr3,16) F
NMS fp30fp30,fp2,fp0,fcr LFDU fp3,gr3()double(g
r3,32) FMA fp24fp24,fp0,fp1,fcr FNMS
fp25fp25,fp0,fp1,fcr LFL fp0()double(gr3,24) F
MA fp27fp27,fp2,fp3,fcr FNMS fp26fp26,fp2,fp3,f
cr LFL fp2()double(gr3,8) FMA
fp29fp29,fp1,fp3,fcr FNMS fp28fp28,fp1,fp3,fcr B
CT ctrCL.6,
for (j0 jltn j4) p00 aj0aj2
m00 - aj0aj2
p01 aj1aj3
m01 - aj1aj3
p10 aj0aj3
m10 - aj0aj3 p11
aj1aj2 m11 -
aj1aj2 8 FMAs 4 Loads Runs at
4.6 cycles/iteration ( 1544 MFLOP/S on 444 MHz
processor)
6
Can its power be harnessed (part II)

8 FMA, 4 Loads 1544 MFLOP/sec (1.15 cycle/load)
(previous slide)
8 FMA, 8 Loads
for (j0 jltn j8)
p00 aj0aj2
m00 - aj0aj2
p01 aj1aj3
m01 - aj1aj3
p10 aj4aj6
m10 - aj4aj6
p11 aj5aj7
m11 - aj5aj7
1480 MFLOP/sec (0.6 cycle/load)
Interactive node 740 MFLOP/sec (1.2
cycle/load)
Interactive nodes have 1 cycle/MemOp barrier!
the AGEN unit is disabled _at_!

7
A more realistic computations Dot Product
(DDOT) Z ? Xi?Yi
2N float ops 2N2 load/stores 4-way concurrency
load
store
8
Optimized matrix x vector product y A x

yi yi1 yi2

steady state 6 float ops 4
load/stores 10-way concurrency
xj xj1 xj2 xj3
9
FFT butterfly
Note on typical processor, this leaves half the
ALU power unused.
10 float ops 10 load/stores 4-way concurrency
load
store
10
FLOP to MemOp ratio

Most programs have at most one FMA per MemOp
DAXPY (Zi A Xi Yi) 2 Loads, 1 Store,
1 FMA
DDOT (Z S Xi Yi) 2 Loads, 1 FMA
Matrix-vector product (k1) loads, k FMAs
FFT butterfly 8 MemOps, 10 floats (5 or 6 FMA)
A few have more (but they are in libraries)
Matrix multiply (well-tuned) 2 FMAs per load
Radix-8 FFT
Your program is probably limited by loads and
stores!

Floating point Multiply Add
11
Need independent instructions

Remember Littles Law!
4 ins/cyc x 3 cycle pipe ?need 12-way
independence
Many recent and future processors need even more.
Out-of-order execution helps.
But limited by instruction window size branch
prediction.
Compiler unrolling of inner loop also helps.
Compiler has inner loop execute, say, 4 points,
then interleaves the operations.
Requires lots of registers.

12
How unrolling gets more concurrency
12 independent operations/cycle
13
Improving the MemOp to FLOP Ratio

for (i1 iltN i)
for (j1 jltN j)
bi,j 0.25
(ai-1j ai1j
ai,j-1 aij-1)
for (i1 iltN-2 i3)
for(j1 jltN j)
bi0j ...
bi1j ...
bi2j ...
for (i i i lt N i)
... / Do last rows /

3 loads 1 store
4 floats
5 loads 3 store
12 floats
14
Overcoming pipeline latency
for (i0 iltsize i) sum ai sum
3.86 cycles/addition
Next add cant start until previous is finished
(3 to 4 cycles later)
for (i0 iltsize i8) s0 ai0 s4
ai4 s1 ai1 s5 ai5 s2
ai2 s6 ai6 s3 ai3 s7
ai7 sum s0s1s2s3s4s5s6s7
0.5 cycle/addition
May change answer due to different rounding.
15
The SP Memory Hierarchy

L1 data cache
64 KBytes 512 lines, each 128 Bytes long
128-way set associative (!!)
Prefetching for up to 4 streams
6 or 7 cycle penalty for L1 cache miss
Data TLB
1 MB 256 pages, each 4KBytes
2-way set associative
25 to 100s cycle penalty for TLB miss
L2 cache
4 MByte 32,768 lines, each 128 Bytes long
1-way (4-way on Nighthawk 2) associative,
randomized (!!)
Only can hold data from 1024 different pages
35 cycle penalty for L2 miss

16
So what??

Sequential access (or small stride) are good
Random access within a limited range is OK
Within 64 KBytes in L1 1 cycle per MemOp
Within 1 MByte up to 7-cycle L1 penalty per 16
words (prefetching hides some cache miss penalty)
Within 4 MByte May get 25 cycle TLB penalty
Larger range huge (80 200 cycle) penalties

17
Stride-one memory accesssum list of floatstimes
for second time through data
Uncached data 4.6 cycles per load
L1 cache
L2 cache
TLB
18
Strided Memory Access
Program adds 4440 4-Byte integers located at
given stride (performed on interactive node)
gt 1 element/cacheline
gt 1 element/page
1 element/page (as bad as it gets)
TLB misses start
L1 misses start
19
Sun E10000s Sparc II processor