Principles of High Performance Computing ICS 632 - PowerPoint PPT Presentation

1 / 116
About This Presentation
Title:

Principles of High Performance Computing ICS 632

Description:

Code that is fast on machine A can be slow on machine B ... all to the code but this would result in completely unreadable/undebuggable code ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 117
Provided by: henrica
Category:

less

Transcript and Presenter's Notes

Title: Principles of High Performance Computing ICS 632


1
Principles of High Performance Computing (ICS
632)
  • Performance
  • of
  • Sequential Programs

2
Performance
  • We will mostly talk about how to make code go
    fast, hence the high performance
  • Performance conflicts with other concerns
  • Correctness
  • You will see that when trying to make code go
    fast one often breaks it
  • Readability
  • Fast code typically requires more lines!
  • Modularity can hurt performance
  • e.g., virtual classes
  • Portability
  • Code that is fast on machine A can be slow on
    machine B
  • At the extreme, highly optimized code is not
    portable at all, and in fact is done in hardware!

3
Why Performance?
  • To do a time-consuming operation in less time
  • I am an aircraft engineer
  • I need to run a simulation to test the stability
    of the wings at high aircraft velocity
  • Id rather have the result in 5 minutes than in 5
    hours so that I can complete the aircraft final
    design sooner.
  • To do an operation before a tighter deadline
  • I am a weather prediction agency
  • I am getting input from weather stations/sensors
  • Id like to make the forecast for tomorrow before
    tomorrow
  • To do a high number of operations per seconds
  • I am the CTO of Amazon.com
  • My Web server gets 1,000 hits per seconds
  • Id like my Web server and my databases to handle
    1,000 transactions per seconds to reduce customer
    delay
  • Amazon does process several GBytes of data per
    seconds

4
How to Improve Performance?
  • Option 1 Buy Faster Hardware
  • Only gets you so far for so long
  • Sometimes the amount of hardware to buy would be
    staggering and one cant just wait for technology
    improvements and price drops
  • Better to achieve the same effect by modifying
    the code a little bit

5
How to Improve Performance?
  • Option 2 Modify the algorithm
  • Example Search for an element in a sorted array
  • First implementation a linear search
  • Easy to write at first
  • Does the job
  • When performance becomes an issue, replace the
    linear search by a binary search
  • More complex code
  • Goes much faster for large arrays

6
How to Improve Performance?
  • Option 3 Modify the data structures
  • Example Linked List
  • The list.length() method computes the length by
    going through the list and incrementing a counter
  • If users call the method often and/or the list is
    long, this can cause significant overhead
  • Instead, add a length attribute to the list
    class, and do 1 and -1 on it when insertion and
    removal
  • The new list.length() method just return the
    length attribute
  • This will vastly speeds up list.length(), and
    will minimally slow down list.insert() and
    list.remove() an minimally increase memory
    consumption by 4 bytes
  • Example Replace a List by a Heap

7
How to Improve Performance?
  • Option 4 Modify the implementation
  • Do not change the spirit of the algorithm but...
  • Shuffle lines of code around
  • to do instructions in a different order
  • to remove optimization blockers
  • Modify code organization
  • e.g., remove classes
  • e.g., modify data structures
  • etc.

8
How to Improve Performance
  • Option 5 Use concurrency
  • Multi-threaded code on a single-CPU machine to
    utilize hardware resources more effectively
  • Multi-threaded code on a multi-CPU/multi-core
    machine

9
Performance as Time
  • Time between the start and the end of an
    operation
  • Also called running time, elapsed time,
    wall-clock time, response time, latency,
    execution time, ...
  • Most straightforward measure my program takes
    12.5s on a Pentium 3.5GHz
  • Can be normalized to some reference time
  • Must be measured on a dedicated machine

10
Performance as Rate
  • Used often so that performance can be independent
    on the size of the application
  • e.g., compressing a 1MB file takes 1 minute.
    compressing a 2MB file takes 2 minutes. The
    performance is the same.
  • Millions of instructions / sec (MIPS)
  • MIPS instruction count / (execution time
    106) clock rate / (CPI 106)
  • But Instructions Set Architectures are not
    equivalent
  • 1 CISC instruction many RISC instructions
  • Programs use different instruction mixes
  • May be ok for same program on same architectures

11
Performance as Rate
  • Millions of floating point operations /sec
    (MFlops)
  • Very popular, but often misleading
  • e.g., A high MFlops rate in a stupid algorithm
    could have poor application performance
  • Application-specific
  • Millions of frames rendered per second
  • Millions of amino-acid compared per second
  • Millions of HTTP requests served per seconds
  • Application-specific metrics are often preferable
    and others may be misleading
  • MFlops can be application-specific thought
  • For instance
  • I want to add two n-element vectors
  • This requires n Floating Point Operations
  • Therefore MFlops is a good measure

12
Measuring Performance Rates
  • How do we measure performance rates?
  • Time a section of code
  • Count how many items are done in that section
    of the code
  • Compute the rate as the number of items divided
    by the measured time
  • Example
  • start_stopwatch()
  • for (i0 ilt1000000 i)
  • x y z a
  • stop_stopwatch()
  • Number of MFlop 2 (1000000 additions, 1000000
    multiplications)
  • Number of MFlops 2 / time

13
Peak Performance?
  • Resource vendors always talk about peak
    performance rate
  • Computed based on specifications of the machine
  • For instance
  • I build a machine with 2 floating point units
  • Each unit can do an operation in 2 cycles
  • My CPU is at 1GHz
  • Therefore I have a 12/2 1GFlops Machine
  • Problem
  • In real code you will never be able to use the
    two floating point units constantly
  • Data needs to come from memory and cause the
    floating point units to be idle
  • Typically, real code achieves only an (often
    small) fraction of the peak performance

14
Benchmarks
  • Since many performance metrics turn out to be
    misleading, people have designed benchmarks
  • Example SPEC Benchmark
  • Integer benchmark
  • Floating point benchmark
  • These benchmarks are typically a collection of
    several codes that come from real-world
    software
  • The question what is a good benchmark? is
    difficult
  • If the benchmarks do not correspond to what
    youll do with the computer, then the benchmark
    results are not relevant to you

15
How About GHz?
  • This is often the way in which people say that a
    computer is better than another
  • More instruction per seconds for higher clock
    rate
  • Faces the same problems as MIPS
  • But usable within a specific architecture

16
Program Performance
  • In this class were not really concerned with
    determining the performance of a compute platform
    (whichever way it is defined)
  • Instead were concerned with improving a
    programs performance
  • For a given platform, take a given program
  • Run it and measure its wall-clock time
  • Enhance it, run it and quantify the performance
    improvement
  • i.e., the reduction in wall-clock time
  • For each version compute its performance
  • preferably as a relevant performance rate
  • so that you can say the best implementation we
    have so far goes this fast (perhaps a of the
    peak performance)

17
The UNIX time Command
  • You can put time in front of any UNIX command you
    invoke
  • When the invoked command completes, time prints
    out timing (and other) information
  • time ls /home/casanova/ -la -R
  • 0.520u 1.570s 020.58 10.1 00k 570105io
    0pf0w
  • 0.520u 0.52 seconds of user time
  • 1.570s 1.57 seconds of system time
  • 020.56 20.56 seconds of wall-clock time
  • 10.1 10.1 of CPU was used
  • 00k memory used (text data)
  • 570105io 570 input, 105 output (file system I/O)
  • 0pf0w 0 page faults and 0 swaps

18
User, System, Wall-Clock?
  • User Time time that the code spends executing
    user code (i.e., non system calls)
  • System Time time that the code spends executing
    system calls
  • Wall-Clock Time time from start to end
  • Wall-Clock User System
  • in our example 20.56 0.52 1.57
  • Why?
  • because the process can be suspended by the O/S
    due to contention for the CPU by other processes
  • because the process can be blocked waiting for
    I/O

19
Using time
  • Its interesting to know what the user time and
    the system time are
  • for instance, if the system time is really high,
    it may be that the code does to many calls to
    malloc(), for instance
  • But one would really need more information to fix
    the code (not always clear which system calls may
    be responsible for the high system time)
  • Wall-clock - system - user I/O suspended
  • If the system is dedicated, suspended 0
  • Therefore one can estimate the cost of I/O
  • If I/O is really high, one may want to look at
    reducing I/O or doing I/O better
  • Therefore, time can give us insight into
    bottlenecks and gives us wall-clock time

20
Drawbacks of UNIX time
  • The time command has poor resolution
  • Only milliseconds
  • Sometimes we want a higher precision, especially
    if our performance improvements are in the 1-2
    range
  • time times the whole code
  • Sometimes were only interested in timing some
    part of the code, for instance the one that we
    are trying to optimize
  • Sometimes we want to compare the execution time
    of different sections of the code

21
Timing with gettimeofday
  • gettimeofday from the standard C library
  • Measures the number of microseconds since
    midnight, Jan 1st 1970, expressed in seconds and
    microseconds
  • include ltsys/time.hgt
  • struct timeval start
  • ...
  • gettimeofday(start,NULL)
  • printf(ld,ld\n,start-gttv_sec,start-gttv_usec)
  • ...
  • Can be used to time sections of code
  • Call gettimeofday at beginning of section
  • Call gettimeofday at end of section
  • Compute the time elapsed in microseconds
  • e.g., (end.tv_sec1000000.0 end.tv_usec -
    start.tv_sec1000000.0 - start.tv_usec) /
    1000000.0

22
Other Ways to Time Code
  • ntp_gettime() (Internet RFC 1589)
  • Sort of like gettimeofday, but reports estimated
    error on time measurement
  • Not available for all systems
  • Part of the GNU C Library
  • Java System.currentTimeMillis()
  • Known to have resolution problems, with
    resolution higher than 1 millisecond!
  • Solution use a native interface to a better
    timer
  • Java System.nanoTime()
  • Added in J2SE 5.0
  • Probably not accurate at the nanosecond level
  • Tons of high precision timing in Java on the Web

23
Dedicated Systems
  • Measuring the performance of a code must be done
    dedicated system
  • No other user can start a process
  • The user measuring the performance only runs the
    minimum amount of processes
  • basically, a shell
  • single-user mode is typically considered
    overkill
  • Nevertheless, one should always present
    measurement results as averages over several
    experiments
  • Because the (small) load imposed by the O/S is
    not deterministic
  • In your assignments, always show averages over 10
    experiments, or more if asked to do so explicitly

24
How do I speed up my code?
  • One option to make code faster is basically to
    monkey around with the code
  • Lets look at some examples of what one can do by
    hand
  • These techniques were very popular before
    compilers were any good
  • Of course, well talk about what the compiler can
    do nowadays

25
Optimization Techniques
  • Technique 1 identify loop constants
  • for (k0kltNk)
  • cij aik bkj
  • sum 0
  • for (k0kltNk)
  • sum aik bkj
  • cij sum

26
Optimization Techniques
  • Technique 2 replace array accesses by pointer
    dereferences
  • for (j0jltNj)
  • aij 2 // 2N adds, N multiplies
  • double ptr (ai0) // 2 adds, 1
    multiplies
  • for (j0jltNj)
  • ptr 2
  • ptr // N integer addition

27
Optimization Techniques
  • Technique 3 Loop Unrolling
  • for (i0ilt100i)
  • ai i
  • i0
  • do
  • ai i i
  • ai i i
  • ai i i
  • ai i i
  • while (ilt100) // fewer comparisons

28
Optimization Techniques
  • Technique 4 Code Motion
  • sum 0
  • for (i 0 i lt fact(n) i)
  • sum i
  • sum 0
  • f fact(n)
  • for (i 0 i lt f i)
  • sum i

29
Optimization Techniques
  • Technique 5 Inlining
  • for (i0iltNi) sum cube(i)
  • ...
  • void cube(i) return (iii)
  • for (i0iltNi) sum iii

30
Other Techniques
  • Common sub-expression elimination
  • x a b - c
  • y a d e b
  • tmp a b
  • x tmp - c
  • y tmp d e

31
Other Techniques
  • Dead code elimination
  • x 12
  • ...
  • x ac
  • ...
  • x ac

Seems obvious, but may be hidden int x
0 ... ifdef FOO x f(3) else
32
Other Techniques
  • Strength reduction
  • a i3 a iii
  • Constant propagation
  • int speedup 3
  • efficiency 100 speedup / numprocs
  • x efficiency 2
  • x 600 / numprocs

33
Now what?
  • There are many other techniques
  • We could apply them all to the code but this
    would result in completely unreadable/undebuggable
    code
  • Fortunately, the compiler should come to the
    rescue
  • To some extent, at least
  • Good compilers can do a lot for you
  • Typically compilers provided by a vendor can do
    pretty tricky optimizations

34
What do compilers do?
  • All modern compilers perform some automatic
    optimization when generating code
  • In fact, you implement some of those in a
    graduate-level compiler class, and sometimes at
    the undergraduate level.
  • Most compilers provide several levels of
    optimization
  • -O0 No optimization
  • in fact some is always done
  • -O1, -O2, .... -OX
  • The higher the optimization level the higher the
    probability that a debugger may have trouble
    dealing with the code.
  • Always debug with -O0
  • some compiler enforce that -g means -O0
  • Some compiler will flat out tell you that higher
    levels of optimization may break some code!

35
Compiler optimizations
  • gcc is a pretty good, free compiler
  • -Os Optimize for size
  • Some optimizations increase code size
    tremendously
  • Do a man gcc and look at the many optimization
    options
  • one can pick and choose,
  • or just use standard sets via O1, O2, etc.
  • The most fancy compilers are typically the ones
    done by vendors
  • You cant sell a good machine if it has a bad
    compiler
  • Compiler technology used to be really poor
  • also, languages used to be designed without
    thinking of compilers (FORTRAN, Ada)
  • no longer true every language designer has
    in-depth understanding of compiler technology
    today

36
What can compilers do?
  • Many, many things
  • Inlining
  • Assignment of variables to registers
  • Its a difficult problem
  • Dead code elimination
  • Algebraic simplification
  • Moving invariant code out of loops
  • Constant propagation
  • Control flow simplification
  • Instruction scheduling, reordering
  • Strength reduction
  • e.g., add to pointers, rather than doing array
    index computation
  • Loop unrolling and software pipelining
  • Dead store elimination
  • and many other......

37
Instruction scheduling
  • Modern computers have multiple functional units
    that could be used in parallel
  • Or at least ones that are pipelined
  • if fed operands at each cycle they can produce a
    result at each cycle
  • although a computation may require 20 cycles
  • Instruction scheduling
  • Reorder the instructions of a program
  • e.g., at the assembly code level
  • Preserve correctness
  • Make it possible to use functional units optimally

38
Instruction Scheduling
  • One cannot just shuffle all instructions around
  • Preserving correctness means that data
    dependences are unchanged
  • Three types of data dependences
  • True dependence
  • a ...
  • ... a
  • Output dependence
  • a ...
  • a ...
  • Anti dependence
  • ... a
  • a ...

39
Instruction Scheduling Example
  • ... ...
  • ADD R1,R2,R4 ADD R1,R2,R4
  • ADD R2,R2,1 LOAD R4,_at_2
  • ADD R3,R6,R2 ADD R2,R2,1
  • LOAD R4,_at_2 ADD R3,R6,R2
  • ... ...
  • Since loading from memory can take many cycles,
    one may as well do is as early as possible
  • Cant move instruction earlier because of
    anti-dependence on R4

40
Software Pipelining
  • Fancy name for instruction scheduling for loops
  • Can be done by a good compiler
  • First unroll the loop
  • Then make sure that instructions can happen in
    parallel
  • i.e., scheduling them on functional units
  • Lets see a simple example

41
Example
  • Source code for(i0iltni) sum ai
  • Loop body in assembly
  • Unroll loop allocate registers
  • May be very difficult

r1 L r0--- stall r2 Add r2,r1r0 Add
r0,12r4 L r3--- stall r2 Add r2,r4r3
Add r3,12r7 L r6--- stall r2 Add r2,r7r6
Add r6,12r10 L r9--- stall r2 Add
r2,r10r9 add r9,12
r1 L r0--- stall r2 Add r2,r1r0 Add r0,4
42
Example (cont.)
Schedule Unrolled Instructions, exploiting
instruction level parallelism if possible
r1 L r0r4 L r3 r2 Add r2,r1 r7 L r6 r0
Add r0,12 r2 Add r2,r4 r10 L r9 r3 Add
r3,12 r2 Add r2,r7 r1 L r0 r6 Add r6,12 r2
Add r2,r10r4 L r3 r9 add r9,12 r2 Add
r2,r1 r7 L r6 r0 Add r0,12 r2 Add r2,r4
r10 L r9 r3 Add r3,12 r2 Add r2,r7 r1 L
r0 r6 Add r6,12 r2 Add r2,r10r4 L r3 r9
add r9,12 r2 Add r2,r1 r7 L r6 . .
.r0 Add r0,12 r2 Add r2,r4 r10 L r9r3
Add r3,12 r2 Add r2,r7r6 Add r6,12 Add
r2,r10 r9 add r9,12
Identifyrepeatingpattern (kernel)
43
Example (cont.)
  • Loop becomes

prologue
r1 L r0r4 L r3 r2 Add r2,r1 r7 L r6
r0 Add r0,12 r2 Add r2,r4 r10 L r9 r3
Add r3,12 r2 Add r2,r7 r1 L r0 r6 Add
r6,12 r2 Add r2r10 r4 L r3 r9 Add r9,12 r2
Add r2,r1 r7 L r6 r0 Add r0,12 r2 Add
r2,r4 r10 L r9r3 Add r3,12 r2 Add r2,r7r6
Add r6,12 Add r2,r10 r9 Add r9,12
kernel
epilogue
44
Software Pipelining
  • The kernel may require many registers and its
    good nice to know how to use as few as possible
  • otherwise, one may have to go to cache more,
    which may negate the benefits of software
    pipelining
  • Dependency constraints must be respected
  • May be very difficult to analyze for complex
    nested loops
  • Software pipelining with registers is a very
    well-known NP-hard program

45
Limits to Compiler Optimization
  • Behavior that may be obvious to the programmer
    can be obfuscated by languages and coding styles
  • e.g., data ranges may be more limited than
    variable types suggest
  • e.g., using an int in C for what could be an
    enumerated type
  • Most analysis is performed only within procedures
  • whole-program analysis is too expensive in most
    cases
  • Most analysis is based only on static information
  • compiler has difficulty anticipating run-time
    inputs
  • When in doubt, the compiler must be conservative
  • cannot perform optimization if it changes program
    behavior under any realizable circumstance
  • even if circumstances seem quite bizarre and
    unlikely

46
So were are we now?
  • We have seen techniques to optimize code
  • reducing the number of instructions
  • instruction scheduling
  • memory access management
  • But compilers do a lot of things
  • So, does it mean that we, as software developers
    have nothing to worry about?
  • Sadly, no

47
Good practice
  • Writing code for high performance means working
    hand-in-hand with the compiler
  • Principle 1 Optimize things that we know the
    compiler cannot deal with
  • Well see a few such examples in the next set of
    slides
  • Principle 2 Write code so that the compiler can
    do its optimizations
  • Remove optimization blockers

48
Optimization blocker aliasing
  • Aliasing two pointers point to the same
    location
  • If a compiler cant tell what a pointer points
    at, it must assume it can point at almost
    anything
  • Example
  • void foo(int q, int p)
  • q 3
  • p
  • q 4
  • cannot be safely optimized to
  • p
  • q 12
  • because perhaps p q
  • Some compilers have pretty fancy aliasing
    analysis capabilities

49
Blocker False Dependencies
  • A special case of aliasing
  • ai bi c
  • ai1 bi1 d
  • The compiler cannot know that (bi1) is
    different from (ai)
  • Therefore, it cant do efficient instruction
    scheduling
  • Instead, one should write code as
  • float f1 bi
  • float f2 bi1
  • ai f1 c
  • ai1 f2 d
  • Used local variable to expose independent
    operations
  • Some compiler allow users to give them hints
  • e.g., declare arrays a and b unaliased via some
    keyword

50
Blocker Function Call
  • sum 0
  • for (i 0 i lt fact(n) i)
  • sum i
  • A compiler cannot optimize this because
  • function fact may have side-effects
  • e.g., modifies global variables
  • Function May Not Return Same Value for Given
    Arguments
  • Depends on other parts of global state, which may
    be modified in the loop
  • Why doesnt compiler look at the code for fact?
  • Linker may overload with different version
  • Unless declared static
  • Interprocedural optimization is not used
    extensively due to cost
  • Inlining can achieve the same effect for small
    procedures
  • Again
  • Compiler treats procedure call as a black box
  • Weakens optimizations in and around them

51
Other Techniques
  • Use more local variables

while( ) res filter0signal0
filter1signal1
filter2signal2 signal
Helps some compilers
register float f0 filter0 register float f1
filter1 register float f2
filter2 while( ) res f0signal0
f1signal1
f2signal2 signal
52
Other Techniques
  • Replace pointer updates for strided memory
    addressing with constant array offsets

f0 r8 r8 4 f1 r8 r8 4 f2 r8
r8 4
Some compilers are better at figuring this out
than others Some systems may go faster with
option 1, some others with option 2!
f0 r80 f1 r84 f2 r88 r8 12
53
Bottom line
  • Know your compilers
  • Some are great
  • Some are not so great
  • Some will not do things that you think they
    should do
  • often because you forget about things like
    aliasing
  • There is not golden rule because there are some
    system-dependent behaviors
  • Although the general principles typically holds
  • Doing all optimization by hand is a bad idea in
    general
  • But were doing it in the class for some of the
    programming assignment to truly understand about
    code, hardware, and performance.

54
By-hand Optimization of Matrix Multiplication
for(i 0 i lt SIZE i) int orig_pa
ai0 for(j 0 j lt SIZE j) int
pa orig_pa int pb a0j int
sum 0 for(k 0 k lt SIZE k)
sum pa pb pa pb SIZE
cij sum
for(i 0 i lt SIZE i) for(j 0 j lt
SIZE j) for(k 0 k lt SIZE k)
cijaikbkj
  • Turned array accesses into pointer dereferences
  • Assign to each element of c just once

55
Results (Courtesy of CMU)
56
Why is Simple Sometimes Better?
  • Easier for humans and the compiler to understand
  • The more the compiler knows the more it can do
  • Pointers are hard to analyze, arrays are easier
  • You never know how fast code will run until you
    time it on a dedicated system
  • The transformations done by hand good optimizers
    will often do for us
  • And they will often do a better job than we can
    do, but not always
  • Pointers may cause aliases and data dependences
    where the array code had none

57
Bottom Line
  • How should I write my programs, given that I have
    a good, optimizing compiler?
  • Dont Smash Code into Oblivion
  • Hard to read, maintain ensure correctness
  • Do
  • Select best algorithm
  • Write code thats readable maintainable
  • Procedures, recursion, without built-in constant
    limits
  • Even though these factors can slow down code
  • Eliminate optimization blockers
  • Allows compiler to do its job

58
Good Performance?
  • You have a code that was given to you or that you
    wrote
  • You compile it with your favorite optimizing
    compiler, you have removed obvious optimization
    blockers
  • And then, performance is poor
  • Not sufficient for the code to be used to meet
    deadlines
  • The code could still be usable but lead to long
    waits, and you can tell that the performance is
    way below the peak performance
  • What do you do?

59
Why is Performance Poor?
  • Performance is poor because the code suffers from
    a performance bottleneck
  • Definition
  • An application runs on a platform that has many
    components
  • CPU, Memory, Operating System, Network, Hard
    Drive, Video Card, etc.
  • Pick a component and make it faster
  • If the application performance increases, that
    component was the bottleneck!

60
Identifying a Bottleneck
  • It can be difficult
  • Youre not going to change the memory bus just to
    see what happens to the application
  • But you can run the code on a different machine
    and see what happens
  • Typical Approach
  • Know/discover the characteristics of the machine
  • Know/discover the characteristics of the
    application
  • Observe the application execution on the machine
  • Reason about what the bottleneck is
  • Luckily there are well-known bottlenecks that are
    likely candidates when performance is poor

61
Removing a Bottleneck
  • Brute force Hardware Upgrade
  • Sometimes necessary, but can only get you so far
    and may be very costly
  • e.g., memory technology
  • Instead, modify the code
  • The bottleneck is there because the code uses a
    resource heavily or in non-intelligent manner
  • This is, unfortunately, what we have to do often
    after the fact
  • You wrote a beautifully structured/modular code
  • Its slow and you have to decrease readability,
    modularity to increase performance

62
The Memory Bottleneck
  • The memory is a very common bottleneck that
    beginning programmers often dont think about
  • When you look at code, you often pay more
    attention to computation
  • ai bj ck
  • The access to the 3 arrays take more time than
    doing an addition
  • For the code above, the memory is the bottleneck
    for many machines!

63
Why the Memory Bottleneck?
  • In the 70s, everything was balanced
  • The memory kept pace with the CPU
  • n cycles to execute an instruction, n cycles to
    bring in a word from memory
  • No longer true
  • CPUs have gotten 1,000x faster
  • Memory have gotten 10x faster and 1,000,000x
    larger
  • Flops are free and bandwidth is expensive and
    processors are STARVED for data

64
Current Memory Technology
source http//www.xbitlabs.com/articles/memory/di
splay/ddr2-ddr_2.html
65
Memory Bottleneck Example
  • Fragment of code ai bj ck
  • Three memory references 2 reads, 1 write
  • One addition can be done in one cycle
  • If the memory bandwidth is 12.8GB/sec, then the
    rate at which the processor can access integers
    (4 bytes) is 12.8102410241024 / 4 3.4GHz
  • The above code needs to access 3 integers
  • Therefore, the rate at which the code gets its
    data is 1.1GHz
  • But the CPU could perform additions at 4GHz!
  • Therefore The memory is the bottleneck
  • And we assumed memory worked at the peak!!!
  • We ignored other possible overheads on the bus
  • In practice the gap can be around a factor 15 or
    higher

66
Reducing the Memory Bottleneck
  • The way in which computer architects have dealt
    with the memory bottleneck is via the memory
    hierarchy

larger, slower, cheaper
CPU
Memory
disk
regs
register reference
L2-cache (SRAM) reference
memory (DRAM) reference
disk reference
L1-cache (SRAM) reference
L3-cache (DRAM) reference
hundreds cycles
tens of thousands cycles
sub ns
1-2 cycles
20 cycles
10 cycles
67
Locality
  • The memory hierarchy is useful because of
    locality
  • Temporal locality a memory location that was
    referenced in the past is likely to be referenced
    again
  • Spatial locality a memory location next to one
    that was referenced in the past is likely to be
    referenced in the near future
  • This is great, but what we write our code for
    performance we want our code to have the maximum
    amount of locality
  • The compiler can do some work for us regarding
    locality
  • But unfortunately not everything

68
Programming for Locality
  • Essentially, a programmer should keep a mental
    picture of the memory layout of the application,
    and reason about locality
  • When writing concurrent code on a multi-core
    architecture, one must also thing of which caches
    are shared/private
  • This can be extremely complex, but there are a
    few well-known techniques
  • The typical example is with 2-D arrays

69
Example 2-D Array Initialization
  • int a200200 int a200200
  • for (i0ilt200i) for (j0jlt200j)
  • for (j0jlt200j) for (i0ilt200i)
  • aij 2 aij 2
  • Which alternative is best?
  • i,j?
  • j,i?
  • To answer this, one must understand the memory
    layout of a 2-D array

70
2-D Arrays in Memory
  • A static 2-D array is one declared as
  • lttypegt ltnamegtltsizegtltsizegt
  • int myarray1030
  • The elements of a 2-D array are stored in
    contiguous memory cells
  • The problem is that
  • The array is 2-D, conceptually
  • Computer memory is 1-D
  • 1-D computer memory a memory location is
    described by a single number, its address
  • Just like a single axis
  • Therefore, there must be a mapping from 2-D to
    1-D
  • From a 2-D abstraction to a 1-D implementation

71
Mapping from 2-D to 1-D?
1-D computer memory
nxn 2-D array
72
Row-Major, Column-Major
  • Luckily, only 2 of the n2! mappings are ever
    implemented in a language
  • Row-Major
  • Rows are stored contiguously
  • Column-Major
  • Columns are stored contiguously

1st row
2nd row
3rd row
4th row
1st col
2nd col
3rd col
4th col
73
Row-Major
  • C uses Row-Major

address
rows in memory
memory lines
memory/cache line
  • Matrix elements are stored in contiguous memory
    lines

74
Row-Major
  • C uses Row-Major
  • First option
  • int a200200
  • for (i0ilt200i)
  • for (j0jlt200j)
  • aij2
  • Second option
  • int a200200
  • for (j0jlt200j)
  • for (i0ilt200i)
  • aij2

75
Counting cache misses
  • nxn 2-D array, element size e bytes, cache line
    size b bytes

memory/cache line
  • One cache miss for every cache line n2 x e /b
  • Total number of memory accesses n2
  • Miss rate e/b
  • Example Miss rate 4 bytes / 64 bytes 6.25
  • Unless the array is very small

memory/cache line
  • One cache miss for every access
  • Example Miss rate 100
  • Unless the array is very small

76
Array Initialization in C
  • First option
  • int a200200
  • for (i0ilt200i)
  • for (j0jlt200j)
  • aij2
  • Second option
  • int a200200
  • for (j0jlt200j)
  • for (i0ilt200i)
  • aij2

Good Locality
77
Performance Measurements
  • Option 1
  • int aXX
  • for (i0ilt200i)
  • for (j0jlt200j)
  • aij2
  • Option 2
  • int aXX
  • for (j0jlt200j)
  • for (i0ilt200i)
  • aij2

Experiments on my laptop
  • Note that other languages use column major
  • e.g., FORTRAN

78
Matrix Multiplication
  • A classic example for locality-aware programming
    is matrix multiplication
  • for (i0iltNi)
  • for (j0jltNj)
  • for (k0kltNk)
  • ci,j aik bkj
  • There are 6 possible orders for the three loops
  • i-j-k, i-k-j, j-i-k, j-k-i, k-i-j, k-j-i
  • Each order corresponds to a different access
    patterns of the matrices
  • Lets focus on the inner loop, as it is the one
    thats executed most often

79
Inner Loop Memory Accesses
  • Each matrix element can be accessed in three
    modes in the inner loop
  • Constant doesnt depend on the inner loops
    index
  • Sequential contiguous addresses
  • Stride non-contiguous addresses (N elements
    apart)
  • cij aik
    bkj
  • i-j-k Constant Sequential Strided
  • i-k-j Sequential Constant Sequential
  • j-i-k Constant Sequential Strided
  • j-k-i Strided Strided Constant
  • k-i-j Sequential Constant Sequential
  • k-j-i Strided Strided Constant

80
Loop order and Performance
  • Constant access is better than sequential access
  • its always good to have constants in loops
    because they can be put in registers (as weve
    seen in our very first optimization)
  • Sequential access is better than strided access
  • sequential access is better than strided because
    it utilizes the cache better
  • Lets go back to the previous slides

81
Best Loop Ordering?
  • cij aik
    bkj
  • i-j-k Constant Sequential Strided
  • i-k-j Sequential Constant Sequential
  • j-i-k Constant Sequential Strided
  • j-k-i Strided Strided Constant
  • k-i-j Sequential Constant Sequential
  • k-j-i Strided Strided Constant
  • k-i-j and i-k-j should have the best performance
  • i-j-k and j-i-k should be worse
  • j-k-i and k-j-i should be the worst
  • You will measure this in a Programming Assignment

82
How good is the best ordering?
  • Let us assume that i-k-j is best
  • How many cache misses?
  • for (i0iltNi)
  • for (k0kltNk)
  • xaik
  • for (j0jltNj)
  • ci,j x bkj
  • Clearly this is not easy to compute
  • e.g., if the matrix is twice the size of the
    cache, there is a lot of loading/evicting and
    obtaining a formula would be complicated
  • Let L be the cache line size in number of matrix
    elements
  • How about a very coarse approximation, by
    assuming that the matrix is much larger than the
    cache?
  • determine what matrix pieces are loaded/written
  • Figure out the expected number of cache misses

83
Slow Memory Operations
  • for (i0iltNi)
  • // (1) read row i of a into cache
  • // (2) write row i of c back to memory
  • for (k0kltNk)
  • // (3) read column j of b into cache
  • for (j0jltNj)
  • ci,jaikbkj
  • L cache line size
  • (1) N (N / L) cache misses
  • (2) N (N / L) cache misses
  • (3) N N N cache misses
  • Although the access to B is sequential, its
    sequential along the column and the matrix is
    store in row-major fashion!
  • Total 2N2/L N3 N3 (for large N)

84
Bad News
  • N3 slow memory operations and 2N3 arithmetic
    operations
  • Ratio ops / mem 2
  • This is bad news because we know that computer
    architectures are NOT balanced and memory
    operations are orders of magnitude slower than
    arithmetic operations
  • Therefore, the memory is still the bottleneck for
    this implementation of matrix multiplication (the
    ratio should be much higher)
  • BUT we have only N2 matrix elements, how come we
    perform N3 slow memory accesses?
  • Because we access matrix B very inefficiently,
    trying to load entire columns one after the other
  • Lesson counting the number of operations and
    comparing it with the size of the data is not
    sufficient to ascertain that an algorithm will
    not suffer from the memory bottleneck

85
Better cache reuse?
  • Since we really need only N2 elements, perhaps
    there is a better way to reorganize the
    operations of the matrix multiplication for a
    higher number of cache hits
  • Possible because and are associative and
    commutative
  • Researchers have spent a lot of time trying to
    find out the best ordering
  • There are even theorems!
  • Let q ratio of operations to slow memory
    accesses
  • q must be as high as possible to remove the
    memory bottleneck
  • HongKung 1981 Any reorganization of the
    algorithm is limited to q O(vM), where M is the
    size of the cache (in number of elements)
  • obtained with a lot of unrealistic assumptions
    about the cache
  • still shows that q wont scale with N, unlike
    what one may think when dividing 2n3 by n2.

86
Blocked Matrix Multiplication
  • One problem with our implementation is that we
    try to access entire columns of matrix B.
  • What about accessing only a subset of a column,
    or of multiple columns, at a time?

87
Blocked Matrix Multiplication
cache line
j
j
i
i
A
B
C
Key idea reuse the other elements in each cache
line as much as possible
88
Blocked Matrix Multiplication
cache line
j
j
i
i
b elements
b elements
A
B
C
May as well compute ci,j1 since one loads column
j1 of B in the cache lines anyway. But must
reorder the operations as follows compute
the first b terms of cij, compute the first b
terms of ci,j1 compute the next b
terms of cii, compute the next b terms of cij1
.....
89
Blocked Matrix Multiplication
cache line
j
j
i
i
A
B
C
May as well compute a whole subrow of C, with the
same reordering of the operations. But by
computing a whole row of C, then one has to load
all columns of B, which one has to do again for
computing the next row of C. Idea reuse the
blocks of B that we have just loaded.
90
Blocked Matrix Multiplication
cache line
j
j
i
i
A
B
C
Order of the operation Compute the first b
terms of all cij values in the C block Compute
the next b terms of all cij values in the C
block . . . Compute the last b terms of all cij
values in the C block
91
Blocked Matrix Multiplication
N 4 b
C11
C12
C13
C14
A11
A12
A13
A14
B11
B12
B13
B14
C21
C22
C23
C24
A21
A22
A23
A24
B21
B22
B23
B24
C31
C32
C43
C34
A31
A32
A33
A34
B32
B32
B33
B34
C41
C42
C43
C44
A41
A42
A43
A144
B41
B42
B43
B44
  • C22 A21B12 A22B22 A23B32 A24B42
  • 4 matrix multiplications
  • 4 matrix additions
  • Main Point each multiplication operates on
    small block matrices, whose size may be chosen
    so that they fit in the cache.

92
Blocked Algorithm
  • The blocked version of the i-j-k algorithm is
    written simply as
  • for (i0iltN/bi)
  • for (j0jltN/bj)
  • for (k0kltN/bk)
  • Cij AikBkj
  • where b is the block size (which we assume
    divides N)
  • where Xij is the block of matrix X on block
    row i and block column j
  • where means matrix addition
  • where means matrix multiplication

93
Cache Misses?
  • for (i0iltN/bi)
  • for (j0jltN/bj)
  • // (1) write block Cij to memory
  • for (k0kltN/bk)
  • // (2) Load block Aik from memory
  • // (3) Load block Bkj from memory
  • Cij AikBkj
  • (1) (N/b)(N/b)bb
  • (2) (N/b)(N/b)(N/b)bb
  • (3) (N/b)(N/b)(N/b)bb
  • Total N2 2N3/b 2N3/b

94
Performance?
  • Slow memory accesses 2N3/b
  • Number of operations 2N3
  • Therefore, ratio ops / mem b
  • This ratio should be as high as possible
  • (Compare to the value of 2 that we obtained with
    the non-blocked implementation)
  • This implies that one should make the block size
    as large as possible
  • But, if we take this result to the extreme, then
    the block size should be equal to N!!
  • This clearly doesnt make sense because then
    were back to the non-blocked implementation

95
Maximum Block Size
  • The blocking optimization only works if the
    blocks fit in cache
  • That is, 3 blocks of size bxb must fit in cache
    (for A, B, and C)
  • Let M be the cache size (in elements)
  • We must have 3b2 M, or b v(M/3)
  • Therefore, in the best case, ratio of number of
    operations to slow memory accesses is v(M/3)

96
Optimizing Further?
  • At this point we know that blocking is a good
    idea
  • Turns out that the best block size isnt that
    easy to determine
  • There are many other things we could do to the
    code
  • loop unrolling
  • instruction reordering
  • ...
  • There are many things the compiler can do to the
    code and there are many compiler flags we could
    use
  • In the end, how do we determine the best
    implementation for a given architecture?

97
Automatic Program Generation
  • It is difficult to optimize code because
  • There are many possible options for
    tuning/modifying the code
  • These options interact in complex ways with the
    compiler and the hardware
  • This is really an optimization problem
  • The objective function is the codes performance
  • The feasible solutions are all possible ways to
    implement the software
  • Typically a finite number of implementation
    decisions are to be made
  • Each decision can take a range of values
  • e.g., the 7th loop in the 3rd function can be
    unrolled 1, 2, ..., 20 times
  • e.g., the block size could be 2x2, 4x4, ...,
    400x400
  • e.g., function could be recursive or iterative
  • And one needs to do it again and again for
    different platforms

98
Automatic Program Generation
  • What is good at solving hard optimization
    problems?
  • computers
  • Therefore, a computer program could generate the
    computer program with the best performance
  • Could use a brute force approach try all
    possible solutions
  • but there is an exponential number of them
  • Could use genetic algorithms
  • Could use some ad-hoc optimization technique

99
Matrix Multiplication
  • We have seen that for matrix multiplication there
    are several possible ways to optimize the code
  • block size
  • optimization flag to the compiler
  • order of loops
  • ...
  • It is difficult to find the best one
  • People have written automatic matrix
    multiplication program generators!

100
The ATLAS Project
  • ATLAS is a software that you can download and run
    on most platforms.
  • It runs for a while (perhaps a couple of hours)
    and generates a .c file that implements matrix
    multiplication!
  • ATLAS optimizes for
  • Instruction cache reuse
  • Floating point instruction ordering
  • pipeline functional units
  • Reducing loop overhead
  • Exposing parallelism
  • multiple functional units
  • Cache reuse

101
ATLAS (500x500 matrices)
Source Jack Dongarra
  • ATLAS is faster than all other portable BLAS
    implementations and it is comparable with
    machine-specific libraries provided by the vendor.

102
Improving an Application
  • So, we have seen ways in which to improve pieces
    of code
  • The problem is that one typically doesnt have an
    application that just performance an array
    initialization, or a matrix multiplication
  • In fact, there are many parts of the application
    that one could think of optimizing for memory,
    etc.

103
Profiling
  • Question how do we know which part of the code
    is the most expensive?
  • If youve not writen the code you may know
  • If youve written the code you may have some idea
    (although experience shows that many programmers
    dont
  • The most expensive part may be in some library
    function you havent written
  • You could put gettimeofday() calls everywhere,
    but that gets really cumbersome for large
    projects
  • The standard way use a profiler

104
What is a Profiler?
  • A profiler is a tool that monitors the execution
    of a program and that reports the amount of time
    spent in different functions
  • Useful to identify the expensive functions
  • Profiling cycle
  • Compile the code with the profiler
  • Run the code
  • Identify the most expensive function
  • Optimize that function
  • call it less often if possible
  • make it faster
  • Repeat until you cant think of any ways to
    further optimize the most expensive function
  • UNIX has a good, free profiler called gprof

105
Using gprof
  • Compile your code using gcc with the -pg option
  • Run your code until completion
  • Then run gprof with your programs name as single
    command-line argument
  • Example
  • gcc -pg prog.c -o prog
  • ./prog
  • gprof prog gt profile_file
  • The output file contains all profiling information

106
Profiling output
  • The content of the file is explained in detail in
    the file itself
  • At the beginning of the file is a summary of
    which fraction of the code is spent in which
    function
  • In the middle section is a detailed entry for
    each function
  • At the end of the file is a function index, in
    which each function is assigned a number in
    brackets, e.g., 3

107
Profiling Output
  • Flat profiling summary
  • cumulative self
  • time seconds seconds name
  • 30.9 0.77 0.77 ___multadd_D2A 1
  • 16.9 1.19 0.42 _scheduler ltcycle 1gt
    3
  • 15.3 1.57 0.38 _scandir 5
  • 9.2 1.80 0.23 _NSLookupAndBindSymbo
    lHint 6
  • 6.4 1.96 0.16 _job ltcycle 1gt 8
  • 4.4 2.07 0.11 _NSIsSymbolNameDefine
    dHint 9
  • 1.6 2.11 0.04 _hash_nkey 10
  • 1.6 2.15 0.04 _pthread_key_create
    11
  • 1.2 2.18 0.03 ___quorem_D2A 12
  • 1.2 2.21 0.03 __mh_dylib_header
    13
  • 1.2 2.24 0.03 _probe_submitter
    14
  • 1.2 2.27 0.03 _request_submitter
    15

in the function itself
in the function and its children
108
Profiling output
  • The middle section of the file provides detailed
    information for each function
  • Entry format
  • index time self children called name
  • 1.21 3.10 80/132 f1
    111
  • 0.69 1.13 52/132 f2
    123
  • 1 23.1 2.12 4.23 132 func
    1
  • 4.23 0.00 32/5231 c
    39
  • Can vary depending on the version of gprof
  • You should really read the explanations in the
    file to be sure

109
Profiling output
  • index time self children called name
  • 1.21 3.10 80/132 f1
    111
  • 0.69 1.13 52/132 f2
    123
  • 1 23.1 2.12 4.23 132 func
    1
  • 4.23 0.00 32/5231 c
    39

Function func 1
110
Profiling output
Parents f1 111, f2 123
  • index time self children called name
  • 1.21 3.10 80/132 f1
    111
  • 0.69 1.13 52/132 f2
    123
  • 1 23.1 2.12 4.23 132 func
    1
  • 4.23 0.00 32/5231 c
    39

Function func 1
111
Profiling output
Parents f1 111, f2 123
  • index time self children called name
  • 1.21 3.10 80/132 f1
    111
  • 0.69 1.13 52/132 f2
    123
  • 1 23.1 2.12 4.23 132 func
    1
  • 4.23 0.00 32/5231 c
    39

Function func 1
Children c 39
112
Profiling output
Parents f1 111, f2 123
  • index time self children called name
  • 1.21 3.10 80/132 f1
    111
  • 0.69 1.13 52/132 f2
    123
  • 1 23.1 2.12 4.23 132 func
    1
  • 4.23 0.00 32/5231 c
    39

Function func 1
Children c 39
f1
f2
calls
calls
func
Call graph
calls
c
113
Profiling output
Parents f1 111, f2 123
  • index time self children called name
  • 1.21 3.10 80/132 f1
    111
  • 0.69 1.13 52/132 f2
    123
  • 1 23.1 2.12 4.23 132 func
    1
  • 4.23 0.00 32/5231 c
    39

Function func 1
Children c 39
f1
f2
Call counts
80 calls
52 calls
func
Call graph
called 132 times total
32 calls
c
called 5231 times total
114
Profiling output
Parents f1 111, f2 123
  • index time self children called name
  • 1.21 3.10 80/132 f1
    111
  • 0.69 1.13 52/132 f2
    123
  • 1 23.1 2.12 4.23 132
Write a Comment
User Comments (0)
About PowerShow.com