Code Optimization - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Code Optimization

Description:

Repeat (indefinitely) 4. gprof UNIX performance tool. Compile and link ... timer: repeat ... count mallocs. Scientific Performance Method: Measure. Fix. Repeat ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 12
Provided by: orionl7
Category:

less

Transcript and Presenter's Notes

Title: Code Optimization


1
Code Optimization
  • Orion Sky Lawlor
  • olawlor_at_uiuc.edu
  • 2003/9/17

2
Roadmap
  • Introduction
  • gprof
  • Timer calls
  • Understanding Performance

3
Introduction
  • Scientific Performance Method
  • Measure (dont assume!)
  • Find the bottlenecks in the code
  • They arent where you expect!
  • Fix the worst problems first
  • Consider stopping-- is it good enough?
  • Fix
  • Improve algorithms first
  • Improve implementations second
  • Repeat (indefinitely)

4
gprof UNIX performance tool
  • Compile and link with -pg flag
  • Adds instrumentation to code
  • Run serial program normally
  • Instrumentation runs automatically
  • Run gprof to analyze trace
  • gprof pgm gmon.out
  • Shows a function-level view of execution time
  • Heisenberg measurement error!

5
Timer Calls
  • CPU time (virtual time)
  • Time spent running your code
  • CmiCpuTimer() 10ms resolution
  • Wall clock time (real time)
  • Includes OS interference, network delays,
    context-switching overhead
  • CmiWallTimer() to 1ns resolution
  • This is what really counts

6
How to call the timer one time
  • Whats wrong with this?
  • double sCmiWallTimer()
  • foo()
  • double eCmiWallTimer()-s
  • CkPrintf(foo took fs\n,e)
  • If CmiWallTimer takes 100ns, and foo takes 50ns,
    this may print 150ns!
  • Only a problem for very fast functions (or slow
    timers!)

7
How to call the timer repeat
  • Repetition can decrease apparent timer overhead
    and increase resolution
  • const int n1000
  • double sCmiWallTimer()
  • for (i0iltni) foo()
  • double e(CmiWallTimer()-s)/n
  • CkPrintf(foo took fs\n,e)
  • Problem what if foos performance is
    cache-sensitive?

8
Understanding Performance CPU
GHz 1ns
Integer Arithmetic
Floating Point
Branches
Cache
10ns
(int)x
Subroutine
Memory
/
/ or
100ns
inline
MHz 1us
Bitshifts and masks
(int )x
10us
100us
Cache-friendly algorithms
KHz 1ms
9
Understanding Performance OS
GHz 1ns
Timer
10ns
sin, tan,...
100ns
Malloc
Syscall
MHz 1us
10us
Tables, identities
Reuse buffers
100us
Avoid
KHz 1ms
Disk
Timeslice
10
Understanding Performance Net
GHz 1ns
10ns
Message Combining
Rethink
100ns
Probe
MHz 1us
RDMA
Barrier
Message
10us
100us
KHz 1ms
11
Conclusions
  • Performance is the whole point of parallel
    programming
  • painful but necessary
  • Asymptotics matter find O(n)
  • Constants matter count mallocs
  • Scientific Performance Method
  • Measure
  • Fix
  • Repeat
Write a Comment
User Comments (0)
About PowerShow.com