mhz: Anatomy of a microbenchmark - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

mhz: Anatomy of a microbenchmark

Description:

How to find the CPU clock speed? Measure a single clock tick. Clock resolution is too coarse. Measure the time to execute a known number of clock ticks ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 26
Provided by: carlst
Category:

less

Transcript and Presenter's Notes

Title: mhz: Anatomy of a microbenchmark


1
mhz Anatomy of a micro-benchmark
  • Carl StaelinHewlett-Packard Laboratories
  • Larry McVoyBitMover, Inc.

2
Outline
  • Problem statement
  • Solution
  • Experimental method
  • Data analysis
  • Results
  • Conclusions

3
Problem
  • How to find the CPU clock speed?
  • Measure a single clock tick
  • Clock resolution is too coarse
  • Measure the time to execute a known number of
    clock ticks
  • Unknown number of clock ticks per C expression

4
Benchmarking basics
  • Accurate timing
  • gettimeofday() resolution is poor
  • timing interval much larger than clock resolution
  • Overhead
  • time to measure the time interval
  • for() loop overhead

5
Simple solution
  • start()for (i 0 i lt 10000 i)
    HUNDRED(a)usecs stop()mhz
    1000000/(double)usecs

6
Limitations
  • Assumes known of clock ticks / expression
  • Increasing processor speeds invalidate timing
  • Single experiment

7
mhz solution
altlt1
8
mhz solution
a
altlt1
9
mhz solution
altlt1
a
pp
10
mhz solution
altlt1
a
pp
11
Greatest common divisor
  • Measure the duration of N expressions
  • Find the greatest common divisor
  • Requires relatively prime expressions
  • But, time is not integral!

12
Measurement error
altlt1
a
pp
Experimental error
Experimental result
13
Computing GCD
t
cycles
14
lmbench 2.0 timing
  • Uniform experimental interface
  • BENCH(), BENCH1()
  • Manages
  • Auto-sizing timing loops
  • Multiple experiments, reports median
  • Removes timing measurement overhead
  • Removes loop overhead

15
Clock resolution
  • Determine duration needed for accurate
    measurements
  • Between 5,000 µ-secs and 1 sec
  • Determine how many iterations of loop inside
    timing interval are needed

16
Timing interface
  • BENCH(bench, enough)
  • report median of 11 experiments
  • measure performance of bench
  • run each test at least enough µ-secs
  • BENCH1(bench, enough)
  • report result of 1 experiment

17
Numerical analysis
  • Minimize noise in result
  • Eliminate obvious outliers
  • Use minimal timing results rather than median
  • Compute GCD for each subset of data points
  • Only compute GCD for independent points
  • Choose mode of GCD computations
  • Detect when result is incorrect
  • If the mhz for minimal and almost minimal results
    is too different, ignore result

18
mhz results
  • Processors
  • x86, Alpha, PowerPC, SPARC, PA-RISC,
  • Operating Systems
  • Linux, HP-UX, SunOS, AIX, IRIX, ...
  • 642 runs of mhz

19
Build your own benchmarks
  • Latency and bandwidth benchmarks
  • lmbench 2.0 timing harness
  • Accurate timing results
  • Simple interface
  • Quickly develop benchmarks to
  • Understand system performance
  • Compare implementations

20
Simple latency benchmark
  • include "bench.h"intmain(int argc, char
    argv) BENCH(lrand48(), 0) micro("lrand48()
    ", get_n()) exit(0)

21
Simple bandwidth benchmark
  • include "bench.h"define M (1024
    1024)intmain(int argc, char argv) char a
    malloc(M) char b malloc(M) BENCH(bcopy(a,
    b,M), 0) mb(get_n() M) exit(0)

22
Pitfalls
  • Measuring the wrong operation
  • Beware of caching effects!
  • Measuring partial operations
  • Subtracting two measurements
  • Operation/benchmarking overhead

23
Example
  • Convolution
  • Similar to matrix multiply
  • What is fastest convolution method?
  • Pointers vs. array indexing?
  • Calculate result a point at a time, or partial
    results for each input point?
  • Short vs. integer vs. float data types?

24
Convolution
  • Quickly wrote about 6 convolution implementations
  • Inserted them in the lmbench 2.0 harness
  • Measured performance
  • Improved convolution performance 3x over initial
    implementation

25
Conclusions
  • mhz is an accurate, platform-independent
    benchmark
  • lmbench timing harness is useful and accurate
  • http//www.bitmover.com/lmbench
Write a Comment
User Comments (0)
About PowerShow.com