COMP 206: Computer Architecture and Implementation - PowerPoint PPT Presentation

About This Presentation
Title:

COMP 206: Computer Architecture and Implementation

Description:

'Make the common case fast' ... 'Make The Common Case Fast' ... Let's say that we want to make the same relative change to one or the ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 37
Provided by: Montek5
Learn more at: http://www.cs.unc.edu
Category:

less

Transcript and Presenter's Notes

Title: COMP 206: Computer Architecture and Implementation


1
COMP 206Computer Architecture and Implementation
  • Montek Singh
  • Wed., Sep 3, 2003
  • Lecture 2

2
Outline
  • Quantitative Principles of Computer Design
  • Amdahls law (make the common case fast)
  • Performance Metrics
  • MIPS, FLOPS, and all that
  • Examples

3
Quantitative Principles of Computer Design
Performance Rate of producing results Throughput B
andwidth
Execution time Response time Latency
4
Comparison
Y is n times larger than X
Y is n larger than X
5
Amdahls Law (1967)
Validity of the single processor approach to
achieving large scale computing capabilities, G.
M. Amdahl, AFIPS Conference Proceedings, pp.
483-485, April 1967
  • Historical context
  • Amdahl was demonstrating the continued validity
    of the single processor approach and of the
    weaknesses of the multiple processor approach
  • Paper contains no mathematical formulation, just
    arguments and simulation
  • The nature of this overhead appears to be
    sequential so that it is unlikely to be amenable
    to parallel processing techniques.
  • A fairly obvious conclusion which can be drawn
    at this point is that the effort expended on
    achieving high parallel performance rates is
    wasted unless it is accompanied by achievements
    in sequential processing rates of very nearly the
    same magnitude.
  • Nevertheless, it is of widespread applicability
    in all kinds of situations

6
Amdahls Law
Bottleneckology Evaluating Supercomputers,
Jack Worlton, COMPCOM 85, pp. 405-406
Average execution rate (performance)
Fraction of results generated at this rate
Weighted harmonic mean
7
Example of Amdahls Law
30 of results are generated at the rate of 1
MFLOPS, 20 at 10 MFLOPS, 50 at 100 MFLOPS. What
is the average performance? What is the
bottleneck?
8
Amdahls Law (HP3 book, pp. 40-41)
9
Implications of Amdahls Law
  • The performance improvements provided by a
    feature are limited by how often that feature is
    used
  • As stated, Amdahls Law is valid only if the
    system always works with exactly one of the rates
  • If a non-blocking cache is used, or there is
    overlap between CPU and I/O operations, Amdahls
    Law as given here is not applicable
  • Bottleneck is the most promising target for
    improvements
  • Make the common case fast
  • Infrequent events, even if they consume a lot of
    time, will make little difference to performance
  • Typical use Change only one parameter of
    system, and compute effect of this change
  • The same program, with the same input data,
    should run on the machine in both cases

10
Make The Common Case Fast
  • All instructions require an instruction fetch,
    only a fraction require a data fetch/store
  • Optimize instruction access over data access
  • Programs exhibit locality
  • Spatial Locality
  • items with addresses near one another tend to be
    referenced close together in time
  • Temporal Locality
  • recently accessed items are likely to be accessed
    in the near future
  • Access to small memories is faster
  • Provide a storage hierarchy such that the most
    frequent accesses are to the smallest (closest)
    memories.

11
Make The Common Case Fast (2)
  • What is the common case?
  • The rate at which the system spends most of its
    time
  • The bottleneck
  • What does this statement mean precisely?
  • Make the common case faster, rather than making
    some other case faster
  • Make the common case faster by a certain amount,
    rather than making some other case faster by the
    same amount
  • Absolute amount?
  • Relative amount?
  • This principle is merely an informal statement of
    a frequently correct consequence of Amdahls Law

12
Make The Common Case Fast (3a)
A machine produces 20 and 80 of its results at
the rates of 1 and 3 MFLOPS, respectively. What
is more advantageous to improve the 1 MFLOPS
rate, or to improve the 3 MFLOPS rate?
Generalize problem Assume rates are x and y
MFLOPS
At (x,y) (1,3), this indicates that it is
better to improve x, the 1 MFLOPS rate, which is
not the common case.
13
Make The Common Case Fast (3b)
Lets say that we want to make the same relative
change to one or the other rate, rather than the
same absolute change.
At (x,y) (1,3), this indicates that it is
better to improve y, the 3 MFLOPS rate, which is
the common case.
If there are two different execution rates,
making the common case faster by the same
relative amount is always more advantageous than
the alternative. However, this does not
necessarily hold if we make absolute changes of
the same magnitude. For three or more rates,
further analysis is needed.
14
Basics of Performance
15
Details of CPI
16
MIPS
  • Machines with different instruction sets?
  • Programs with different instruction mixes?
  • Dynamic frequency of instructions
  • Uncorrelated with performance
  • Marketing metric
  • Meaningless Indicator of Processor Speed

17
MFLOP/s
  • Popular in supercomputing community
  • Often not where time is spent
  • Not all FP operations are equal
  • Normalized MFLOP/s
  • Can magnify performance differences
  • A better algorithm (e.g., with better data reuse)
    can run faster even with higher FLOP count
  • DGEQRF vs. DGEQR2 in LAPACK

18
Aspects of CPU Performance
19
Example 1 (see HP3 pp. 42-45 for more examples)
Which change is more effective on a certain
machine speeding up 10-fold the floating point
square root operation only, which takes up 20 of
execution time, or speeding up 2-fold all
floating point operations, which take up 50 of
total execution time? (Assume that the cost of
accomplishing either change is the same, and
the two changes are mutually exclusive.)
Fsqrt fraction of FP sqrt results Rsqrt
rate of producing FP sqrt results Fnon-sqrt
fraction of non-sqrt results Rnon-sqrt rate
of producing non-sqrt results Ffp fraction of
FP results Rfp rate of producing FP
results Fnon-fp fraction of non-FP
results Rnon-fp rate of producing non-FP
results Rbefore average rate of producing
results before enhancement Rafter average
rate of producing results after enhancement
20
Example 1 (Soln. using Amdahls Law)
Improving all FP operations is more effective
21
Example 2
Why?
Which CPU performs better?
22
Example 2 (Solution)
If clock cycle time of A was only 1.1x clock
cycle time of B, then CPU B would be about 9
higher performance.
23
Example 3
A LOAD/STORE machine has the characteristics
shown below. We also observe that 25 of the
ALU operations directly use a loaded value that
is not used again. Thus we hope to improve
things by adding new ALU instructions that have
one source operand in memory. The CPI of the new
instructions is 2. The only unpleasant
consequence of this change is that the CPI of
branch instructions will increase from 2 to 3.
Overall, will CPU performance increase?
24
Example 3 (Solution)
Before change
After change
Since CPU time increases, change will not improve
performance.
25
Example 4
A load-store machine has the characteristics
shown below. An optimizing compiler for the
machine discards 50 of the ALU operations,
although it cannot reduce loads, stores, or
branches. Assuming a 500 MHz (2 ns) clock, what
is the MIPS rating for optimized code versus
unoptimized code? Does the ranking of MIPS agree
with the ranking of execution time?
26
Example 4 (Solution)
Without optimization
With optimization
Performance increases, but MIPS decreases!
27
Performance of (Blocking) Caches
28
Example
Assume we have a machine where the CPI is 2.0
when all memory accesses hit in the cache. The
only data accesses are loads and stores, and
these total 40 of the instructions. If the miss
penalty is 25 clock cycles and the miss rate is
2, how much faster would the machine be if all
memory accesses were cache hits?
29
Means
30
Weighted Means
31
Relations among Means
Equality holds if and only if all the elements
are identical.
32
Summarizing Computer Performance
Characterizing Computer Performance with a
Single Number, J. E. Smith, CACM, October 1988,
pp. 1202-1206
  • The starting point is universally accepted
  • The time required to perform a specified
    amount of computation is the ultimate measure of
    computer performance
  • How should we summarize (reduce to a single
    number) the measured execution times (or measured
    performance values) of several benchmark
    programs?
  • Two required properties
  • A single-number performance measure for a set of
    benchmarks expressed in units of time should be
    directly proportional to the total (weighted)
    time consumed by the benchmarks.
  • A single-number performance measure for a set of
    benchmarks expressed as a rate should be
    inversely proportional to the total (weighted)
    time consumed by the benchmarks.

33
Arithmetic Mean for Times
Smaller is better for execution times
34
Harmonic Mean for Rates
Larger is better for execution rates
35
Avoid the Geometric Mean
  • If benchmark execution times are normalized to
    some reference machine, and means of normalized
    execution times are computed, only the geometric
    mean gives consistent results no matter what the
    reference machine is (see Figure 1.17 in HP3, pg.
    38)
  • This has led to declaring the geometric mean as
    the preferred method of summarizing execution
    time (e.g., SPEC)
  • Smiths comments
  • The geometric mean does provide a consistent
    measure in this context, but it is consistently
    wrong.
  • If performance is to be normalized with respect
    to a specific machine, an aggregate performance
    measure such as total time or harmonic mean rate
    should be calculated before any normalizing is
    done. That is, benchmarks should not be
    individually normalized first.

36
Programs to Evaluate Performance
  • (Toy) Benchmarks
  • 10-100 line program
  • sieve, puzzle, quicksort
  • Synthetic Benchmarks
  • Attempt to match average frequencies of real
    workloads
  • Whetstone, Dhrystone
  • Kernels
  • Time-critical excerpts of real programs
  • Livermore loops
  • Real programs
  • gcc, compress
Write a Comment
User Comments (0)
About PowerShow.com