COMP 206: Computer Architecture and Implementation - PowerPoint PPT Presentation

About This Presentation
Title:

COMP 206: Computer Architecture and Implementation

Description:

That is, benchmarks should not be individually normalized first.' 20 ... If benchmarks/summary inadequate, then choose between improving product for real ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 39
Provided by: Montek5
Learn more at: http://www.cs.unc.edu
Category:

less

Transcript and Presenter's Notes

Title: COMP 206: Computer Architecture and Implementation


1
COMP 206Computer Architecture and Implementation
  • Montek Singh
  • Wed., Sep 1, 2004
  • Lecture 3
  • (continuation of Lecture 2)

2
Outline
  • Quantitative Principles of Computer Design
  • Amdahls law (make the common case fast)
  • Performance Metrics
  • MIPS, FLOPS, and all that
  • Examples

3
Example 1 (see HP3 pp. 42-45 for more examples)
Which change is more effective on a certain
machine speeding up 10-fold the floating point
square root operation only, which takes up 20 of
execution time, or speeding up 2-fold all
floating point operations, which take up 50 of
total execution time? (Assume that the cost of
accomplishing either change is the same, and
the two changes are mutually exclusive.)
Fsqrt fraction of FP sqrt results Rsqrt
rate of producing FP sqrt results Fnon-sqrt
fraction of non-sqrt results Rnon-sqrt rate
of producing non-sqrt results Ffp fraction of
FP results Rfp rate of producing FP
results Fnon-fp fraction of non-FP
results Rnon-fp rate of producing non-FP
results Rbefore average rate of producing
results before enhancement Rafter average
rate of producing results after enhancement
4
Example 1 (Soln. using Amdahls Law)
Improving all FP operations is more effective
5
Example 2
Why?
Which CPU performs better?
6
Example 2 (Solution)
If clock cycle time of A was only 1.1x clock
cycle time of B, then CPU B would be about 9
higher performance.
7
Example 3
A LOAD/STORE machine has the characteristics
shown below. We also observe that 25 of the
ALU operations directly use a loaded value that
is not used again. Thus we hope to improve
things by adding new ALU instructions that have
one source operand in memory. The CPI of the new
instructions is 2. The only unpleasant
consequence of this change is that the CPI of
branch instructions will increase from 2 to 3.
Overall, will CPU performance increase?
8
Example 3 (Solution)
Before change
After change
Since CPU time increases, change will not improve
performance.
9
Example 4
A load-store machine has the characteristics
shown below. An optimizing compiler for the
machine discards 50 of the ALU operations,
although it cannot reduce loads, stores, or
branches. Assuming a 500 MHz (2 ns) clock, what
is the MIPS rating for optimized code versus
unoptimized code? Does the ranking of MIPS agree
with the ranking of execution time?
10
Example 4 (Solution)
Without optimization
With optimization
Performance increases, but MIPS decreases!
11
Performance of (Blocking) Caches
12
Example
Assume we have a machine where the CPI is 2.0
when all memory accesses hit in the cache. The
only data accesses are loads and stores, and
these total 40 of the instructions. If the miss
penalty is 25 clock cycles and the miss rate is
2, how much faster would the machine be if all
memory accesses were cache hits?
13
Means
14
Weighted Means
15
Relations among Means
Equality holds if and only if all the elements
are identical.
16
Summarizing Computer Performance
Characterizing Computer Performance with a
Single Number, J. E. Smith, CACM, October 1988,
pp. 1202-1206
  • The starting point is universally accepted
  • The time required to perform a specified
    amount of computation is the ultimate measure of
    computer performance
  • How should we summarize (reduce to a single
    number) the measured execution times (or measured
    performance values) of several benchmark
    programs?
  • Two required properties
  • A single-number performance measure for a set of
    benchmarks expressed in units of time should be
    directly proportional to the total (weighted)
    time consumed by the benchmarks.
  • A single-number performance measure for a set of
    benchmarks expressed as a rate should be
    inversely proportional to the total (weighted)
    time consumed by the benchmarks.

17
Arithmetic Mean for Times
Smaller is better for execution times
18
Harmonic Mean for Rates
Larger is better for execution rates
19
Avoid the Geometric Mean
  • If benchmark execution times are normalized to
    some reference machine, and means of normalized
    execution times are computed, only the geometric
    mean gives consistent results no matter what the
    reference machine is (see Figure 1.17 in HP3, pg.
    38)
  • This has led to declaring the geometric mean as
    the preferred method of summarizing execution
    time (e.g., SPEC)
  • Smiths comments
  • The geometric mean does provide a consistent
    measure in this context, but it is consistently
    wrong.
  • If performance is to be normalized with respect
    to a specific machine, an aggregate performance
    measure such as total time or harmonic mean rate
    should be calculated before any normalizing is
    done. That is, benchmarks should not be
    individually normalized first.

20
Programs to Evaluate Performance
  • (Toy) Benchmarks
  • 10-100 line program
  • sieve, puzzle, quicksort
  • Synthetic Benchmarks
  • Attempt to match average frequencies of real
    workloads
  • Whetstone, Dhrystone
  • Kernels
  • Time-critical excerpts of real programs
  • Livermore loops
  • Real programs
  • gcc, compress

21
SPEC Std Perf Evaluation Corp
  • First round 1989 (SPEC CPU89)
  • 10 programs yielding a single number
  • Second round 1992 (SPEC CPU92)
  • SPECint92 (6 integer programs) and SPECfp92 (14
    floating point programs)
  • Compiler flags unlimited. March 93 of DEC 4000
    Model 610
  • spice unix.c/def(sysv,has_bcopy,bcopy(a,b,c)m
    emcpy(b,a,c)
  • wave5 /ali(all,dcomnat)/aga/ur4/ur200
  • nasa7 /norecu/aga/ur4/ur2200/lcblas
  • Third round 1995 (SPEC CPU95)
  • Single flag setting for all programs new set of
    programs (8 integer, 10 floating point)
  • Phased out in June 2000
  • SPEC CPU2000 released April 2000

22
SPEC95 Details
  • Reference machine
  • Sun SPARCstation 10/40
  • 128 MB memory
  • Sun SC 3.0.1 compilers
  • Benchmarks larger than SPEC92
  • Larger code size
  • More memory activity
  • Minimal calls to library routines
  • Greater reproducibility of results
  • Standardized build and run environment
  • Manual intervention forbidden
  • Definitions of baseline tightened
  • Multiple numbers
  • SPECint_95base, SPECint_95, SPECfp_95base,
    SPECfp_95

Source SPEC
23
Trends in Integer Performance
Source Microprocessor Report 13(17), 27 Dec 1999
24
Trends in Floating Point Performance
Source Microprocessor Report 13(17), 27 Dec 1999
25
SPEC95 Ratings of Processors
Source Microprocessor Report, 24 Apr 2000
26
SPEC95 vs SPEC CPU2000
Source Microprocessor Report, 17 Apr 2000
Read SPEC CPU2000 Measuring CPU Performance in
the New Millennium, John L. Henning, Computer,
July 2000, pages 28-35
27
SPEC CPU2000 Example
  • Baseline machine Sun Ultra 5, 300 MHz UltraSPARC
    Iii, 256 KB L2
  • Running time ratios scaled by factor of 100
  • Reference score of baseline machine is 100
  • Reference time of 176.gcc should be 1100, not 110
  • Example shows 667 MHz Alpha processor on both
    CINT2000 and CINT95

Source Microprocessor Report, 17 Apr 2000
28
Performance Evaluation
  • Given sales is a function of performance relative
    to the competition
  • Theres a big investment in improving product as
    reported by performance summary
  • Good products created when you have
  • Good benchmarks
  • Good ways to summarize performance
  • If benchmarks/summary inadequate, then choose
    between improving product for real programs vs.
    improving product to get more sales
  • Sales almost always wins!
  • Execution time is the measure of computer
    performance!
  • What about cost?

29
Cost of Integrated Circuits
30
Explanations
Second term in Dies per wafer corrects for the
rectangular dies near the periphery of round
wafers
Die yield assumes a simple empirical model
defects are randomly distributed over the wafer,
and yield is inversely proportional to the
complexity of the fabrication process (indicated
by a)
a3 for modern processes implies that cost of
die is proportional to (Die area)4
31
Real World Examples
Revised Model Reduces Cost Estimates, Linley
Gwennap, Microprocessor Report 10(4), 25 Mar 1996
  • 0.18-micron process standard, 0.11-micron
    available now
  • BiCMOS is dead
  • Silicon-on-Insulator (SOI) process in works

32
Moores Law
Cramming More Components onto Integrated
Circuits, G. E. Moore, Electronics, pp. 114-117,
April 1965
  • Historical context
  • Predicting implications of technology scaling
  • Makes over 25 predictions, and all of them have
    come true
  • Read the paper and find out these predictions!
  • Moores Law
  • The complexity for minimum component costs has
    increased at a rate of roughly a factor of two
    per year.
  • Based on extrapolation from five points!
  • Later, more accurate formula
  • Technology scaling of integrated circuits
    following this trend has been driver of much
    economic productivity over last two decades

33
Moores Law in Action at Intel
Source Microprocessor Report 9(6), 8 May 1995
34
Moores Law At Risk?
Source Microprocessor Report, 24 Aug 1998
35
Where Do The Transistors Go?
Source Microprocessor Report, 24 Apr 2000
  • Logic contributes a (vanishingly) small fraction
    of the number of transistors
  • Memory (mostly on-chip cache) is the biggest
    fraction
  • Computing is free, communication is expensive

36
Chip Photographs
Source http//micro.magnet.fsu.edu/chipshots/inde
x.html
UltraSparc
HP-PA 8000
37
Embedded Processors
  • More new instruction sets introduced in 1999 than
    in PC market for last 15 years
  • Hot trends of 1999
  • Network processors
  • Configurable cores
  • VLIW-based processors
  • ARM unit sales now surpass 68K/Coldfire unit
    sales
  • Diversity of market supports wide range of
    performance, power, and cost

Source Microprocessor Report, 17 Jan 2000
38
Power-Performance Tradeoff (Embedded)
Source Microprocessor Report, 17 Jan 2000
Write a Comment
User Comments (0)
About PowerShow.com