Performance, ALUs and such like - PowerPoint PPT Presentation

1 / 72
About This Presentation
Title:

Performance, ALUs and such like

Description:

Performance, ALUs and such like – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 73
Provided by: tar115
Learn more at: http://cseweb.ucsd.edu
Category:
Tags: alus | fact | performance | such

less

Transcript and Presenter's Notes

Title: Performance, ALUs and such like


1
Performance, ALUs and such like
  • The good news no quiz today !
  • Homework 1 is on the net now, so are the slides
    from previous class.
  • Home page is www.cs.ucsd.edu/tsoni/cse141
  • Finals will be the last day of class, no special
    time slot
  • Add-drops shall be handled at break.
  • Today Chap 2 and 4 of the text.

2
The Story so far
  • Computer organization concept of abstraction
  • Instruction Set Architectures Definition,
    types, examples
  • Instruction formats operands, addressing modes
  • Operations load, store, arithmetic, logical
  • Control instructions branch, jump, procedures
  • Stacks

Basically learnt about Instruction Set
Architectures
3
MIPS Software Register Conventions
0 zero constant 0 1 at reserved for
assembler 2 v0 expression evaluation
3 v1 function results 4 a0 arguments 5 a1 6 a2 7
a3 8 t0 temporary caller saves . . . (callee
can clobber) 15 t7
16 s0 callee saves . . . (caller can
clobber) 23 s7 24 t8 temporary
(contd) 25 t9 26 k0 reserved for OS
kernel 27 k1 28 gp Pointer to global
area 29 sp Stack pointer 30 fp frame
pointer 31 ra Return Address (HW)
4
Example Swap()
swap(int v, int k) int temp temp
vk vk vk1 vk1 temp
  • Can we figure out the code?

swap // 4v, 5k muli 2, 5, 4
// 2 k4 add 2, 4, 2 // 2 v(4k) lw
15, 0(2) // 15temp (20)(vk) lw 16,
4(2) // 16 (24) (vk1) sw 16,
0(2) // (vk) 16 (vk1) sw 15, 4(2)
// (vk1) 15 temp jr 31 //
return
5
Example Leaf_procedure()
int PairDiff(int a, int b, int c,int d) int
temp temp (ab)-(cd) return temp
  • Procedures?

Assume caller puts a0-a3 a,b,c,d and wants
result in v0 PairDiff // sub
sp,sp,12 // Make space for 3 temp
locations sw t1, 8(sp) // save t1 (optional
if MIPS convention) sw t0, 4(sp) // save t0
(optional if MIPS convention) sw s0, 0(sp) //
save s0 add t0,a0,a1 // (t0ab) add
t1,a2,a3 // (t1cd) sub s0,t0,t1 //
(s0t0-t1) add v0,s0,zero // store return
value in v0 lw s0,0(sp) // restore
registers lw t0,4(sp) // (optional if MIPS
convention) lw t1,8(sp) // (optional if
MIPS convention) add sp,sp,12 // pop the
stack jr ra // The actual return
to calling routine
6
Example Nested_procedure()
int fact(int n) if(nlt1) return(1) else
return (nfact(n-1))
  • What about nested procedures? ra ??
  • Recursive procedures?

Assume a0 n fact // sub
sp,sp,8 // Make space for 2 temp
locations sw ra, 4(sp) // save return
address sw a0, 4(sp) // save argument
n slt t0,a0,1 // test for nlt1 beq
t0,zero, L1 // if (ngt1) goto L1 add
v0,zero,1 // v01 add sp,sp,8 //
pop the stack jr ra // return
L1 sub a0,a0,1 // n-- jal fact
// call fact again. lw a0,0(sp) //
fact() returns here. Restore n lw ra,4(sp)
// restore return address add sp,sp,8 //
pop stack mult v0,a0,v0 // v0
nfact(n-1) jr ra // return to
caller
(nlt1) case
(ngt1) case
7
Comparing Instruction Set Architectures
Design-time metrics Can it be implemented, in
how long, at what cost? Can it be programmed?
Ease of compilation? Static Metrics How many
bytes does the program occupy in memory? Dynamic
Metrics How many instructions are
executed? How many bytes does the processor
fetch to execute the program? How many clocks
are required per instruction? How "lean" a
clock is practical? Best Metric Time to
execute the program!
  • This depends on
  • instruction set,
  • processor organization, and
  • compilation techniques.

8
Computer Performance
Measuring and Discussing Computer System
Performance
or My computer is faster than your computer
9
SPEC Performance
RISC introduction
performance now improves 50 per year (2x every
1.5 years)
But what is performance ??
10
Performance depends on the eyes of the beholder?
  • Purchasing perspective
  • given a collection of machines, which has the
  • best performance ?
  • least cost ?
  • best performance / cost ?
  • Design perspective
  • faced with design options, which has the
  • best performance improvement ?
  • least cost ?
  • best performance / cost ?
  • Both require
  • basis for comparison
  • metric for evaluation
  • Our goal is to understand cost performance
    implications of architectural choices

11
Two ideas
  • How much faster is the Concorde compared to the
    747?
  • How much bigger is the 747 than the Douglas DC-8?

Which has higher performance?
Time to do the task (Execution Time)
execution time, response time, latency Tasks
per day, hour, week, sec, ns. .. (Performance)
throughput, bandwidth Response time and
throughput often are in opposition
12
Two mechanisms of getting to the bay-area
Vehicle
Speed
Time to Bay Area
Passengers
Throughput (pm/h)
Ferrari
160 mph
3.1 hours
2
320
Greyhound
65 mph
7.7 hours
60
3900
Time to do the task from start to finish
execution time, response time, latency Tasks
per unit time throughput, bandwidth
mostly used for data movement
Response time and throughput often are in
opposition
13
Relative performance ?
  • can be confusing
  • A runs in 12 seconds
  • B runs in 20 seconds
  • A/B .6 , so A is 40 faster, or 1.4X faster, or
    B is 40 slower
  • B/A 1.67, so A is 67 faster, or 1.67X faster,
    or B is 67 slower
  • needs a precise definition

14
Relative performance ?
  • Performance is in units of things-per-second
  • bigger is better
  • If we are primarily concerned with response time
  • performance(x) 1
    execution_time(x)
  • " X is n times faster than Y" means
  • Performance(X)
  • n ----------------------
  • Performance(Y)

PerformanceX
Relative Performance
Execution TimeY
n



Execution TimeX
PerformanceY
15
How many times ?
  • Time of Concorde vs. Boeing 747?
  • Concord is 1350 mph / 610 mph 2.2 times faster

  • 6.5 hours / 3 hours
  • Throughput of Concorde vs. Boeing 747 ?
  • Concord is 178,200 pmph / 286,700 pmph 0.62
    times faster
  • Boeing is 286,700 pmph / 178,200 pmph 1.6
    times faster
  • Boeing is 1.6 times (60)faster in terms of
    throughput
  • Concord is 2.2 times (120) faster in terms of
    flying time
  • We will focus primarily on execution time for a
    single job

16
Some grammar?
  • times faster than (or times as fast as)
  • theres a multiplicative factor relating
    quantities
  • X was 3 time faster than Y ? speed(X) 3
    speed(Y)
  • percent faster than
  • implies an additive relationship
  • X was 25 faster than Y ? speed(X) (125/100)
    speed(Y)
  • percent slower than
  • implies subtraction
  • X was 5 slower than Y ? speed(X) (1-5/100)
    speed(Y)
  • 100 slower means it doesnt move at all !
  • times slower than or times as slow as
  • is awkward.
  • X was 3 times slower than Y means speed(X)
    1/3 speed(Y)

17
Avoid Linguistic Confusion
  • X is r times faster than Y ? speed(X) r
    speed(Y)
  • ? speed(Y)
    1/r speed(X)
  • ? Y is r times
    slower than X
  • X is r times faster than Y, Y is s times faster
    than Z
  • ? speed(X) r speed(Y) rs speed(Z)
  • ? X is rs faster than Z
  • (Cannot do this with numbers !)
  • Easiest way to avoid confusion
  • Convert faster to times faster
  • then do calculation and convert back if needed.
  • Example change 25 faster to 5/4 times
    faster.

18
Which time anyways ?
gt time foo ... foos results ... 90.7u 12.9s 239
65 gt
  • user CPU time? (time CPU spends running your
    code)
  • total CPU time (user kernel)? (includes op.
    sys. code)
  • Wallclock time? (total elapsed time)
  • Includes time spent waiting for I/O, other users,
    ...
  • Answer depends ...
  • For measuring processor speed, we can use total
    CPU.
  • If no I/O or interrupts, wallclock may be better
  • more precise (microseconds rather than 1/100 sec)
  • can measure individual sections of code

19
Metrics of Performance
Answers per month Useful Operations per second
Application
Programming Language
Compiler
(millions) of Instructions per second
MIPS (millions) of (F.P.) operations per second
MFLOP/s
ISA
Datapath
Megabytes per second
Control
Function Units
Cycles per second (clock rate)
Transistors
Wires
Pins
Each metric has a place and a purpose, and each
can be misused
20
Levels of benchmarking
Cons
Pros
  • very specific
  • non-portable
  • difficult to run, or
  • measure
  • hard to identify cause
  • representative

Actual Target Workload
  • portable
  • widely used
  • improvements useful in reality
  • less representative

Full Application Benchmarks
  • easy to fool

Small Kernel Benchmarks
  • easy to run, early in design cycle
  • peak may be a long way from application
    performance
  • identify peak capability and potential
    bottlenecks

Microbenchmarks
21
Cycle Time
  • Instead of reporting execution time in seconds,
    we often use cycles
  • Clock ticks indicate when to start activities
    (one abstraction)
  • cycle time time between ticks seconds per
    cycle
  • clock rate (frequency) cycles per second (1
    Hz. 1 cycle/sec)A 200 Mhz. clock has a cycle
    time of

22
Cycle Time
seconds
instructions
seconds/cycle
cycles/instruction
  • Improve performance gt reduce execution time
  • Reduce instruction count (Programmer, Compiler)
  • Reduce cycles per instruction (ISA, Machine
    designer)
  • Reduce clock cycle time (Hardware designer,
    Physicist)

23
Performance Variation
CPU Execution Time
Instruction Count
CPI
Clock Cycle Time

X
X
24
Amdahls Law
  • Execution Time After Improvement Execution
    Time Unaffected
  • ( Execution Time Affected / Amount of
    Improvement )
  • Example "Suppose a program runs in 100 seconds
    on a machine, with multiply responsible for 80
    seconds of this time. How much do we have to
    improve the speed of multiplication if we want
    the program to run 4 times faster?" How about
    making it 5 times faster?
  • Principle Make the common case fast

25
MIPS, MFLOPS etc.
  • MIPS - million instructions per second
  • number of instructions executed in program
    Clock rate
  • execution time in seconds 106
    CPI 106
  • MFLOPS - million floating point operations per
    second
  • number of floating point operations executed in
    program
  • execution time in seconds 106
  • program-independent
  • deceptive

26
Example RISC Processor
Base Machine (Reg / Reg) Op Freq Cycles CPI(i)
Time ALU 50 1 .5 23 Load 20 5
1.0 45 Store 10 3 .3 14 Branch 20 2
.4 18 2.2
Typical Mix
How much faster would the machine be if a better
data cache reduced the average load time to 2
cycles? How does this compare with using branch
prediction to shave a cycle off the branch
time? What if two ALU instructions could be
executed at once?
27
SPEC
Which Programs?
  • peak throughput measures (simple programs)
  • synthetic benchmarks (whetstone, dhrystone,...)
  • Real applications
  • SPEC (best of both worlds, but with problems of
    their own)
  • System Performance Evaluation Cooperative
  • Provides a common set of real applications along
    with strict guidelines for how to run them.
  • provides a relatively unbiased means to compare
    machines.

28
SPEC89
  • Compiler enhancements and performance

29
SPECCPU2000 Suite
  • SPECint2000
  • gzip and bzip2 - compression
  • gcc compiler 205K lines of messy code!
  • crafty chess program
  • parser word processing
  • vortex object-oriented database
  • perlbmk PERL interpreter
  • eon computer visualization
  • vpr, twolf CAD tools for VLSI
  • mcf, gap combinatorial programs
  • SPECfp2000 10 Fortran, 3 C programs
  • scientific application programs (physics,
    chemistry, image processing, number theory, ...)

30
Performance is always misleading
  • Performance is specific to a particular program/s
  • Total execution time is a consistent summary of
    performance
  • For a given architecture performance increases
    come from
  • increases in clock rate (without adverse CPI
    affects)
  • improvements in processor organization that lower
    CPI
  • compiler enhancements that lower CPI and/or
    instruction count
  • Pitfall expecting improvement in one aspect of
    a machines performance to affect the total
    performance
  • You should not always believe everything you
    read! Read carefully!

31
Computer Arithmetic
What do all those bits mean now?
bits (011011011100010 ....01)
data
instruction
number
text chars
..............
R-format
I-format ...
integer
floating point
single precision
double precision
signed
unsigned
...
...
...
...
32
Computer Arithmetic
  • How do you represent
  • negative numbers?
  • fractions?
  • really large numbers?
  • really small numbers?
  • How do you
  • do arithmetic?
  • identify errors (e.g. overflow)?
  • What is an ALU and what does it look like?
  • ALUarithmetic logic unit

33
Big Endian vs. Little Endian
Big Endian IBM, Mot, HP, Sun
Little Endian Dec, Intel
  • Some processors (e.g. PowerPC) provide both
  • If you can figure out how to switch modes or get
    the compiler to issue Byte-reversed loads and
    stores

34
Binary Numbers An Introduction
Consider a 4-bit binary number Examples of
binary arithmetic 3 2 5 3 3 6
Binary
Binary
Decimal
Decimal
0
0000
4
0100
1
0001
5
0101
2
0010
6
0110
3
0011
7
0111
1
1
1
0
0
1
1
0
0
1
1
0
0
1
0

0
0
1
1

0
1
0
1
0
1
1
0
35
Negative Numbers Some options
  • Sign Magnitude -- MSB is sign bit, rest the same
  • -1 1001
  • -5 1101
  • Ones complement -- flip all bits to negate
  • -1 1111
  • -5 1010
  • We would like a number system that provides
  • obvious representation of 0,1,2...
  • uses adder for addition
  • single value of 0
  • equal coverage of positive and negative numbers
  • easy detection of sign
  • easy negation

36
Negative Numbers twos complement
  • Positive numbers normal binary representation
  • Negative numbers flip bits (0 ??1) , then add 1

Decimal -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7
Twos Complement Binary 1000 1001 1010 1011 1100 1
101 1110 1111 0000 0001 0010 0011 0100 0101 0110 0
111
Smallest 4-bit number -8
Biggest 4-bit number 7
37
Twos complement arithmetic
2s Complement Binary
2s Complement Binary
Decimal
Decimal
0
0000
1111
-1
1
0001
1110
-2
2
0010
1101
-3
3
0011
1100
-4
4
0100
1011
-5
5
0101
1010
-6
6
0110
1001
-7
7
0111
1000
-8
  • Examples 7 - 6 7 (- 6) 1 3 -
    5 3 (- 5) -2

1
1
1
1
1
0
1
1
1
0
0
1
1
1
0
1
0

1
0
1
1

0
0
0
1
1
1
1
0
Uses simple adder for and - numbers
38
Things to keep in mind
  • Negation
  • flip bits and add 1. (Works for and -)
  • Might cause overflow.
  • Extend sign when loading into large register
  • 3 gt 0011, 00000011, 0000000000000011
  • -3 gt 1101, 11111101, 1111111111111101
  • Overflow detection
  • (need to raise exception when answer cant be
    represented)
  • 0101 5
  • 0110 6
  • 1011 -5 ??!!!

39
Overflow detection again
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
0
2
- 4
0
0
1
1

1
1
1
0

3
- 2
0
1
0
1
1
0
1
0
5
- 6
1
1
1
0
1
0
1
0
0
1
1
1
1
1
0
0
7
- 4
3
- 5
0
0
1
1

1
0
1
1

1
0
1
0
0
1
1
1
-6
7
So how do we detect overflow?
Carry into MSB ! Carry out of MSB
40
Execution the heart of it all
Instruction Fetch
Instruction Decode
Operand Fetch
Execute
Result Store
Next Instruction
41
A Basic ALU
  • ALU Control Lines (ALUop) Function
  • 000 And
  • 001 Or
  • 010 Add
  • 110 Subtract
  • 111 Set-on-less-than

General idea Build for 1-bit numbers and then
extend for n-bits!
42
Some basics of digital logic
43
1-bit ALU
  • ALU Control Lines (ALUop) Function
  • 000 And
  • 001 Or
  • ALU Control Lines (ALUop) Function
  • 000 And
  • 001 Or
  • 010 Add

But how do we make the adder?
44
1-bit Full Adder
  • This is also called a (3, 2) adder
  • Half Adder No CarryIn nor CarryOut
  • Truth Table

45
1-bit Full Adder CarryOut
CarryOut (!A B CarryIn) (A !B
CarryIn) (A B !CarryIn) (A B
CarryIn) CarryOut B CarryIn A CarryIn
A B
46
1-bit Full Adder Sum
Sum (!A !B CarryIn) (!A B
!CarryIn) (A !B !CarryIn) (A B
CarryIn)
47
32-bit ALU
  • (ALUop) Function
  • And
  • Or
  • Add

The 1-bit ALU
What about other operations sub s1, s, s3
s1 s2 - s3 Subtraction slt s1, s2,
s3 if (s2 lt s3) s1 1 else s1
0 set on less than (SLT)
The 32-bit ALU
48
32-bit ALU
  • Keep in mind the following
  • (A - B) is the same as A (-B)
  • 2s Complement negate Take the inverse of every
    bit and add 1
  • Bit-wise inverse of B is !B
  • A - B A (-B) A (!B 1) A !B 1

Binvert provides the negation
What about the 1 ?
49
32-bit ALU
Setting CarryIn0 1 provides the 1 for the
32-bit adder.
50
32-bit ALU slt
  • slt instruction
  • If ( altb) result 1, else result 0
  • If ( a-b lt 0 ) result 1, else result 0
  • Do a subtract
  • use sign bit
  • route to bit 0 of result
  • all other bits zero

51
32-bit ALU Special conditions
Overflow Detection Logic
  • Carry into MSB ! Carry out of MSB
  • For a N-bit ALU Overflow CarryInN - 1 XOR
    CarryOutN - 1

CarryIn0
A0
1-bit ALU
Result0
X
Y
X XOR Y
B0
0
0
0
CarryOut0
0
1
1
1
0
1
1
1
0
CarryIn2
A2
1-bit ALU
Result2
B2
CarryIn3
Overflow
A3
1-bit ALU
Result3
B3
CarryOut3
52
32-bit ALU Special conditions
  • Thus MSB block has special logic to generate
  • Set line (sign bit)
  • Overflow line

53
32-bit ALU Special conditions
Zero Detection Logic
  • Zero Detection Logic is just one BIG NOR gate
  • Any non-zero input to the NOR gate will cause its
    output to be zero

CarryIn0
Zero
54
32 bit ALU
  • Notice control lines000 and001 or010
    add110 subtract111 slt
  • zero is a 1 when the result is zero!

But what about performance ?
55
32 bit ALU
  • We can build an ALU to support the MIPS
    instruction set
  • key idea use multiplexor to select the output
    we want
  • we can efficiently perform subtraction using
    twos complement
  • we can replicate a 1-bit ALU to produce a 32-bit
    ALU
  • Important points about hardware
  • all of the gates are always working
  • the speed of a gate is affected by the number of
    inputs to the gate
  • the speed of a circuit is affected by the number
    of gates in series (on the critical path or
    the deepest level of logic)
  • Our primary focus comprehension, however,
  • Clever changes to organization can improve
    performance (similar to using better algorithms
    in software)
  • well look at two examples for addition and
    multiplication

56
Performance
Ideal (CS) versus Reality (EE)
  • When input 0 -gt 1, output 1 -gt 0 but NOT
    instantly
  • Output goes 1 -gt 0 output voltage goes from Vdd
    (5v) to 0v
  • When input 1 -gt 0, output 0 -gt 1 but NOT
    instantly
  • Output goes 0 -gt 1 output voltage goes from 0v
    to Vdd (5v)
  • Voltage does not like to change instantaneously

Voltage
Vout
1 gt Vdd
Vin
0 gt GND
Time
57
Performance
Series Connection
Vdd
V1
Vin
G1
G2
G1
G2
C1
Time
  • Total Propagation Delay Sum of individual
    delays d1 d2
  • Capacitance C1 has two components
  • Capacitance of the wire connecting the two gates
  • Input capacitance of the second inverter

58
Performance Calculating Delays
Vdd
Vdd
V1
Vin
V2
V2
V1
Vin
G1
G2
C1
V3
Vdd
V3
G3
  • Sum delays along serial paths
  • Delay (Vin -gt V2) ! Delay (Vin -gt V3)
  • Delay (Vin -gt V2) Delay (Vin -gt V1) Delay (V1
    -gt V2)
  • Delay (Vin -gt V3) Delay (Vin -gt V1) Delay (V1
    -gt V3)
  • Critical Path The longest among the N parallel
    paths
  • C1 Wire C Cin of Gate 2 Cin of Gate 3

59
Performance Storage elements
  • Storage element D flip flop with negative edge
    triggered

Clk
Setup
Hold
D
Q
D
Dont Care
Dont Care
Clock-to-Q
Unknown
Q
  • Setup Time Input must be stable BEFORE the
    trigger clock edge
  • Hold Time Input must REMAIN stable after the
    trigger clock edge
  • Clock-to-Q time
  • Output cannot change instantaneously at the
    trigger clock edge
  • Similar to delay in logic gates, two components
  • Internal Clock-to-Q
  • Load dependent Clock-to-Q

60
Performance Synchronous logic
  • All storage elements are clocked by the same
    clock edge
  • The combination logic blocks
  • Inputs are updated at each clock tick
  • All outputs MUST be stable before the next clock
    tick

61
Performance Critical Path
Clk
  • Critical path the slowest path between any two
    storage devices
  • Cycle time is a function of the critical path
  • must be greater than
  • Clock-to-Q Longest Path through the Combination
    Logic Setup

62
Clock Skew
Clk1
Clock Skew
Clk2
  • The worst case scenario for cycle time
    consideration
  • The input register sees CLK1
  • The output register sees CLK2
  • Cycle Time
Write a Comment
User Comments (0)
About PowerShow.com