Title: Performance, ALUs and such like
1Performance, ALUs and such like
- The good news no quiz today !
- Homework 1 is on the net now, so are the slides
from previous class. - Home page is www.cs.ucsd.edu/tsoni/cse141
- Finals will be the last day of class, no special
time slot - Add-drops shall be handled at break.
- Today Chap 2 and 4 of the text.
2The Story so far
- Computer organization concept of abstraction
- Instruction Set Architectures Definition,
types, examples - Instruction formats operands, addressing modes
- Operations load, store, arithmetic, logical
- Control instructions branch, jump, procedures
- Stacks
Basically learnt about Instruction Set
Architectures
3MIPS Software Register Conventions
0 zero constant 0 1 at reserved for
assembler 2 v0 expression evaluation
3 v1 function results 4 a0 arguments 5 a1 6 a2 7
a3 8 t0 temporary caller saves . . . (callee
can clobber) 15 t7
16 s0 callee saves . . . (caller can
clobber) 23 s7 24 t8 temporary
(contd) 25 t9 26 k0 reserved for OS
kernel 27 k1 28 gp Pointer to global
area 29 sp Stack pointer 30 fp frame
pointer 31 ra Return Address (HW)
4Example Swap()
swap(int v, int k) int temp temp
vk vk vk1 vk1 temp
- Can we figure out the code?
swap // 4v, 5k muli 2, 5, 4
// 2 k4 add 2, 4, 2 // 2 v(4k) lw
15, 0(2) // 15temp (20)(vk) lw 16,
4(2) // 16 (24) (vk1) sw 16,
0(2) // (vk) 16 (vk1) sw 15, 4(2)
// (vk1) 15 temp jr 31 //
return
5Example Leaf_procedure()
int PairDiff(int a, int b, int c,int d) int
temp temp (ab)-(cd) return temp
Assume caller puts a0-a3 a,b,c,d and wants
result in v0 PairDiff // sub
sp,sp,12 // Make space for 3 temp
locations sw t1, 8(sp) // save t1 (optional
if MIPS convention) sw t0, 4(sp) // save t0
(optional if MIPS convention) sw s0, 0(sp) //
save s0 add t0,a0,a1 // (t0ab) add
t1,a2,a3 // (t1cd) sub s0,t0,t1 //
(s0t0-t1) add v0,s0,zero // store return
value in v0 lw s0,0(sp) // restore
registers lw t0,4(sp) // (optional if MIPS
convention) lw t1,8(sp) // (optional if
MIPS convention) add sp,sp,12 // pop the
stack jr ra // The actual return
to calling routine
6Example Nested_procedure()
int fact(int n) if(nlt1) return(1) else
return (nfact(n-1))
- What about nested procedures? ra ??
- Recursive procedures?
Assume a0 n fact // sub
sp,sp,8 // Make space for 2 temp
locations sw ra, 4(sp) // save return
address sw a0, 4(sp) // save argument
n slt t0,a0,1 // test for nlt1 beq
t0,zero, L1 // if (ngt1) goto L1 add
v0,zero,1 // v01 add sp,sp,8 //
pop the stack jr ra // return
L1 sub a0,a0,1 // n-- jal fact
// call fact again. lw a0,0(sp) //
fact() returns here. Restore n lw ra,4(sp)
// restore return address add sp,sp,8 //
pop stack mult v0,a0,v0 // v0
nfact(n-1) jr ra // return to
caller
(nlt1) case
(ngt1) case
7Comparing Instruction Set Architectures
Design-time metrics Can it be implemented, in
how long, at what cost? Can it be programmed?
Ease of compilation? Static Metrics How many
bytes does the program occupy in memory? Dynamic
Metrics How many instructions are
executed? How many bytes does the processor
fetch to execute the program? How many clocks
are required per instruction? How "lean" a
clock is practical? Best Metric Time to
execute the program!
- This depends on
- instruction set,
- processor organization, and
- compilation techniques.
8Computer Performance
Measuring and Discussing Computer System
Performance
or My computer is faster than your computer
9SPEC Performance
RISC introduction
performance now improves 50 per year (2x every
1.5 years)
But what is performance ??
10 Performance depends on the eyes of the beholder?
- Purchasing perspective
- given a collection of machines, which has the
- best performance ?
- least cost ?
- best performance / cost ?
- Design perspective
- faced with design options, which has the
- best performance improvement ?
- least cost ?
- best performance / cost ?
- Both require
- basis for comparison
- metric for evaluation
- Our goal is to understand cost performance
implications of architectural choices
11Two ideas
- How much faster is the Concorde compared to the
747? - How much bigger is the 747 than the Douglas DC-8?
Which has higher performance?
Time to do the task (Execution Time)
execution time, response time, latency Tasks
per day, hour, week, sec, ns. .. (Performance)
throughput, bandwidth Response time and
throughput often are in opposition
12Two mechanisms of getting to the bay-area
Vehicle
Speed
Time to Bay Area
Passengers
Throughput (pm/h)
Ferrari
160 mph
3.1 hours
2
320
Greyhound
65 mph
7.7 hours
60
3900
Time to do the task from start to finish
execution time, response time, latency Tasks
per unit time throughput, bandwidth
mostly used for data movement
Response time and throughput often are in
opposition
13Relative performance ?
- can be confusing
- A runs in 12 seconds
- B runs in 20 seconds
- A/B .6 , so A is 40 faster, or 1.4X faster, or
B is 40 slower - B/A 1.67, so A is 67 faster, or 1.67X faster,
or B is 67 slower - needs a precise definition
14Relative performance ?
- Performance is in units of things-per-second
- bigger is better
- If we are primarily concerned with response time
- performance(x) 1
execution_time(x) - " X is n times faster than Y" means
- Performance(X)
- n ----------------------
- Performance(Y)
PerformanceX
Relative Performance
Execution TimeY
n
Execution TimeX
PerformanceY
15How many times ?
- Time of Concorde vs. Boeing 747?
- Concord is 1350 mph / 610 mph 2.2 times faster
-
6.5 hours / 3 hours - Throughput of Concorde vs. Boeing 747 ?
- Concord is 178,200 pmph / 286,700 pmph 0.62
times faster - Boeing is 286,700 pmph / 178,200 pmph 1.6
times faster - Boeing is 1.6 times (60)faster in terms of
throughput - Concord is 2.2 times (120) faster in terms of
flying time - We will focus primarily on execution time for a
single job
16Some grammar?
- times faster than (or times as fast as)
- theres a multiplicative factor relating
quantities - X was 3 time faster than Y ? speed(X) 3
speed(Y) - percent faster than
- implies an additive relationship
- X was 25 faster than Y ? speed(X) (125/100)
speed(Y) - percent slower than
- implies subtraction
- X was 5 slower than Y ? speed(X) (1-5/100)
speed(Y) - 100 slower means it doesnt move at all !
- times slower than or times as slow as
- is awkward.
- X was 3 times slower than Y means speed(X)
1/3 speed(Y)
17Avoid Linguistic Confusion
- X is r times faster than Y ? speed(X) r
speed(Y) - ? speed(Y)
1/r speed(X) - ? Y is r times
slower than X - X is r times faster than Y, Y is s times faster
than Z - ? speed(X) r speed(Y) rs speed(Z)
- ? X is rs faster than Z
- (Cannot do this with numbers !)
- Easiest way to avoid confusion
- Convert faster to times faster
- then do calculation and convert back if needed.
- Example change 25 faster to 5/4 times
faster.
18Which time anyways ?
gt time foo ... foos results ... 90.7u 12.9s 239
65 gt
- user CPU time? (time CPU spends running your
code) - total CPU time (user kernel)? (includes op.
sys. code) - Wallclock time? (total elapsed time)
- Includes time spent waiting for I/O, other users,
... - Answer depends ...
- For measuring processor speed, we can use total
CPU. - If no I/O or interrupts, wallclock may be better
- more precise (microseconds rather than 1/100 sec)
- can measure individual sections of code
19Metrics of Performance
Answers per month Useful Operations per second
Application
Programming Language
Compiler
(millions) of Instructions per second
MIPS (millions) of (F.P.) operations per second
MFLOP/s
ISA
Datapath
Megabytes per second
Control
Function Units
Cycles per second (clock rate)
Transistors
Wires
Pins
Each metric has a place and a purpose, and each
can be misused
20Levels of benchmarking
Cons
Pros
- very specific
- non-portable
- difficult to run, or
- measure
- hard to identify cause
Actual Target Workload
- portable
- widely used
- improvements useful in reality
Full Application Benchmarks
Small Kernel Benchmarks
- easy to run, early in design cycle
- peak may be a long way from application
performance
- identify peak capability and potential
bottlenecks
Microbenchmarks
21Cycle Time
- Instead of reporting execution time in seconds,
we often use cycles - Clock ticks indicate when to start activities
(one abstraction) - cycle time time between ticks seconds per
cycle - clock rate (frequency) cycles per second (1
Hz. 1 cycle/sec)A 200 Mhz. clock has a cycle
time of
22Cycle Time
seconds
instructions
seconds/cycle
cycles/instruction
- Improve performance gt reduce execution time
- Reduce instruction count (Programmer, Compiler)
- Reduce cycles per instruction (ISA, Machine
designer) - Reduce clock cycle time (Hardware designer,
Physicist)
23Performance Variation
CPU Execution Time
Instruction Count
CPI
Clock Cycle Time
X
X
24Amdahls Law
- Execution Time After Improvement Execution
Time Unaffected - ( Execution Time Affected / Amount of
Improvement ) - Example "Suppose a program runs in 100 seconds
on a machine, with multiply responsible for 80
seconds of this time. How much do we have to
improve the speed of multiplication if we want
the program to run 4 times faster?" How about
making it 5 times faster? - Principle Make the common case fast
25MIPS, MFLOPS etc.
- MIPS - million instructions per second
- number of instructions executed in program
Clock rate - execution time in seconds 106
CPI 106 - MFLOPS - million floating point operations per
second - number of floating point operations executed in
program - execution time in seconds 106
- program-independent
- deceptive
26Example RISC Processor
Base Machine (Reg / Reg) Op Freq Cycles CPI(i)
Time ALU 50 1 .5 23 Load 20 5
1.0 45 Store 10 3 .3 14 Branch 20 2
.4 18 2.2
Typical Mix
How much faster would the machine be if a better
data cache reduced the average load time to 2
cycles? How does this compare with using branch
prediction to shave a cycle off the branch
time? What if two ALU instructions could be
executed at once?
27SPEC
Which Programs?
- peak throughput measures (simple programs)
- synthetic benchmarks (whetstone, dhrystone,...)
- Real applications
- SPEC (best of both worlds, but with problems of
their own) - System Performance Evaluation Cooperative
- Provides a common set of real applications along
with strict guidelines for how to run them. - provides a relatively unbiased means to compare
machines.
28SPEC89
- Compiler enhancements and performance
29SPECCPU2000 Suite
- SPECint2000
- gzip and bzip2 - compression
- gcc compiler 205K lines of messy code!
- crafty chess program
- parser word processing
- vortex object-oriented database
- perlbmk PERL interpreter
- eon computer visualization
- vpr, twolf CAD tools for VLSI
- mcf, gap combinatorial programs
- SPECfp2000 10 Fortran, 3 C programs
- scientific application programs (physics,
chemistry, image processing, number theory, ...)
30Performance is always misleading
- Performance is specific to a particular program/s
- Total execution time is a consistent summary of
performance - For a given architecture performance increases
come from - increases in clock rate (without adverse CPI
affects) - improvements in processor organization that lower
CPI - compiler enhancements that lower CPI and/or
instruction count - Pitfall expecting improvement in one aspect of
a machines performance to affect the total
performance - You should not always believe everything you
read! Read carefully!
31Computer Arithmetic
What do all those bits mean now?
bits (011011011100010 ....01)
data
instruction
number
text chars
..............
R-format
I-format ...
integer
floating point
single precision
double precision
signed
unsigned
...
...
...
...
32Computer Arithmetic
- How do you represent
- negative numbers?
- fractions?
- really large numbers?
- really small numbers?
- How do you
- do arithmetic?
- identify errors (e.g. overflow)?
- What is an ALU and what does it look like?
- ALUarithmetic logic unit
33Big Endian vs. Little Endian
Big Endian IBM, Mot, HP, Sun
Little Endian Dec, Intel
- Some processors (e.g. PowerPC) provide both
- If you can figure out how to switch modes or get
the compiler to issue Byte-reversed loads and
stores
34Binary Numbers An Introduction
Consider a 4-bit binary number Examples of
binary arithmetic 3 2 5 3 3 6
Binary
Binary
Decimal
Decimal
0
0000
4
0100
1
0001
5
0101
2
0010
6
0110
3
0011
7
0111
1
1
1
0
0
1
1
0
0
1
1
0
0
1
0
0
0
1
1
0
1
0
1
0
1
1
0
35Negative Numbers Some options
- Sign Magnitude -- MSB is sign bit, rest the same
- -1 1001
- -5 1101
- Ones complement -- flip all bits to negate
- -1 1111
- -5 1010
- We would like a number system that provides
- obvious representation of 0,1,2...
- uses adder for addition
- single value of 0
- equal coverage of positive and negative numbers
- easy detection of sign
- easy negation
36Negative Numbers twos complement
- Positive numbers normal binary representation
- Negative numbers flip bits (0 ??1) , then add 1
Decimal -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7
Twos Complement Binary 1000 1001 1010 1011 1100 1
101 1110 1111 0000 0001 0010 0011 0100 0101 0110 0
111
Smallest 4-bit number -8
Biggest 4-bit number 7
37Twos complement arithmetic
2s Complement Binary
2s Complement Binary
Decimal
Decimal
0
0000
1111
-1
1
0001
1110
-2
2
0010
1101
-3
3
0011
1100
-4
4
0100
1011
-5
5
0101
1010
-6
6
0110
1001
-7
7
0111
1000
-8
- Examples 7 - 6 7 (- 6) 1 3 -
5 3 (- 5) -2
1
1
1
1
1
0
1
1
1
0
0
1
1
1
0
1
0
1
0
1
1
0
0
0
1
1
1
1
0
Uses simple adder for and - numbers
38Things to keep in mind
- Negation
- flip bits and add 1. (Works for and -)
- Might cause overflow.
- Extend sign when loading into large register
- 3 gt 0011, 00000011, 0000000000000011
- -3 gt 1101, 11111101, 1111111111111101
- Overflow detection
- (need to raise exception when answer cant be
represented) - 0101 5
- 0110 6
- 1011 -5 ??!!!
39Overflow detection again
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
0
2
- 4
0
0
1
1
1
1
1
0
3
- 2
0
1
0
1
1
0
1
0
5
- 6
1
1
1
0
1
0
1
0
0
1
1
1
1
1
0
0
7
- 4
3
- 5
0
0
1
1
1
0
1
1
1
0
1
0
0
1
1
1
-6
7
So how do we detect overflow?
Carry into MSB ! Carry out of MSB
40Execution the heart of it all
Instruction Fetch
Instruction Decode
Operand Fetch
Execute
Result Store
Next Instruction
41A Basic ALU
- ALU Control Lines (ALUop) Function
- 000 And
- 001 Or
- 010 Add
- 110 Subtract
- 111 Set-on-less-than
General idea Build for 1-bit numbers and then
extend for n-bits!
42Some basics of digital logic
431-bit ALU
- ALU Control Lines (ALUop) Function
- 000 And
- 001 Or
- ALU Control Lines (ALUop) Function
- 000 And
- 001 Or
- 010 Add
But how do we make the adder?
441-bit Full Adder
- This is also called a (3, 2) adder
- Half Adder No CarryIn nor CarryOut
- Truth Table
451-bit Full Adder CarryOut
CarryOut (!A B CarryIn) (A !B
CarryIn) (A B !CarryIn) (A B
CarryIn) CarryOut B CarryIn A CarryIn
A B
461-bit Full Adder Sum
Sum (!A !B CarryIn) (!A B
!CarryIn) (A !B !CarryIn) (A B
CarryIn)
4732-bit ALU
- (ALUop) Function
- And
- Or
- Add
The 1-bit ALU
What about other operations sub s1, s, s3
s1 s2 - s3 Subtraction slt s1, s2,
s3 if (s2 lt s3) s1 1 else s1
0 set on less than (SLT)
The 32-bit ALU
4832-bit ALU
- Keep in mind the following
- (A - B) is the same as A (-B)
- 2s Complement negate Take the inverse of every
bit and add 1 - Bit-wise inverse of B is !B
- A - B A (-B) A (!B 1) A !B 1
Binvert provides the negation
What about the 1 ?
4932-bit ALU
Setting CarryIn0 1 provides the 1 for the
32-bit adder.
5032-bit ALU slt
- slt instruction
- If ( altb) result 1, else result 0
- If ( a-b lt 0 ) result 1, else result 0
- Do a subtract
- use sign bit
- route to bit 0 of result
- all other bits zero
5132-bit ALU Special conditions
Overflow Detection Logic
- Carry into MSB ! Carry out of MSB
- For a N-bit ALU Overflow CarryInN - 1 XOR
CarryOutN - 1
CarryIn0
A0
1-bit ALU
Result0
X
Y
X XOR Y
B0
0
0
0
CarryOut0
0
1
1
1
0
1
1
1
0
CarryIn2
A2
1-bit ALU
Result2
B2
CarryIn3
Overflow
A3
1-bit ALU
Result3
B3
CarryOut3
5232-bit ALU Special conditions
- Thus MSB block has special logic to generate
- Set line (sign bit)
- Overflow line
5332-bit ALU Special conditions
Zero Detection Logic
- Zero Detection Logic is just one BIG NOR gate
- Any non-zero input to the NOR gate will cause its
output to be zero
CarryIn0
Zero
5432 bit ALU
- Notice control lines000 and001 or010
add110 subtract111 slt
- zero is a 1 when the result is zero!
But what about performance ?
5532 bit ALU
- We can build an ALU to support the MIPS
instruction set - key idea use multiplexor to select the output
we want - we can efficiently perform subtraction using
twos complement - we can replicate a 1-bit ALU to produce a 32-bit
ALU - Important points about hardware
- all of the gates are always working
- the speed of a gate is affected by the number of
inputs to the gate - the speed of a circuit is affected by the number
of gates in series (on the critical path or
the deepest level of logic) - Our primary focus comprehension, however,
- Clever changes to organization can improve
performance (similar to using better algorithms
in software) - well look at two examples for addition and
multiplication
56Performance
Ideal (CS) versus Reality (EE)
- When input 0 -gt 1, output 1 -gt 0 but NOT
instantly - Output goes 1 -gt 0 output voltage goes from Vdd
(5v) to 0v - When input 1 -gt 0, output 0 -gt 1 but NOT
instantly - Output goes 0 -gt 1 output voltage goes from 0v
to Vdd (5v) - Voltage does not like to change instantaneously
Voltage
Vout
1 gt Vdd
Vin
0 gt GND
Time
57Performance
Series Connection
Vdd
V1
Vin
G1
G2
G1
G2
C1
Time
- Total Propagation Delay Sum of individual
delays d1 d2 - Capacitance C1 has two components
- Capacitance of the wire connecting the two gates
- Input capacitance of the second inverter
58Performance Calculating Delays
Vdd
Vdd
V1
Vin
V2
V2
V1
Vin
G1
G2
C1
V3
Vdd
V3
G3
- Sum delays along serial paths
- Delay (Vin -gt V2) ! Delay (Vin -gt V3)
- Delay (Vin -gt V2) Delay (Vin -gt V1) Delay (V1
-gt V2) - Delay (Vin -gt V3) Delay (Vin -gt V1) Delay (V1
-gt V3) - Critical Path The longest among the N parallel
paths - C1 Wire C Cin of Gate 2 Cin of Gate 3
59Performance Storage elements
- Storage element D flip flop with negative edge
triggered
Clk
Setup
Hold
D
Q
D
Dont Care
Dont Care
Clock-to-Q
Unknown
Q
- Setup Time Input must be stable BEFORE the
trigger clock edge - Hold Time Input must REMAIN stable after the
trigger clock edge - Clock-to-Q time
- Output cannot change instantaneously at the
trigger clock edge - Similar to delay in logic gates, two components
- Internal Clock-to-Q
- Load dependent Clock-to-Q
60Performance Synchronous logic
- All storage elements are clocked by the same
clock edge - The combination logic blocks
- Inputs are updated at each clock tick
- All outputs MUST be stable before the next clock
tick
61Performance Critical Path
Clk
- Critical path the slowest path between any two
storage devices - Cycle time is a function of the critical path
- must be greater than
- Clock-to-Q Longest Path through the Combination
Logic Setup
62Clock Skew
Clk1
Clock Skew
Clk2
- The worst case scenario for cycle time
consideration - The input register sees CLK1
- The output register sees CLK2
- Cycle Time