Performance, ALUs and such like

About This Presentation

Title:

Performance, ALUs and such like

Description:

Performance, ALUs and such like – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 73

Provided by: tar115

Learn more at: http://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Performance, ALUs and such like

1
Performance, ALUs and such like

The good news no quiz today !
Homework 1 is on the net now, so are the slides
from previous class.
Home page is www.cs.ucsd.edu/tsoni/cse141
Finals will be the last day of class, no special
time slot
Add-drops shall be handled at break.
Today Chap 2 and 4 of the text.

2
The Story so far

Computer organization concept of abstraction
Instruction Set Architectures Definition,
types, examples
Instruction formats operands, addressing modes
Operations load, store, arithmetic, logical
Control instructions branch, jump, procedures
Stacks

Basically learnt about Instruction Set
Architectures
3
MIPS Software Register Conventions
0 zero constant 0 1 at reserved for
assembler 2 v0 expression evaluation
3 v1 function results 4 a0 arguments 5 a1 6 a2 7
a3 8 t0 temporary caller saves . . . (callee
can clobber) 15 t7
16 s0 callee saves . . . (caller can
clobber) 23 s7 24 t8 temporary
(contd) 25 t9 26 k0 reserved for OS
kernel 27 k1 28 gp Pointer to global
area 29 sp Stack pointer 30 fp frame
pointer 31 ra Return Address (HW)
4
Example Swap()
swap(int v, int k) int temp temp
vk vk vk1 vk1 temp

Can we figure out the code?

swap // 4v, 5k muli 2, 5, 4
// 2 k4 add 2, 4, 2 // 2 v(4k) lw
15, 0(2) // 15temp (20)(vk) lw 16,
4(2) // 16 (24) (vk1) sw 16,
0(2) // (vk) 16 (vk1) sw 15, 4(2)
// (vk1) 15 temp jr 31 //
return
5
Example Leaf_procedure()
int PairDiff(int a, int b, int c,int d) int
temp temp (ab)-(cd) return temp

Procedures?

Assume caller puts a0-a3 a,b,c,d and wants
result in v0 PairDiff // sub
sp,sp,12 // Make space for 3 temp
locations sw t1, 8(sp) // save t1 (optional
if MIPS convention) sw t0, 4(sp) // save t0
(optional if MIPS convention) sw s0, 0(sp) //
save s0 add t0,a0,a1 // (t0ab) add
t1,a2,a3 // (t1cd) sub s0,t0,t1 //
(s0t0-t1) add v0,s0,zero // store return
value in v0 lw s0,0(sp) // restore
registers lw t0,4(sp) // (optional if MIPS
convention) lw t1,8(sp) // (optional if
MIPS convention) add sp,sp,12 // pop the
stack jr ra // The actual return
to calling routine
6
Example Nested_procedure()
int fact(int n) if(nlt1) return(1) else
return (nfact(n-1))

What about nested procedures? ra ??
Recursive procedures?

Assume a0 n fact // sub
sp,sp,8 // Make space for 2 temp
locations sw ra, 4(sp) // save return
address sw a0, 4(sp) // save argument
n slt t0,a0,1 // test for nlt1 beq
t0,zero, L1 // if (ngt1) goto L1 add
v0,zero,1 // v01 add sp,sp,8 //
pop the stack jr ra // return
L1 sub a0,a0,1 // n-- jal fact
// call fact again. lw a0,0(sp) //
fact() returns here. Restore n lw ra,4(sp)
// restore return address add sp,sp,8 //
pop stack mult v0,a0,v0 // v0
nfact(n-1) jr ra // return to
caller
(nlt1) case
(ngt1) case
7
Comparing Instruction Set Architectures
Design-time metrics Can it be implemented, in
how long, at what cost? Can it be programmed?
Ease of compilation? Static Metrics How many
bytes does the program occupy in memory? Dynamic
Metrics How many instructions are
executed? How many bytes does the processor
fetch to execute the program? How many clocks
are required per instruction? How "lean" a
clock is practical? Best Metric Time to
execute the program!

This depends on
instruction set,
processor organization, and
compilation techniques.

8
Computer Performance
Measuring and Discussing Computer System
Performance
or My computer is faster than your computer
9
SPEC Performance
RISC introduction
performance now improves 50 per year (2x every
1.5 years)
But what is performance ??
10
Performance depends on the eyes of the beholder?

Purchasing perspective
given a collection of machines, which has the
best performance ?
least cost ?
best performance / cost ?
Design perspective
faced with design options, which has the
best performance improvement ?
least cost ?
best performance / cost ?
Both require
basis for comparison
metric for evaluation
Our goal is to understand cost performance
implications of architectural choices

11
Two ideas

How much faster is the Concorde compared to the
747?
How much bigger is the 747 than the Douglas DC-8?

Which has higher performance?
Time to do the task (Execution Time)
execution time, response time, latency Tasks
per day, hour, week, sec, ns. .. (Performance)
throughput, bandwidth Response time and
throughput often are in opposition
12
Two mechanisms of getting to the bay-area
Vehicle
Speed
Time to Bay Area
Passengers
Throughput (pm/h)
Ferrari
160 mph
3.1 hours
2
320
Greyhound
65 mph
7.7 hours
60
3900
Time to do the task from start to finish
execution time, response time, latency Tasks
per unit time throughput, bandwidth
mostly used for data movement
Response time and throughput often are in
opposition
13
Relative performance ?

can be confusing
A runs in 12 seconds
B runs in 20 seconds
A/B .6 , so A is 40 faster, or 1.4X faster, or
B is 40 slower
B/A 1.67, so A is 67 faster, or 1.67X faster,
or B is 67 slower
needs a precise definition

14
Relative performance ?

Performance is in units of things-per-second
bigger is better
If we are primarily concerned with response time
performance(x) 1
execution_time(x)
" X is n times faster than Y" means
Performance(X)
n ----------------------
Performance(Y)

PerformanceX
Relative Performance
Execution TimeY
n

Execution TimeX
PerformanceY
15
How many times ?

Time of Concorde vs. Boeing 747?
Concord is 1350 mph / 610 mph 2.2 times faster
6.5 hours / 3 hours
Throughput of Concorde vs. Boeing 747 ?
Concord is 178,200 pmph / 286,700 pmph 0.62
times faster
Boeing is 286,700 pmph / 178,200 pmph 1.6
times faster
Boeing is 1.6 times (60)faster in terms of
throughput
Concord is 2.2 times (120) faster in terms of
flying time
We will focus primarily on execution time for a
single job

16
Some grammar?

times faster than (or times as fast as)
theres a multiplicative factor relating
quantities
X was 3 time faster than Y ? speed(X) 3
speed(Y)
percent faster than
implies an additive relationship
X was 25 faster than Y ? speed(X) (125/100)
speed(Y)
percent slower than
implies subtraction
X was 5 slower than Y ? speed(X) (1-5/100)
speed(Y)
100 slower means it doesnt move at all !
times slower than or times as slow as
is awkward.
X was 3 times slower than Y means speed(X)
1/3 speed(Y)

17
Avoid Linguistic Confusion

X is r times faster than Y ? speed(X) r
speed(Y)
? speed(Y)
1/r speed(X)
? Y is r times
slower than X
X is r times faster than Y, Y is s times faster
than Z
? speed(X) r speed(Y) rs speed(Z)
? X is rs faster than Z
(Cannot do this with numbers !)
Easiest way to avoid confusion
Convert faster to times faster
then do calculation and convert back if needed.
Example change 25 faster to 5/4 times
faster.

18
Which time anyways ?
gt time foo ... foos results ... 90.7u 12.9s 239
65 gt

user CPU time? (time CPU spends running your
code)
total CPU time (user kernel)? (includes op.
sys. code)
Wallclock time? (total elapsed time)
Includes time spent waiting for I/O, other users,
...
Answer depends ...
For measuring processor speed, we can use total
CPU.
If no I/O or interrupts, wallclock may be better
more precise (microseconds rather than 1/100 sec)
can measure individual sections of code

19
Metrics of Performance
Answers per month Useful Operations per second
Application
Programming Language
Compiler
(millions) of Instructions per second
MIPS (millions) of (F.P.) operations per second
MFLOP/s
ISA
Datapath
Megabytes per second
Control
Function Units
Cycles per second (clock rate)
Transistors
Wires
Pins
Each metric has a place and a purpose, and each
can be misused
20
Levels of benchmarking
Cons
Pros

very specific
non-portable
difficult to run, or
measure
hard to identify cause

representative

Actual Target Workload

portable
widely used
improvements useful in reality

less representative

Full Application Benchmarks

easy to fool

Small Kernel Benchmarks

easy to run, early in design cycle

peak may be a long way from application
performance

identify peak capability and potential
bottlenecks

Microbenchmarks
21
Cycle Time

Instead of reporting execution time in seconds,
we often use cycles
Clock ticks indicate when to start activities
(one abstraction)
cycle time time between ticks seconds per
cycle
clock rate (frequency) cycles per second (1
Hz. 1 cycle/sec)A 200 Mhz. clock has a cycle
time of

22
Cycle Time
seconds
instructions
seconds/cycle
cycles/instruction

Improve performance gt reduce execution time
Reduce instruction count (Programmer, Compiler)
Reduce cycles per instruction (ISA, Machine
designer)
Reduce clock cycle time (Hardware designer,
Physicist)

23
Performance Variation
CPU Execution Time
Instruction Count
CPI
Clock Cycle Time

X
X
24
Amdahls Law

Execution Time After Improvement Execution
Time Unaffected
( Execution Time Affected / Amount of
Improvement )
Example "Suppose a program runs in 100 seconds
on a machine, with multiply responsible for 80
seconds of this time. How much do we have to
improve the speed of multiplication if we want
the program to run 4 times faster?" How about
making it 5 times faster?
Principle Make the common case fast

25
MIPS, MFLOPS etc.

MIPS - million instructions per second
number of instructions executed in program
Clock rate
execution time in seconds 106
CPI 106
MFLOPS - million floating point operations per
second
number of floating point operations executed in
program
execution time in seconds 106

program-independent
deceptive

26
Example RISC Processor
Base Machine (Reg / Reg) Op Freq Cycles CPI(i)
Time ALU 50 1 .5 23 Load 20 5
1.0 45 Store 10 3 .3 14 Branch 20 2
.4 18 2.2
Typical Mix
How much faster would the machine be if a better
data cache reduced the average load time to 2
cycles? How does this compare with using branch
prediction to shave a cycle off the branch
time? What if two ALU instructions could be
executed at once?
27
SPEC
Which Programs?

peak throughput measures (simple programs)
synthetic benchmarks (whetstone, dhrystone,...)
Real applications
SPEC (best of both worlds, but with problems of
their own)
System Performance Evaluation Cooperative
Provides a common set of real applications along
with strict guidelines for how to run them.
provides a relatively unbiased means to compare
machines.

28
SPEC89

Compiler enhancements and performance

29
SPECCPU2000 Suite

SPECint2000
gzip and bzip2 - compression
gcc compiler 205K lines of messy code!
crafty chess program
parser word processing
vortex object-oriented database
perlbmk PERL interpreter
eon computer visualization
vpr, twolf CAD tools for VLSI
mcf, gap combinatorial programs
SPECfp2000 10 Fortran, 3 C programs
scientific application programs (physics,
chemistry, image processing, number theory, ...)

30
Performance is always misleading

Performance is specific to a particular program/s
Total execution time is a consistent summary of
performance
For a given architecture performance increases
come from
increases in clock rate (without adverse CPI
affects)
improvements in processor organization that lower
CPI
compiler enhancements that lower CPI and/or
instruction count
Pitfall expecting improvement in one aspect of
a machines performance to affect the total
performance
You should not always believe everything you
read! Read carefully!

31
Computer Arithmetic
What do all those bits mean now?
bits (011011011100010 ....01)
data
instruction
number
text chars
..............
R-format
I-format ...
integer
floating point
single precision
double precision
signed
unsigned
...
...
...
...
32
Computer Arithmetic

How do you represent
negative numbers?
fractions?
really large numbers?
really small numbers?
How do you
do arithmetic?
identify errors (e.g. overflow)?
What is an ALU and what does it look like?
ALUarithmetic logic unit

33
Big Endian vs. Little Endian
Big Endian IBM, Mot, HP, Sun
Little Endian Dec, Intel

Some processors (e.g. PowerPC) provide both
If you can figure out how to switch modes or get
the compiler to issue Byte-reversed loads and
stores

34
Binary Numbers An Introduction
Consider a 4-bit binary number Examples of
binary arithmetic 3 2 5 3 3 6
Binary
Binary
Decimal
Decimal
0
0000
4
0100
1
0001
5
0101
2
0010
6
0110
3
0011
7
0111
1
1
1
0
0
1
1
0
0
1
1
0
0
1
0

0
0
1
1

0
1
0
1
0
1
1
0
35
Negative Numbers Some options

Sign Magnitude -- MSB is sign bit, rest the same
-1 1001
-5 1101
Ones complement -- flip all bits to negate
-1 1111
-5 1010

We would like a number system that provides
obvious representation of 0,1,2...
uses adder for addition
single value of 0
equal coverage of positive and negative numbers
easy detection of sign
easy negation

36
Negative Numbers twos complement

Positive numbers normal binary representation
Negative numbers flip bits (0 ??1) , then add 1

Decimal -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7
Twos Complement Binary 1000 1001 1010 1011 1100 1
101 1110 1111 0000 0001 0010 0011 0100 0101 0110 0
111
Smallest 4-bit number -8
Biggest 4-bit number 7
37
Twos complement arithmetic
2s Complement Binary
2s Complement Binary
Decimal
Decimal
0
0000
1111
-1
1
0001
1110
-2
2
0010
1101
-3
3
0011
1100
-4
4
0100
1011
-5
5
0101
1010
-6
6
0110
1001
-7
7
0111
1000
-8

Examples 7 - 6 7 (- 6) 1 3 -
5 3 (- 5) -2

1
1
1
1
1
0
1
1
1
0
0
1
1
1
0
1
0

1
0
1
1

0
0
0
1
1
1
1
0
Uses simple adder for and - numbers
38
Things to keep in mind

Negation
flip bits and add 1. (Works for and -)
Might cause overflow.
Extend sign when loading into large register
3 gt 0011, 00000011, 0000000000000011
-3 gt 1101, 11111101, 1111111111111101
Overflow detection
(need to raise exception when answer cant be
represented)
0101 5
0110 6
1011 -5 ??!!!

39
Overflow detection again
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
0
2
- 4
0
0
1
1

1
1
1
0

3
- 2
0
1
0
1
1
0
1
0
5
- 6
1
1
1
0
1
0
1
0
0
1
1
1
1
1
0
0
7
- 4
3
- 5
0
0
1
1

1
0
1
1

1
0
1
0
0
1
1
1
-6
7
So how do we detect overflow?
Carry into MSB ! Carry out of MSB
40
Execution the heart of it all
Instruction Fetch
Instruction Decode
Operand Fetch
Execute
Result Store
Next Instruction
41
A Basic ALU

ALU Control Lines (ALUop) Function
000 And
001 Or
010 Add
110 Subtract
111 Set-on-less-than

General idea Build for 1-bit numbers and then
extend for n-bits!
42
Some basics of digital logic
43
1-bit ALU

ALU Control Lines (ALUop) Function
000 And
001 Or

ALU Control Lines (ALUop) Function
000 And
001 Or
010 Add

But how do we make the adder?
44
1-bit Full Adder

This is also called a (3, 2) adder
Half Adder No CarryIn nor CarryOut
Truth Table

45
1-bit Full Adder CarryOut
CarryOut (!A B CarryIn) (A !B
CarryIn) (A B !CarryIn) (A B
CarryIn) CarryOut B CarryIn A CarryIn
A B
46
1-bit Full Adder Sum
Sum (!A !B CarryIn) (!A B
!CarryIn) (A !B !CarryIn) (A B
CarryIn)
47
32-bit ALU

(ALUop) Function
And
Or
Add

The 1-bit ALU
What about other operations sub s1, s, s3
s1 s2 - s3 Subtraction slt s1, s2,
s3 if (s2 lt s3) s1 1 else s1
0 set on less than (SLT)
The 32-bit ALU
48
32-bit ALU

Keep in mind the following
(A - B) is the same as A (-B)
2s Complement negate Take the inverse of every
bit and add 1
Bit-wise inverse of B is !B
A - B A (-B) A (!B 1) A !B 1

Binvert provides the negation
What about the 1 ?
49
32-bit ALU
Setting CarryIn0 1 provides the 1 for the
32-bit adder.
50
32-bit ALU slt

slt instruction
If ( altb) result 1, else result 0
If ( a-b lt 0 ) result 1, else result 0

Do a subtract
use sign bit
route to bit 0 of result
all other bits zero

51
32-bit ALU Special conditions
Overflow Detection Logic

Carry into MSB ! Carry out of MSB
For a N-bit ALU Overflow CarryInN - 1 XOR
CarryOutN - 1

CarryIn0
A0
1-bit ALU
Result0
X
Y
X XOR Y
B0
0
0
0
CarryOut0
0
1
1
1
0
1
1
1
0
CarryIn2
A2
1-bit ALU
Result2
B2
CarryIn3
Overflow
A3
1-bit ALU
Result3
B3
CarryOut3
52
32-bit ALU Special conditions

Thus MSB block has special logic to generate
Set line (sign bit)
Overflow line

53
32-bit ALU Special conditions
Zero Detection Logic

Zero Detection Logic is just one BIG NOR gate
Any non-zero input to the NOR gate will cause its
output to be zero

CarryIn0
Zero
54
32 bit ALU

Notice control lines000 and001 or010
add110 subtract111 slt

zero is a 1 when the result is zero!

But what about performance ?
55
32 bit ALU

We can build an ALU to support the MIPS
instruction set
key idea use multiplexor to select the output
we want
we can efficiently perform subtraction using
twos complement
we can replicate a 1-bit ALU to produce a 32-bit
ALU
Important points about hardware
all of the gates are always working
the speed of a gate is affected by the number of
inputs to the gate
the speed of a circuit is affected by the number
of gates in series (on the critical path or
the deepest level of logic)
Our primary focus comprehension, however,
Clever changes to organization can improve
performance (similar to using better algorithms
in software)
well look at two examples for addition and
multiplication