ECE200 - PowerPoint PPT Presentation

1 / 90

About This Presentation

Title:

ECE200

Description:

Write register: destination operand register number. Write data: data written into register file ... Truth table (4) (0) (34) (43) Adding support for jump ... – PowerPoint PPT presentation

Number of Views:13

Avg rating:3.0/5.0

Slides: 91

Provided by: PAJ58

Learn more at: http://www2.ece.rochester.edu

Category:

more less

Transcript and Presenter's Notes

Title: ECE200

1
ECE200 Computer Organization

Chapter 5 The Processor Datapath and Control

2
Homework 5

5.5, 5.7, 5.9, 5.15, 5.16, 5.22, 5.24, 5.29

3
What weve covered so far

Computer abstractions and technology (Ch 1)
Defining, measuring, evaluating performance (Ch
2)
Instruction set architecture and assembly
language programming (Ch 3)
Computer arithmetic (Ch 4)
Basic CPU organization (Ch 5)
Advanced CPU organization (Ch 6)
Caches and main memories (Ch 7)
Input/Output (Ch 8 and Motorola HC11 manuals)
Multiprocessors (Ch 9) if we get to it

4
Outline for Chapter 5 lectures

Goals in processor implementation
Brief review of sequential logic design
Pieces of the processor implementation puzzle
A simple implementation of a MIPS integer
instruction subset
Datapath
Control logic design
A multi-cycle MIPS implementation
Datapath
Control logic design
Microcoded control
Exceptions
Some real microprocessor datapath and control

5
Goals in processor implementation

Balance the rate of supply of instructions and
data and the rate at which the execution core can
consume them and can update memory

instruction supply
data supply
execution core
6
Goals in processor implementation

Recall from Chapter 2
CPU Time INST x CPI x CT
INST largely a function of the ISA and compiler
Objective minimize CPI x CT within design
constraints (cost, power, etc.)
Trading off CPI and CT is tricky

multiplier
multiplier
multiplier
logic
logic
logic
7
Brief review of sequential logic design

State elements are clocked devices
Flip flops, etc
Combinatorial elements hold no state
ALU, caches, multiplier, multiplexers, etc.
In edge triggered clocking, state elements are
only updated on the (rising) edge of the clock
pulse

8
Brief review of sequential logic design

The same state element can be read at the
beginning of a clock cycle and updated at the end
Example incrementing the PC

clock
12
8
Add input
8
PC
Add output
12
Add
4
PC register
8
12
clock
9
Our processor design progression

(1) Instruction fetch, execute, and operand reads
from data memory all take place in a single clock
cycle
(2) Instruction fetch, execute, and operand reads
from data memory take place in successive clock
cycles
(3) A pipelined design (Chapter 6)

10
Pieces of the processor puzzle

Instruction fetch
Execution
Data memory

instruction supply
data supply
execution core
11
Instruction fetch datapath

Memory to hold instructions
Register to hold the instruction memory address
Logic to generate the next instruction address

PC 4
12
Execution datapath

Focus on only a subset of all MIPS instructions
add, sub, and, or
lw, sw
slt
beq, j
For all instructions except j, we
Read operands from the register file
Perform an ALU operation
For all instructions except sw, beq, and j, we
write a result into the register file

13
Execution datapath

Register file block diagram
Read register 1,2 source operand register
numbers
Read data 1,2 source operands (32 bits each)
Write register destination operand register
number
Write data data written into register file
RegWrite when asserted, enables the writing of
Write Data

14
Execution datapath

Datapath for R-type (add, sub, and, or, slt)
R-type instruction format

31
26
16
15
11
10
6
5
0
25
20
21
op
rs
rt
funct
rd
shamt
15
Execution datapath

Datapath for beq instruction
I-type instruction format
Zero ALU output indicates if rsrt (branch is
taken/not taken)
Branch target address is the sign extended
immediate left shifted two positions, and added
to PC4

31
26
16
15
0
25
20
21
op
rs
rt
immediate
16
Data memory

Used for lw, sw (I-type format)
Block diagram
Address memory location to be read or written
Read data data out of the memory on a load
Write data data into the memory on a store
MemRead indicates a read operation is to be
performed
MemWrite indicates a write operation is to be
performed

17
Execution datapath data memory

Datapath for lw, sw
Address is the sign-extended immediate added to
the source operand read out of the register file
sw data written to memory from specified
register
lw data written to register file from specified
memory address

18
Putting the pieces together

Single clock cycle for fetch, execute, and
operand read from data memory
3 MUXes
Register file operand or sign extended immediate
to ALU
ALU or data memory output written to register
file
PC4 or branch target address written to PC
register

19
Datapath for R-type instructions
Example add 4, 18, 30
20
Datapath for I-type ALU instructions
Example slti 7, 4, 100
21
Datapath for not taken beq instruction
Example beq 28, 13, EXIT
22
Datapath for taken beq instruction
Example beq 28, 13, EXIT
23
Datapath for load instruction
Example lw 8, 112(2)
24
Datapath for store instruction
Example sw 10, 0(3)
25
Control signals we need to generate
26
ALU operation control

ALU control input codes from Chapter 4
Two steps to generate the ALU control input
Use the opcode to distinguish R-type, lw and sw,
and beq
If R-type, use funct field to determine the ALU
control input

ALU control input ALU operation Used for
000 and and
001 or or
010 add add, lw, sw
110 subtract sub, beq
111 set on less than slt
27
ALU operation control

Opcode used to generate a 2-bit signal called
ALUOp with the following encodings
00 lw or sw, perform an ALU add
01 beq, perform an ALU subtract
10 R-type, ALU operation is determined by the
funct field

Funct Instruction ALU control input
100000 add 010
100010 sub 110
100100 and 000
100101 or 001
101010 slt 111
28
Comparing instruction fields
31
26
16
15
11
10
6
5
0
25
20
21
0
rs
rt
funct
rd
shamt
R-type
31
26
16
15
0
25
20
21
4
rs
rt
immediate (offset)
beq
31
26
16
15
0
25
20
21
35 (43)
rs
rt
immediate (offset)
lw (sw)

Opcode, source registers, function code, and
immediate fields always in same place
Destination register is
bits 15-11 (rd) for R-type
bits 20-16 (rt) for lw
MUX to select the right one

29
Datapath with instr fields and ALU control
30
Main control unit design
31
Main control unit design

Truth table

32
Adding support for jump instructions

J-type format
Next PC formed by shifting left the 26-bit target
two bits and combining it with the 4 high-order
bits of PC4
Now the next PC will be one of
PC4
beq target address
j target address
We need another MUX and control bit

31
26
0
25
2
target
33
Adding support for jump instructions
34
Evaluation of the simple implementation

All instructions take one clock cycle (CPI 1)
Assume the following worst case delays
Instruction memory 4 time units
Data memory 4 time units (read), 2 time units
(write)
ALU 4 time units
Adders 3 time units
Register file 2 time units (read), 1 time unit
(write)
MUXes, sign extension, gates, and shifters 1
time unit
Large disparity in worst case delays among
instruction types
R-type 421411 13 time units
beq 4214111 14 time units
j 411 6 time units
store 4242 12 time units
load 424411 16 time units

35
Evaluation of the simple implementation

Disparity would be worse in a real machine
Even slower integer instructions (e.g.,
multiply/divide in MIPS)
Floating point instructions
Simple instructions take as long as complex ones

36
A multicycle implementation

Instruction fetch, register file access, etc
occur in separate clock cycles
Different instruction types take different
numbers of cycles to complete
Clock cycle time should be faster

37
High level view of datapath

New registers store results of each step
Not programmer visible!
Hardware can be shared
One ALU for PC4, branch target calculation, EA
calculation, and arithmetic operations
One memory for instructions and data

38
Detailed multi-cycle datapath
39
Multi-cycle control
40
First two cycles for all instructions

Instruction fetch (1st cycle)
Load the instruction into the IR register
IR MemoryPC
Increment the PC
PC PC4
Instruction decode and register fetch (2nd cycle)
Read register file locations rs and rt, results
into the A and B registers
ARegIR25-21
BRegIR20-16
Calculate the branch target address and load into
ALUOut
ALUOut PC(sign-extend (IR15-0) ltlt2)

41
Instruction fetch

IRMemPC

42
Instruction fetch

PCPC4

43
Instruction decode and register fetch

ARegIR25-21, BRegIR20-16

44
Instruction decode and register fetch

ALUOut PC(sign-extend (IR15-0) ltlt2)

45
Additional cycles for R-type

Execution
ALUOut A op B
Completion
RegIR15-11 ALUOut

46
R-type execution cycle

ALUOut A op B

47
R-type completion cycle

RegIR15-11 ALUOut

48
Additional cycles for store

Address computation
ALUOut A sign-extend (IR15-0)
Memory access
MemoryALUOut B

49
Store address computation cycle

ALUOut A sign-extend (IR15-0)

50
Store memory access cycle

MemoryALUOut B

51
Additional cycles for load

Address computation
ALUOut A sign-extend (IR15-0)
Memory access
MDR MemoryALUOut
Read completion
RegIR20-16 MDR

52
Load memory access cycle

MDR MemoryALUOut

53
Load read completion cycle

RegIR20-16 MDR

54
Additional cycle for beq

Branch completion
if (A B) PC ALUOut

55
Branch completion cycle for beq

if (A B) PC ALUOut

56
Additional cycle for j

Jump completion
PC PC31-28 (IR25-0ltlt2)

57
Jump completion cycle for j

PC PC31-28 (IR25-0ltlt2)

58
Control logic design

Implemented as a Finite State Machine
Inputs 6 opcode bits
Outputs 16 control signals
State 4 bits for 10 states

59
High-level view of FSM
60
Instruction fetch cycle
61
Instruction decode/register fetch cycle
62
R-type execution cycle
63
R-type completion cycle
64
Memory address computation cycle
65
Store memory access cycle
66
Load memory access cycle
67
Load read completion cycle
68
beq branch completion cycle
69
j jump completion cycle
70
Complete FSM
71
Evaluation of the multi-cycle design

CPI calculated based on the instruction mix
For gcc (Figure 4.54)
23 loads (5 cycles each)
13 stores (4 cycles each)
19 branches (3 cycles each)
2 jumps (3 cycles each)
43 ALU (4 cycles each)
CPI 0.2350.1340.1930.0230.4344.02
Cycle time is calculated from the longest delay
path assuming the same timing delays as before

72
Worst case datapath branch target

ALUOut PC(sign-extend (IR15-0) ltlt2)
Delay 7 time units (delay of simple 16)

73
Evaluation of the multi-cycle design

Time per instruction of simple and multi-cycle
TPI(simple) CPI(simple) x cycle time(simple)
16
TPI(multi-cycle) 4.02 x 7 28.1
Simple single-cycle implementation is faster
Multicycle with pipelining will be considerably
faster than single-cycle implementation

74
Exceptions

An exception is an event that causes a deviation
from the normal execution of instructions
Types of exceptions
Operating system call (e.g., read a file, print a
file)
Input/output device request
Page fault (request for instruction/data not in
memory Ch 7)
Arithmetic error (overflow, underflow, etc.)
Undefined instruction
Misaligned memory access (e.g., word access to
odd address)
Memory protection violation
Hardware error
Power failure
An exception is not usually due to an error!
We need to be able to restart the program at the
point where the exception was detected

75
Handling exceptions

Detect the exception
Save enough information about the exception to
handle it properly
Save enough information about the program to
resume it after the exception is handled
Handle the exception
Either terminate the program or resume executing
it depending on the exception type

76
Detecting exceptions

Performed by hardware
Overflow determined from the opcode and the
overflow output of the ALU
Undefined instruction determined from
The opcode in the main control unit
The function code and ALUop in the ALU control
logic

77
Detecting exceptions
overflow
undefined instruction
78
Saving exception information

Performed by hardware
We need the type of exception and the PC of the
instruction when the exception occurred
In MIPS, the Cause register holds the exception
type
Need an encoding for each exception type
Need a signal from the control unit to load it
into the Cause register
and the Exception Program Counter (EPC) register
holds the PC
Need to subtract 4 from the PC register to get
the correct PC (since we loaded PC4 into the PC
register during the Instruction Fetch cycle)
Need a signal from the control unit to load it
into EPC

79
Saving exception information
80
Saving program information

Needed in order to restart the program from the
point where the exception occurred
Performed by hardware and software
EPC register holds the PC of the instruction that
had the exception (where we will restart the
program)
The software routine that handles the exception
saves any registers that it will need to the
stack and restores them when it is done

81
Handling the exception

Performed by hardware and software
Need to transfer control to a software routine to
handle the exception (exception handler)
The exception handler runs in a privileged mode
that allows it to use special instructions and
access all of memory
Our programs run in user mode
The hardware enables the privileged mode, loads
PC with the address of the exception handler, and
transitions to the Fetch state

82
Handling the exception

Loading the PC with exception handler address

83
Exception handler

Stores the values of the registers that it will
need to the stack
Handles the particular exception
Operating system call calls the subroutine
associated with the call
Underflow sets register to zero or uses
denormalized numbers
I/O handles the particular I/O request, e.g.,
keyboard input
Restores registers from the stack (if program is
to be restarted)
Terminates the program, or resumes execution by
loading the PC with EPC and transitioning to the
Instruction Fetch state

84
FSM modifications
85
The Intel Pentium processor

Introduced in 1993
Uses a multi-cycle datapath with the following
steps for integer instructions
Prefetch (PF) read instruction from the
instruction memory
Decode 1 (D1) first stage of instruction decode
Decode 2 (D2) second stage of instruction decode
Execute (E) perform the ALU operation
Write back (WB) write the result to the register
file
Datapath usage varies by instruction type
Simple instructions make one pass through the
datapath using state machine control
Complex instructions make multiple passes,
reusing the same hardware elements under
microcode control

86
The Intel Pentium processor

The Pentium is a 2-way superscalar design as two
instructions can simultaneously execute
Ideal CPI for a 2-way superscalar is 0.5
Conditions for superscalar execution
Both must be simple instructions
The result of the first instruction cannot be
needed by the second
Both instructions cannot write the same register
The first instruction in program sequence cannot
be a jump

D2
E
WB
U pipe
D1
D2
E
WB
V pipe
87
The Intel Pentium Pro processor

Introduced in 1995 as the successor to the
Pentium
The basis for the Pentium II and Pentium III
Implements a 14-cycle, 3-way superscalar integer
datapath
Very high frequency is the goal
Uses out-of-order execution in that instructions
may execute out of their original program order
Completely handled by hardware transparently to
software
Instructions execute as soon as their source
operands become available
Complicates exception handling
Some instructions before the excepting one may
not have executed, while some after it may have
executed

88
The Intel Pentium Pro processor

Pentium Pro designers (and AMD designers before
them) used innovative engineering to overcome the
disadvantages of CISC ISAs
Many complex X86 instructions are internally
translated by hardware into RISC-like micro-ops
with state machine control
Achieves a very low CPI for simple integer
operations even on programs compiled for older
implementations
Combination of high frequency and low CPI gave
the Pentium Pro extremely competitive integer
performance versus RISC microprocessors
Result has been that RISC CPUs have failed to
gain the desktop market share that had been
expected

89
The Intel Pentium 4 processor

20 cycle superscalar integer pipeline
Extremely high frequency (gt3GHz)
Major effort to lower power dissipation
Clock gating clock to a unit is turned off when
the unit is not in use
Trace cache caches micro-ops of previously
decoded complex instructions to avoid
power-consuming decode operation

90
Questions?

Write a Comment

User Comments (0)