Title: ECE200
1ECE200 Computer Organization
- Chapter 5 The Processor Datapath and Control
2Homework 5
- 5.5, 5.7, 5.9, 5.15, 5.16, 5.22, 5.24, 5.29
3What weve covered so far
- Computer abstractions and technology (Ch 1)
- Defining, measuring, evaluating performance (Ch
2) - Instruction set architecture and assembly
language programming (Ch 3) - Computer arithmetic (Ch 4)
- Basic CPU organization (Ch 5)
- Advanced CPU organization (Ch 6)
- Caches and main memories (Ch 7)
- Input/Output (Ch 8 and Motorola HC11 manuals)
- Multiprocessors (Ch 9) if we get to it
4Outline for Chapter 5 lectures
- Goals in processor implementation
- Brief review of sequential logic design
- Pieces of the processor implementation puzzle
- A simple implementation of a MIPS integer
instruction subset - Datapath
- Control logic design
- A multi-cycle MIPS implementation
- Datapath
- Control logic design
- Microcoded control
- Exceptions
- Some real microprocessor datapath and control
5Goals in processor implementation
- Balance the rate of supply of instructions and
data and the rate at which the execution core can
consume them and can update memory
instruction supply
data supply
execution core
6Goals in processor implementation
- Recall from Chapter 2
- CPU Time INST x CPI x CT
- INST largely a function of the ISA and compiler
- Objective minimize CPI x CT within design
constraints (cost, power, etc.) - Trading off CPI and CT is tricky
multiplier
multiplier
multiplier
logic
logic
logic
7Brief review of sequential logic design
- State elements are clocked devices
- Flip flops, etc
- Combinatorial elements hold no state
- ALU, caches, multiplier, multiplexers, etc.
- In edge triggered clocking, state elements are
only updated on the (rising) edge of the clock
pulse
8Brief review of sequential logic design
- The same state element can be read at the
beginning of a clock cycle and updated at the end - Example incrementing the PC
clock
12
8
Add input
8
PC
Add output
12
Add
4
PC register
8
12
clock
9Our processor design progression
- (1) Instruction fetch, execute, and operand reads
from data memory all take place in a single clock
cycle - (2) Instruction fetch, execute, and operand reads
from data memory take place in successive clock
cycles - (3) A pipelined design (Chapter 6)
10Pieces of the processor puzzle
- Instruction fetch
- Execution
- Data memory
instruction supply
data supply
execution core
11Instruction fetch datapath
- Memory to hold instructions
- Register to hold the instruction memory address
- Logic to generate the next instruction address
PC 4
12Execution datapath
- Focus on only a subset of all MIPS instructions
- add, sub, and, or
- lw, sw
- slt
- beq, j
- For all instructions except j, we
- Read operands from the register file
- Perform an ALU operation
- For all instructions except sw, beq, and j, we
write a result into the register file
13Execution datapath
- Register file block diagram
- Read register 1,2 source operand register
numbers - Read data 1,2 source operands (32 bits each)
- Write register destination operand register
number - Write data data written into register file
- RegWrite when asserted, enables the writing of
Write Data
14Execution datapath
- Datapath for R-type (add, sub, and, or, slt)
- R-type instruction format
31
26
16
15
11
10
6
5
0
25
20
21
op
rs
rt
funct
rd
shamt
15Execution datapath
- Datapath for beq instruction
- I-type instruction format
- Zero ALU output indicates if rsrt (branch is
taken/not taken) - Branch target address is the sign extended
immediate left shifted two positions, and added
to PC4
31
26
16
15
0
25
20
21
op
rs
rt
immediate
16Data memory
- Used for lw, sw (I-type format)
- Block diagram
- Address memory location to be read or written
- Read data data out of the memory on a load
- Write data data into the memory on a store
- MemRead indicates a read operation is to be
performed - MemWrite indicates a write operation is to be
performed
17Execution datapath data memory
- Datapath for lw, sw
- Address is the sign-extended immediate added to
the source operand read out of the register file - sw data written to memory from specified
register - lw data written to register file from specified
memory address
18Putting the pieces together
- Single clock cycle for fetch, execute, and
operand read from data memory - 3 MUXes
- Register file operand or sign extended immediate
to ALU - ALU or data memory output written to register
file - PC4 or branch target address written to PC
register
19Datapath for R-type instructions
Example add 4, 18, 30
20Datapath for I-type ALU instructions
Example slti 7, 4, 100
21Datapath for not taken beq instruction
Example beq 28, 13, EXIT
22Datapath for taken beq instruction
Example beq 28, 13, EXIT
23Datapath for load instruction
Example lw 8, 112(2)
24Datapath for store instruction
Example sw 10, 0(3)
25Control signals we need to generate
26ALU operation control
- ALU control input codes from Chapter 4
- Two steps to generate the ALU control input
- Use the opcode to distinguish R-type, lw and sw,
and beq - If R-type, use funct field to determine the ALU
control input
ALU control input ALU operation Used for
000 and and
001 or or
010 add add, lw, sw
110 subtract sub, beq
111 set on less than slt
27ALU operation control
- Opcode used to generate a 2-bit signal called
ALUOp with the following encodings - 00 lw or sw, perform an ALU add
- 01 beq, perform an ALU subtract
- 10 R-type, ALU operation is determined by the
funct field
Funct Instruction ALU control input
100000 add 010
100010 sub 110
100100 and 000
100101 or 001
101010 slt 111
28Comparing instruction fields
31
26
16
15
11
10
6
5
0
25
20
21
0
rs
rt
funct
rd
shamt
R-type
31
26
16
15
0
25
20
21
4
rs
rt
immediate (offset)
beq
31
26
16
15
0
25
20
21
35 (43)
rs
rt
immediate (offset)
lw (sw)
- Opcode, source registers, function code, and
immediate fields always in same place - Destination register is
- bits 15-11 (rd) for R-type
- bits 20-16 (rt) for lw
- MUX to select the right one
29Datapath with instr fields and ALU control
30Main control unit design
31Main control unit design
32Adding support for jump instructions
- J-type format
- Next PC formed by shifting left the 26-bit target
two bits and combining it with the 4 high-order
bits of PC4 - Now the next PC will be one of
- PC4
- beq target address
- j target address
- We need another MUX and control bit
31
26
0
25
2
target
33Adding support for jump instructions
34Evaluation of the simple implementation
- All instructions take one clock cycle (CPI 1)
- Assume the following worst case delays
- Instruction memory 4 time units
- Data memory 4 time units (read), 2 time units
(write) - ALU 4 time units
- Adders 3 time units
- Register file 2 time units (read), 1 time unit
(write) - MUXes, sign extension, gates, and shifters 1
time unit - Large disparity in worst case delays among
instruction types - R-type 421411 13 time units
- beq 4214111 14 time units
- j 411 6 time units
- store 4242 12 time units
- load 424411 16 time units
35Evaluation of the simple implementation
- Disparity would be worse in a real machine
- Even slower integer instructions (e.g.,
multiply/divide in MIPS) - Floating point instructions
- Simple instructions take as long as complex ones
36A multicycle implementation
- Instruction fetch, register file access, etc
occur in separate clock cycles - Different instruction types take different
numbers of cycles to complete - Clock cycle time should be faster
37High level view of datapath
- New registers store results of each step
- Not programmer visible!
- Hardware can be shared
- One ALU for PC4, branch target calculation, EA
calculation, and arithmetic operations - One memory for instructions and data
38Detailed multi-cycle datapath
39Multi-cycle control
40First two cycles for all instructions
- Instruction fetch (1st cycle)
- Load the instruction into the IR register
- IR MemoryPC
- Increment the PC
- PC PC4
- Instruction decode and register fetch (2nd cycle)
- Read register file locations rs and rt, results
into the A and B registers - ARegIR25-21
- BRegIR20-16
- Calculate the branch target address and load into
ALUOut - ALUOut PC(sign-extend (IR15-0) ltlt2)
41Instruction fetch
42Instruction fetch
43Instruction decode and register fetch
44Instruction decode and register fetch
- ALUOut PC(sign-extend (IR15-0) ltlt2)
45Additional cycles for R-type
- Execution
- ALUOut A op B
- Completion
- RegIR15-11 ALUOut
46R-type execution cycle
47R-type completion cycle
48Additional cycles for store
- Address computation
- ALUOut A sign-extend (IR15-0)
- Memory access
- MemoryALUOut B
49Store address computation cycle
- ALUOut A sign-extend (IR15-0)
50Store memory access cycle
51Additional cycles for load
- Address computation
- ALUOut A sign-extend (IR15-0)
- Memory access
- MDR MemoryALUOut
- Read completion
- RegIR20-16 MDR
52Load memory access cycle
53Load read completion cycle
54Additional cycle for beq
- Branch completion
- if (A B) PC ALUOut
55Branch completion cycle for beq
56Additional cycle for j
- Jump completion
- PC PC31-28 (IR25-0ltlt2)
57Jump completion cycle for j
58Control logic design
- Implemented as a Finite State Machine
- Inputs 6 opcode bits
- Outputs 16 control signals
- State 4 bits for 10 states
59High-level view of FSM
60Instruction fetch cycle
61Instruction decode/register fetch cycle
62R-type execution cycle
63R-type completion cycle
64Memory address computation cycle
65Store memory access cycle
66Load memory access cycle
67Load read completion cycle
68beq branch completion cycle
69j jump completion cycle
70Complete FSM
71Evaluation of the multi-cycle design
- CPI calculated based on the instruction mix
- For gcc (Figure 4.54)
- 23 loads (5 cycles each)
- 13 stores (4 cycles each)
- 19 branches (3 cycles each)
- 2 jumps (3 cycles each)
- 43 ALU (4 cycles each)
- CPI 0.2350.1340.1930.0230.4344.02
- Cycle time is calculated from the longest delay
path assuming the same timing delays as before
72Worst case datapath branch target
- ALUOut PC(sign-extend (IR15-0) ltlt2)
- Delay 7 time units (delay of simple 16)
73Evaluation of the multi-cycle design
- Time per instruction of simple and multi-cycle
- TPI(simple) CPI(simple) x cycle time(simple)
16 - TPI(multi-cycle) 4.02 x 7 28.1
- Simple single-cycle implementation is faster
- Multicycle with pipelining will be considerably
faster than single-cycle implementation
74Exceptions
- An exception is an event that causes a deviation
from the normal execution of instructions - Types of exceptions
- Operating system call (e.g., read a file, print a
file) - Input/output device request
- Page fault (request for instruction/data not in
memory Ch 7) - Arithmetic error (overflow, underflow, etc.)
- Undefined instruction
- Misaligned memory access (e.g., word access to
odd address) - Memory protection violation
- Hardware error
- Power failure
- An exception is not usually due to an error!
- We need to be able to restart the program at the
point where the exception was detected
75Handling exceptions
- Detect the exception
- Save enough information about the exception to
handle it properly - Save enough information about the program to
resume it after the exception is handled - Handle the exception
- Either terminate the program or resume executing
it depending on the exception type
76Detecting exceptions
- Performed by hardware
- Overflow determined from the opcode and the
overflow output of the ALU - Undefined instruction determined from
- The opcode in the main control unit
- The function code and ALUop in the ALU control
logic
77Detecting exceptions
overflow
undefined instruction
78Saving exception information
- Performed by hardware
- We need the type of exception and the PC of the
instruction when the exception occurred - In MIPS, the Cause register holds the exception
type - Need an encoding for each exception type
- Need a signal from the control unit to load it
into the Cause register - and the Exception Program Counter (EPC) register
holds the PC - Need to subtract 4 from the PC register to get
the correct PC (since we loaded PC4 into the PC
register during the Instruction Fetch cycle) - Need a signal from the control unit to load it
into EPC
79Saving exception information
80Saving program information
- Needed in order to restart the program from the
point where the exception occurred - Performed by hardware and software
- EPC register holds the PC of the instruction that
had the exception (where we will restart the
program) - The software routine that handles the exception
saves any registers that it will need to the
stack and restores them when it is done
81Handling the exception
- Performed by hardware and software
- Need to transfer control to a software routine to
handle the exception (exception handler) - The exception handler runs in a privileged mode
that allows it to use special instructions and
access all of memory - Our programs run in user mode
- The hardware enables the privileged mode, loads
PC with the address of the exception handler, and
transitions to the Fetch state
82Handling the exception
- Loading the PC with exception handler address
83Exception handler
- Stores the values of the registers that it will
need to the stack - Handles the particular exception
- Operating system call calls the subroutine
associated with the call - Underflow sets register to zero or uses
denormalized numbers - I/O handles the particular I/O request, e.g.,
keyboard input - Restores registers from the stack (if program is
to be restarted) - Terminates the program, or resumes execution by
loading the PC with EPC and transitioning to the
Instruction Fetch state
84FSM modifications
85The Intel Pentium processor
- Introduced in 1993
- Uses a multi-cycle datapath with the following
steps for integer instructions - Prefetch (PF) read instruction from the
instruction memory - Decode 1 (D1) first stage of instruction decode
- Decode 2 (D2) second stage of instruction decode
- Execute (E) perform the ALU operation
- Write back (WB) write the result to the register
file - Datapath usage varies by instruction type
- Simple instructions make one pass through the
datapath using state machine control - Complex instructions make multiple passes,
reusing the same hardware elements under
microcode control
86The Intel Pentium processor
- The Pentium is a 2-way superscalar design as two
instructions can simultaneously execute - Ideal CPI for a 2-way superscalar is 0.5
- Conditions for superscalar execution
- Both must be simple instructions
- The result of the first instruction cannot be
needed by the second - Both instructions cannot write the same register
- The first instruction in program sequence cannot
be a jump
D2
E
WB
U pipe
D1
D2
E
WB
V pipe
87The Intel Pentium Pro processor
- Introduced in 1995 as the successor to the
Pentium - The basis for the Pentium II and Pentium III
- Implements a 14-cycle, 3-way superscalar integer
datapath - Very high frequency is the goal
- Uses out-of-order execution in that instructions
may execute out of their original program order - Completely handled by hardware transparently to
software - Instructions execute as soon as their source
operands become available - Complicates exception handling
- Some instructions before the excepting one may
not have executed, while some after it may have
executed
88The Intel Pentium Pro processor
- Pentium Pro designers (and AMD designers before
them) used innovative engineering to overcome the
disadvantages of CISC ISAs - Many complex X86 instructions are internally
translated by hardware into RISC-like micro-ops
with state machine control - Achieves a very low CPI for simple integer
operations even on programs compiled for older
implementations - Combination of high frequency and low CPI gave
the Pentium Pro extremely competitive integer
performance versus RISC microprocessors - Result has been that RISC CPUs have failed to
gain the desktop market share that had been
expected
89The Intel Pentium 4 processor
- 20 cycle superscalar integer pipeline
- Extremely high frequency (gt3GHz)
- Major effort to lower power dissipation
- Clock gating clock to a unit is turned off when
the unit is not in use - Trace cache caches micro-ops of previously
decoded complex instructions to avoid
power-consuming decode operation
90Questions?