ECE200 - PowerPoint PPT Presentation

1 / 90
About This Presentation
Title:

ECE200

Description:

Write register: destination operand register number. Write data: data written into register file ... Truth table (4) (0) (34) (43) Adding support for jump ... – PowerPoint PPT presentation

Number of Views:13
Avg rating:3.0/5.0
Slides: 91
Provided by: PAJ58
Category:

less

Transcript and Presenter's Notes

Title: ECE200


1
ECE200 Computer Organization
  • Chapter 5 The Processor Datapath and Control

2
Homework 5
  • 5.5, 5.7, 5.9, 5.15, 5.16, 5.22, 5.24, 5.29

3
What weve covered so far
  • Computer abstractions and technology (Ch 1)
  • Defining, measuring, evaluating performance (Ch
    2)
  • Instruction set architecture and assembly
    language programming (Ch 3)
  • Computer arithmetic (Ch 4)
  • Basic CPU organization (Ch 5)
  • Advanced CPU organization (Ch 6)
  • Caches and main memories (Ch 7)
  • Input/Output (Ch 8 and Motorola HC11 manuals)
  • Multiprocessors (Ch 9) if we get to it

4
Outline for Chapter 5 lectures
  • Goals in processor implementation
  • Brief review of sequential logic design
  • Pieces of the processor implementation puzzle
  • A simple implementation of a MIPS integer
    instruction subset
  • Datapath
  • Control logic design
  • A multi-cycle MIPS implementation
  • Datapath
  • Control logic design
  • Microcoded control
  • Exceptions
  • Some real microprocessor datapath and control

5
Goals in processor implementation
  • Balance the rate of supply of instructions and
    data and the rate at which the execution core can
    consume them and can update memory

instruction supply
data supply
execution core
6
Goals in processor implementation
  • Recall from Chapter 2
  • CPU Time INST x CPI x CT
  • INST largely a function of the ISA and compiler
  • Objective minimize CPI x CT within design
    constraints (cost, power, etc.)
  • Trading off CPI and CT is tricky

multiplier
multiplier
multiplier
logic
logic
logic
7
Brief review of sequential logic design
  • State elements are clocked devices
  • Flip flops, etc
  • Combinatorial elements hold no state
  • ALU, caches, multiplier, multiplexers, etc.
  • In edge triggered clocking, state elements are
    only updated on the (rising) edge of the clock
    pulse

8
Brief review of sequential logic design
  • The same state element can be read at the
    beginning of a clock cycle and updated at the end
  • Example incrementing the PC

clock
12
8
Add input
8
PC
Add output
12
Add
4
PC register
8
12
clock
9
Our processor design progression
  • (1) Instruction fetch, execute, and operand reads
    from data memory all take place in a single clock
    cycle
  • (2) Instruction fetch, execute, and operand reads
    from data memory take place in successive clock
    cycles
  • (3) A pipelined design (Chapter 6)

10
Pieces of the processor puzzle
  • Instruction fetch
  • Execution
  • Data memory

instruction supply
data supply
execution core
11
Instruction fetch datapath
  • Memory to hold instructions
  • Register to hold the instruction memory address
  • Logic to generate the next instruction address

PC 4
12
Execution datapath
  • Focus on only a subset of all MIPS instructions
  • add, sub, and, or
  • lw, sw
  • slt
  • beq, j
  • For all instructions except j, we
  • Read operands from the register file
  • Perform an ALU operation
  • For all instructions except sw, beq, and j, we
    write a result into the register file

13
Execution datapath
  • Register file block diagram
  • Read register 1,2 source operand register
    numbers
  • Read data 1,2 source operands (32 bits each)
  • Write register destination operand register
    number
  • Write data data written into register file
  • RegWrite when asserted, enables the writing of
    Write Data

14
Execution datapath
  • Datapath for R-type (add, sub, and, or, slt)
  • R-type instruction format

31
26
16
15
11
10
6
5
0
25
20
21
op
rs
rt
funct
rd
shamt
15
Execution datapath
  • Datapath for beq instruction
  • I-type instruction format
  • Zero ALU output indicates if rsrt (branch is
    taken/not taken)
  • Branch target address is the sign extended
    immediate left shifted two positions, and added
    to PC4

31
26
16
15
0
25
20
21
op
rs
rt
immediate
16
Data memory
  • Used for lw, sw (I-type format)
  • Block diagram
  • Address memory location to be read or written
  • Read data data out of the memory on a load
  • Write data data into the memory on a store
  • MemRead indicates a read operation is to be
    performed
  • MemWrite indicates a write operation is to be
    performed

17
Execution datapath data memory
  • Datapath for lw, sw
  • Address is the sign-extended immediate added to
    the source operand read out of the register file
  • sw data written to memory from specified
    register
  • lw data written to register file from specified
    memory address

18
Putting the pieces together
  • Single clock cycle for fetch, execute, and
    operand read from data memory
  • 3 MUXes
  • Register file operand or sign extended immediate
    to ALU
  • ALU or data memory output written to register
    file
  • PC4 or branch target address written to PC
    register

19
Datapath for R-type instructions
Example add 4, 18, 30
20
Datapath for I-type ALU instructions
Example slti 7, 4, 100
21
Datapath for not taken beq instruction
Example beq 28, 13, EXIT
22
Datapath for taken beq instruction
Example beq 28, 13, EXIT
23
Datapath for load instruction
Example lw 8, 112(2)
24
Datapath for store instruction
Example sw 10, 0(3)
25
Control signals we need to generate
26
ALU operation control
  • ALU control input codes from Chapter 4
  • Two steps to generate the ALU control input
  • Use the opcode to distinguish R-type, lw and sw,
    and beq
  • If R-type, use funct field to determine the ALU
    control input

ALU control input ALU operation Used for
000 and and
001 or or
010 add add, lw, sw
110 subtract sub, beq
111 set on less than slt
27
ALU operation control
  • Opcode used to generate a 2-bit signal called
    ALUOp with the following encodings
  • 00 lw or sw, perform an ALU add
  • 01 beq, perform an ALU subtract
  • 10 R-type, ALU operation is determined by the
    funct field

Funct Instruction ALU control input
100000 add 010
100010 sub 110
100100 and 000
100101 or 001
101010 slt 111
28
Comparing instruction fields
31
26
16
15
11
10
6
5
0
25
20
21
0
rs
rt
funct
rd
shamt
R-type
31
26
16
15
0
25
20
21
4
rs
rt
immediate (offset)
beq
31
26
16
15
0
25
20
21
35 (43)
rs
rt
immediate (offset)
lw (sw)
  • Opcode, source registers, function code, and
    immediate fields always in same place
  • Destination register is
  • bits 15-11 (rd) for R-type
  • bits 20-16 (rt) for lw
  • MUX to select the right one

29
Datapath with instr fields and ALU control
30
Main control unit design
31
Main control unit design
  • Truth table

32
Adding support for jump instructions
  • J-type format
  • Next PC formed by shifting left the 26-bit target
    two bits and combining it with the 4 high-order
    bits of PC4
  • Now the next PC will be one of
  • PC4
  • beq target address
  • j target address
  • We need another MUX and control bit

31
26
0
25
2
target
33
Adding support for jump instructions
34
Evaluation of the simple implementation
  • All instructions take one clock cycle (CPI 1)
  • Assume the following worst case delays
  • Instruction memory 4 time units
  • Data memory 4 time units (read), 2 time units
    (write)
  • ALU 4 time units
  • Adders 3 time units
  • Register file 2 time units (read), 1 time unit
    (write)
  • MUXes, sign extension, gates, and shifters 1
    time unit
  • Large disparity in worst case delays among
    instruction types
  • R-type 421411 13 time units
  • beq 4214111 14 time units
  • j 411 6 time units
  • store 4242 12 time units
  • load 424411 16 time units

35
Evaluation of the simple implementation
  • Disparity would be worse in a real machine
  • Even slower integer instructions (e.g.,
    multiply/divide in MIPS)
  • Floating point instructions
  • Simple instructions take as long as complex ones

36
A multicycle implementation
  • Instruction fetch, register file access, etc
    occur in separate clock cycles
  • Different instruction types take different
    numbers of cycles to complete
  • Clock cycle time should be faster

37
High level view of datapath
  • New registers store results of each step
  • Not programmer visible!
  • Hardware can be shared
  • One ALU for PC4, branch target calculation, EA
    calculation, and arithmetic operations
  • One memory for instructions and data

38
Detailed multi-cycle datapath
39
Multi-cycle control
40
First two cycles for all instructions
  • Instruction fetch (1st cycle)
  • Load the instruction into the IR register
  • IR MemoryPC
  • Increment the PC
  • PC PC4
  • Instruction decode and register fetch (2nd cycle)
  • Read register file locations rs and rt, results
    into the A and B registers
  • ARegIR25-21
  • BRegIR20-16
  • Calculate the branch target address and load into
    ALUOut
  • ALUOut PC(sign-extend (IR15-0) ltlt2)

41
Instruction fetch
  • IRMemPC

42
Instruction fetch
  • PCPC4

43
Instruction decode and register fetch
  • ARegIR25-21, BRegIR20-16

44
Instruction decode and register fetch
  • ALUOut PC(sign-extend (IR15-0) ltlt2)

45
Additional cycles for R-type
  • Execution
  • ALUOut A op B
  • Completion
  • RegIR15-11 ALUOut

46
R-type execution cycle
  • ALUOut A op B

47
R-type completion cycle
  • RegIR15-11 ALUOut

48
Additional cycles for store
  • Address computation
  • ALUOut A sign-extend (IR15-0)
  • Memory access
  • MemoryALUOut B

49
Store address computation cycle
  • ALUOut A sign-extend (IR15-0)

50
Store memory access cycle
  • MemoryALUOut B

51
Additional cycles for load
  • Address computation
  • ALUOut A sign-extend (IR15-0)
  • Memory access
  • MDR MemoryALUOut
  • Read completion
  • RegIR20-16 MDR

52
Load memory access cycle
  • MDR MemoryALUOut

53
Load read completion cycle
  • RegIR20-16 MDR

54
Additional cycle for beq
  • Branch completion
  • if (A B) PC ALUOut

55
Branch completion cycle for beq
  • if (A B) PC ALUOut

56
Additional cycle for j
  • Jump completion
  • PC PC31-28 (IR25-0ltlt2)

57
Jump completion cycle for j
  • PC PC31-28 (IR25-0ltlt2)

58
Control logic design
  • Implemented as a Finite State Machine
  • Inputs 6 opcode bits
  • Outputs 16 control signals
  • State 4 bits for 10 states

59
High-level view of FSM
60
Instruction fetch cycle
61
Instruction decode/register fetch cycle
62
R-type execution cycle
63
R-type completion cycle
64
Memory address computation cycle
65
Store memory access cycle
66
Load memory access cycle
67
Load read completion cycle
68
beq branch completion cycle
69
j jump completion cycle
70
Complete FSM
71
Evaluation of the multi-cycle design
  • CPI calculated based on the instruction mix
  • For gcc (Figure 4.54)
  • 23 loads (5 cycles each)
  • 13 stores (4 cycles each)
  • 19 branches (3 cycles each)
  • 2 jumps (3 cycles each)
  • 43 ALU (4 cycles each)
  • CPI 0.2350.1340.1930.0230.4344.02
  • Cycle time is calculated from the longest delay
    path assuming the same timing delays as before

72
Worst case datapath branch target
  • ALUOut PC(sign-extend (IR15-0) ltlt2)
  • Delay 7 time units (delay of simple 16)

73
Evaluation of the multi-cycle design
  • Time per instruction of simple and multi-cycle
  • TPI(simple) CPI(simple) x cycle time(simple)
    16
  • TPI(multi-cycle) 4.02 x 7 28.1
  • Simple single-cycle implementation is faster
  • Multicycle with pipelining will be considerably
    faster than single-cycle implementation

74
Exceptions
  • An exception is an event that causes a deviation
    from the normal execution of instructions
  • Types of exceptions
  • Operating system call (e.g., read a file, print a
    file)
  • Input/output device request
  • Page fault (request for instruction/data not in
    memory Ch 7)
  • Arithmetic error (overflow, underflow, etc.)
  • Undefined instruction
  • Misaligned memory access (e.g., word access to
    odd address)
  • Memory protection violation
  • Hardware error
  • Power failure
  • An exception is not usually due to an error!
  • We need to be able to restart the program at the
    point where the exception was detected

75
Handling exceptions
  • Detect the exception
  • Save enough information about the exception to
    handle it properly
  • Save enough information about the program to
    resume it after the exception is handled
  • Handle the exception
  • Either terminate the program or resume executing
    it depending on the exception type

76
Detecting exceptions
  • Performed by hardware
  • Overflow determined from the opcode and the
    overflow output of the ALU
  • Undefined instruction determined from
  • The opcode in the main control unit
  • The function code and ALUop in the ALU control
    logic

77
Detecting exceptions
overflow
undefined instruction
78
Saving exception information
  • Performed by hardware
  • We need the type of exception and the PC of the
    instruction when the exception occurred
  • In MIPS, the Cause register holds the exception
    type
  • Need an encoding for each exception type
  • Need a signal from the control unit to load it
    into the Cause register
  • and the Exception Program Counter (EPC) register
    holds the PC
  • Need to subtract 4 from the PC register to get
    the correct PC (since we loaded PC4 into the PC
    register during the Instruction Fetch cycle)
  • Need a signal from the control unit to load it
    into EPC

79
Saving exception information
80
Saving program information
  • Needed in order to restart the program from the
    point where the exception occurred
  • Performed by hardware and software
  • EPC register holds the PC of the instruction that
    had the exception (where we will restart the
    program)
  • The software routine that handles the exception
    saves any registers that it will need to the
    stack and restores them when it is done

81
Handling the exception
  • Performed by hardware and software
  • Need to transfer control to a software routine to
    handle the exception (exception handler)
  • The exception handler runs in a privileged mode
    that allows it to use special instructions and
    access all of memory
  • Our programs run in user mode
  • The hardware enables the privileged mode, loads
    PC with the address of the exception handler, and
    transitions to the Fetch state

82
Handling the exception
  • Loading the PC with exception handler address

83
Exception handler
  • Stores the values of the registers that it will
    need to the stack
  • Handles the particular exception
  • Operating system call calls the subroutine
    associated with the call
  • Underflow sets register to zero or uses
    denormalized numbers
  • I/O handles the particular I/O request, e.g.,
    keyboard input
  • Restores registers from the stack (if program is
    to be restarted)
  • Terminates the program, or resumes execution by
    loading the PC with EPC and transitioning to the
    Instruction Fetch state

84
FSM modifications
85
The Intel Pentium processor
  • Introduced in 1993
  • Uses a multi-cycle datapath with the following
    steps for integer instructions
  • Prefetch (PF) read instruction from the
    instruction memory
  • Decode 1 (D1) first stage of instruction decode
  • Decode 2 (D2) second stage of instruction decode
  • Execute (E) perform the ALU operation
  • Write back (WB) write the result to the register
    file
  • Datapath usage varies by instruction type
  • Simple instructions make one pass through the
    datapath using state machine control
  • Complex instructions make multiple passes,
    reusing the same hardware elements under
    microcode control

86
The Intel Pentium processor
  • The Pentium is a 2-way superscalar design as two
    instructions can simultaneously execute
  • Ideal CPI for a 2-way superscalar is 0.5
  • Conditions for superscalar execution
  • Both must be simple instructions
  • The result of the first instruction cannot be
    needed by the second
  • Both instructions cannot write the same register
  • The first instruction in program sequence cannot
    be a jump

D2
E
WB
U pipe
D1
D2
E
WB
V pipe
87
The Intel Pentium Pro processor
  • Introduced in 1995 as the successor to the
    Pentium
  • The basis for the Pentium II and Pentium III
  • Implements a 14-cycle, 3-way superscalar integer
    datapath
  • Very high frequency is the goal
  • Uses out-of-order execution in that instructions
    may execute out of their original program order
  • Completely handled by hardware transparently to
    software
  • Instructions execute as soon as their source
    operands become available
  • Complicates exception handling
  • Some instructions before the excepting one may
    not have executed, while some after it may have
    executed

88
The Intel Pentium Pro processor
  • Pentium Pro designers (and AMD designers before
    them) used innovative engineering to overcome the
    disadvantages of CISC ISAs
  • Many complex X86 instructions are internally
    translated by hardware into RISC-like micro-ops
    with state machine control
  • Achieves a very low CPI for simple integer
    operations even on programs compiled for older
    implementations
  • Combination of high frequency and low CPI gave
    the Pentium Pro extremely competitive integer
    performance versus RISC microprocessors
  • Result has been that RISC CPUs have failed to
    gain the desktop market share that had been
    expected

89
The Intel Pentium 4 processor
  • 20 cycle superscalar integer pipeline
  • Extremely high frequency (gt3GHz)
  • Major effort to lower power dissipation
  • Clock gating clock to a unit is turned off when
    the unit is not in use
  • Trace cache caches micro-ops of previously
    decoded complex instructions to avoid
    power-consuming decode operation

90
Questions?
Write a Comment
User Comments (0)
About PowerShow.com