CS5222 Advanced Computer Architecture Part 4: Superscalar Processors

About This Presentation

Title:

CS5222 Advanced Computer Architecture Part 4: Superscalar Processors

Description:

CS5222 Advanced Computer Architecture Part 4: Superscalar Processors – PowerPoint PPT presentation

Number of Views:1051

Avg rating:3.0/5.0

Slides: 124

Provided by: chich8

Category:

more less

Transcript and Presenter's Notes

Title: CS5222 Advanced Computer Architecture Part 4: Superscalar Processors

1
CS5222Advanced Computer ArchitecturePart 4
Superscalar Processors

Fall Term, 2004/2005
Chi Chi Hung (email chich_at_comp.nus.edu.sg)
Building S/17, Rm 5-13
Phone 874-2832

2
Part A

Emergence of Superscalar Processors

3
Introduction

Three phases
Idea (in early 70s)
Architecture proposals and prototype machines
Commercial products

4
Proposed/Prototypes
5
Appearance of Superscalar Processors (I)

Early 90s was the time when VLSI technology
started to accelerate.

6
Appearance of Superscalar Processors (II)

Come from
Converting from an existing (scalar) RISC line.
E.g. Intel 960, MC 88000, HP PA, Sun SPARC, MIPS
R AMD.
Conceiving a new architecture. E.g. Power 1,
Alpha.
CISC superscalar processors appear later than
RISC ones because
Complexity of decoding multiple variable length
instr..
Complexity of handling memory architecture.
Relatively lower issue rate for CISC processors
(Note that for the same performance, RISC
superscalar proc. needs a higher issue rate).

7
Commercial Superscalar Processors
8
Part B

Tasks of Superscalar Processing

9
Specific Tasks of Superscalar Processing (I)
10
Specific Tasks of Superscalar Processing (II)

Five aspects
Parallel decoding
Sophisticated hardware with high issue rate.
Length of decoding stage multiple cycles?
Predecoding?
Superscalar instruction issue
High issue rate implies smaller gap betn 2
sequential instr.
Amplify restrictive effects of control data
dependency.
Sample solns shelving, register renaming,
speculative branch processing.
Parallel instruction execution
Preserving sequential consistency of execution
Retain logical consistency of program execution
due to out-of-order execution.
Preserving sequential consistency of exception
execution

11
Part C

Parallel Decoding

12
Sequential Decoding vs. Parallel Decoding
13
Basic Ideas of Parallel Decoding

Parallel decoding Decoding multiple instr. /
cycle
Hardware complexity increases with issue rate.
Check dependencies w.r.t.
Instructions currently being executed.
Instruction candidates to be issued next.
Multiple instructions decoding in a clock cycle
Decode-issue path becomes critical for clock
frequencies.
Solutions
Multiple pipeline cycles for decoding
E.g. PowerPC601/604, UltraSPARC 2 cycles Alpha
21064 3 cycles, Pentium Pro 4.5 cycles
Predecoding

14
Principle of Pre-Decoding

Part of decode task in loading phase of on-chip
instruction cache.
Shorten overall decoding time or reduce no. of
cycles for decoding and instruction issue.
Append a number of decode bits to each instr.
Instruction class
Type of resources required for execution
Calculation of branch address (for some
processors)
CISC processors require more bits for information
such as variable instruction length (e.g.
starting/ending of I).
Extra space is required. E.g. K5 adds 5 extra
bits to each byte.
Common to most predominant processor lines.

15
Example of Pre-Decoding
16
Pre-decode Bits
17
Superscalar Processing with Pre-Decoding
18
Part D
Superscalar Instruction Issue
19
Design Space

Issue policy specifies how dependencies are
handled during issue process.
Issue rate specifies the max. no. of instructions
a superscalar processor is able to issue in each
cycle.

20
Design Space of Issue Policies (I)
21
Design Space of Issue Policies (II)

Four main aspects
False data dependencies
E.g. WAR, WAW (note that this is just for
registers, not mem.)
Solution Register renaming renaming the
destination reg. That is, the result is written
into a dynamically allocated spare register
instead of the specified register.
Unresolved control dependencies
Solution Speculative branch processing A guess
about the outcome of the unresolved conditional
branch is made.
Use of shelving
Separate issue/dispatch into two stages.
Handling blockages either directly (with issue
window) or by decoupling (no dependency checking
on issue).
Handling of issue blockages
Preserving issue order In-order vs. out-of-order
Alignment of issue Aligned vs. unaligned issue

22
Principle of Blocking Issue Mode
23
Principle of Shelving Shelving
24
Design Aspects Related to Handling of Blockages
25
Issue Order of Instructions (I)
26
Issue Order of Instructions (II)

In-order
A dependent instruction will block the issue of
all subsequent instructions until the dependency
is resolved.
Out-of-order
An independent instruction can be issued even if
a dependent instruction is still in the issue
window.
Some processors allow partial out-of-order. E.g.
PowerPC 601 issues branches and FP out-of-order
MC 88100 does only for FP instructions.
Not many processors employ out-of-order because
Preserving sequential consistency requires much
more efforts.
Shelving reduces the need for out-of-order.

27
Aligned Issue of Instructions (I)
28
Aligned Issue of Instructions (II)

Aligned issue
No instructions of the next window will be
considered as candidates for issue until all
instructions in the current window have been
issued.
Unaligned issue
A gliding window whose width equals the issue
rate is employed.
In every cycle, all instructions in the window
are checked for dependencies. Those independent
ones are issued either as in-order or
out-of-order. Then the window will be refilled.

29
Most Frequently Used Issue Policies of Scalar
Processors
30
Most Frequently Used Issue Policies of
SuperScalar Proc.
31
Trend in Instruction Issue Policies
32
Issue Rate (I)

Issue rate (or superscalarity) refers to the
maximum number of instructions a superscalar
processor can issue in one cycle.
Higher issue rate potentially offers higher
performance. The cost is the more complex
circuitry. It needs a balance between the two.

33
Issue Rate (II)
34
Part E
Superscalar Instruction Issue Shelving
35
Introduction

Eliminate issue blockages due to dependencies.
Make use of dedicated instruction buffers, called
shelving buffers in front of EU(s).
Shelving decouples dependency checking from
instruction issue, and defers it to instr.
dispatch.
Decoded instructions are issued to the shelving
buffers without any checks for data or control
dependencies or for busy EU(s).
Processors with shelving usually employ in-order,
aligned issue polices, together with register
renaming speculative conditional branch
execution (Only true dependencies can block
instruction execution). (Why in-order, aligned
issue?)
Dependency check will be done during instruction
dispatch phase (from shelving buffer to EU).
Dependency free instructions, with their operands
available, will be available for execution
dataflow principle of operation.

36
Principle of Straightforward Issue Policy
37
Principle of Shelving
38
Design Space of Shelving
39
Part E-1

Design Space Topic of Shelving
Scope of Shelving

40
Scope of Shelving

Scope of shelving specifies whether shelving is
restricted to a few instruction types or is
performed for all instructions.

41
Part E-2

Design Space Topic of Shelving
Layout of Shelving Buffers

42
Layout of Shelving Buffers
43
Part E-2-1

Design Space Topic of Shelving
Layout of Shelving Buffers
Type of Buffers

44
Type of Shelving Buffers (I)

Standalone buffers are buffers which are used
exclusively for shelving.
Combined buffers are those with multiple
functionalities.

45
Type of Shelving Buffers (II)

Standalone using reservation station (RS)
Individual
Earliest to be adopted
In front of each EU
Size usually small (2-4)
Group
Hold instructions for a group of EUs that execute
inst. of the same type
More reliable
Large in size (8-16)
Shelving or dispatching more than one instruction
per cycle

46
Type of Shelving Buffers (III)

Standalone using reservation station (RS)
(Contd)
Central
Most flexible
Disadvantages
Need a word length equal to the longest possible
data word
Much more complex
Size about 20
Combined buffers (reorder buffer ROB) for
shelving, renaming reordering.
Expect to be the future trend

47
Type of Shelving Buffers (IV)
48
Combined Buffer for Shelving, Renaming and
Reordering
49
Part E-2-2

Design Space Topic of Shelving
Layout of Shelving Buffers
Number of Buffer Entries

50
Shelving Buffer Entries in Superscalar Processors
What types of RSs should be expected?
51
Part E-2-3

Design Space Topic of Shelving
Layout of Shelving Buffers
Number of Read/Write Ports

52
Number of Read/Write Ports for Shelving Buffers

Individual reservation stations only need to
forward a single instruction per cycle.
Group/Central reservation stations need to
deliver multiple instructions per cycle, ideally
as many as the number of EU(s) connected.
Study the relationship between read/write ports
and no. of shelving buffer entries

53
Part E-3
Design Space Topic of Shelving Operand Fetch
Policy
54
Types of Operand Fetch Policies (I)

Two types
Issue bound
Operands fetched during instruction issue.
Shelving buffers provide entries long enough to
hold source operands.
Dispatch bound
Operands fetched during instruction dispatch.
Shelving buffers contain short register
identifiers.

55
Types of Operand Fetch Policies (II)
56
Operand Fetch During Instr. Issue w/ Single
Register File
57
Operand Fetch During Instr. Dispatch w/ Single
Register File
58
Policies Comparison of Operand Fetch

Policy comparison
Issue bound
Register file supplies all operands for all
issued instructions.
Need twice as many read ports in the register
file as the max. issue rate.
Size of RS is relatively larger.
Dispatch bound
No. of read ports should equal to twice the
dispatch rate (Note that max. dispatch rate is
usually higher than that of issue rate, why?).
Critical decode/issue path is shorter.
Shelving buffers are relatively less complex.

59
Issue Bound Operand Fetch with Multiple Register
Files
60
Dispatch Bound Operand Fetch with Multiple
Register Files
61
MFU Shelving Buffer Types Operand Fetch
Policies
62
Part E-4
Design Space Topic of Shelving Instruction
Dispatch Scheme
63
Design Space of Inst. Dispatch

Instruction dispatch involves twp basic tasks
scheduling the instructions held in a particular
RS for execution and disseminating the scheduled
instruction(s) to the allocated EU(s).

Instruction dispatch scheme
64
Part E-4-1

Design Space Topic of Shelving
Instruction Dispatch Scheme
Dispatch policy

65
Design Space of Dispatch Policy
Dispatch policy
66
Consideration of Dispatch Policy (I)

Dispatch policy specifies how instructions are
selected for execution and how dispatch blockages
are handled.
Selection rule
Specify when instructions are considered as
executable.
Arbitration rule
Choose a subset of instructions when more
instructions are eligible for execution than can
be disseminated in the next cycle.
Usually , older instructions are preferable
than younger ones.

67
Consideration of Dispatch Policy (II)

Dispatch policy (Contd)
Dispatch order
Will a non-executable instruction block all
subsequent instructions from being dispatched.
Three types
In-order Simple (only last inst. to be
inspected)
Partially out-of-order (for certain instr. Types)
Out-of-order
Complex
Need to check all instructions in shelving buffer
for executable instructions.
Expect to be used in group or central RS.

68
Dispatch Order
69
Part E-4-2

Design Space Topic of Shelving
Instruction Dispatch Scheme
Dispatch rate

70
Considerations of Dispatch Rate

Dispatch rate is defined as the no. of
instructions that can be dispatched from each
reservation station per cycle.
Ideal dispatch rate is one instruction per EU.
Easier to achieve in individual and group RS.
Future dispatch rate is expected to get higher
because of less restrictions imposed on data
path, ports, and transistor count.
Note that very often, max. issue rate is less
than max. dispatch rate.

71
Multiplicity of Dispatched Instructions
72
Max. Issue and Dispatch Rates of Superscalar Proc.

Study relationship between issue rate and
dispatch rate.

73
Part E-4-3

Design Space Topic of Shelving
Instruction Dispatch Scheme
Checking for Operand Availability

74
Intro. to Checking for Operand Availability

Availability checking is done
when operands are fetched from the register file,
and
(during dispatch) if operands of instructions in
the shelving buffers are available.
Solution Scoreboard
Direct check of the scoreboard bits
RS does not hold any explicit status information
indicating if source operands are available.
Employed when operands are fetched during inst.
dispatch.
Check of explicit status bit
Availability is indicated in RS through status
bits.
Employed if operands are fetched during inst.
issue.
Additional associative search needed for value
updating in RS.

75
Principle of Scoreboarding
76
Scheme for Checking Operand Availability
77
Use of Multiple Buses for Updating Multiple RSs

If multiple RSs exists, their updating must be
done globally.

78
Updating RSs in case of Multiple Register Files
79
Internal Data Paths of PowerPC604
80
Part E-4-4

Design Space Topic of Shelving
Instruction Dispatch Scheme
Treatment of Empty Reservation Station

81
Treatment of Empty Reservation Table
82
Part E-4-5

Design Space Topic of Shelving
Instruction Dispatch Scheme
Typical Dispatch Schemes

83
Typical Approaches in Dispatching (I)

Assumptions for typical solutions
Register renaming and speculative execution are
usually employed.
If operands are fetched during instruction
dispatch, use direct checking method.
If operands are fetched during instruction issue,
use explicit status bits to maintain and check
operand availability
Empty RS is usually bypassed.

84
Typical Approaches in Dispatching (II)
85
Part F
Superscalar Instruction Issue Register Renaming
86
Introduction to Register Renaming

Standard technique for removing false data
dependencies (i.e. WAR, WAW).
Always turn instructions to be three-operands by
renaming the destination operand.
Two implementations
Static
Done by the compiler.
Dynamic
Take place in hardware during execution time.
Require extra circuitry for suppl. register
space, additional data paths and logic.

87
Implementation of Register Renaming
88
Chronology of Renaming in Commercial Processors
89
Design Space of Register Renaming
90
Part F-1

Design Space Topic Register Renaming
Scope of Register Renaming

91
Scope of Renaming
92
Part F-2
Design Space Topic Register Renaming Layout of
Rename Buffers
93
Layout of Rename Buffers
94
Types of Rename Buffers
95
Architecture of Rename Buffers

For merged arch. rename register file
A free physical register is allocated to each
destination register specified in an instruction.
A mapping table is used to track all allocation
reg. pairs.
Scheme is required to reclaim physical registers
no longer in use.
For all three other cases, intermediate results
are held in respective rename buffer until their
retirement. During retirement, content of rename
buffer will be written back to architectural
register file.

96
Example of Renaming Architecture Register (I)
97
Example of Renaming Architecture Register (II)
98
Number of Rename Buffers
99
Access Mechanism of Rename Buffers (I)

Need to access rename buffers because
Fetch operands
Update rename registers
Deallocate rename registers
Two distinct mechanisms
Associative mechanism
Indexed access mechanism

100
Access Mechanism of Rename Buffers (II)
101
Part F-3
Design Space Topic Register Renaming Operand
Fetch Policy
102
Operand Fetch Policies of Rename Buffers

Two policies
Rename bound
Fetch referenced operands during renaming
Dispatch bound
Defer operand fetch until dispatching

103
Part F-4
Design Space Topic Register Renaming Rename Rate
104
Rename Rate

Rename rate is the max. number of renames per
cycle that a processor is able to perform.
To avoid bottlenecks, rename rate is equal to
issue rate.
HW requirements a large number of ports at
register files and the mapping tables.

105
Part F-5
Design Space Topic Register Renaming Most
Frequently Used Renaming
106
Most Frequently Used Basic Renaming
107
Part G
Parallel Execution
108
Concept of Parallel Execution

Independent of whether instructions are issued or
dispatched in-order or out-of-order, they will
generally be finished in out-of-program-order.
Three terms
to finish operation is completed except for
writing back the result into the architectural
register or memory (and status bits).
to complete the last action of instruction
execution (i.e. write back to arch. registers) is
finished.
to retire write back to arch. registers and
delete completed instruction from ROB (Reorder
Buffer).

109
Part H
Preserving Sequential Consistency of Instruction
Execution
110
Sequential Consistency (I)

Two aspects
Order in which instructions are completed.
Order in which memory is accessed due to LD/ST.
Processor consistency indicates the consistency
of instruction completion with sequential
instruction execution.
Two possible processor consistencies
Weak instructions are completed out of order,
provided that non data dependencies are
scarified.
Strong instructions are forced to complete in
strict program order. Usually achieved with ROB.

111
Sequential Consistency (II)

Memory consistency indicates whether memory
accesses are performed in the same order as in a
sequential processor.
Two possible memory access consistencies
Weak memory accesses may be out of order
compared with a strict sequential program
execution, provided that data dependencies must
not be violated.
Strong memory accesses occur strictly in program
order.

112
Sequential Consistency (III)
113
Sequential Consistency Model
114
Concept of Load/Store Reordering
115
Principle of Reorder Buffer
116
Use of Reorder Buffer in Commercial Processors
117
Design Space of Reorder Buffers
118
Basic Layout of Reorder Buffers
119
Sample Implementation of Reorder Buffers
120
Comparison of Shelves and Reorder Buffer Entries
121
Part I
Preserving Sequential Consistency of Exception
Processing
122
Sequential Consistency of Exception Processing
123

Write a Comment

User Comments (0)