Superscalar Processors - PowerPoint PPT Presentation

About This Presentation
Title:

Superscalar Processors

Description:

Superscalar Processors J. Nelson Amaral Ready Bit (cont.) Upon completion, an instruction broadcasts the name and content of its result physical register to all ... – PowerPoint PPT presentation

Number of Views:197
Avg rating:3.0/5.0
Slides: 67
Provided by: JoseN50
Category:

less

Transcript and Presenter's Notes

Title: Superscalar Processors


1
Superscalar Processors
  • J. Nelson Amaral

2
Scalar to Superscalar
  • Scalar Processor one instruction pass through
    each pipeline stage in each cycle
  • Superscalar Processor multiple instructions at
    each pipeline stage in each cycle
  • Wider pipeline
  • Superpipelined Processor Decompose stages into
    smaller stages ? More Stages
  • Deeper pipeline

Baer p. 75
3
Superscalar
  • Front end (IF and ID)
  • Must fetch and decode multiple instructions per
    cycle
  • m-way superscalar brings (ideally) m
    instructions per cycle into the pipeline
  • Back end (EX, Mem and WB)
  • Must execute and write back several instructions
    per cycle

Baer p. 75
4
Superscalar
  • In-order (or static)
  • Instructions leave front-end in program order
  • Out-of-order (or dynamic)
  • instructions leave front-end, and execute, in a
    different order than the program order
  • WB is called commit stage
  • must ensure that the program semantics is
    followed
  • more complex design

Baer p. 76
5
Limits to Superscalar Performance
  • Superscalars rely on exploiting Instruction-Level
    Parallelism (ILP)
  • They remove WAR and WAW dependences
  • But the amount of ILP is limited by RAW (true)
    dependences

Data Dependence Graph
S0
RAW
WAW
S1
WAR
S2
WAW
RAW
S3
Baer p. 76
6
Limits to Superscalar Performance
  • Superscalars rely on exploiting Instruction-Level
    Parallelism (ILP)
  • They remove WAR and WAW dependences
  • But the amount of ILP is limited by RAW (true)
    dependences

Data Dependence Graph
S0
RAW
S1
RA
RB
RA
Baer p. 76
7
Limits to Superscalar Performance
  • Complexity of logic to remove dependencies
  • Designers predicted 8-way and 16-way superscalars
  • We have 6-way superscalars and m is not likely to
    grow

Baer p. 76
8
Limits to Superscalar PerformanceNumber of
Forward Paths
1-way
Baer p. 76
9
Limits to Superscalar PerformanceNumber of
Forward Paths
2-way
m-way requires m2 paths
paths may become too long for signal propagation
within a single clock
Baer p. 76
10
Limits to Clock Cycle Reduction
  • Power dissipation increases with frequency
  • Read and Writing to pipeline registers in every
    cycle.
  • Time to access pipeline register imposes a bound
    on the duration of a pipeline stage

Baer p. 76
11
Limits on Pipeline Length
  • Speculative actions (pe. branch prediction) are
    resolved later in a longer pipeline
  • Recovery from misspeculation is delayed

Baer p. 76
12
Why the Multicore Revolution?
Baer p. 77
13
Speed Demons X Brainiacs
register renaming
reorder buffer
reservation stations
Baer p. 77
14
Out-of-Order and Memory Hierarchy
  • Question Does out-of-order execution help hide
    memory latencies?
  • Short answer No.
  • Latencies of 100 cycles or more are too long and
    fill up all internal queues and stall pipelines
  • Latencies around 100 cycles are too short to
    justify context switching.
  • Solution hardware for several contexts to enable
    fast context switching ? multithreading

Baer p. 78
15
DEC Alpha 21164
4-way in-order RISC
virtually indexed
32
32 64-bit
Baer p. 79
16
21164 Instruction Pipeline
Integer pipe 1 shifter and multiplier Integer
pipe 2 branches
48-entry I-TLB
64-entry D-TLB
Baer p. 79
17
Integer pipe 1 shifter and multiplier Integer
pipe 2 branches
48-entry I-TLB
64-entry D-TLB
Baer p. 80
18
Example
i1 R1 ? R2 R3 Use integer pipeline
1 i2 R4 ? R1 R5 Use integer
pipeline 2 i3 R7 ? R8 R9 Requires
an integer pipeline i4 F0 ? F2 F4
Floating point add i5 i6 i7 i8 i9 i10 i11 i
12
Assume no structural or data hazard for these
instructions.
Baer p. 81
19
Front-end Occupancy
i1 R1 ? R2 R3 i2 R4 ? R1 R5 i3 R7 ?
R8 R9 i4 F0 ? F2 F4
S0
S1
S2
S3
Time t0
Time t0 1
Backend
Baer p. 82
20
Front-end Occupancy
i1 R1 ? R2 R3 i2 R4 ? R1 R5 i3 R7 ?
R8 R9 i4 F0 ? F2 F4
S0
S1
S2
S3
Time t0 1
Time t0 2
Backend
Baer p. 82
21
Front-end Occupancy
i1 R1 ? R2 R3 i2 R4 ? R1 R5 i3 R7 ?
R8 R9 i4 F0 ? F2 F4
Time t0 2
S0
S1
S2
S3
Time t0 3
Backend
i9
i5
i10
i6
i11
i7
i3
i12
i8
i4
i3 cannot move to S3 because of resource conflict
(there are only two integer pipelines)
i4 does not move to S3 to preserve program order
(it is blocked by i3)
Baer p. 82
22
Front-end Occupancy
i1 R1 ? R2 R3 i2 R4 ? R1 R5 i3 R7 ?
R8 R9 i4 F0 ? F2 F4
Time t0 3
S0
S1
S2
S3
Backend
Time t0 4
i9
i5
i1
i10
i6
i2
i11
i7
i3
i12
i8
i4
i2 cannot move to the backend because of of RAW
dependency with i1.
Baer p. 82
23
Front-end Occupancy
i1 R1 ? R2 R3 i2 R4 ? R1 R5 i3 R7 ?
R8 R9 i4 F0 ? F2 F4
Time t0 4
S0
S1
S2
S3
Backend
Time t0 5
i1
Baer p. 82
24
Backend
Baer p. 82
25
Scoreboard Speculation
If the load hits L1-cache, then schedule L at t1
and U at t3.
Scoreboard assumes it is a hit.
If it is a miss, abort any dependent instruction
already issued.
Baer p. 82
26
Can Compiler Help Performance?(Example)
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
Assume that all instructions are in issuing slot
(state S2) at time t.
27
Compiler Effect
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
S0
S1
S2
S3
Time t
Time t 1
Backend
i3
i4
Instruction i3 cannot advance to S3 because of an
structural hazard The load in i1 uses an
integer pipe to compute the address
Baer p. 82
28
Compiler Effect
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
Time t 1
S0
S1
S2
S3
Backend
Time t 2
Time t 3
i1
i2
i3
i4
i2 cannot advance because of the RAW dependency
with i1
at t3 the load continues execution in the back
end (2-cycle latency)
Baer p. 82
29
Compiler Effect
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
Time t 3
S0
S1
S2
S3
Backend
Time t 4
i1
Baer p. 82
30
Compiler Effect
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
Time t 4
S0
S1
S2
S3
Backend
Time t 5
i2
i3
i4
i4 cannot advance because of the RAW dependency
with i3
Baer p. 82
31
Compiler Effect
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
Time t 5
S0
S1
S2
S3
Backend
Time t 6
i3
i4 advances to execution at t6 and it will be
the only integer instruction executing at that
cycle.
Baer p. 82
32
After Compiler Optimization
i1 R1 ? MemR2 i1 integer nop i2
R4 ? R1 R3 i3 R5 ? R1 R6 i4
R7 ? R4 R5
S0
S1
S2
S3
Time t
Time t 1
Backend
i4
i5
i2
i6
i3
i7
Two integer Instructions advance to S3.
Baer p. 82
33
After Compiler Optimization
i1 R1 ? MemR2 i1 integer nop i2
R4 ? R1 R3 i3 R5 ? R1 R6 i4
R7 ? R4 R5
Time t 1
S0
S1
S2
S3
Backend
Time t 2
Baer p. 82
34
After Compiler Optimization
i1 R1 ? MemR2 i1 integer nop i2
R4 ? R1 R3 i3 R5 ? R1 R6 i4
R7 ? R4 R5
Time t 2
S0
S1
S2
S3
Backend
Time t 3
Time t 4
i1
i4
i1
i5
i2
i6
i3
i7
Load in i1 still needs two cycles to execute.
Baer p. 82
35
After Compiler Optimization
i1 R1 ? MemR2 i1 integer nop i2
R4 ? R1 R3 i3 R5 ? R1 R6 i4
R7 ? R4 R5
Time t 4
S0
S1
S2
S3
Backend
Time t 5
i1
i2 and i3 can advance to backend together. There
is no depencency between them.
Baer p. 82
36
After Compiler Optimization
i1 R1 ? MemR2 i1 integer nop i2
R4 ? R1 R3 i3 R5 ? R1 R6 i4
R7 ? R4 R5
Time t 4
S0
S1
S2
S3
Backend
Time t 5
Time t 6
i12
i4
i5
i6
i7
i4 still advances to backend at t6!
but now i5 could advance along with i4
Textbook says that i4 would advance to backend
at t5.
Baer p. 82
37
Scoreboarding
Scoreboarding allows instructions to execute out
of order when there are sufficient resources and
no data dependences.
John L. Hennessy and David A. Patterson Computer
Architecture A Quantitative Approach Third
Edition, p. A-69.
38
Another scoreboarding
39
Scoreboarding
  • Thornton Algorithm (Scoreboarding) CDC 6600
    (1964)
  • A single unit (the scoreboard) monitors the
    progress of the execution of instructions and the
    status of all registers.
  • Tomasulos Algorithm IBM 360/91 (1967)
  • Reservation stations buffer operands and results.
    A Common Data Bus (CDB) distributes results
    directly to functional units

Some of this material is from Prof. Vojin G.
Oklobzijas tutorial at ISSCC97.
Baer p. 81
40
CDC 6600
Not shown branch unit that modifies the PC
Baer p. 86
41
CDC 6600 Scoreboard Operation
Issue
free functional unit?
Baer p. 86
42
CDC 6600 Scoreboard Operation
Dispatch
Mark execution unit busy
Baer p. 87
43
CDC 6600 Scoreboard Operation
Execution
Baer p. 87
44
CDC 6600 Scoreboard Operation
Write result
WAR Example i0 DIV.D F0, F2, F4 i1
ADD.D F10, F0, F8 i2 SUB.D F8, F8, F14 Has
to stall the write of i2 until i1 has read F8
Baer p. 87
45
Scoreboarding Example
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Baer p. 88
46
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 1
i1
issued
R4
R0
R2
1
1
Unit Busy (U)?
Mult1 0
Mult2 0
Adder 0
Register Unit
R4 NIL
R6 NIL
R8 NIL
Mult1
Baer p. 88
47
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 2
i1
dispatched
R4
R0
R2
issued
1
1
i2
issued
R6
R4
R8
Mult1
0
1
Unit Busy (U)?
Mult1 0
Mult2 0
Adder 0
Register Unit
R4 Mult1
R6 NIL
R8 NIL
1
Mult2
Baer p. 88
48
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 3
i1
dispatched
R4
R0
R2
1
1
execute
i2
issued
R6
R4
R8
Mult1
0
1
i3
issued
R8
R2
R12
1
1
Unit Busy (U)?
Mult1 1
Mult2 0
Adder 0
Register Unit
R4 Mult1
R6 Mult2
R8 NIL
Adder
Baer p. 88
49
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 4
i1
R4
R0
R2
1
1
execute
i2
issued
R6
R4
R8
Mult1
0
1
i3
issued
R8
R2
R12
1
1
dispatched
Unit Busy (U)?
Mult1 1
Mult2 0
Adder 0
Register Unit
R4 Mult1
R6 Mult2
R8 Adder
1
Baer p. 88
50
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 5
(No change)
i1
R4
R0
R2
1
1
execute
i2
issued
R6
R4
R8
Mult1
0
1
R8
R2
R12
1
1
dispatched
i3
execute
Unit Busy (U)?
Mult1 1
Mult2 0
Adder 1
Register Unit
R4 Mult1
R6 Mult2
R8 Adder
Baer p. 88
51
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 6
i3 asks for permission to write. Permission is
denied (WAR with i2).
i1
R4
R0
R2
1
1
execute
i2
issued
R6
R4
R8
Mult1
0
1
R8
R2
R12
1
1
i3
execute
Unit Busy (U)?
Mult1 1
Mult2 0
Adder 1
Register Unit
R4 Mult1
R6 Mult2
R8 Adder
Baer p. 88
52
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 8
i1 asks for permission to write. Permission
is granted.
i1
R4
R0
R2
1
1
execute
write
i2
issued
R6
R4
R8
Mult1
0
1
R8
R2
R12
1
1
i3
execute
Unit Busy (U)?
Mult1 1
Mult2 0
Adder 1
Register Unit
R4 Mult1
R6 Mult2
R8 Adder
Baer p. 88
53
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 9
i2
issued
R6
R4
R8
Mult1
0
1
dispatched
R8
R2
R12
1
1
i3
execute
write
Unit Busy (U)?
Mult1 0
Mult2 1
Adder 1
Register Unit
R4
R6 Mult2
R8 Adder
Adder
Baer p. 88
54
Register Renaming, Reorder Buffer, and
Reservation Stations
  • Difference between in-order X out-of-order
    execution
  • When instructions leave the front end?
  • In-order WAR and WAW prevent dispatch
  • Out-of-order register renaming avoids WAR and
    WAW
  • How are instructions processed in the back-end?
  • Instructions can wait in reservation stations
    because of RAW dependencies or structural hazards
  • A reorder buffer imposes program order commitment

Baer p. 89
55
Register Renaming (example)
i1 R1 ? R2/R3 Takes a long time i2
R4 ? R1 R5 i3 R5 ? R6 R7 i4 R1 ?
R8 R9
The registers that appear in the program are
logical or architectural registers.
In-order Only i1 issues. Others are blocked by
RAW dependency.
At the last stage of the front end all registers
are mapped to physical registers.
Out-of-order i3 and i4 can issue and finish
execution while i1 executes
Baer p. 89
56
Renaming Process
Renaming Stage
Ri ?Rj op Rk
Ra ? Rb op Rc
Rb Rename(Rj) Rc Rename(Rk) Ra
freelist(first) Rename(Ri) freelist(first) fir
st ?next(first)
Baer p. 90
57
Register Renaming (example)
How about i3, can it write into R5 before i1 and
i2 complete?
If i1 generates an exception, what will be the
value of R5 in the exception state?
i1 R1 ? R2/R3 i2 R4 ? R1 R5 i3
R5 ? R6 R7 i4 R1 ? R8 R9
R32
Ri Rename(Ri)
R1 R1
R2 R2
R3 R3
R4 R4
R5 R5
R6 R6
R7 R7
R8 R8
R9 R9
R32
R35
R32
R33
R34
R35
R33
R34
i4 will finish execution before i1. Can we allow
it to write the result to R1 before i1?
Freelist R32, R33, R34, R35, R36,
Baer p. 90
58
Reorder Buffer
  • Even though we allow out-of-order execution, we
    require in-order-completion.
  • A reorder buffer (ROB) ensures that the results
    produced by instructions are committed to the
    logical register in order.

Baer p. 91
59
Reorder Buffer (cont.)
  • Each entry in the ROB has the following fields
  • flag has the instruction completed?
  • value value computed by the instruction
  • result register name logical register
  • instruction type arithmetic/load/store/branch/
  • Each instruction that has its destination
    register renamed is entered in the ROB

Baer p. 91
60
Instruction Flag Value Reg. Name Type




Ri Rename(Ri)
R1 R1
R2 R2
R3 R3
R4 R4
R5 R5
R6 R6
R7 R7
R8 R8
R9 R9
R32
R35
i1 R1 ? R2/R3 i2 R4 ? R1 R5 i3
R5 ? R6 R7 i4 R1 ? R8 R9
R32
R32
R33
R33
R34
R34
R35
Freelist R32, R33, R34, R35, R36,
Baer p. 92
61
But.
  • Where do instructions wait before being executed?
  • How an instruction knows that it is ready to be
    executed?

Baer p. 93
62
Reservation Stations
  • After register renaming, the front-end dispatches
    the instruction to a reservation station.
  • Reservation stations can
  • be grouped into a centralized queue called an
    instruction window.
  • be associated with functional units according to
    the opcode.

Baer p. 93
63
Reservation Stations (cont.)
  • Each entry in the Reservation Station must
    contain
  • Operation to be performed
  • Source operands (either value or physical name of
    the register) a flag indicates which one
  • physical name of the result register
  • ROB entry where the result will be stored.

Baer p. 93
64
Scheduling
  • Scheduling Selection of which instruction should
    execute next in a given execution unit
  • oldest instruction
  • critical instruction

Baer p. 93
65
Ready Bit
  • A ready bit is associated with each physical
    register.
  • When an instruction that uses a physical register
    Ri is dispatched
  • if Ri is ready, pass Ri value to the reservation
    station and set flag to true (ready)
  • if Ri is not ready, pass the name of Ri to the
    reservation station and set flag to false (not
    ready)
  • When both flags are true, the instruction is
    ready to be issued.

Baer p. 93
66
Ready Bit (cont.)
  • Upon completion, an instruction broadcasts the
    name and content of its result physical register
    to all reservation stations (RS).
  • Each RS that needs it, will grab the content and
    update its flags.

Baer p. 93
Write a Comment
User Comments (0)
About PowerShow.com