Tomasulos Approach and Hardware Based Speculation - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Tomasulos Approach and Hardware Based Speculation

Description:

Hardware Schemes for ILP. Why in hardware at run time? ... write result to CDB through to ROB and any waiting reservation stations. ... – PowerPoint PPT presentation

Number of Views:554
Avg rating:3.0/5.0
Slides: 32
Provided by: thayeren
Category:

less

Transcript and Presenter's Notes

Title: Tomasulos Approach and Hardware Based Speculation


1
Tomasulos Approach andHardware Based Speculation
  • Vincent H. Berk
  • October 22nd
  • Reading for Today 3.1 3.7

2
Hardware Schemes for ILP
  • Why in hardware at run time?
  • Works when dependence is not known at run time
  • Simplifies compiler
  • Allows code for one machine to run well on
    another
  • Key idea Allow instructions behind stall to
    proceed
  • DIVD F0, F2, F4
  • ADDD F10, F0, F8
  • SUBD F12, F8, F14
  • Enables out-of-order execution ? out-of-order
    completion
  • ID stage checks for both structural hazards and
    data dependences

3
Hardware Schemes for ILP
  • Out-of-order execution divides ID stage
  • 1. Issue decode instructions, check for
    structural hazards
  • 2. Read operands wait until no data hazards,
    then read operands

4
Tomasulos Algorithm
  • For IBM 360/91 about 3 years after CDC 6600
  • Goal High performance without special compilers
  • Differences between IBM 360 CDC 6600 ISA
  • IBM has only 2 register specifiers/instruction
    vs. 3 in CDC 6600
  • IBM has 4 FP registers vs. 8 in CDC 6600
  • Differences between Tomasulos Algorithm
    Scoreboard
  • Control buffers (called reservation stations)
    distributed with functional units vs. centralized
    in scoreboard
  • Registers in instructions replaced by pointers to
    reservation station buffer
  • HW renaming of registers to avoid WAR, WAW
    hazards
  • Common data bus (CDB) broadcasts results to
    functional units
  • Load and stores treated as functional units as
    well
  • Alpha 21264, HP 8000, MIPS 10000, Pentium III,
    PowerPC 604, ...

5
Three Stages of Tomasulo Algorithm
  • 1. Issue Get instruction from FP operation
    queue
  • If reservation station free, issues instruction
    sends operands (renames registers).
  • 2. Execution Operate on operands (EX)
  • When operands ready then execute if not ready,
    watch common data bus for result.
  • 3. Write result Finish execution (WB)
  • Write on common data bus to all awaiting units
    mark reservation station available.
  • Common data bus data source (come from bus)

6
Tomasulo Organization
From Instruction Unit
FP Registers
From Memory

Load Buffers
FP Op Queue
Store Buffers
Operand Bus
To Memory
Operation Bus
FP Mul Res. Station
FP Add Res. Station
Reservation Stations
FP Adders
FP Multipliers
Common data bus (CDB)
7
Reservation Station Components
  • Op Operation to perform in the unit (e.g., or
    )
  • Qj, Qk Reservation stations producing source
    registers
  • Vj, Vk Value of source operands
  • Rj, Rk Flags indicating when Vj, Vk are ready
  • Busy Indicates reservation station and FU is
    busy
  • Register result status Indicates which
    functional unit will write each register, if one
    exists. Blank when no pending instructions will
    write that register.

8
Tomasulo Example Cycle 1
9
Tomasulo Example Cycle 2
10
Tomasulo Example Cycle 3
Register names are renamed in reservation
stations Load1 completing who is waiting for
Load1?
11
Tomasulo Example Cycle 4
Load2 completing who is waiting for it?
12
Tomasulo Example Cycle 5
13
Tomasulo Example Cycle 6
14
Tomasulo Summary
  • Reservation stations renaming to larger set of
    registers buffering source operands
  • Prevents registers as bottleneck
  • Avoids WAR, WAW hazards of scoreboard
  • Allows loop unrolling in HW
  • Not limited to basic blocks
  • (integer units get ahead, beyond branches)
  • Lasting Contributions
  • Dynamic scheduling
  • Register renaming
  • Load/store disambiguation
  • 360/91 descendants are Pentium III PowerPC 604
    MIPS R10000 HP-PA 8000 Alpha 21264

15
Hardware-Based Speculation
  • Instead of just instruction fetch and decode,
    also execute instructions based on prediction of
    branch.
  • Execute instructions out of order as soon as
    their operands are available.
  • Wait with instruction commit until branch is
    decided.
  • Re-order instructions after execution and commit
    them in order
  • reorder buffer or ROB
  • register file not updated until commit
  • Do not raise exceptions until instruction is
    committed
  • ROB holds and provides operands until commit.

16
Tomasulo with Speculation
  • Issue Empty reservation station and an empty
    ROB slot. Send operands to reservation station
    from register file or from ROB. This stage is
    often referred to as dispatch
  • Execute Monitor CDB for operands, check RAW
    hazards. When both operands are available, then
    execute.
  • Write Result When available, write result to
    CDB through to ROB and any waiting reservation
    stations. Stores write to value field in ROB.
  • Commit Three cases
  • Normal Commit write registers, in order commit
  • Store update memory
  • Incorrect branch flush ROB, reservation
    stations and restart execution at correct PC

17
(No Transcript)
18
Problems with speculation
  • Multi Issue Machines
  • Must be able to commit multiple instructions from
    ROB
  • More registers, more renaming
  • How much speculation
  • How many branches deep?
  • What to do on a cache miss?
  • TLB miss?
  • Cache interference due to incorrect branch
    prediction

19
Figure 3.41 Number of registers available for
renaming.
20
Figure 3.45 Window size the number of
instructions the issue unit may look ahead and
schedule from.
21
ILP Software Approaches 2
22
HW Support for More ILP
  • Avoid branch prediction by turning branches into
    conditionally executed instructions
  • If (X) then A B op C else NOP
  • If false, then neither store result nor cause
    exception
  • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
    conditional move PA-RISC can annul any
    following instruction.
  • IA-64 61 1-bit condition fields selected so
    conditional execution of any instruction
  • Drawbacks to conditional instructions
  • Still takes a clock even if annulled
  • Stall if condition evaluated late
  • Complex conditions reduce effectiveness
    condition becomes known late in pipeline

X
A B op C
23
Software Pipelining
  • Observation if iterations from loops are
    independent, then can get more ILP by taking
    instructions from different iterations
  • Software pipelining reorganizes loops so that
    each iteration is made from instructions chosen
    from different iterations of the original loop

Software-
iteration
24
SW Pipelining Example
1 LD F0, 0 (R1) LD F0, 0 (R1) 2 ADDD F4, F0,
F2 ADDD F4, F0, F2 3 SD 0 (R1), F4 LD F0, 8
(R1) 4 LD F6, 8 (R1) 1 SD 0 (R1), F4 Stores
Mi 5 ADDD F8, F6, F2 2 ADDD F4, F0, F2 Adds to
Mi-1 6 SD 8, (R1), F8 3 LD F0, 16 (R1) Loads
Mi-2 7 LD F10, 16 (R1) 4 SUBI R1, R1,
8 8 ADDD F12, F10, F2 5 BNEZ R1, LOOP 9 SD 16
(R1), F12 SD 0 (R1), F4 10 SUBI R1, R1,
24 ADDD F4, F0, F2 11 BNEZ R1, LOOP SD 8
(R1), F4
Read F4 Read F0 SD IF ID EX Mem WB Write
F4 ADD IF ID EX Mem WB LD IF ID EX Mem WB
Write F0
25
SW Pipelining Example
  • Symbolic Loop Unrolling
  • Smaller code space
  • Overhead paid only once vs. each iteration in
    loop unrolling
  • 100 iterations 25 loops with 4 unrolled
    iterations each

Software Pipelining
Number of overlapped operations
(a) Software pipelining
Time
Loop Unrolling
Number of overlapped operations
Time
(b) Loop unrolling
26
Trace Scheduling
  • Focus on critical path (trace selection)
  • Compiler has to decide what the critical path
    (the trace) is
  • Most likely basic blocks are put in the trace
  • Loops are unrolled in the trace
  • Now speed it up (trace compaction)
  • Focus on limiting instruction count
  • Branches are seen as jumps into or out of the
    trace
  • Problem
  • Significant overhead for parts that are not in
    the trace
  • Unclear if it is feasible in practice

27
Superblocks
  • Similar to Trace Scheduling but
  • Single entrance, multiple exits
  • Tail duplication
  • Handle cases that exited the superblock
  • Residual loop handling
  • Could in itself be a superblock
  • Problem
  • Code size
  • Worth the hassle?

28
(No Transcript)
29
Conditional instructions
  • Instruction that is executed depending on one of
    its arguments
  • Instruction is executed but results are not
    always written.
  • Should only be used for very small sequences,
    else use normal branch.

BNEZ R1, L ADDU R2, R3, R0 L CMOVZ R2,
R3, R1
30
Speculation
  • Compiler moves instructions before branch if
  • Data flow is not affected (optionally with use of
    renaming)
  • Preserve exception behavior
  • Avoid load/store address conflicts (no renaming
    for memory loc.)
  • Preserving exception behavior
  • Mechanism to indicate an instruction is
    speculative
  • Poison bit raise exception when value is used
  • Using Conditional instructions
  • Requires In-Order instruction commit
  • Register renaming
  • Writeback at commit
  • Forwarding
  • Raise exceptions at commit

31
Speculation
if (A0) AB else AA4 LD R1, 0(R3)
load A BNEZ R1, L1 test A LD R1, 0(R2)
then J L2 skip else L1 DADDI R1, R1, 4
else L2 SD R1, 0(R3) store A
LD R1, 0(R3) load A LD R14, 0(R2) load B
(speculative) BEQZ R1, L3 branch
if DADDI R14, R1, 4 else L3 SD R14, 0(R3)
store A
Write a Comment
User Comments (0)
About PowerShow.com