Instruction Level Parallelism and Dynamic Execution - PowerPoint PPT Presentation

About This Presentation

Title:

Instruction Level Parallelism and Dynamic Execution

Description:

Recall from Pipelining Review Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls Ideal pipeline CPI: measure of the maximum ... – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 41

Provided by: Ran5172

Category:

more less

Transcript and Presenter's Notes

Title: Instruction Level Parallelism and Dynamic Execution

1
Instruction Level Parallelism and Dynamic
Execution
2
Recall from Pipelining Review

Pipeline CPI Ideal pipeline CPI Structural
Stalls Data Hazard Stalls Control Stalls
Ideal pipeline CPI measure of the maximum
performance attainable by the implementation
Structural hazards HW cannot support this
combination of instructions
Data hazards Instruction depends on result of
prior instruction still in the pipeline
Control hazards Caused by delay between the
fetching of instructions and decisions about
changes in control flow (branches and jumps)

3
Data Hazards Review

RAW (read after write) hazard
instruction I occurs before instruction J in the
program but
instruction J tries to read an operand before
instruction I writes to it, so J incorrectly gets
the old value
Example
I LW R1, 0(R2)
J ADD R3, R1, R4
A RAW hazard is a true data dependence, where
there is a programmer-mandated flow of data from
one instruction (the producer) to another (the
consumer)
therefore, the consumer must wait for the
producer to finish computing and writing

4
Data Hazards Review

WAW (write after write) hazard
instruction I occurs before instruction J in the
program but
instruction J tries to write an operand before
instruction I writes to it, so the wrong order of
writes causes the destination register to end up
with the value from I rather than that from J
Example
I SUB R1, R2, R3
J ADD R1, R3, R4
A WAW hazard is a not a true data dependence, but
rather a kind of name dependence, called output
dependence , because of the (avoidable?) same
name of the destination registers
WAW hazards cannot occur in the classic 5-stage
MIPS integer pipeline. Why?
registers are written only in one stage, the WB
stage, and
instructions enter the pipeline in order
However, we shall deal with situations where
instructions may be executed out of order

5
Data Hazards Review

WAR (write after read) hazard
instruction I occurs before instruction J in the
program but
instruction J tries to write an operand before
instruction I reads it, so I incorrectly gets the
later value
Example
I SUB R2, R1, R3
J ADD R1, R3, R4
A WAR hazard is a not a true data dependence, but
rather a kind of name dependence, called
antidependence, because of the (avoidable?)
shared name of two registers
WAR hazards cannot occur in the classic 5-stage
MIPS integer pipeline. Why?
registers are read early and written late
instructions enter the pipeline in order
However, we shall deal with situations where
instructions may be executed out of order

6
Why Dynamic Scheduling?
Static pipeline scheduling
Yes
Data Hazard
Bypass possible
Yes
Bypass or Forwarding
No
No
Pipeline processing
Stall instruction
Goal of ILP To get as many instructions as
possible executing in
parallel while respecting dependencies
7
Recall Data Hazard Resolution In-order issue,
in-order completion
Time (clock cycles)
I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r2,r7
Bubble
ALU
DMem
or r8,r2,r9
Extend to Multiple instruction issue? What if
load had longer delay? Can and issue?
8
In-Order Issue, Out-of-order Completion

Which hazards are present? RAW? WAR? WAW?
load r3 lt- r1, r2
add r1 lt- r5, r2
sub r3 lt- r3, r1 or r3 lt- r2, r1
Register Reservations
when issue mark destination register busy till
complete
check all register reservations before issue

9
Advantages ofDynamic Scheduling

Handles cases when dependences unknown at compile
time
(e.g., because they may involve a memory
reference)
It simplifies the compiler
Allows code that compiled for one pipeline to run
efficiently on a different pipeline
Hardware speculation, a technique with
significant performance advantages, that builds
on dynamic scheduling

10
HW Schemes Instruction Parallelism

Key idea Allow instructions behind stall to
proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F
8,F14
Enables out-of-order execution and allows
out-of-order completion
Will distinguish when an instruction begins
execution and when it completes execution
between 2 times, the instruction is in execution
In a dynamically scheduled pipeline, all
instructions pass through issue stage in order
(in-order issue)

11
Dynamic Scheduling Step 1

Simple pipeline has 1 stage to check both
structural and data hazards Instruction Decode
(ID), also called Instruction Issue
Split the ID pipe stage of simple 5-stage
pipeline into 2 stages
IssueDecode instructions, check for structural
hazards
Read operandsWait until no data hazards, then
read operands

12
A Dynamic Algorithm Tomasulos Algorithm

For IBM 360/91 (before caches!)
Goal High Performance without special compilers
Small number of floating point registers (4 in
360) prevented interesting compiler scheduling of
operations
This led Tomasulo to try to figure out how to get
more effective registers renaming in hardware!
Why Study 1966 Computer?
The descendants of this have flourished!
Alpha 21264, HP 8000, MIPS 10000, Pentium III,
PowerPC 604,

13
Tomasulo Algorithm

Control buffers distributed with Function Units
(FU)
FU buffers called reservation stations have
pending operands
Registers in instructions replaced by values or
pointers to reservation stations(RS)
form of register renaming
avoids WAR, WAW hazards
More reservation stations than registers, so can
do optimizations compilers cant
Results to FU from RS, not through registers,
over Common Data Bus that broadcasts results to
all FUs
Load and Stores treated as FUs with RSs as well
Integer instructions can go past branches,
allowing FP ops beyond basic block in FP queue

14
Tomasulo Organization
FP Registers
From Mem
FP Op Queue
Load Buffers
Load1 Load2 Load3 Load4 Load5 Load6
Store Buffers
Add1 Add2 Add3
Mult1 Mult2
Reservation Stations
To Mem
FP adders
FP multipliers
Common Data Bus (CDB)
15
Reservation Station Components

Op Operation to perform in the unit (e.g., or
)
Vj, Vk Value of Source operands
Store buffers has V field, result to be stored
Qj, Qk Reservation stations producing source
registers (value to be written)
Note Qj,Qk0 gt ready
Store buffers only have Qi for RS producing
result
Busy Indicates reservation station or FU is
busy
Register result statusIndicates which
functional unit will write each register, if one
exists. Blank when no pending instructions that
will write that register.

16
Three Stages of Tomasulo Algorithm

1. Issueget instruction from FP Op Queue
If reservation station free (no structural
hazard), control issues instr sends operands
(renames registers).
2. Executeoperate on operands (EX)
When both operands ready then execute if not
ready, watch Common Data Bus for result
3. Write resultfinish execution (WB)
Write on Common Data Bus to all awaiting units
mark reservation station available
Normal data bus data destination (go to bus)
Common data bus data source (come from bus)
64 bits of data 4 bits of Functional Unit
source address
Write if matches expected Functional Unit
(produces result)
Does the broadcast
Example speed 2 clks for load, 3 clks for /-,
10 clks for 40 clks for /

17
Tomasulo Example
18
Tomasulo Example Cycle 1
19
Tomasulo Example Cycle 2
Note Can have multiple loads outstanding
20
Tomasulo Example Cycle 3

Note registers names are removed (renamed) in
Reservation Stations MULT issued
Load1 completing what is waiting for Load1?

21
Tomasulo Example Cycle 4

Load2 completing what is waiting for Load2?

22
Tomasulo Example Cycle 5

Timer starts down for Add1, Mult1

23
Tomasulo Example Cycle 6

Issue ADDD here despite name dependency on F6?

24
Tomasulo Example Cycle 7

Add1 (SUBD) completing what is waiting for it?

25
Tomasulo Example Cycle 8
26
Tomasulo Example Cycle 9
27
Tomasulo Example Cycle 10

Add2 (ADDD) completing what is waiting for it?

28
Tomasulo Example Cycle 11

Write result of ADDD here?
All quick instructions complete in this cycle!

29
Tomasulo Example Cycle 12
30
Tomasulo Example Cycle 13
31
Tomasulo Example Cycle 14
32
Tomasulo Example Cycle 15

Mult1 (MULTD) completing what is waiting for it?

33
Tomasulo Example Cycle 16

Just waiting for Mult2 (DIVD) to complete

34
After skipping a couple of cycles
35
Tomasulo Example Cycle 55
36
Tomasulo Example Cycle 56

Mult2 (DIVD) is completing what is waiting for
it?

37
Tomasulo Example Cycle 57

Once again In-order issue, out-of-order
execution and out-of-order completion.

38
Tomasulo Drawbacks

Complexity
delays of 360/91, MIPS 10000, Alpha 21264, IBM
PPC 620 in CAAQA 2/e, but not in silicon!
Many associative stores (CDB) at high speed
Performance limited by Common Data Bus
Each CDB must go to multiple functional units
?high capacitance, high wiring density
Number of functional units that can complete per
cycle limited to one!
Multiple CDBs ? more FU logic for parallel assoc
stores
Non-precise interrupts!
We will address this later

39
Superscalar Architecture

A superscalar processor executes more than one
instruction during
a clock cycle by simultaneously dispatching
multiple instructions to
redundant functional units on the processor.
Each functional unit is not a separate CPU core
but an execution resource
within a single CPU

Superscalar Pipeline
Typical 5-stage pipeline
40
Conclusion
Pipeline design and scheduling are techniques to
achieve significant throughput improvement in
modern CPU.
20-stage pipeline

Write a Comment

User Comments (0)