Structure of Computer Systems - PowerPoint PPT Presentation

About This Presentation

Title:

Structure of Computer Systems

Description:

Structure of Computer Systems Course 5 The Central Processing Unit - CPU Solutions for hazard cases Scoreboard method Tomasulo s method Branch prediction Scoreboard ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 23

Provided by: sebes4

Category:

more less

Transcript and Presenter's Notes

Title: Structure of Computer Systems

1
Structure of Computer Systems

Course 5
The Central Processing Unit - CPU

2
Solutions for hazard cases

Scoreboard method
Tomasulos method
Branch prediction

3
Scoreboard method

General considerations (wiki)
used first in the CDC 6600 computer (1966),
used for dynamically scheduling a pipeline so
that the instructions can execute out-of-order
when there are no conflicts and the hardware is
available (no structural hazard is present)
the data dependencies of every instruction are
logged.
instructions are released only when the
scoreboard determines that there are no conflicts
with previously issued and incomplete
instructions.
if an instruction is stalled because it is unsafe
to continue, the scoreboard monitors the flow of
executing instructions until all dependencies
have been resolved before the stalled instruction
is issued.

4
Scoreboard method

Implementation of the scoreboard method
Every instruction goes through 4 stages
Issue(ID1)
decode instructions
check for structural and WAW hazards
stall until structural and WAW hazards are
resolved
Read operands (ID2)
wait until no RAW hazards
then read operands
Execution (EX)
operate on operands
may be multiple cycles - notify scoreboard when
done
Write result (WB)
finish execution
stall if WAR hazard

5
Scoreboard method

Scoreboard structure
Instruction status
Indicates which of 4 steps the instruction is in
ID1, ID2, EX, or WB.
Functional unit status Indicates the state of
the functional unit (FU)
Busy Indicates whether the unit is busy or not
Op Operation to perform in the unit (e.g., or
)
Fi Destination register
Fj, Fk Source-register numbers
Qj, Qk Functional units producing source
registers Fj, Fk
Rj, Rk Flags indicating when Fj, Fk are ready
Register result status
Indicates which functional unit will write each
register, if one exists.
Blank when no pending instructions will write
that register

6
Scoreboard method

Speedup from scoreboard
1.7 for FORTRAN programs
2.5 for hand-coded assembly language programs
Hardware
Scoreboard hardware approximately same as one FPU
Main cost - buses (4 times normal amount)
Could be more severe for modern processors

7
Scoreboard and Tomasulos algorithm

Issues with Scoreboard method
it does not solve structural hazard
No forwarding logic
introduces stall phases when a required
functional unit is busy the stall affects the
next instructions too
Tomasulos algorithm
avoid the structural hazard and also resolve WAR
and WAW dependencies with Register renaming and
Common data bus (CDB)
Used first in IBM 360/91 computer (1969)
Register renaming keep multiple copies of the
same physical register
Avoids data dependencies when the dependency is
caused by the limited number of registers and not
by a real data dependency
Common data bus a data is put on a common bus
as soon as its available avoiding unnecessary
stall until the data is written in the
destination register

8
Tomasulos alorithm

Instruction stages
Issue an instruction is issued if the required
functional unit and all operands are available,
else it is stalled and the next instruction is
tested and if possible issued if a real data is
not yet available a virtual value is considered,
until the real value becomes available
Registers are renamed to avoid WAR and WAW
hazards
Execute the instruction is carried out as long
as the necessary operands are available or
present on the CDB special care must be given to
Load and Store instructions that require access
to the memory
Write result the result of the executed
instruction is written back into the destination
register and Store operations are made with the
memory
(see later commit stage)

9
Tomasulos alorithm

Reservation stations
buffers that fetch and store instruction operands
as they are available
A reservation station holds the data and the
result of an instruction
It points to registers (if data is available) or
other reservation stations that will contain the
necessary data as soon as it becomes available
(before it is written back in the register)
The reservation station stores the result of an
instruction execution and releases the
functional unit as soon the instruction is
executed the result becomes available for other
reservation stations in this way we avoid WAR
and RAW stalls

10
Tomasulos algorithm

To avoid structural hazard, redundant functional
units are used, such as multiple integer ALUs,
floating point ALUs or address computing ALUs
Example the P6 architecture (Pentium II and III)
contains 7 ALUs gt 2IEU, 1FEU, 1MMX, 3AGU
In front of every functional unit a buffer or a
list may store the request(s) (instructions)
destined for that unit e.g. Netburst
architecture (Pentium IV) has a list of requests
for every reservation station
In this way every functional unit is scheduled in
advance and it can work almost without stalling

11
Tomasulos algorithm

Commit an extra stage in the instruction
execution sequence, besides issue, execute and
write result
Used to further improve the Tomasulos solution
In the Write result stage the result is written
in the re-order buffer (ROB) and not directly in
the destination register or memory all data in
ROB may be used by other instructions in this
way some stall periods may be avoided
Re-order buffer (ROB) it is used to commit
instructions executed out-of-order
Contains data regarding instructions in original
order some entries may be filled-in in advance
as result of out-of-order execution
The instructions are committed in their original
order
ROB is useful for role-back procedures in case of
branch prediction mismatch or exceptions
In the commit stage data from the re-order buffer
is copied into the real registers or into the
memory in the order specified through the program
and not in the order of execution

12
Branch prediction

A method for solving control hazard
Problem a brunch in the program disturbs
pipeline execution if the branch is taken the
pipeline must be flushed and reinitialized with
instructions from the target address
Principle try to guess the direction of a branch
instruction (mainly conditional branch) and load
the pipeline with instructions from the correct
branch
Methods
Static prediction based on the nature of the
branch instruction
Dynamic prediction take into consideration the
history of the branch instructions (if there were
taken or not in the past may predict their future
behavior)

13
Branch prediction

Static prediction based on the nature of the
branch instruction
Cases
Procedure calls - are taken
Unconditional jumps - are taken
Backward branches - are taken (considered as
loops in the program)
Forward branches - are not taken (considered
exceptions from a normal execution)
Advantage
it is simple and fast
works well for programs having many loops
drawback
does not work well if there are a lot of
conditional jumps

14
Branch prediction

Dynamic prediction - take into consideration the
history of the branch instructions
Principle use previous executions of a
conditional jump in order to better predict the
next executions
Methods
Next line predictor stores the pointer to the
next instruction (or group of instructions if
multiple instructions are fetched in the same
time) the method stores the decision as well as
the target (pointer) of the branch
Saturating counters store in 1 or two bits
(saturating counters) the decisions made before
in case of 2 bit counter 4 states
Strongly not taken (00) not taken is
predicted
Weakly not taken (01) not taken is predicted
Weakly taken (10) taken is predicted
Strongly taken (11) - taken is predicted
every occurrence of the branch updates
the state of the counter

15
Branch prediction

Dynamic prediction methods (cont.)
store the decision and the target address for
every executed conditional jump in a BHT (Branch
History Table) and BTB (Branch Target Buffer)
this information will help predict next
executions of the same instructions with aprox.
90 probability.
BHT and BTB are indexed with less significant
bits of the addresses (of PC) the number of bits
used determines the dimension of the tables
Two-level adaptive predictor
necessary for alternating and imbricated
conditional jumps
idea to memorize jump sequence patterns
prediction based on a pattern of taken (1) and
not taken (0) branches

a two-level adaptive predictor with an n-bit
history can predict any repetitive sequence with
any period if all n-bit sub-sequences are
different

16
Branch prediction

Dynamic prediction methods (cont.)
Local branch prediction
a separate history buffer for each conditional
jump instruction
it may use a 2 level branch predictor with common
or individual pattern history table
Pentium II and III have local branch predictors
with a local 4-bit history and a local pattern
history table with 16 entries for each
conditional jump
Global branch predictor
keeps a shared (global) history of all
conditional jumps
any correlation between two branches is used for
prediction
poor results if branches are not correlated
usually not as good as local predictors
variants
gshare" predictor
gselect predictor

17
Branch prediction

Dynamic prediction methods (cont.)
Global branch predictor possible
implementation two-level adaptive predictor with
globally shared history buffer and pattern
history table
gshare" predictor - index in the prediction
history table is a XOR between the global history
buffer and the jump address
gselect predictor index is obtain by
concatenating the history buffer and the jumps
address
Pentium M, Core 2 and AMD processors use global
branch prediction
combinations of local and global predictors
Alloyed branch prediction - concatenates local
and global branch history buffer, sometimes also
with the address of the jump
Agree predictor makes a XOR between the local
and global predictor (used in Pentium 4)
Hybrid predictor a combination of predictors
the result is selected through voting or from the
predictor with the best hit rates
Loop predictor detects if a conditional jump is
a loop it is taken N-1 times and not taken 1
time it may use a counter for the loop it may
be part of a hybrid predictor
Prediction of indirect jumps when the jump
target of a conditional branch has multiple
choices store the previous targets and more
bits on the prediction history buffer for such a
jump
Prediction of function returns stores a copy of
the stack that contains the return addresses of
the executed functions

18
Branch prediction

Correlated prediction
example of a combination between local and global
prediction
how it works
every entry in the history table has 4 predictors
(e.g. 2 bit counters)
the 2 bit global history buffer select between
the 4 predictors
the state of the selected predictor is updated
according with the decision made
the global branch history gives the context and
the local predictors store behavior of different
jump instructions
(2,2) predictor 2 bit counters and 2 bit
history buffer

19
Misprediction statistics for specs tests
1. 4096 Entries 2-bit BHT 2. Unlimited Entries
2-bit BHT 3. 1024 Entries - local and global
prediction (2,2) BHT - 1 and 3 require the same
amount of memory 8kbits
20
Branch prediction

Tournament predictor
2-bit local predictor fail on important branches
by adding global information, performance may
improved
Tournament predictors use two predictors, 1
based on global information and 1 based on local
information, and combine with a selector
Hopes to select right predictor for right branch
(or right context of branch)

21
Misprediction statistics
22
Branch prediction