Superscalar Processors by

About This Presentation

Title:

Superscalar Processors by

Description:

Superscalar Processors by Sherri Sparks Overview What are superscalar processors? Program Representation, Dependencies, & Parallel Execution Micro architecture of a ... – PowerPoint PPT presentation

Number of Views:97

Avg rating:3.0/5.0

Slides: 43

Provided by: Clande6

Learn more at: http://www.cs.ucf.edu

Category:

more less

Transcript and Presenter's Notes

Title: Superscalar Processors by

1
Superscalar Processors by

Sherri Sparks

2
Overview

What are superscalar processors?
Program Representation, Dependencies, Parallel
Execution
Micro architecture of a typical superscalar
processor
A look at 3 superscalar implementations
Conclusion The future of superscalar processing

3
What are superscalars and how do they differ
from pipelines?

In simple pipelining, you are limited to fetching
1 single instruction into the pipeline per clock
cycle. This causes a performance bottleneck.
Superscalar processors overcome the 1 instruction
per clock cycle limit of simple pipelines and
possess the ability to fetch multiple
instructions during the same clock cycle. They
also employ advanced techniques like branch
prediction to ensure an uninterrupted stream of
instructions.

4
Development History of Superscalars

Pipelining was developed in the late 1950s and
became popular in the 1960s.
Examples of early pipelined architectures are the
CDC 6600 and the IBM 360/91 (Tomasulos
algorithm)
Superscalars appeared in the mid to late 1980s

5
Instruction Processing Model

Need to maintain software compatibility.
The assembly instruction set was the level chosen
to maintain compatibility because it did not
affect existing software.
Need to maintain at least a semblance of a
sequential execution model for programmers who
rely on the concept of sequential execution in
software design.
A superscalar processor may execute instructions
out of order at the hardware level, but execution
must appear sequential at the programming
level.

6
Superscalar Implementation

Instruction fetch strategies that simultaneously
fetch multiple instructions often by using branch
prediction techniques.
Methods for determining data dependencies and
keeping track of register values during execution
Methods for issuing multiple instructions in
parallel
Resources for parallel execution of many
instructions including multiple pipelined
functional units and memory hierarchies capable
of simultaneously servicing multiple memory
references.
Methods for communicating data values through
memory through load and store instructions.
Methods for committing the process state in
correct order. This is to maintain the outward
appearance of sequential execution.

7
From Sequential to Parallel

Parallel execution often results in instructions
completing non sequentially.
Speculative execution means that some
instructions may be executed when they would not
have been executed at all according to the
sequential model (i.e. incorrect branch
prediction).
To maintain the outward appearance of sequential
execution for the programmer, storage cannot be
updated immediately. The results must be held in
temporary status until the storage us updated.
Meanwhile, these temporary results must be usable
by dependant instructions.
When its determined that the sequential model
would have executed an instruction, the temporary
results are made permanent by updating the
outward state of the machine. This process is
called committing the instruction.

8
Dependencies

Parallel Execution introduces 2 types of
dependencies
Control dependencies due to incrementing or
updating the program counter in response to
conditional branch instructions.
Data dependencies due to resource contention as
instructions may need to read / write to the same
storage or memory locations.

9
Overcoming Control Dependencies Example

L2 mov r3,r7
lw r8,(r3)
add r3,r3,4
lw r9,(r3)
ble r8,r9,L3
move r3,r7
sw r9,(r3)
add r3,r3,4
sw r8,(r3)
add r5,r5,1
L3 add r6,r6,1
add r7,r7,4
blt r6,r4,L2
Blocks are issued are initiated into the window
of execution.

Block 1
Block 2
Block 3
10
Control Dependencies Branch Predicition

To gain the most parallelism, control
dependencies due to conditional branches has to
be overcome.
Branch prediction attempts to overcome this by
predicting the outcome of a branch and
speculatively fetching and executing instructions
from the predicted path.
If the predicted path is correct, the speculative
status of the instructions is removed and they
affect the state of the machine like any other
instruction.
If the predicted path is wrong, then recovery
actions are taken so as not to incorrectly modify
the state of the machine.

11
Data Dependencies

Data dependencies occur because instructions may
access the same register or memory location
3 Types of data dependencies or hazards
RAW (read after write) occurs because a later
instruction can only read a value after a
previous instruction has written it.
WAR (write after read) occurs when an
instruction needs to write a new value into a
storage location but must wait until all
preceding instructions needing to read the old
value have done so.
WAW (write after write) occurs when multiple
instructions update the same storage location it
must appear that these updates occur in the
proper sequence.

12
Data Dependency Example

mov r3,r7
lw r8,(r3)
add r3,r3,4
lw r9,(r3)
ble r8,r9,L3

RAW
WAW
WAR
13
Parallel Execution Method

Instructions are fetched using branch prediction
to form a dynamic stream of instructions
Instructions are examined for dependencies and
dependencies are removed
Examined instructions are dispatched to the
window of execution (These instructions are no
longer in sequential order, but are ordered
according to their data dependencies.
Instructions are issued from the window in an
order determined by their dependencies and
hardware resource availability.
Following execution, instructions are put back
into their sequential program order and then
committed so their results update the machine
state.

14
Superscalar Microarchitecture

Parallel Execution Method Summarized in 5 phases
1. Instruction Fetch Branch Prediction
2. Decode Register Dependence Analysis
3. Issue Execution
4. Memory Operation Analysis Execution
5. Instruction Reorder Commit

15
Superscalar Microarchitecture
16
Instruction Fetch Branch Prediction

Fetch phase must fetch multiple instructions per
cycle from cache memory to keep a steady feed of
instructions going to the other stages.
The number of instructions fetched per cycle
should match or be greater than the peak
instruction decode execution rate (to allow for
cache misses or occasions where the max of
instructions cant be fetched)
For conditional branches, fetch mechanism must be
redirected to fetch instructions from branch
targets.
4 steps to processing conditional branch
instructions
1. Recognizing that in instruction is a
conditional branch
2. Determining the branch outcome (taken or not
taken)
3. Computing the branch target
4. Transferring control by redirecting
instruction fetch (as in the case of a taken
branch)

17
Processing Conditional Branches

STEP 1 Recognizing Conditional Branches
Instruction decode information is held in the
instruction cache. These extra bits are used to
identify the basic instruction types.

18
Processing Conditional Branches

STEP 2 Determining Branch Outcome
Static Predictions (information determined from
static binary). Ex Certain opcode types might
result in more branches taken than others or a
backwards branch direction might be more likely
in loops.
Predictions based on profiling information
(execution statistics collected during a previous
run of the program).
Dynamic Predictions (information gathered during
program execution about past history of branch
outcomes). Branch history outcomes are stored in
a branch history table or a branch prediction
table.

19
Processing Conditional Branches

STEP 3 Computing Branch Targets
Branch targets are usually relative to the
program counter and are computed as
branch target program counter offset
Finding target addresses can be sped up by having
a branch target buffer which holds the target
address used the last time the branch was
executed.
EX Branch Target Address Cache used in PowerPC
604

20
Processing Conditional Branches

STEP 4 Transferring Control
Problem Thee is often a delay in recognizing a
branch, modifying the program counter and
fetching the target instructions.
Several Solutions
Use the stockpiled instructions in the
instructions buffer to mask the delay
Use a buffer that contains instructions from both
taken and not taken branch paths
Delayed Branches Branch does not take effect
until instruction after the branch. This allowed
the fetch of target instructions to overlap
execution of the instruction following the
branch. The also introduce assumptions about
pipeline structure and therefore delayed branches
are rarely used anymore.

21
Instruction Decoding, Renaming, Dispatch

Instructions are removed from the fetch buffers,
decoded and examined for control and data
dependencies.
Instructions are dispatched to buffers associated
with hardware functional units for later issuing
and execution.

22
Instruction Decoding

The decode phase sets up execution tuples for
each instruction.
An execution tuple contains
An operation to be executed
The identities of storage elements where input
operands will eventually reside
The locations where an instructions result must
be placed

23
Register Renaming

Used to eliminate WAW and RAW dependencies.
2 Types
Physical register file is larger than logical
register file and a mapping table is used to
associate physical register values with logical
register values. Physical registers are assigned
from a free list.
Reorder Buffer Uses the same size physical and
logical register files. There is also a reorder
buffer that contains 1 entry per active
instruction and maintains the sequential ordering
of instructions. It is a circular queue
implemented in hardware. As instructions are
dispatched they enter the queue at the tail. As
instructions complete, their results are inserted
into their assigned locations in the reorder
buffer. When an instructions reaches the head of
the queue, its entry is removed and its result
placed in the register file.

24
Register Renaming I
25
Register Renaming II(using a reorder buffer)
26
Instruction Issuing Parallel Execution

Instruction issuing is defined as the run-time
checking for availability of data and resources.
Constraints on instruction issue
Availability of physical resources like
instruction units, interconnect, and register
file
Organization of buffers holding execution tuples

27
Single Queue Method

If there is no out of order issuing, operand
availability can be managed via reservation bits
assigned to each register.
A register is reserved when an instruction
modifying the register issues.
A register is cleared when the instruction
completes.
Instructions may issue if there are no
reservations on its operands.

28
Multiple Queue Method

There are multiple queues organized according to
instruction type.
Instructions issue from individual queues in
sequential order.
Individual queues may issue out of order with
respect to one another.

29
Reservation Stations

Instructions issue out of order
Reservation stations hold information about
source operands for an operation.
When all operands are present, the instruction
may issue.
Reservation stations may be partitioned according
to instruction type or pooled into a single large
block.

30
Memory Operation Analysis Execution

To reduce latency, memory hierarchies are used
may contain primary and secondary caches.
Address translation to physical addresses is
improved by using a translation lookaside
buffer which contains a cache of recently
accessed pages.
Multiported memory hierarchy is used to allow
multiple memory requests to be serviced
simultaneously. Multiporting is achieved by
having multiple memory banks or making multiple
serial requests during the same cycle.
Store address buffers are used to make sure
memory operations dont violate hazard
conditions. Store address buffers contain the
addresses of all pending store operations.

31
Memory Hazard Detection
32
Instruction Reorder Commit

When an instruction is committed, its result is
allowed to modify the logical state of the
machine.
The purpose of the commit phase is to maintain
the illusion of a sequential execution model.
2 methods
1. The state of the machine is saved in a
history buffer. Instruction update the state of
the machine as they execute and when there is a
problem, the state of the machine can be
recovered from the history buffer. The commit
phase gets rid of the history state thats no
longer needed.
2. The state of the machine is separated into a
physical state and a logical state. The physical
state is updated in memory as instructions
complete. The logical state is updated in a
sequential order as the speculative status of
instructions is cleared. The speculative state is
maintained in a reorder buffer and during the
commit phase, the result of an operation is moved
from the reorder buffer to a logical register or
memory.

33
The Role of Software

Superscalars can be made more efficient if
parallelism in software can be increased.
1. By increasing the likelihood that a group of
instructions can be issued simultaneously
2. By decreasing the likelihood that an
instruction has to wait for the result of a
previous instruction

34
A Look At 3 Superscalar Processors

MIPS R10000
DEC Alpha 21164
AMD K5

35
MIPS R10000

Typical superscalar processor
Able to fetch 4 instructions at a time
Uses predecode to generate bits to assist with
branch prediction (512 entry prediction table)
Resume cache is used to fetch not taken
instructions and has space to handle 4 branch
predictions at a time
Register renaming uses a physical register file
2x the size of the logical register file.
Physical registers are allocated from a free list
3 instruction queues memory, integer, and
floating point
5 functional units (an address adder, 2 integer
ALUs, a floating point multiplier / divider /
square rooter, floating point adder)
Supports on-chip primary data cache (32 KB, 2 way
set associative) and an off-chip secondary cache.
Uses reorder buffer mechanism to maintain machine
state during execptions.
Instructions are committed 4 at a time

36
Alpha 21164

Simple superscalar that forgoes the advantage of
dynamic scheduling in favor of a high clock rate
4 Instructions at a time are fetched from an 8K
instruction cache
2 instruction buffers that issue instructions in
program order
Branches are predicted using a history table
associated with the instruction cache
Uses the single queue method of instruction
issuing
4 functional units (2 ALUs, a floating point
adder, and a floating point multiplier)
2 level cache memory (primary 8K cache
secondary 96 K 3way set associative cache)
Sequential machine state is maintained during
interrupts because instructions are not issued
out of order
The pipeline functions as a simple reorder buffer
since instructions in the pipeline are maintained
in sequential order

37
Alpha 21164 Superscalar Organization
38
AMD-K5

Implements the complex Intel x86 instruction set
Use 5 pre-decode bits for decoding variable
length instructions
Instructions are fetched from the instruction
cache at a rate of 16 bytes / cycle placed in a
16 element queue.
Branch prediction is integrated with the
instruction cache. There is 1 prediction entry
per cache line.
Due to instruction set complexity, 2 cycles are
required to decode
Instructions are converted to ROPS (simple risc
like operations)
Instructions read operand data are dispatched
to functional unit reservation stations
There are 6 functional units 2 integer ALUs, 1
floating point unit, 2 load/ store units a
branch unit.
Up to 4 ROPs can be issued per clock cycle
Has an 8K data cache with 4 banks. Dual load/
stores are allowed to different banks.
16 entry reorder buffer maintains machine state
when there is an exception and recovers from
incorrect branch predictions

39
AMD K5 Superscalar Organization
40
The Future of Superscalar Processing

Superscalar design performance gain
BUT increasing hardware parallelism may be a case
of diminishing returns.
There are limits to instruction level parallelism
in programs that can be exploited.
Simultaneously issuing more instructions
increases complexity and requires more cross
checking. This will eventually affect the clock
rate.
There is a widening gap between processor and
memory performance
Many believe that the 8-way superscalar is the
limit and that we will reach this limit within 2
years.
Some believe VLIW will replace superscalars and
offers advantages
Because software is responsible for creating the
execution schedule, the size of the instruction
window that can be examined for parallelism is
larger than a superscalar can do in hardware
Since there is no dependence checking by the
processor VLIW hardware is simpler to implement
and may allow a faster clock.