Superscalar Organization - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Superscalar Organization

Description:

Fetch, decode centralized whereas dispatch decentralized ... Stop fetching - Allow pipeline to drain. Another interrupt might occur while allowing to drain ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 46
Provided by: aash63
Category:

less

Transcript and Presenter's Notes

Title: Superscalar Organization


1
Chapter 4
  • Superscalar Organization

2
Limitations of Scalar Pipelines
  • Maximum throughput bounded by one instruction per
    cycle
  • Inefficient Unification into One Pipeline
  • ALU, MEM stages very diverse eg FP
  • Rigid nature of in order pipeline
  • If a leading instrn is stalled every subsequent
    instrn is stalled

3
A Rigid Pipeline
4
Solving Problems of Scalar Pipelines
  • Maximum throughput bounded by one instruction per
    cycle -gt parallel pipelines
  • Inefficient Unification into Single Pipeline
  • -gt diversified pipelines
  • Rigid nature of in order pipeline
  • -gt allow out of ordering
  • OOO pipelines or dynamic pipelines

5
Machine Parallelism
  • No Parallelism (Nonpipelined)
  • Temporal Parallelism (Pipelined)
  • Spatial Parallelism (Multiple units)
  • Combined Temporal and Spatial Parallelism

6
A Parallel Pipeline
Width 3
7
Scalar and Parallel Pipeline
  • The five-stage i486 scalar pipeline
  • The five-stage Pentium Parallel Pipeline of
    width2

8
Diversified Parallel Pipeline
9
Diversified Functional Units
CDC 6600 with 10 diversified functional units
10
Motorola 88110 Superscalar uP
11
Interpipeline Stage Buffer
  • Single entry buffer
  • Multientry buffer
  • Multientry buffer with reordering

12
A Dynamic Pipeline
13
In-order Issue into Diversified Pipelines
RD ? Fn (RS, RT)
Inorder Inst. Stream
Dest. Reg.
Func Unit
Source Registers
Issue stage needs to check 1. Structural
Dependence 2. RAW Hazard 3. WAW Hazard
4. WAR Hazard
14
A Superscalar Pipeline
A six stage Template (TEM) superscalar pipeline
15
Superscalar Pipeline Design
Instruction Flow
Data Flow
Retire
16
Inorder Pipelines
Inorder pipeline, no WAW no WAR (almost always
true)
17
Limitations of Inorder Pipelines
  • CPI of inorder pipelines degrades very sharply if
    the machine parallelism is increased beyond a
    certain point, i.e. when NxM approaches average
    distance between dependent instructions
  • Forwarding is no longer effective
  • ? must stall more often
  • Pipeline may never be full due to frequent
    dependency stalls!!

18
Out-of-order Pipelining 101
IF



ID



RD



INT
Fadd1
Fmult1
LD/ST
EX
Fadd2
Fmult2
Fmult3
WB



19
Instruction Fetching
  • Fetch should not be bottleneck
  • Wide I-caches
  • Fetch bandwidth
  • Two major problems
  • Misaligned accesses
  • Control flow (branch) instructions
  • 2 solutions for misaligned accesses
  • Compiler
  • Hardware alignment network

20
Organization of wide I-cache
(a) One cache line is equal to one physical row
(b) One cache line is equal to two physical rows
21
Misalignment of the Fetch Group
22
RS-6000 with Auto-Realignment
23
Instruction Decoding
  • Identify individual instructions
  • Determine I-types
  • Detect dependencies
  • 2 major complexity factors
  • ISA (RISC/CISC)
  • Width of the parallel pipeline

24
Instruction Decoding (2)
  • ISA (RISC/CISC)
  • Uniform width
  • Regular encoding
  • Few different formats
  • Few addressing modes
  • Dependences -gt comparators
  • Multiported reg files, multiple buses
  • Branches
  • CISC examples Pentium, K-5
  • How to identify the start of the next instruction

25
Fetch/Decode Unit of Intel P6 Pipeline
Uops employ load/store model Decoder 0
generalized decoder Decode needs multiple stages
-gt Concept of predecoding
26
Pre-decoding Mechanism of AMD K-5
Predecode logic sits between memory and
I-cache Additional info stored into cache 5
bits start/end of instrn, number of uops,
location of opcodes, prefixes
27
Predecoding
  • Overhead of predecoding
  • Increase in I-cache size
  • K5 increase is 50
  • Increased I-cache miss penalty
  • Not a problem if I-miss rates low
  • Predecoding in RISC machines too
  • Identify branch instructions
  • PowerPC 7 bits
  • UltraSPARC, R10000, HP PA-8000 4/5 bits

28
Instruction Dispatch in Superscalar Pipeline
29
Instruction Dispatching
  • Route instruction to appropriate functional unit
    (FU) for execution
  • Temporary buffering before execution
  • Prior to exec, an instrn must have all its
    operands
  • Reservation Stations
  • Fetch, decode centralized whereas dispatch
    decentralized
  • Centralized/distributed reservation station
  • Pentium 4 centralized, PowerPC - distributed

30
Centralized Reservation Station
31
Distributed Reservation Station
32
Dispatch (contd)
  • Hybrid approaches
  • Clustered reservation stations
  • Not centralized, but some reservation stations
    feed more than 1 FU
  • Centralized best overall utilization
  • Needs to be multiported
  • Distributed possible idling

33
Dispatch vs Issue
  • Dispatching associating instructions with FU
    types
  • Issuing initiation of execution in FUs
  • Separation of dispatch/issue makes sense only for
    distributed reservation stations
  • For centralized instruction windows (reservation
    stations), the two terms interchangeable

34
Instruction Execution
  • Heart of a CPU
  • Lots of FUs in current superscalars
  • INT (1 or more), FP, LD/ST,
  • Some FUs are specialized
  • TI SuperSPARC contains cascaded ALUs
  • IBM RS/6000 has multiply-add units

35
Functional Units
(a) Int Functional Unit in TI Supersparc
(b) FP Functional Unit in IBM RS6000
36
Instruction Execution
  • LD/ST units (L/S units) (L/S pipes)
  • Dedicated L/S unit, or INT units used
  • Branch units
  • Dedicated or INT units used
  • Graphics and image processing units
  • Pixel processing units
  • Bit manipulation units
  • Trimedia media processor
  • Quad avg

37
Instruction Execution (2)
  • Quad average
  • Quad avg used in MPEG decoding fro decompressing
    compressed videos
  • (ae1)/2 (bf1)/2 (cg1)/2 (dh1)/2
  • a,b,c,d,e,f,g,h are bytes
  • Stored as 2 32-bit quantities
  • Replaces numerous add and divide instrns

38
Instruction Execution (3)
  • Best mix of FUs for a superscalar proc
  • Depends on application domain and I-mix
  • If 40 ALU, 20 branches, 40 L/S, 4-2-4 rule of
    thumb
  • ALU units abundant in current processors
  • L/S units are more scarce
  • Needs D-cache to be multiported
  • Banked memory easier to make than multiported

39
Instruction Execution (4)
  • Banked D-caches/Multiported D-caches
  • If no bank conflicts, good bandwidth
  • Intel Pentium 8-banked D-cache
  • Banking cheaper than true multiporting or
    multiple copies
  • Total number of FUs often more than processor
    superscalarity
  • Superscalarity typically F/D/retire width
  • Complexity n2 where n FUs

40
Completion and Retiring
  • Completion means Finish execution
  • Retiring means Update Machine state
  • Retiring in my opinion - committing results to
    register or memory (Book uses different
    terminology pg 207)
  • Completion Buffer/Reorder Buffer (ROB)
  • Out of order execution, in-order retiring
  • Stages between instruction buffer and ROB are out
    of order

41
A Dynamic Pipeline
42
Reorder Buffers (ROB)
  • Put instructions back to order
  • Instructions enter ROB in o-o-o (out of order)
  • Instructions leave ROB in order
  • An instruction is architecturally complete when
    it leaves ROB
  • Registers updated memory updated
  • Some times memory instructions have their own
    ROB or MOB (memory ordering buffer)

43
Interrupts/Exceptions
  • Interrupts external, I/O devices, OS
  • Exceptions internal, errors
  • Illegal opcode, divide by 0, overflow/underflow,
    page faults
  • OS needs to intervene to handle exceptions
  • 2 ways of interrupt/exception handling
  • Precise interrupts
  • Imprecise interrupts

44
Precise and Imprecise Interrupts
  • Save state of machine at interrupt, restart from
    the point of interrupt
  • Stop fetching - Allow pipeline to drain
  • Another interrupt might occur while allowing to
    drain
  • Process interrupt from an earlier instruction
    first
  • Precise interrupts if exceptions are processed
    in the same order as a non-pipelined machine
  • Imprecise interrupts if exception processing
    order different from the non-pipelined order

45
ROB for exception handling
  • When exception occurs, instrn tagged in ROB
  • Completion stage checks each instruction before
    it is completed
  • Tagged instructions not allowed to complete
  • Instrns prior to tagged instrn allowed to
    complete
  • Machine state is checkpointed or saved
  • Remaining instructions in pipeline some of
    which may have finished are discarded
  • Exception processed check point restored
    execution resumes
Write a Comment
User Comments (0)
About PowerShow.com