CSC: 345 Computer Architecture - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

CSC: 345 Computer Architecture

Description:

... usage issue for planes (1/5 the cost to implement), wifi will follow ... DHS and DOJ both want the ban on cellphone/wifi on planes to remain in effect ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 34
Provided by: Jan776
Category:

less

Transcript and Presenter's Notes

Title: CSC: 345 Computer Architecture


1
CSC 345 Computer Architecture
  • Jane Huang
  • Instruction PipeliningRISC

2
We can think of the functionality of the CPU in
terms of
  • Fetch instructions
  • Interpret instructions
  • Fetch data
  • Process data
  • Write data

Indirection
Indirection
3
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
  • Sequential laundry takes 6 hours for 4 loads
  • If they learned pipelining, how long would
    laundry take?

David Pattersons Lecture Slides
4
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
  • Pipelined laundry takes 3.5 hours for 4 loads

5
Pipelining Lessons
  • Pipelining doesnt help latency of single task,
    it helps throughput of entire workload
  • Pipeline rate limited by slowest pipeline stage
  • Multiple tasks operating simultaneously
  • Potential speedup Number pipe stages
  • Unbalanced lengths of pipe stages reduces speedup
  • Time to fill pipeline and time to drain it
    reduces speedup

6 PM
7
8
9
Time
T a s k O r d e r
6
Instruction Prefetch
  • As a simple approach, the instruction cycle could
    be split into two stages fetch instruction and
    execute instruction.
  • Stage 1 Fetch and buffer instruction.
  • Stage 2 Execute instruction
  • If both stages were of equal duration then
    instruction cycle time would be halved. But..
  • Execution time is usually longer than fetch time.
  • A conditional branch instruction means that we
    wouldnt know the address of the next
    instruction. (Guessing can reduce overall delay)

7
Further Speedup
  • Further speedup can be gained by introducing more
    stages into the pipeline.
  • Fetch Instruction (FI)
  • Decode Instruction (DI)
  • Calculate Operands (CO)
  • Fetch Operands (FO)
  • Execute Instruction (EI)
  • Write Operand (WO)
  • Various stages can be of more equal duration.
  • Note some instructions do NOT need all six
    stages. For example a load instruction does not
    need to WO stage.
  • To simplify pipeline hardware, timing is set up
    to assume that each instruction requires all six
    stages.

8
Further Speedup
9
Performance Enhancement
  • Without a pipeline, the 9 instructions would take
  • 9 X 6 54 time units to complete.
  • With a pipeline, 9 instructions would take
  • 6 (9-1) 14 stages.Number of Stages
    (Number of instructions 1)

Limiting Factors
  • Stages that are not of equal durations will
    create waiting
  • Problem of conditional branch which can
    invalidate instructions.
  • Interrupts
  • CO stage might depend on result in a register
    from a previous instruction that has not yet
    completed.
  • Overhead in moving data from buffer to buffer in
    the pipeline this can lengthen the execution
    time of an individual instruction.Significant
    when sequential instructions are logically
    dependent.

10
Pipeline Hazards
Hazards are situations in pipelining which
prevent the next instruction in the instruction
stream from executing during the designated clock
cycle. Hazards reduce the ideal speedup gained
from pipelining and are classified into three
classes Structural hazards Arise from
hardware resource conflicts when the available
hardware cannot support all possible combinations
of instructions. Data hazards Arise when an
instruction depends on the results of a previous
instruction in a way that is exposed by the
overlapping of instructions in the pipeline
Control hazards Arise from the pipelining of
conditional branches and other instructions that
change the PC
11
Data Hazards
  • Read After Write (RAW) InstrJ tries to read
    operand before InstrI writes it
  • Caused by a Data Dependence (in compiler
    nomenclature). This hazard results from an
    actual need for communication.

I add r1,r2,r3 J sub r4,r1,r3
Write After Read (WAR) InstrJ writes operand
before InstrI reads it Called an
anti-dependence by compiler writers.This
results from reuse of the name r1.
Patterson Slides
12
Data Hazards (cont..)
  • Write After Write (WAW) InstrJ writes operand
    before InstrI writes it.
  • Called an output dependence by compiler
    writersThis also results from the reuse of name
    r1.

13
Assume instruction 3is a conditional branchto
instruction 15.There is no way to know which
branch istaken until after EI. Assume
instruction 4will be taken. If actually
instruction15 is taken the pipelineis flushed
and Ins. 15 is fetched.
14
Dealing with Branches
  • Branches impede the consistent flow of data to
    the instruction pipeline.
  • Several approaches have been proposed
  • Multiple Streams
  • Prefetch branch target
  • Loop buffer
  • Branch prediction
  • Delayed branch
  • Multiple Streams
  • Brute force approach that replicates the initial
    portions of the pipeline, allowing the fetching
    of both instructions.
  • Contention delays for register and memory access
    between the parallel stages.
  • Additional branch instructions may enter the
    pipeline before the original branch has been
    resolved (?? Multiple multiple streams??)
  • This approach is used in IBM 370/168 and IBM
    3033.

15
Dealing with Branches
  • Prefetch Branch Target
  • Target of branch is prefetched in addition to the
    instruction following the branch.
  • The target is saved until the branch instruction
    is executed.
  • Loop Buffer
  • Small, very high speed memory Maintained by
    instruction fetch stage containing n most
    recently used instructions.
  • When used in conjunction with prefetching, the
    loop buffer contains some instructions ahead of
    the current instruction.
  • Instructions fetched in sequence will be
    available without usual memory access time.
  • If a branch occurs to a target just ahead of the
    current instruction it might already be in the
    buffer.
  • WELL SUITED to handling loops.

16
Branch Prediction
  • Static Approaches
  • Predict never taken
  • Predict always takenStudies show that
    conditional branches are taken more than 50 of
    the time.In a paged machine prefetching the
    branch is more likely to cause a page fault
    (avoidance mechanism is needed).
  • Predict by opcodeDecision is based on the opcode
    of the branch instruction.One study reported
    success rates of over 75 with this strategy.
  • Dynamic Approaches
  • Attempt to improve prediction rate by recording
    the history of conditional
  • branches in the program.
  • Taken / not taken switch
  • Branch history table

17
Branch Prediction
  • Taken / Not taken switch
  • A single bit associated with each switch.
  • Directs the processor to make the same decision
    next time around.

18
Branch Prediction
  • Storing 2 history bits can improve the situation.
  • Two consecutive wrong predictions are needed to
    change the prediction decision.

Not taken
Predicttaken
Predicttaken
Do while (condition)
Do while (condition)
Taken
Not taken
Taken
end loop
Next Instruction
Not taken
Predictnot taken
Predictnot taken
Predictnot taken
Taken
19
Introduction to RISC
  • RISC Reduced Instruction Set Computing
  • Large number of general-purpose registers
  • Use of compiler technology to optimize register
    usage
  • Emphasis on optimizing the instruction pipeline.

20
Trends
  • To compensate for programming errors there has
    been a trend to simplify programming by
    developing powerful and complex high-level
    programming languages.
  • HLL support OO and other high-level concepts.
  • Introduces SEMANTIC GAP ie large gap between
    HLL and instruction set which leads to
  • Program inefficiency
  • Compiler complexity
  • Excessive machine program size
  • Computer Architects attempted to close this gap
    by creating more complex instruction sets.
  • Several studies were conducted to try to
    understand the behaviour of HLL programming
    languages.

21
Trends
  • Operations
  • Assignment statements predominate Simple data
    movement is important
  • Numerous conditional statements (IF, LOOP)
    implemented using compare branch
    instructions.Sequence control mechanism is
    important
  • Operands
  • From Patterson study the majority of references
    are to simple scalar variables.
  • 80 of these variables were local to a procedure.
  • References to arrays and structures requires an
    earlier reference to a pointer, which is usually
    local.
  • Patterson study each instruction referenced an
    average of 0.5 memory operands and 1.4 registers.
  • Fast operand referencing is important.

22
Trends
  • Procedure Calls
  • Procedure calls are the most time-consuming
    operations in HLL programs.
  • Two significant factors ( of parameters, depth
    of nesting)
  • Tannenbaums study
  • 98 of procedures lt 6 arguments
  • 92 used lt six local scalar variables.
  • Implications
  • Attempting to make instruction set architecture
    close to HLLs may NOT be the most effective
    design strategy.
  • Optimize performance of the most time consuming
    aspects of HLL progs.
  • RISC therefore
  • Uses a large number of registers (or compiler
    optimization) to optimize operand referencing.
  • Reduce memory references vs register references
    (locality of references supports this)
  • Straightforward instruction pipelining will be
    inefficient because of the high percentage of
    branches.
  • A simplified instruction set is needed.

23
Registers
  • Use of large set of registers decreases need to
    access memory.
  • Favor the use of registers for local scalars.
  • Multiple sets of registers each assigned to a
    procedure.
  • Procedure call switches the processor to use a
    different fixed-size window of registers.
  • Windows for adjacent procedures are overlapped to
    allow parameter passing.

Parameters registers
Local registers
Temporaryregisters
Level J
Call / return
Parameters registers
Local registers
Temporaryregisters
Level J 1
24
Circular Buffer Organization of Overlapped Windows
  • To handle an unbounded number of procedure
    calls a circular buffer is used
  • Studies showed that with 8 windows, save or
    restore is only needed on 1 of the calls or
    returns.
  • Global variables cannot be stored here (special
    registers or an area of main memory.

25
Large Register File versus Cache
  • When the register file is organized into windows
    it acts more like a specialized cache memory.
    (but faster!)
  • Register file may make inefficient use of space.
  • Cache must read an entire block at a time (may
    increase or decrease efficiency).
  • Cache can hold global or local variables.

26
RISC Architecture
  • One instruction per cycle
  • Machine cycle supports fetching 2 operands from
    registers, performing an ALU operation, and
    storing results in a register.
  • Register-to-register operations
  • Most instructions should be register to register
  • Only simple LOAD and STORE instructions access
    memory.
  • Simple addressing modes
  • Almost all instructions are simple register
    addressing
  • Simple instruction formats
  • Instruction length is fixed and aligned on word
    boundaries.
  • Field locations especially the opcode are
    fixed.
  • Fixed length fields means that opcode decoding
    and operand fetch can occur simultaneously.
  • Control unit is simplified.

27
RISC Pipelining
  • One instruction per cycle
  • Machine cycle supports fetching 2 operands from
    registers, performing an ALU operation, and
    storing results in a register.
  • Register-to-register operations
  • Most instructions should be register to register
  • Only simple LOAD and STORE instructions access
    memory.
  • Simple addressing modes
  • Almost all instructions are simple register
    addressing
  • Simple instruction formats
  • Instruction length is fixed and aligned on word
    boundaries.
  • Field locations especially the opcode are
    fixed.
  • Fixed length fields means that opcode decoding
    and operand fetch can occur simultaneously.
  • Control unit is simplified.

28
RISC Pipelining
A RISC instruction consists of three primary
stges I Instruction fetch E Execute
(Calculates Memory address) D Memory.
Register-to-memory or memory-to-register
operation. Without pipelining (13 time units)
29
RISC Pipelining
  • Two-stage pipelining can speed-up performance
  • Problems
  • Single port memory is used therefore only one
    memory access is possible per stage. Wait stages
    must be inserted.
  • Branch instruction interrupts sequential flow,
    therefore a NOOP must be inserted.

30
RISC Pipelining
  • Three-stage pipelining can occur IF dual memory
    accesses are allowed
  • per stage.
  • Problems
  • Branch instructions cause speedup to fall short
    of maximum.
  • Data dependencies are introduced. (for example
    if the output from one instruction is needed as
    input in the next instruction).

31
RISC Pipelining
Further improvement can be gained by splitting
the E stage into two substages E1 Register
file read E2 ALU operation and register
write
32
RISC Pipelining
  • Optimization
  • Problems occur because of data and branch
    dependencies.
  • Code reorganization techniques can be used.
  • One example of code reorganization is the
    delayed branch.

33
RISC Pipelining
  • Optimization
  • Instead of inserting a NOOP the compiler can try
    to find something useful for the processor to do.
  • For example switch the ADD and JUMP around.
  • If the BRANCH is conditional this can ONLY
    happen if the effect of executing the instruction
    early makes no difference if the branch is taken.
Write a Comment
User Comments (0)
About PowerShow.com