CS5222%20Advanced%20Processor%20Architecture%20Part%205:%20Processing%20of%20Control%20Transfer%20Instructions - PowerPoint PPT Presentation

About This Presentation
Title:

CS5222%20Advanced%20Processor%20Architecture%20Part%205:%20Processing%20of%20Control%20Transfer%20Instructions

Description:

CS5222 Advanced Processor Architecture Part 5: Processing of Control Transfer Instructions – PowerPoint PPT presentation

Number of Views:305
Avg rating:3.0/5.0
Slides: 106
Provided by: chich8
Category:

less

Transcript and Presenter's Notes

Title: CS5222%20Advanced%20Processor%20Architecture%20Part%205:%20Processing%20of%20Control%20Transfer%20Instructions


1
CS5222Advanced Processor ArchitecturePart 5
Processing of Control Transfer Instructions
  • Fall Term, 2004/2005
  • Chi Chi Hung (email chich_at_comp.nus.edu.sg)
  • Building S/17, Rm 5-13
  • Phone 6874-2832

2
Overview
  • Introduction to branch problem
  • Basic approaches to branch handling
  • Delay branch
  • Branch handling
  • Multiway branching
  • Guarded execution

3
Part A
  • Introduction

4
Types of Branches
  • Branches are used to transfer control to a
    specified locus of the program.
  • Types of branches
  • Unconditional branches always taken
  • simple unconditional branch
  • branch to subroutine
  • return from subroutine
  • Conditional branches a condition to determine if
    the branch is taken
  • loop-closing conditional branch backward
    branches that are taken for all but the last
    iteration of a loop
  • other conditional branches
  • Implementation decisions (Example from PowerPC)
  • All types of branch instructions merge into
    branch to subroutine instructions with the help
    of a specified bit in instruction encoding?
  • Conditional and unconditional branches are
    handled in an unified way?

5
Types of Branches
6
Checking of Branch Condition
  • Two approaches to check the branch condition
  • Check the result of an instruction
  • Compare two operands
  • Two approaches to implement checking
  • result state concept
  • direct state concept

7
Implementation of Checking Branch Condition
8
Implementation of Checking Branch Condition
Result State
  • Under result state concept, a result state,
    normally in the form of a condition code or
    flags, is declared in ISA.
  • The result state holds relevant information about
    the results of operations.
  • It is updated automatically during instruction
    execution.
  • It can be tested by a conditional branch
  • relevant conditions of any operand immediately
    before the branch
  • relevant conditions of any operand generated
    earlier
  • Need some new instruction such as teq to test
    operand.
  • Many architectures use result state approach,
    such as IBM mainframes, VAX, x86, SPARC, PowerPC.

9
Implementation of Checking Branch Condition
Result State
  • Example
  • add r1, r2, r3 // r1 r2 r3
  • beq label // test for result equals zero and
    if yes,
  • branch to the location label
  • label
  • teq r1 // test for (r1) 0 update result
    state
  • beq label // test for result equals zero if
    yes,
  • branch to location label

10
Implementation of Checking Branch Condition
Result State
  • Disadvantages
  • Generation of result state is not
    straightforward require irregular structure and
    occupy additional chip area.
  • Need mechanism to preserve sequential consistency
    in case of parallel execution (e.g. superscalar
    and VLIW) avoid multiple or out-of-order
    updating of result state.
  • To retaining sequential consistency in
    superscalar and VLIW
  • Support multiple sets of condition codes or
    flags.
  • Adopt direct check concept
  • Advantage save a small percentage of code
    length, which is not important.
  • Expect novel architectures to use other approach
    (direct check concept).

11
Implementation of Checking Branch Condition
Direct Check
  • Alternative for checking results of operations.
  • Instead of result state, results of operations
    are checked directly for specified conditions by
    using dedicated instructions. If the specified
    condition is met, the conditional branch will be
    initiated.
  • Implemented as either two instructions or one
    instruction.
  • Two instructions approach
  • result value is checked by an appropriate compare
    instruction and the outcome is written into a
    chosen register (value is Boolean).
  • a conditional branch instruction is then used to
    test the deposited outcome and branch
    accordingly.
  • One instruction approach
  • testing and conditional branching are done by the
    same inst.

12
Implementation of Checking Branch Condition
Direct Check
  • Example
  • add r1, r2, r3 // r1 r2 r3
  • cmpeq r7, r1 // r7 true if r1 0, else NOP
  • bt r7, zero // branch to label if r7 true,
    else NOP
  • label

13
Implementation of Checking Branch Condition
Summary
14
Branch Statistics
  • Branches affect program parallelism/performance.
  • Branch Percentage Branch accounts for about 20
    of general-purpose code and about 5-10 of
    scientific/technical code.
  • Conditional Branches Majority of branches are
    conditional (80).
  • Ratio of Taken to Not-Taken Branches About 5/6
    of conditional branches are taken.

15
Branch Statistics
16
Branch Statistics
17
Branch Problem
  • Interrupt of pipelining, thus introducing
    bubbles (or wasted cycles) in the instruction
    pipes and resulting in performance loss.
  • Processing conditional branch usually causes a
    longer penalty than unconditional branch
    (Usually, an extra cycle is required to evaluate
    the specified condition).
  • Performance gets worse for unresolved conditional
    branches (takes more cycles to determine the
    condition).
  • Due to high frequency of branches (1/4 to 1/6
    instructions), ineffective branch processing can
    seriously impede performance.
  • Recent processor development makes situation
    worse
  • Pipelines are becoming deeper.
  • Higher probability to encounter branches, as
    inst. consumption rate increases.

18
Branch Problem
19
Performance Measurement
20
Performance Measurement
  • Branch penalty is defined as the number of
    additional delay cycles occurring until the
    target instruction is fetched over the natural
    1-cycle delay.
  • E.g. Effective penalty of branch processing P
  • P ft Pt fnt Pnt t taken, nt not
    taken
  • P80386 0.75 8 0.25 2 6.5 cycles
  • Pi486 0.75 2 0.25 0 1.5 cycles

21
Performance Measurement
  • E.g. Effective penalty of branch processing P
    with prediction
  • P ftc Ptc ftm Ptm fntc Pntc fntm
    Pntm
  • where tc correctly predicted taken
  • ntc correctly predicted not-taken
  • tm mis-predicted taken
  • ntm mis-predicted not taken
  • If Ptc Pntc Pc and Ptm Pntm Pm , fc ftc
    fntc and fm ftm fntm
  • P fc Pc fm Pm
  • Ppentium 0.9 0 0.1 3.5 0.35 cycles

22
Performance Measurement 0-Cycle Branch
  • Zero-cycle branching (branch folding) refers to
    branch implementations which allow execution of
    branches with a one-cycle gain compared to a
    sequential execution.
  • Instruction logically following the branch is
    executed immediately after the instruction which
    precedes the branch.
  • It can be implemented with conditional branches,
    when it provides a seamless execution along both
    the taken and not-taken paths.
  • Examples of processors using branch folding
  • PowerPC, PA8000

23
Performance Measurement 0-Cycle Branch
24
Part B
  • Basic Approach to Branch Handling

25
Basic Questions to Branch Handling
  • Three basic questions
  • Whether branch delay slots are used.
  • With delay branching, otherwise unused bubbles
    are filled as far as possible with executable
    instructions.
  • Change of execution semantics (after delay
    branch) is needed.
  • How unresolved conditional branches are handled.
  • Simplest way is to block branch processing
  • Prediction or speculation of branching with
    reasonable accuracy can help.
  • Whether the architecture provides means to avoid
    conditional branches.
  • Control dependencies are changed to data
    dependencies.
  • Redefinition of ISA is needed.

26
Basic Approaches to Branch Handling
27
Basic Approaches to Branch Handling
28
Part C
  • Delayed Branching

29
Basic Delayed Branching Scheme
  • Instruction slots following branches are called
    branch delay slot.
  • Branch delay slots are wasted during traditional
    execution.
  • With delayed branching, instruction that follows
    the branch(s) is(are) always executed in the
    delay slot(s) and branch instruction is effective
    later, delayed by n cycles.
  • Scheme applies to both conditional and
    unconditional branches, although the former case
    have longer delays.
  • With the reverse execution order (w.r.t. branch),
    delayed branching requires an architectural
    redefinition of the execution sequence of
    instructions.

30
Branch Delay Slot
31
Principle of Delay Branching
32
Performance Gain of Delayed Branching
  • Performance gain Gd by delayed branching
  • Assume 100 instructions have 100 fb delay
    slots.
  • nu slots utilized 100 fb ff fb br.
    freq. ff freq. filled up
  • Gd (100 nu)/100 1 nu/100 fb ff
  • Gmax fb with ff 1

33
Analysis of Delayed Branching
  • Advantage
  • Slightly increased performance.
  • Disadvantage
  • Redefinition of architecture
  • Slight code expansion due to NOPs
  • Interrupt processing being more difficult.
  • Additional hardware for delayed branching.

34
Extensions to Basic Scheme
Annulment Options in using delay slots, which
permit more delay slots to be filled.
35
Kinds of Annulment for Conditional Branches
  • With annulment, 95 filling of delay slots is
    achieved.

36
Annulment Options in Arch. with Delayed Branching
37
Relation Betn Delayed Branching Superscalar
Exec.
  • Two problems with superscalar architecture
  • Fewer instructions will be available for the
    delay slots fill-in because independent
    instructions are used for parallel execution.
  • With modification to the architecture, it will be
    difficult to reverse branch with multiple delayed
    slot instructions.
  • Most architectures using delayed branching only
    allow one delay slot, irrespective of the issue
    rate.
  • Apparently, the concept has no future.

38
Part D
  • Branch Processing

39
Branch Processing Design Space
40
Part D-1
  • Branch Detection

41
Concepts of Branch Detection
  • When a branch is detected can determine the
    penalties seen by the program.
  • During instruction decoding? Or earlier than
    that?
  • Two approaches
  • Master pipeline during common instruction
    decoding stage
  • Early branch detection earlier than instruction
    decoding stage
  • In parallel
  • Look-ahead
  • Integrated instruction fetch and branch detection
    (Next instruction to be fetched is detected for
    branch if yes, both sequential and target
    address/instruction will be fetched also)

42
Branch Detection Schemes
43
Parallel Branch Detection
44
Lookahead Branch Detection
45
Lookahead Branch Detection
46
Part D-2
  • Handling of Unresolved Conditional Br.

47
Design Space of Branch Processing Policies
48
Penalties in Blocking Branch Processing
  • Study Taken penalty is much higher than
    not-taken penalty

49
Speculative Branch Processing
  • Speculative branch processing
  • A guess is made to an unresolved conditional
    branch and execution continues speculatively
    along the guessed path.
  • In case of a correct prediction, the speculative
    execution can be confirmed and then continued.
  • For an incorrect guess, all speculatively
    executed instructions have to be discarded and
    execution restarted along the correct path.
  • Three key aspects
  • Prediction scheme
  • Extent of speculative execution
  • Recovery scheme in case of mis-prediction.

50
Multiway Branching
  • Most ambitious scheme to cope with unresolved
    conditional branches.
  • Both possible paths of branching are pursued.
    Once the specified condition is evaluated, the
    correct path is confirmed and the incorrect path
    is cancelled.

51
Part D-2-1
  • Branch Prediction

52
Kinds of Branch Predictions
53
Fixed Prediction
54
Penalty Figures of Always Not Taken Fixed
Prediction
55
Penalty Figures of Always Taken Fixed Prediction
  • Study Why are there many more processors using
    complex Always not Taken approach?

56
Alternatives of Static Branch Prediction
57
Static Opcode-based Prediction in MC88110
58
Dynamic Branch Prediction
  • Dynamic prediction is made on branch history
    branches which were taken at their last (n)
    occurrences are also likely to be taken at their
    next occurrence.
  • It has higher performance as well as more complex
    implementation.
  • Two types of expressing branch history
  • Explicit dynamic technique
  • Implicit dynamic technique

59
Types of Dynamic Branch Prediction
60
Explicit Prediction 1-Bit Dynamic Prediction
61
Explicit Prediction 2-Bit Dynamic Prediction
62
Explicit Prediction 3-Bit Dynamic Prediction
63
Implicit Dynamic Prediction
  • Two branch target paths that can be used for
    branch prediction. They are
  • BTAC (Branch Target Access Cache)
  • BTIC (Branch Target Instruction Cache)
  • Extra cache is introduced to hold
  • Most recently used branch addresses and either
  • Corresponding branch target addresses (for BTAC)
  • Corresponding branch target instructions (for
    BTIC)
  • Note that implicit dynamic prediction is a
    special kind of a 1-bit prediction.

64
Implementation Alternatives of History Bits
65
Example of Implementing History Bits BHT of
PowerPC604
  • Note that there are 2 prediction schemes here!

66
Multiple Prediction Techniques in a Processor
  • Implicit scheme faster but less accurate n-bit
    predictors slower but more accurate.

67
Accuracy of Branch Prediction (by Yeh Patt 1992)
More Accuracy
More Complex
68
Accuracy of Branch Prediction (by Yeh Patt 1992)
  • Study Range performance among benchmarks
    accuracy in floating point vs. integer
    applications.

69
Prediction Accuracy of Recent Processors
70
Branch Prediction Trends
71
Extent of Speculativeness
72
Basic Tasks During Recovery from Mis-Prediction
73
Recovery Activities from Mis-Prediction
74
Frequently Employed Recovery Schemes from
Mis-Pred.
75
2-Buffers Recovery for Recovery of Mis-Prediction
76
3-Buffers Recovery for Recovery of Mis-Prediction
77
Part D-3
  • Accessing Branch Target Path

78
Branch Target Accessing Schemes
  • Taken conditional branches has a higher
    occurrence than the not taken ones.
  • Higher penalty for taken guesses depends
    heavily on how branch target path is accessed.
  • For schemes

79
Accessing Branch Target Path Compute/Fetch
  • Branch target address (BTA) is computed and the
    corresponding branch target instruction (BTI) is
    fetched.
  • When a branch is encountered, the next sequential
    address is overwritten by the computed branch
    target address (BTA).
  • Drawbacks
  • Sequential access manner of BTA and BTI results
    in considerable taken-path access penalty

80
Accessing Branch Target Path Compute/Fetch
81
Accessing Branch Target Path BTAC Scheme
  • BTAC contains pairs of recently used branch
    addresses (BA) and branch target addresses
    (BTAs).
  • When the actual instruction fetch addresses is a
    branch addresses there is a corresponding entry
    in the BTAC, the BTA is fetched along with the
    branch instruction in the same cycle.
  • Issues of implementation
  • Associativity of BTAC
  • How BTAC is initialized
  • Whether BTAC contains recent branches or only
    recently taken branches
  • Replacement of entries in BTAC
  • Should predict bits are used, will they be stored
    in BTAC?

82
Accessing Branch Target Path BTAC Scheme
83
Examples of Processors Using BTAC Scheme
84
Accessing Branch Target Path BTIC Scheme
  • BTIC is to provide a small extra cache which
    delivers, for taken or predicted taken branches,
    the branch target instruction (BTI) or a
    specified number of BTIs.
  • Two implementation choices
  • Contains
  • addresses of recently taken BA
  • BTI
  • instruction addresses following BTI
  • Contains
  • addresses of recently taken BA
  • BTI and BTI1
  • instruction addresses following BTI are
    calculated dynamically

85
Principle of BTIC Scheme with Storing Taken Path
Contn
86
Principle of BTIC Scheme w/ Calculating Taken
Path Contn
87
Examples of Processors Using BTIC
88
Accessing Branch Target Path Successor Index in
I-Cache
89
Trends in Branch Target Accessing Schemes
90
Subroutine Return Stack For Return Addresses
  • Dedicated hardware stack to hold the return
    address of subroutine function.
  • Size of return stack depends on expected depth of
    nesting subroutines.

91
Branch Penalties in Current Processors
92
Part D-4
  • Microarchitectural Implementation of Branch
    Processing

93
Basic Subtasks of Branch Processing
94
Part D-5
  • Multiway Branching

95
Multiway Branching
  • Both sequential and taken paths of an unresolved
    conditional branch are pursued.
  • Multiple unresolved branches are allowed in
    multiway branching.

96
Multiway Branching for Multiple Unresolved
Branching
97
Part E
  • Guarded Execution

98
Concept of Guarded Execution
  • Guarded execution is a means to eliminate at
    least partly, conditional branches
  • It introduces conditional operate instructions
    into the architecture and use them to replace
    conditional branches.
  • It has two parts a condition part, called guard,
    and an operation part, which is a traditional
    instruction.
  • E.g. cmovxx ra.rq, rb.rq, rc.wq
  • where xx condition
  • ra.rq and rb.rq are read-only operand registers.
  • rc.wq is the write-only operand register.
  • Example cmoveg // cmove if ra is equal to zero

99
Instruction Mix about Branch Execution
100
Performance Study of Guarded Execution
101
Basic Block Enlargement for Full Guarded Execution
  • Full guarding all instructions are assumed to be
    guarded

102
Basic Block Enlargement for Restricted Guarded
Execution
  • Restricted guarding only operate instructions
    have a guarded form.

103
Path Expansion for Full Guarding
104
Path Expansion for Restricted Guarding
105
  • END
Write a Comment
User Comments (0)
About PowerShow.com