Title: CS5222%20Advanced%20Processor%20Architecture%20Part%205:%20Processing%20of%20Control%20Transfer%20Instructions
1CS5222Advanced Processor ArchitecturePart 5
Processing of Control Transfer Instructions
- Fall Term, 2004/2005
- Chi Chi Hung (email chich_at_comp.nus.edu.sg)
- Building S/17, Rm 5-13
- Phone 6874-2832
2Overview
- Introduction to branch problem
- Basic approaches to branch handling
- Delay branch
- Branch handling
- Multiway branching
- Guarded execution
3Part A
4Types of Branches
- Branches are used to transfer control to a
specified locus of the program. - Types of branches
- Unconditional branches always taken
- simple unconditional branch
- branch to subroutine
- return from subroutine
- Conditional branches a condition to determine if
the branch is taken - loop-closing conditional branch backward
branches that are taken for all but the last
iteration of a loop - other conditional branches
- Implementation decisions (Example from PowerPC)
- All types of branch instructions merge into
branch to subroutine instructions with the help
of a specified bit in instruction encoding? - Conditional and unconditional branches are
handled in an unified way?
5Types of Branches
6Checking of Branch Condition
- Two approaches to check the branch condition
- Check the result of an instruction
- Compare two operands
- Two approaches to implement checking
- result state concept
- direct state concept
7Implementation of Checking Branch Condition
8Implementation of Checking Branch Condition
Result State
- Under result state concept, a result state,
normally in the form of a condition code or
flags, is declared in ISA. - The result state holds relevant information about
the results of operations. - It is updated automatically during instruction
execution. - It can be tested by a conditional branch
- relevant conditions of any operand immediately
before the branch - relevant conditions of any operand generated
earlier - Need some new instruction such as teq to test
operand. - Many architectures use result state approach,
such as IBM mainframes, VAX, x86, SPARC, PowerPC.
9Implementation of Checking Branch Condition
Result State
- Example
- add r1, r2, r3 // r1 r2 r3
- beq label // test for result equals zero and
if yes, - branch to the location label
-
- label
- teq r1 // test for (r1) 0 update result
state - beq label // test for result equals zero if
yes, - branch to location label
-
10Implementation of Checking Branch Condition
Result State
- Disadvantages
- Generation of result state is not
straightforward require irregular structure and
occupy additional chip area. - Need mechanism to preserve sequential consistency
in case of parallel execution (e.g. superscalar
and VLIW) avoid multiple or out-of-order
updating of result state. - To retaining sequential consistency in
superscalar and VLIW - Support multiple sets of condition codes or
flags. - Adopt direct check concept
- Advantage save a small percentage of code
length, which is not important. - Expect novel architectures to use other approach
(direct check concept).
11Implementation of Checking Branch Condition
Direct Check
- Alternative for checking results of operations.
- Instead of result state, results of operations
are checked directly for specified conditions by
using dedicated instructions. If the specified
condition is met, the conditional branch will be
initiated. - Implemented as either two instructions or one
instruction. - Two instructions approach
- result value is checked by an appropriate compare
instruction and the outcome is written into a
chosen register (value is Boolean). - a conditional branch instruction is then used to
test the deposited outcome and branch
accordingly. - One instruction approach
- testing and conditional branching are done by the
same inst.
12Implementation of Checking Branch Condition
Direct Check
- Example
- add r1, r2, r3 // r1 r2 r3
- cmpeq r7, r1 // r7 true if r1 0, else NOP
-
- bt r7, zero // branch to label if r7 true,
else NOP -
- label
13Implementation of Checking Branch Condition
Summary
14Branch Statistics
- Branches affect program parallelism/performance.
- Branch Percentage Branch accounts for about 20
of general-purpose code and about 5-10 of
scientific/technical code. - Conditional Branches Majority of branches are
conditional (80). - Ratio of Taken to Not-Taken Branches About 5/6
of conditional branches are taken.
15Branch Statistics
16Branch Statistics
17Branch Problem
- Interrupt of pipelining, thus introducing
bubbles (or wasted cycles) in the instruction
pipes and resulting in performance loss. - Processing conditional branch usually causes a
longer penalty than unconditional branch
(Usually, an extra cycle is required to evaluate
the specified condition). - Performance gets worse for unresolved conditional
branches (takes more cycles to determine the
condition). - Due to high frequency of branches (1/4 to 1/6
instructions), ineffective branch processing can
seriously impede performance. - Recent processor development makes situation
worse - Pipelines are becoming deeper.
- Higher probability to encounter branches, as
inst. consumption rate increases.
18Branch Problem
19Performance Measurement
20Performance Measurement
- Branch penalty is defined as the number of
additional delay cycles occurring until the
target instruction is fetched over the natural
1-cycle delay. - E.g. Effective penalty of branch processing P
- P ft Pt fnt Pnt t taken, nt not
taken - P80386 0.75 8 0.25 2 6.5 cycles
- Pi486 0.75 2 0.25 0 1.5 cycles
21Performance Measurement
- E.g. Effective penalty of branch processing P
with prediction - P ftc Ptc ftm Ptm fntc Pntc fntm
Pntm - where tc correctly predicted taken
- ntc correctly predicted not-taken
- tm mis-predicted taken
- ntm mis-predicted not taken
- If Ptc Pntc Pc and Ptm Pntm Pm , fc ftc
fntc and fm ftm fntm - P fc Pc fm Pm
-
- Ppentium 0.9 0 0.1 3.5 0.35 cycles
22Performance Measurement 0-Cycle Branch
- Zero-cycle branching (branch folding) refers to
branch implementations which allow execution of
branches with a one-cycle gain compared to a
sequential execution. - Instruction logically following the branch is
executed immediately after the instruction which
precedes the branch. - It can be implemented with conditional branches,
when it provides a seamless execution along both
the taken and not-taken paths. - Examples of processors using branch folding
- PowerPC, PA8000
23Performance Measurement 0-Cycle Branch
24Part B
- Basic Approach to Branch Handling
25Basic Questions to Branch Handling
- Three basic questions
- Whether branch delay slots are used.
- With delay branching, otherwise unused bubbles
are filled as far as possible with executable
instructions. - Change of execution semantics (after delay
branch) is needed. - How unresolved conditional branches are handled.
- Simplest way is to block branch processing
- Prediction or speculation of branching with
reasonable accuracy can help. - Whether the architecture provides means to avoid
conditional branches. - Control dependencies are changed to data
dependencies. - Redefinition of ISA is needed.
26Basic Approaches to Branch Handling
27Basic Approaches to Branch Handling
28Part C
29Basic Delayed Branching Scheme
- Instruction slots following branches are called
branch delay slot. - Branch delay slots are wasted during traditional
execution. - With delayed branching, instruction that follows
the branch(s) is(are) always executed in the
delay slot(s) and branch instruction is effective
later, delayed by n cycles. - Scheme applies to both conditional and
unconditional branches, although the former case
have longer delays. - With the reverse execution order (w.r.t. branch),
delayed branching requires an architectural
redefinition of the execution sequence of
instructions.
30Branch Delay Slot
31Principle of Delay Branching
32Performance Gain of Delayed Branching
- Performance gain Gd by delayed branching
- Assume 100 instructions have 100 fb delay
slots. - nu slots utilized 100 fb ff fb br.
freq. ff freq. filled up - Gd (100 nu)/100 1 nu/100 fb ff
- Gmax fb with ff 1
33Analysis of Delayed Branching
- Advantage
- Slightly increased performance.
- Disadvantage
- Redefinition of architecture
- Slight code expansion due to NOPs
- Interrupt processing being more difficult.
- Additional hardware for delayed branching.
34Extensions to Basic Scheme
Annulment Options in using delay slots, which
permit more delay slots to be filled.
35Kinds of Annulment for Conditional Branches
- With annulment, 95 filling of delay slots is
achieved.
36Annulment Options in Arch. with Delayed Branching
37Relation Betn Delayed Branching Superscalar
Exec.
- Two problems with superscalar architecture
- Fewer instructions will be available for the
delay slots fill-in because independent
instructions are used for parallel execution. - With modification to the architecture, it will be
difficult to reverse branch with multiple delayed
slot instructions. - Most architectures using delayed branching only
allow one delay slot, irrespective of the issue
rate. - Apparently, the concept has no future.
38Part D
39Branch Processing Design Space
40Part D-1
41Concepts of Branch Detection
- When a branch is detected can determine the
penalties seen by the program. - During instruction decoding? Or earlier than
that? - Two approaches
- Master pipeline during common instruction
decoding stage - Early branch detection earlier than instruction
decoding stage - In parallel
- Look-ahead
- Integrated instruction fetch and branch detection
(Next instruction to be fetched is detected for
branch if yes, both sequential and target
address/instruction will be fetched also)
42Branch Detection Schemes
43Parallel Branch Detection
44Lookahead Branch Detection
45Lookahead Branch Detection
46Part D-2
- Handling of Unresolved Conditional Br.
47Design Space of Branch Processing Policies
48Penalties in Blocking Branch Processing
- Study Taken penalty is much higher than
not-taken penalty
49Speculative Branch Processing
- Speculative branch processing
- A guess is made to an unresolved conditional
branch and execution continues speculatively
along the guessed path. - In case of a correct prediction, the speculative
execution can be confirmed and then continued. - For an incorrect guess, all speculatively
executed instructions have to be discarded and
execution restarted along the correct path. - Three key aspects
- Prediction scheme
- Extent of speculative execution
- Recovery scheme in case of mis-prediction.
50Multiway Branching
- Most ambitious scheme to cope with unresolved
conditional branches. - Both possible paths of branching are pursued.
Once the specified condition is evaluated, the
correct path is confirmed and the incorrect path
is cancelled.
51Part D-2-1
52Kinds of Branch Predictions
53Fixed Prediction
54Penalty Figures of Always Not Taken Fixed
Prediction
55Penalty Figures of Always Taken Fixed Prediction
- Study Why are there many more processors using
complex Always not Taken approach?
56Alternatives of Static Branch Prediction
57Static Opcode-based Prediction in MC88110
58Dynamic Branch Prediction
- Dynamic prediction is made on branch history
branches which were taken at their last (n)
occurrences are also likely to be taken at their
next occurrence. - It has higher performance as well as more complex
implementation. - Two types of expressing branch history
- Explicit dynamic technique
- Implicit dynamic technique
59Types of Dynamic Branch Prediction
60Explicit Prediction 1-Bit Dynamic Prediction
61Explicit Prediction 2-Bit Dynamic Prediction
62Explicit Prediction 3-Bit Dynamic Prediction
63Implicit Dynamic Prediction
- Two branch target paths that can be used for
branch prediction. They are - BTAC (Branch Target Access Cache)
- BTIC (Branch Target Instruction Cache)
- Extra cache is introduced to hold
- Most recently used branch addresses and either
- Corresponding branch target addresses (for BTAC)
- Corresponding branch target instructions (for
BTIC) - Note that implicit dynamic prediction is a
special kind of a 1-bit prediction.
64Implementation Alternatives of History Bits
65Example of Implementing History Bits BHT of
PowerPC604
- Note that there are 2 prediction schemes here!
66Multiple Prediction Techniques in a Processor
- Implicit scheme faster but less accurate n-bit
predictors slower but more accurate.
67Accuracy of Branch Prediction (by Yeh Patt 1992)
More Accuracy
More Complex
68Accuracy of Branch Prediction (by Yeh Patt 1992)
- Study Range performance among benchmarks
accuracy in floating point vs. integer
applications.
69Prediction Accuracy of Recent Processors
70Branch Prediction Trends
71Extent of Speculativeness
72Basic Tasks During Recovery from Mis-Prediction
73Recovery Activities from Mis-Prediction
74Frequently Employed Recovery Schemes from
Mis-Pred.
752-Buffers Recovery for Recovery of Mis-Prediction
763-Buffers Recovery for Recovery of Mis-Prediction
77Part D-3
- Accessing Branch Target Path
78Branch Target Accessing Schemes
- Taken conditional branches has a higher
occurrence than the not taken ones. - Higher penalty for taken guesses depends
heavily on how branch target path is accessed. - For schemes
79Accessing Branch Target Path Compute/Fetch
- Branch target address (BTA) is computed and the
corresponding branch target instruction (BTI) is
fetched. - When a branch is encountered, the next sequential
address is overwritten by the computed branch
target address (BTA). - Drawbacks
- Sequential access manner of BTA and BTI results
in considerable taken-path access penalty
80Accessing Branch Target Path Compute/Fetch
81Accessing Branch Target Path BTAC Scheme
- BTAC contains pairs of recently used branch
addresses (BA) and branch target addresses
(BTAs). - When the actual instruction fetch addresses is a
branch addresses there is a corresponding entry
in the BTAC, the BTA is fetched along with the
branch instruction in the same cycle. - Issues of implementation
- Associativity of BTAC
- How BTAC is initialized
- Whether BTAC contains recent branches or only
recently taken branches - Replacement of entries in BTAC
- Should predict bits are used, will they be stored
in BTAC?
82Accessing Branch Target Path BTAC Scheme
83Examples of Processors Using BTAC Scheme
84Accessing Branch Target Path BTIC Scheme
- BTIC is to provide a small extra cache which
delivers, for taken or predicted taken branches,
the branch target instruction (BTI) or a
specified number of BTIs. - Two implementation choices
- Contains
- addresses of recently taken BA
- BTI
- instruction addresses following BTI
- Contains
- addresses of recently taken BA
- BTI and BTI1
- instruction addresses following BTI are
calculated dynamically
85Principle of BTIC Scheme with Storing Taken Path
Contn
86Principle of BTIC Scheme w/ Calculating Taken
Path Contn
87Examples of Processors Using BTIC
88Accessing Branch Target Path Successor Index in
I-Cache
89Trends in Branch Target Accessing Schemes
90Subroutine Return Stack For Return Addresses
- Dedicated hardware stack to hold the return
address of subroutine function. - Size of return stack depends on expected depth of
nesting subroutines.
91Branch Penalties in Current Processors
92Part D-4
- Microarchitectural Implementation of Branch
Processing
93Basic Subtasks of Branch Processing
94Part D-5
95Multiway Branching
- Both sequential and taken paths of an unresolved
conditional branch are pursued. - Multiple unresolved branches are allowed in
multiway branching.
96Multiway Branching for Multiple Unresolved
Branching
97Part E
98Concept of Guarded Execution
- Guarded execution is a means to eliminate at
least partly, conditional branches - It introduces conditional operate instructions
into the architecture and use them to replace
conditional branches. - It has two parts a condition part, called guard,
and an operation part, which is a traditional
instruction. - E.g. cmovxx ra.rq, rb.rq, rc.wq
- where xx condition
- ra.rq and rb.rq are read-only operand registers.
- rc.wq is the write-only operand register.
- Example cmoveg // cmove if ra is equal to zero
99Instruction Mix about Branch Execution
100Performance Study of Guarded Execution
101Basic Block Enlargement for Full Guarded Execution
- Full guarding all instructions are assumed to be
guarded
102Basic Block Enlargement for Restricted Guarded
Execution
- Restricted guarding only operate instructions
have a guarded form.
103Path Expansion for Full Guarding
104Path Expansion for Restricted Guarding
105