RISC Processors - PowerPoint PPT Presentation

About This Presentation
Title:

RISC Processors

Description:

RISC Processors Chapter 14 S. Dandamudi Outline Introduction Evolution of CISC processors RISC design principles PowerPC processor Architecture Addressing modes ... – PowerPoint PPT presentation

Number of Views:208
Avg rating:3.0/5.0
Slides: 71
Provided by: S316
Category:

less

Transcript and Presenter's Notes

Title: RISC Processors


1
RISC Processors
  • Chapter 14
  • S. Dandamudi

2
Outline
  • Introduction
  • Evolution of CISC processors
  • RISC design principles
  • PowerPC processor
  • Architecture
  • Addressing modes
  • Instruction set
  • Itanium processor
  • Architecture
  • Addressing modes
  • Instruction set
  • Instruction-level parallelism
  • Branch handling
  • Speculative execution

3
Introduction
  • CISC
  • Complex instruction set
  • Pentium is the most popular example
  • RISC
  • Simple instructions
  • Reduced complexity
  • Modern processors use this design philosophy
  • PowerPC, MIPS, SPARC, Intel Itanium
  • Borrow some features from CISC
  • No precise definition
  • We can identify some common characteristics

4
Evolution of CISC Designs
  • Motivation to efficiently use expensive resources
  • Processor
  • Memory
  • High density code
  • Complex instructions
  • Hardware complexity is handled by
    microprogramming
  • Microprogramming is also helpful to
  • Reduce the impact of memory access latency
  • Offers flexibility
  • Low-cost members of the same family
  • Tailored to high-level language constructs

5
Evolution of CISC Designs (contd)
6
Evolution of CISC Designs (contd)
  • Example
  • Autoincrement addressing mode of VAX
  • Performs the following actions
  • (R2) (R2) R3 R2 R2 1
  • RISC equivalent
  • R4 (R2)
  • R4 R4 R3
  • (R2) R4
  • R2 R2 1

7
Why RISC?
  • Simple instructions are preferred
  • Complex instructions are mostly ignored by
    compilers
  • Due to semantic gap
  • Simple data structures
  • Complex data structures are used relatively
    infrequently
  • Better to support a few simple data types
    efficiently
  • Synthesize complex ones
  • Simple addressing modes
  • Complex addressing modes lead to variable length
    instructions
  • Lead to inefficient instruction decoding and
    scheduling

8
Why RISC? (contd)
  • Large register set
  • Efficient support for procedure calls and returns
  • Patterson and Sequins study
  • Procedure call/return 12-15 of HLL statements
  • Constitute 31-33 of machine language
    instructions
  • Generate nearly half (45) of memory references
  • Small activation record
  • Tanenbaums study
  • Only 1.25 of the calls have more than 6
    arguments
  • More than 93 have less than 6 local scalar
    variables
  • Large register set can avoid memory references

9
RISC Design Principles
  • Simple operations
  • Simple instructions that can execute in one cycle
  • Register-to-register operations
  • Only load and store operations access memory
  • Rest of the operations on a register-to-register
    basis
  • Simple addressing modes
  • A few addressing modes (1 or 2)
  • Large number of registers
  • Needed to support register-to-register operations
  • Minimize the procedure call and return overhead

10
RISC Design Principles (contd)
Register windows storing activation records
11
RISC Design Principles (contd)
  • Fixed-length instructions
  • Facilitates efficient instruction execution
  • Simple instruction format
  • Fixed boundaries for various fields
  • opcode, source operands,
  • Other features
  • Tend to use Harvard architecture
  • Pipelining is visible at the architecture level

12
PowerPC
  • Registers
  • 32 general-purpose registers (GPR0 GPR31)
  • 32 floating-point registers (FPR0 FPR31)
  • Condition register (CR)
  • Similar to Pentiums flags register
  • Divided into 8 CR fields (4 bits each)
  • less than (LT), greater than (GT), equal to
    (EQ), Overflow (SO)
  • CR1 is for floating-point exceptions
  • Other CR fields can be used for integer or FP
    exceptions
  • Branch instructions can test a specific CR field
    bit

13
PowerPC (contd)
14
PowerPC (contd)
  • XER register serves two distinct purposes
  • Bits 0, 1, and 2 are used to capture
  • Summary overflow (SO), overflow (OV), carry (CA)
  • OV and CA are similar to Pentiums overflow and
    carry
  • SO, once set, only a special instruction can
    clear it
  • Bits 25 to 31 (7 bits)
  • Specifies the number of bytes to be transferred
    between memory and registers
  • Two instructions
  • Load string word indexed (lswx)
  • Store string word indexed (stswx)
  • Can load/store all 32 registers (GPR0-GPR31)

15
PowerPC (contd)
  • Link register (LR)
  • Used to store the procedure return address
  • Stores the effective address of the instruction
    following the procedure call instruction
  • Procedure calls use the branch instructions
  • Example b branch, bl procedure call
  • Count register (CTR)
  • Maintains loop count value
  • Similar to Pentium's ECX register
  • Branch instructions can test the value
  • 32-bit PowerPC implementations use segmentation
    like the Pentium

16
PowerPC (contd)
  • Addressing modes
  • Load/store instructions support three addressing
    modes
  • Can use GPRs
  • Register Indirect
  • Effective address contents of rA or 0
  • Specifying 0 generates address 0
  • Register Indirect with Immediate Index
  • Effective address Contents of rA or 0 imm16
  • Register Indirect with Index
  • Effective address Contents of rA or 0
    contents of rB

17
PowerPC (contd)
Instruction format
18
PowerPC (contd)
  • Bits 0-5
  • Specify primary opcode
  • Other fields specify suboperations
  • Depends on instruction type
  • AA bit
  • 1 (use absolute address)
  • 0 (use relative address)
  • LK bit
  • 0 (no link --- branch)
  • 1 (link --- turns branch into a procedure call)

19
PowerPC Instruction Set
  • Data Transfer instructions
  • Byte loads
  • lbz rD,disp(rA) Load byte and zero
  • lbzu rD,disp(rA) Load byte and zero
  • with update
  • Effective address contents of rA disp
  • lbzx rD,rA,rB Load byte and zero indexed
  • lbzux rD,rA,rB Load byte and zero
  • with update indexed
  • Effective address contents of rA contents of
    rB
  • Upper three bytes of rD are zeroed
  • Update versions rA ? effective address

20
PowerPC Instruction Set (contd)
  • Similar instructions for halfword and word loads
  • lhz, lhzu, lhzx, lhzxu
  • lwz, lwzu, lwzx, lwzxu
  • For halfword loads, sign extension is possible
  • lha, lhau, lhax, lhaxu
  • Multiword load
  • lmw rD,disp(rA)
  • Loads n consecutive words at EA to registers rD,
    , r31

21
PowerPC Instruction Set (contd)
  • Similar instructions for store
  • stbz, stbzu, stbzx, stbzxu
  • sthz, sthzu, sthzx, sthzxu
  • stwz, stwzu, stwzx, stwzxu
  • Multiword store
  • stmw rD,disp(rA)
  • Stores n consecutive words at EA to registers rD,
    , r31

22
PowerPC Instruction Set (contd)
  • Arithmetic Instructions
  • Add instructions
  • add rD,rA,rB rD ? rA rB
  • Status and overflow bits of CR0 and XER are not
    altered
  • add. rD,rA,rB alters LT,GT,EQ,SO of CR0
  • addo rD,rA,rB alters SO,OV of XER
  • addo. rD,rA,rB alters LT,GT,EQ,SO of CR0
  • and SO,OV of XER
  • These four instructions do not alter the CA bit
    of XER

23
PowerPC Instruction Set (contd)
  • To alter CA bit, use
  • adde rD,rA,rB
  • To alter the other bits, use
  • adde., addeo, addeo.
  • Immediate operand version
  • addi rD,rA,Simm16
  • We can use addi to implement other instructions
  • li rD,value as addi rD,0,value
  • la rD,disp(rA) as addi rD,rA,disp
  • subi rD,rA,value as addi rD,rA,-value

24
PowerPC Instruction Set (contd)
  • Subtract instructions
  • subf rD,rA,rB rD ? rB - rA
  • subf subtract from
  • Like add, other forms are available
  • subf., subfo, subfo.
  • Negate instruction
  • neg rD,rA rD ? 0 - rA

25
PowerPC Instruction Set (contd)
  • Multiply instructions
  • Two instructions to get upper and lower 32 bits
    of the 64-bit result
  • mullw rD,rA,rB signed/unsigned multiply
  • Stores the lower-order 32 bits of the result
  • Use the following to get the upper 32 bits
  • mulhw rD,rA,rB signed
  • mulhwu rD,rA,rB unsigned
  • Immediate form
  • mulli rD,rA,Simm16
  • Stores only lower 32 bits of the 48-bit result

26
PowerPC Instruction Set (contd)
  • Divide instructions
  • Two divide instructions
  • Signed (divw)
  • divw rD,rA,rB rD rA/rB
  • Unsigned (divwu)
  • Both give only quotient
  • For quotient and remainder, use
  • divw rD,rA,rB quotient in rD
  • mullw rX,rD,rB
  • subf rC,rX,rA remainder in rC

27
PowerPC Instruction Set (contd)
  • Logical instructions
  • and rD,rS,rB and. rD,rS,rB
  • andi. rD,rS,Uimm16 andis. rD,rS,Uimm16
  • andc rD,rS,rB andc. rD,rS,rB
  • andis left shift uimm16 by four positions
    before ANDing
  • andc complement rB before ANDing
  • Dot versions update the LT, GT, EQ, SO bits of
    CR0
  • Logical OR also has these six versions
  • Move register instruction is implemented using OR
  • mr rA,RS is equivalent to or
    rA,rS,rS
  • NOP is implemented as
  • ori 0,0,0

28
PowerPC Instruction Set (contd)
  • Other logical operations
  • NAND
  • nand
  • nand.
  • NOR
  • nor
  • nor.
  • XOR
  • xor, xor.
  • xori, xoris
  • Equivalence (exclusive-NOR)
  • eqv
  • eqv.

29
PowerPC Instruction Set (contd)
  • Shift and Rotate instructions
  • Shift left
  • slw rA,rS,rB shift left word
  • Shift left the word in rS by rB positions and
    store result in rA
  • Shifted out bits get zeroes
  • Also have the dot version slw.
  • Shift right
  • srw srw. (logical)
  • sraw sraw. (arithmetic)
  • Rotate left instructions
  • rlwnm rA,rS,rB,MB,ME
  • rotlw rA,rS,rB ? rlwnm rA,rS,rB,0,31

30
PowerPC Instruction Set (contd)
  • Compare instructions
  • Two versions
  • For signed and unsigned
  • Two formats
  • Register and immediate
  • Register compare
  • cmp crfD,rA,rB
  • Updates LT (rA lt rB), GT (rA gt rB), EQ, SO bits
    in the crfD
  • If crfD is not specified, CR0 is used
  • Immediate version
  • cmp crfD,rA,Simm16

31
PowerPC Instruction Set (contd)
  • Branch Instructions
  • Used for both branch (LK 0) and procedure calls
    (LK 1)
  • Can use absolute (AA 1) or relative address (AA
    0)
  • b target (AA0, LK0) Branch
  • ba target (AA1, LK0) Branch Absolute
  • bl target (AA0, LK1) Branch then link
  • bla target (AA1, LK1) Branch Absolute then
    link
  • The last two are procedure calls
  • Three types of conditional branches
  • Direct address
  • Register indirect
  • CTR or LR

32
PowerPC Instruction Set (contd)
  • Conditional branch instructions (direct address)
  • bc BO,BI,target (AA0, LK0)
  • Branch Conditional
  • bca BO,BI,target (AA1, LK0)
  • Branch Conditional Absolute
  • bcl BO,BI,target (AA0, LK1)
  • Branch Conditional then link
  • bcla BO,BI,target (AA1, LK1)
  • Branch Conditional Absolute then link
  • BO branch options (5 bits) ? specifies branch
    condition
  • BI branch input (5 bits) ? specifies a bit in
    CR field

33
PowerPC Instruction Set (contd)
  • Nine different branch conditions can be specified
  • Decrement CTR branch if CTR ? 0 AND cond false
  • Decrement CTR branch if CTR 0 AND cond false
  • Decrement CTR branch if CTR ? 0 AND cond true
  • Decrement CTR branch if CTR 0 AND cond true
  • Branch if cond false
  • Branch if cond true
  • Decrement CTR branch if CTR ? 0
  • Decrement CTR branch if CTR 0
  • Branch always

34
PowerPC Instruction Set (contd)
  • LR-based branch instructions
  • bclr BO,BI (LK0)
  • Branch Conditional to Link Register
  • bclrl BO,BI (LK1)
  • Branch Conditional to Link Register then Link
  • Target address is taken from LR
  • Used to return from procedure calls
  • CTR-based branch instructions
  • bcctr BO,BI (LK0)
  • bcctrl BO,BI (LK1)
  • CTR instead of LR is used to get target

35
Itanium
  • Intels 64-bit processor
  • RISC based
  • Based on EPIC design philosophy
  • Explicit Parallel Instruction Computing
  • Support for ILP
  • 3-instruction wide word
  • Speculative computation
  • Hides memory latency
  • Predication
  • Improves branch handling
  • Large number of registers
  • 128 integer and 128 FP
  • Aids in efficient procedure calls

36
Itanium (contd)
37
Itanium (contd)
  • Registers
  • 128 general purpose register (gr0 gr127)
  • 64-bit wide
  • NaT (Not-a-Thing) bit
  • Used in speculative loading
  • Divided into static and stacked
  • Static
  • First 32 registers (gr0 gr31)
  • gr0 is read-only (always provides zero)
  • Stacked
  • Available for programs
  • Used as register stack frame

38
Itanium (contd)
  • Registers
  • Branch registers
  • 8 in total (br0 br7)
  • 64-bit wide
  • Specify target address for
  • Conditional branches
  • Procedure calls
  • Return
  • User mask register
  • Alignment, byte ordering,
  • Other registers
  • Predicate register, Application registers,
    Current frame marker

39
Itanium (contd)
  • Addressing modes
  • Load/store instructions can access memory
  • Specify three registers r1, r2, r3
  • r32 and r3 are used to compute effective address
  • r1 receives/supplies data
  • Register indirect addressing
  • Effective address contents of r3
  • Register indirect with immediate addressing
  • Effective address contents of r3 imm9
  • r3 Effective address
  • Register indirect with index addressing
  • Effective address contents of r3 contents of
    r2
  • r3 Effective address

40
Itanium (contd)
  • Instruction Format
  • (qp) mnemonic.comp dests srcs
  • qp qualifying predicate
  • Specifies a predicate register
  • 64 1-bit registers
  • Executed if the specified PR is 1
  • Otherwise, instruction is treated as NOP
  • mnemonic
  • Identifies an instruction (e.g., compare)
  • comp
  • Gives more information to completely specify
    instruction
  • E.g., Type of comparison is equality

41
Itanium (contd)
42
Itanium (contd)
43
Itanium (contd)
  • Examples
  • add r1 r2,r3
  • Predicate instruction
  • (p4) add r1 r2,r3
  • add r1 r2,r3,1
  • Compare instructions
  • cmp.eq p3 r2,r4
  • cmp.gt p2,p3 r3,r4
  • Branch instruction
  • br.cloop.sptk loop_back

44
Instruction-level Parallelism
  • Itanium provides
  • Runtime support for explicit parallelism
  • Compiler/assembler can indicate parallelism
  • Instruction groups
  • Large number of registers
  • Instruction groups
  • Set of instructions that do not have conflicting
    dependencies
  • Can be executed in parallel
  • Compiler/assembler can indicate this by
    notation

45
Instruction-level Parallelism
  • Example Logical expression with four terms
  • if (r10 r11 r12 r13)
  • / if-block code /
  • can be done using or-tree evaluation
  • or r1 r10,r11 / Group 1 /
  • or r2 r12,r13
  • or r3 r1,r2 / Group 2 /
  • Other instructions / Group 3 /
  • Processor can execute as many instructions from
    group as it can
  • Depends on the available resources

46
Itanium Instruction Bundle
  • Each instruction is encoded using 41 bits
  • Three instructions are bundled together
  • 128-bit Instruction bundle
  • No conflicting dependencies among the three
    instructions
  • Aids in instructionlevel parallelism
  • 5-bit template
  • Specifies mapping of instruction slots to
    execution instruction types
  • Six instruction types
  • Integer ALU, non-ALU integer, memory, branch, FP,
    extended

47
Itanium Instructions
  • Data transfer instructions
  • Load and store instructions are more complicated
    than a typical RISC processor
  • Load instructions
  • (qp) ldSZ.ldtype.ldhint r1r3
  • (qp) ldSZ.ldtype.ldhint r1r3,r2
  • (qp) ldSZ.ldtype.ldhint r1r3,imm9
  • Loads SZ bytes from memory
  • SZ can be 1, 2, 4, or 8 to load 1, 2, 4, or 8
    bytes
  • Example
  • ld8 r5 r6

Locality of memory access
Special load operations advanced, speculative
48
Itanium Instructions (contd)
  • ldtype
  • This completer can be used to specify special
    load operations
  • Advanced
  • ld8.a r5 r6
  • Speculative
  • ld8.s r5 r6
  • ldhint
  • Locality of memory access
  • None Temporal locality, level 1
  • nt 1 No temporal locality, level 1
  • nt a No temporal locality, all levels

49
Itanium Instructions (contd)
  • Store instructions
  • Simpler than load instructions
  • (qp) stSZ.sttype.sthint r1r3
  • (qp) stSZ.sttype.sthint r1r3,imm9
  • Move instructions
  • (qp) mov r1 r3
  • (qp) mov r1 imm2
  • (qp) mov r1 imm64
  • First two are pseudo-instructions
  • Implemented using other processor instructions

50
Itanium Instructions (contd)
  • Arithmetic instructions
  • Simpler than load instructions
  • (qp) add r1 r2,r3
  • (qp) add r1 r2,r3,1
  • (qp) add r1 imm,r4
  • Move instruction
  • (qp) mov r1 r3
  • implemented as
  • (qp) add r1 0,r3
  • Move instruction
  • (qp) mov r1 imm22
  • implemented as
  • (qp) add r1 imm22,r0

can be imm14 or imm22
51
Itanium Instructions (contd)
  • Similar instructions for subtraction
  • Shift-add
  • (qp) shladd r1 r2,count,r3
  • Before adding, r2 is left-shifted by count bit
    positions
  • Integer multiply is realized using the xma
    instruction and floating-point registers
  • No divide instruction
  • Done in software

52
Itanium Instructions (contd)
  • Logical instructions
  • AND
  • OR
  • XOR
  • No NOT operation
  • Can use and-complement (andcm)
  • Complements one of the operands before ANDing
  • Format
  • (qp) and r1 r2,r3
  • (qp) and r1 imm8,r3

53
Itanium Instructions (contd)
  • Shift instructions
  • Left-shift
  • Right-shift
  • Format
  • (qp) shl r1 r2,r3
  • (qp) and r1 imm8,r3
  • Right-shift
  • (qp) shr r1 r2,r3 (signed version)
  • (qp) shr.u r1 r2,r3 (Unsigned version)

54
Itanium Instructions (contd)
  • Compare instructions
  • Format
  • (qp) cmp.crel.ctype p1,p2 r2,r3
  • (qp) cmp.crel.ctype p1,p2 imm8,r3
  • crel Type of comparison
  • Cmp type signed unsigned
  • lt lt ult
  • ? le ule
  • gt gt ugt
  • ? ge uge
  • eq eq

55
Itanium Instructions (contd)
  • ctype Specifies how the two predicate registers
    are to be updated
  • Default
  • Comparison result in p1 and its complement in p2
  • or type
  • p1 and p2 are set to 1 only if the comparison
    result is 1
  • Otherwise, p1 and p2 are not altered
  • Useful in OR-type simultaneous execution
  • andtype
  • p1 and p2 are set to 0 only if the comparison
    result is 0
  • Otherwise, p1 and p2 are not altered
  • Useful in AND-type simultaneous execution

56
Itanium Instructions (contd)
  • Branch instructions
  • Used for jump as well as procedure calls
  • Supports both direct and indirect branching
  • All direct branched are IP-relative
  • IP relative form
  • (qp) br.btype.bwh.ph.dh target25
  • (basic form)
  • (qp) br.btype.bwh.ph.dh b1target25
  • (call form)
  • br.btype.bwh.ph.dh target25
  • (counted loop form)

57
Itanium Instructions (contd)
  • Indirect form
  • (qp) br.btype.bwh.ph.dh b2 (basic form)
  • (qp) br.btype.bwh.ph.dh b1b2 (call form)
  • btype Type of branch
  • cond or none (for basic form)
  • Branch taken if qp is 1 otherwise not
  • To invoke a procedure
  • Use the call form with btype call
  • Turns branch into a conditional procedure call
  • Procedure invoked only if qp is 1 otherwise not
  • Return address is saved in b1 branch register

58
Itanium Instructions (contd)
  • Uncounted counted loop version
  • Set btype cloop
  • Loop count is in application register ar65
  • If ar65 not zero, decrements and takes branch
  • RET version
  • Use btype ret
  • Should use the indirect form and specify the
    branch register that has the return address
  • Example 1 Conditional skip
  • (p3) br skip or
  • (p3) br.cond skip

59
Itanium Instructions (contd)
  • Example 2 Loop iterates 100 times
  • mov lc 100
  • Loop_back
  • . . .
  • br.cloop loop_back
  • Example 3 Procedure call to sum
  • (p0) br.call br2 sum
  • Example 4 Return from a procedure
  • (p0) br.ret br2

60
Handling Branches
  • Three techniques
  • Branch elimination
  • Eliminate branches
  • Best way to handle branches is not to have
    branches
  • Possible to eliminate some types of branches
  • Branch speedup
  • Reduce the delay associated with branches
  • Reorder instructions
  • Speculative execution
  • Branch prediction
  • Discussed before (see Chapter 8)

61
Handling Branches (contd)
  • Branch elimination in Itanium
  • Can be done using predication
  • if (R1 R2)
  • R3 R3 R1
  • else
  • R3 R3 R1

cmp r1,r2 je equal sub r3,r1 jmp
next equal add r3,r1 next
cmp.eq p1,p2 r1,r2 (p1) add r3
r3,r1 (P2) sub r3 r3,r1
62
Handling Branches (contd)
  • switch (r6)
  • case 1
  • r2 r3 r4
  • break
  • case 2
  • r2 r3 - r4
  • break
  • case 3
  • r2 r3 r5
  • break
  • case 4
  • r2 r3 r5
  • break
  • cmp.eq p1,p0 r6,1
  • cmp.eq p2,p0 r6,2
  • cmp.eq p3,p0 r6,3
  • cmp.eq p4,p0 r6,4
  • (p1) add r2 r3,r4
  • (p2) sub r2 r3,r4
  • (p3) add r2 r3,r5
  • (p4) sub r2 r3,r5

63
Speculative Execution
  • Instructions are executed in expectation that
    they will be needed
  • Keeps pipeline full
  • Masks memory latency
  • Itanium supports two types
  • Handles data dependencies
  • Data dependencies are discussed in Chapter 8
  • Handles control dependencies
  • Both are compiler optimizations
  • Reorders instructions

64
Speculative Execution (contd)
  • Data speculation

sub r6 r7,r8 //cycle 1 sub r9 r10,r6
//cycle 2 ld8 r4 r5 add r11 r12,r4
//cycle 4
ld8 r4 r5 //cycle 1 sub r6 r7,r8
sub r9 r10,r6 //cycle 2 add r11
r12,r4 //cycle 3
65
Speculative Execution (contd)
  • Ambiguous dependency between first st8 and ld8

sub r6 r7,r8 //cycle 1 st8 r9 r6
//cycle 2 ld8 r4 r5 add r11
r12,r4 //cycle 4 st8 r10 r11
//cycle 5
66
Speculative Execution (contd)
  • We can move such load instructions using advance
    load (ld.a) and check load (ld.c)

ld8.a r4 r5 //cycle 0 or earlier . .
. sub r6 r7,r8 //cycle 1 st8 r9
r6 //cycle 2 ld8.c r4 r5 add r11
r12,r4 st8 r10 r11 //cycle 3
67
Speculative Execution (contd)
  • Further improvement with advance check (chk.a)

ld8.a r4 r5 //cycle -1 or earlier
. . . add r11 r12,r4 //cycle 1 sub
r6 r7,r8 st8 r9 r6 //cycle
2 chk.a r4,recover back st8 r10
r11 recover ld8 r4 r5 // reload
add r11 r12,r4 // reexecute add br
back // jump back
68
Speculative Execution (contd)
  • Control speculation
  • To reduce long latency instructions such as
    loads, advance them earlier into the code

cmp.eq p1,p0 r10,10 //cycle 0 (p1) br.cond
skip //cycle 0 ld8 r1 r2
//cycle 1 add r3 r1,r4 //cycle
3 skip // other instructions
Cannot advance because of branch
69
Speculative Execution (contd)
ld8.s r1 r2 cycle 2 or earlier
//other instructions cmp.eq p1,p0
r10,10 //cycle 0 (p1) br.cond skip
//cycle 0 chk.s r1,recovery //cycle 0
add r3 r1,r4 //cycle 0 skip //other
instructions recovery ld8 r1 r2
br skip
Speculative check chk.s allows us to advance ld8
70
Branch Prediction
  • Branch hints
  • bwh completer (branch whether hint)
  • spnt static branch not taken
  • sptk static branch taken
  • dpnt dynamic branch not taken
  • dptk static branch not taken
  • Prefetch hint (ph)
  • Hint about sequential prefetch
  • few or many
  • Deallocation hint (dh)
  • Specifies whether branch cache should be cleared
  • clr indicates deallocation

Last slide
Write a Comment
User Comments (0)
About PowerShow.com