RISC Processors

About This Presentation

Title:

RISC Processors

Description:

RISC Processors Chapter 14 S. Dandamudi Outline Introduction Evolution of CISC processors RISC design principles PowerPC processor Architecture Addressing modes ... – PowerPoint PPT presentation

Number of Views:208

Avg rating:3.0/5.0

Slides: 71

Provided by: S316

Category:

more less

Transcript and Presenter's Notes

Title: RISC Processors

1
RISC Processors

Chapter 14
S. Dandamudi

2
Outline

Introduction
Evolution of CISC processors
RISC design principles
PowerPC processor
Architecture
Addressing modes
Instruction set

Itanium processor
Architecture
Addressing modes
Instruction set
Instruction-level parallelism
Branch handling
Speculative execution

3
Introduction

CISC
Complex instruction set
Pentium is the most popular example
RISC
Simple instructions
Reduced complexity
Modern processors use this design philosophy
PowerPC, MIPS, SPARC, Intel Itanium
Borrow some features from CISC
No precise definition
We can identify some common characteristics

4
Evolution of CISC Designs

Motivation to efficiently use expensive resources
Processor
Memory
High density code
Complex instructions
Hardware complexity is handled by
microprogramming
Microprogramming is also helpful to
Reduce the impact of memory access latency
Offers flexibility
Low-cost members of the same family
Tailored to high-level language constructs

5
Evolution of CISC Designs (contd)
6
Evolution of CISC Designs (contd)

Example
Autoincrement addressing mode of VAX
Performs the following actions
(R2) (R2) R3 R2 R2 1
RISC equivalent
R4 (R2)
R4 R4 R3
(R2) R4
R2 R2 1

7
Why RISC?

Simple instructions are preferred
Complex instructions are mostly ignored by
compilers
Due to semantic gap
Simple data structures
Complex data structures are used relatively
infrequently
Better to support a few simple data types
efficiently
Synthesize complex ones
Simple addressing modes
Complex addressing modes lead to variable length
instructions
Lead to inefficient instruction decoding and
scheduling

8
Why RISC? (contd)

Large register set
Efficient support for procedure calls and returns
Patterson and Sequins study
Procedure call/return 12-15 of HLL statements
Constitute 31-33 of machine language
instructions
Generate nearly half (45) of memory references
Small activation record
Tanenbaums study
Only 1.25 of the calls have more than 6
arguments
More than 93 have less than 6 local scalar
variables
Large register set can avoid memory references

9
RISC Design Principles

Simple operations
Simple instructions that can execute in one cycle
Register-to-register operations
Only load and store operations access memory
Rest of the operations on a register-to-register
basis
Simple addressing modes
A few addressing modes (1 or 2)
Large number of registers
Needed to support register-to-register operations
Minimize the procedure call and return overhead

10
RISC Design Principles (contd)
Register windows storing activation records
11
RISC Design Principles (contd)

Fixed-length instructions
Facilitates efficient instruction execution
Simple instruction format
Fixed boundaries for various fields
opcode, source operands,
Other features
Tend to use Harvard architecture
Pipelining is visible at the architecture level

12
PowerPC

Registers
32 general-purpose registers (GPR0 GPR31)
32 floating-point registers (FPR0 FPR31)
Condition register (CR)
Similar to Pentiums flags register
Divided into 8 CR fields (4 bits each)
less than (LT), greater than (GT), equal to
(EQ), Overflow (SO)
CR1 is for floating-point exceptions
Other CR fields can be used for integer or FP
exceptions
Branch instructions can test a specific CR field
bit

13
PowerPC (contd)
14
PowerPC (contd)

XER register serves two distinct purposes
Bits 0, 1, and 2 are used to capture
Summary overflow (SO), overflow (OV), carry (CA)
OV and CA are similar to Pentiums overflow and
carry
SO, once set, only a special instruction can
clear it
Bits 25 to 31 (7 bits)
Specifies the number of bytes to be transferred
between memory and registers
Two instructions
Load string word indexed (lswx)
Store string word indexed (stswx)
Can load/store all 32 registers (GPR0-GPR31)

15
PowerPC (contd)

Link register (LR)
Used to store the procedure return address
Stores the effective address of the instruction
following the procedure call instruction
Procedure calls use the branch instructions
Example b branch, bl procedure call
Count register (CTR)
Maintains loop count value
Similar to Pentium's ECX register
Branch instructions can test the value
32-bit PowerPC implementations use segmentation
like the Pentium

16
PowerPC (contd)

Addressing modes
Load/store instructions support three addressing
modes
Can use GPRs
Register Indirect
Effective address contents of rA or 0
Specifying 0 generates address 0
Register Indirect with Immediate Index
Effective address Contents of rA or 0 imm16
Register Indirect with Index
Effective address Contents of rA or 0
contents of rB

17
PowerPC (contd)
Instruction format
18
PowerPC (contd)

Bits 0-5
Specify primary opcode
Other fields specify suboperations
Depends on instruction type
AA bit
1 (use absolute address)
0 (use relative address)
LK bit
0 (no link --- branch)
1 (link --- turns branch into a procedure call)

19
PowerPC Instruction Set

Data Transfer instructions
Byte loads
lbz rD,disp(rA) Load byte and zero
lbzu rD,disp(rA) Load byte and zero
with update
Effective address contents of rA disp
lbzx rD,rA,rB Load byte and zero indexed
lbzux rD,rA,rB Load byte and zero
with update indexed
Effective address contents of rA contents of
rB
Upper three bytes of rD are zeroed
Update versions rA ? effective address

20
PowerPC Instruction Set (contd)

Similar instructions for halfword and word loads
lhz, lhzu, lhzx, lhzxu
lwz, lwzu, lwzx, lwzxu
For halfword loads, sign extension is possible
lha, lhau, lhax, lhaxu
Multiword load
lmw rD,disp(rA)
Loads n consecutive words at EA to registers rD,
, r31

21
PowerPC Instruction Set (contd)

Similar instructions for store
stbz, stbzu, stbzx, stbzxu
sthz, sthzu, sthzx, sthzxu
stwz, stwzu, stwzx, stwzxu
Multiword store
stmw rD,disp(rA)
Stores n consecutive words at EA to registers rD,
, r31

22
PowerPC Instruction Set (contd)

Arithmetic Instructions
Add instructions
add rD,rA,rB rD ? rA rB
Status and overflow bits of CR0 and XER are not
altered
add. rD,rA,rB alters LT,GT,EQ,SO of CR0
addo rD,rA,rB alters SO,OV of XER
addo. rD,rA,rB alters LT,GT,EQ,SO of CR0
and SO,OV of XER
These four instructions do not alter the CA bit
of XER

23
PowerPC Instruction Set (contd)

To alter CA bit, use
adde rD,rA,rB
To alter the other bits, use
adde., addeo, addeo.
Immediate operand version
addi rD,rA,Simm16
We can use addi to implement other instructions
li rD,value as addi rD,0,value
la rD,disp(rA) as addi rD,rA,disp
subi rD,rA,value as addi rD,rA,-value

24
PowerPC Instruction Set (contd)

Subtract instructions
subf rD,rA,rB rD ? rB - rA
subf subtract from
Like add, other forms are available
subf., subfo, subfo.
Negate instruction
neg rD,rA rD ? 0 - rA

25
PowerPC Instruction Set (contd)

Multiply instructions
Two instructions to get upper and lower 32 bits
of the 64-bit result
mullw rD,rA,rB signed/unsigned multiply
Stores the lower-order 32 bits of the result
Use the following to get the upper 32 bits
mulhw rD,rA,rB signed
mulhwu rD,rA,rB unsigned
Immediate form
mulli rD,rA,Simm16
Stores only lower 32 bits of the 48-bit result

26
PowerPC Instruction Set (contd)

Divide instructions
Two divide instructions
Signed (divw)
divw rD,rA,rB rD rA/rB
Unsigned (divwu)
Both give only quotient
For quotient and remainder, use
divw rD,rA,rB quotient in rD
mullw rX,rD,rB
subf rC,rX,rA remainder in rC

27
PowerPC Instruction Set (contd)

Logical instructions
and rD,rS,rB and. rD,rS,rB
andi. rD,rS,Uimm16 andis. rD,rS,Uimm16
andc rD,rS,rB andc. rD,rS,rB
andis left shift uimm16 by four positions
before ANDing
andc complement rB before ANDing
Dot versions update the LT, GT, EQ, SO bits of
CR0
Logical OR also has these six versions
Move register instruction is implemented using OR
mr rA,RS is equivalent to or
rA,rS,rS
NOP is implemented as
ori 0,0,0

28
PowerPC Instruction Set (contd)

Other logical operations
NAND
nand
nand.
NOR
nor
nor.
XOR
xor, xor.
xori, xoris
Equivalence (exclusive-NOR)
eqv
eqv.

29
PowerPC Instruction Set (contd)

Shift and Rotate instructions
Shift left
slw rA,rS,rB shift left word
Shift left the word in rS by rB positions and
store result in rA
Shifted out bits get zeroes
Also have the dot version slw.
Shift right
srw srw. (logical)
sraw sraw. (arithmetic)
Rotate left instructions
rlwnm rA,rS,rB,MB,ME
rotlw rA,rS,rB ? rlwnm rA,rS,rB,0,31

30
PowerPC Instruction Set (contd)

Compare instructions
Two versions
For signed and unsigned
Two formats
Register and immediate
Register compare
cmp crfD,rA,rB
Updates LT (rA lt rB), GT (rA gt rB), EQ, SO bits
in the crfD
If crfD is not specified, CR0 is used
Immediate version
cmp crfD,rA,Simm16

31
PowerPC Instruction Set (contd)

Branch Instructions
Used for both branch (LK 0) and procedure calls
(LK 1)
Can use absolute (AA 1) or relative address (AA
0)
b target (AA0, LK0) Branch
ba target (AA1, LK0) Branch Absolute
bl target (AA0, LK1) Branch then link
bla target (AA1, LK1) Branch Absolute then
link
The last two are procedure calls
Three types of conditional branches
Direct address
Register indirect
CTR or LR

32
PowerPC Instruction Set (contd)

Conditional branch instructions (direct address)
bc BO,BI,target (AA0, LK0)
Branch Conditional
bca BO,BI,target (AA1, LK0)
Branch Conditional Absolute
bcl BO,BI,target (AA0, LK1)
Branch Conditional then link
bcla BO,BI,target (AA1, LK1)
Branch Conditional Absolute then link
BO branch options (5 bits) ? specifies branch
condition
BI branch input (5 bits) ? specifies a bit in
CR field

33
PowerPC Instruction Set (contd)

Nine different branch conditions can be specified
Decrement CTR branch if CTR ? 0 AND cond false
Decrement CTR branch if CTR 0 AND cond false
Decrement CTR branch if CTR ? 0 AND cond true
Decrement CTR branch if CTR 0 AND cond true
Branch if cond false
Branch if cond true
Decrement CTR branch if CTR ? 0
Decrement CTR branch if CTR 0
Branch always

34
PowerPC Instruction Set (contd)

LR-based branch instructions
bclr BO,BI (LK0)
Branch Conditional to Link Register
bclrl BO,BI (LK1)
Branch Conditional to Link Register then Link
Target address is taken from LR
Used to return from procedure calls
CTR-based branch instructions
bcctr BO,BI (LK0)
bcctrl BO,BI (LK1)
CTR instead of LR is used to get target

35
Itanium

Intels 64-bit processor
RISC based
Based on EPIC design philosophy
Explicit Parallel Instruction Computing
Support for ILP
3-instruction wide word
Speculative computation
Hides memory latency
Predication
Improves branch handling
Large number of registers
128 integer and 128 FP
Aids in efficient procedure calls

36
Itanium (contd)
37
Itanium (contd)

Registers
128 general purpose register (gr0 gr127)
64-bit wide
NaT (Not-a-Thing) bit
Used in speculative loading
Divided into static and stacked
Static
First 32 registers (gr0 gr31)
gr0 is read-only (always provides zero)
Stacked
Available for programs
Used as register stack frame

38
Itanium (contd)

Registers
Branch registers
8 in total (br0 br7)
64-bit wide
Specify target address for
Conditional branches
Procedure calls
Return
User mask register
Alignment, byte ordering,
Other registers
Predicate register, Application registers,
Current frame marker

39
Itanium (contd)

Addressing modes
Load/store instructions can access memory
Specify three registers r1, r2, r3
r32 and r3 are used to compute effective address
r1 receives/supplies data
Register indirect addressing
Effective address contents of r3
Register indirect with immediate addressing
Effective address contents of r3 imm9
r3 Effective address
Register indirect with index addressing
Effective address contents of r3 contents of
r2
r3 Effective address

40
Itanium (contd)

Instruction Format
(qp) mnemonic.comp dests srcs
qp qualifying predicate
Specifies a predicate register
64 1-bit registers
Executed if the specified PR is 1
Otherwise, instruction is treated as NOP
mnemonic
Identifies an instruction (e.g., compare)
comp
Gives more information to completely specify
instruction
E.g., Type of comparison is equality

41
Itanium (contd)
42
Itanium (contd)
43
Itanium (contd)

Examples
add r1 r2,r3
Predicate instruction
(p4) add r1 r2,r3
add r1 r2,r3,1
Compare instructions
cmp.eq p3 r2,r4
cmp.gt p2,p3 r3,r4
Branch instruction
br.cloop.sptk loop_back

44
Instruction-level Parallelism

Itanium provides
Runtime support for explicit parallelism
Compiler/assembler can indicate parallelism
Instruction groups
Large number of registers
Instruction groups
Set of instructions that do not have conflicting
dependencies
Can be executed in parallel
Compiler/assembler can indicate this by
notation

45
Instruction-level Parallelism

Example Logical expression with four terms
if (r10 r11 r12 r13)
/ if-block code /
can be done using or-tree evaluation
or r1 r10,r11 / Group 1 /
or r2 r12,r13
or r3 r1,r2 / Group 2 /
Other instructions / Group 3 /
Processor can execute as many instructions from
group as it can
Depends on the available resources

46
Itanium Instruction Bundle

Each instruction is encoded using 41 bits
Three instructions are bundled together
128-bit Instruction bundle
No conflicting dependencies among the three
instructions
Aids in instructionlevel parallelism
5-bit template
Specifies mapping of instruction slots to
execution instruction types
Six instruction types
Integer ALU, non-ALU integer, memory, branch, FP,
extended

47
Itanium Instructions

Data transfer instructions
Load and store instructions are more complicated
than a typical RISC processor
Load instructions
(qp) ldSZ.ldtype.ldhint r1r3
(qp) ldSZ.ldtype.ldhint r1r3,r2
(qp) ldSZ.ldtype.ldhint r1r3,imm9
Loads SZ bytes from memory
SZ can be 1, 2, 4, or 8 to load 1, 2, 4, or 8
bytes
Example
ld8 r5 r6

Locality of memory access
Special load operations advanced, speculative
48
Itanium Instructions (contd)

ldtype
This completer can be used to specify special
load operations
Advanced
ld8.a r5 r6
Speculative
ld8.s r5 r6
ldhint
Locality of memory access
None Temporal locality, level 1
nt 1 No temporal locality, level 1
nt a No temporal locality, all levels

49
Itanium Instructions (contd)

Store instructions
Simpler than load instructions
(qp) stSZ.sttype.sthint r1r3
(qp) stSZ.sttype.sthint r1r3,imm9
Move instructions
(qp) mov r1 r3
(qp) mov r1 imm2
(qp) mov r1 imm64
First two are pseudo-instructions
Implemented using other processor instructions

50
Itanium Instructions (contd)

Arithmetic instructions
Simpler than load instructions
(qp) add r1 r2,r3
(qp) add r1 r2,r3,1
(qp) add r1 imm,r4
Move instruction
(qp) mov r1 r3
implemented as
(qp) add r1 0,r3
Move instruction
(qp) mov r1 imm22
implemented as
(qp) add r1 imm22,r0

can be imm14 or imm22
51
Itanium Instructions (contd)

Similar instructions for subtraction
Shift-add
(qp) shladd r1 r2,count,r3
Before adding, r2 is left-shifted by count bit
positions
Integer multiply is realized using the xma
instruction and floating-point registers
No divide instruction
Done in software

52
Itanium Instructions (contd)

Logical instructions
AND
OR
XOR
No NOT operation
Can use and-complement (andcm)
Complements one of the operands before ANDing
Format
(qp) and r1 r2,r3
(qp) and r1 imm8,r3

53
Itanium Instructions (contd)

Shift instructions
Left-shift
Right-shift
Format
(qp) shl r1 r2,r3
(qp) and r1 imm8,r3
Right-shift
(qp) shr r1 r2,r3 (signed version)
(qp) shr.u r1 r2,r3 (Unsigned version)

54
Itanium Instructions (contd)

Compare instructions
Format
(qp) cmp.crel.ctype p1,p2 r2,r3
(qp) cmp.crel.ctype p1,p2 imm8,r3
crel Type of comparison
Cmp type signed unsigned
lt lt ult
? le ule
gt gt ugt
? ge uge
eq eq

55
Itanium Instructions (contd)

ctype Specifies how the two predicate registers
are to be updated
Default
Comparison result in p1 and its complement in p2
or type
p1 and p2 are set to 1 only if the comparison
result is 1
Otherwise, p1 and p2 are not altered
Useful in OR-type simultaneous execution
andtype
p1 and p2 are set to 0 only if the comparison
result is 0
Otherwise, p1 and p2 are not altered
Useful in AND-type simultaneous execution

56
Itanium Instructions (contd)

Branch instructions
Used for jump as well as procedure calls
Supports both direct and indirect branching
All direct branched are IP-relative
IP relative form
(qp) br.btype.bwh.ph.dh target25
(basic form)
(qp) br.btype.bwh.ph.dh b1target25
(call form)
br.btype.bwh.ph.dh target25
(counted loop form)

57
Itanium Instructions (contd)

Indirect form
(qp) br.btype.bwh.ph.dh b2 (basic form)
(qp) br.btype.bwh.ph.dh b1b2 (call form)
btype Type of branch
cond or none (for basic form)
Branch taken if qp is 1 otherwise not
To invoke a procedure
Use the call form with btype call
Turns branch into a conditional procedure call
Procedure invoked only if qp is 1 otherwise not
Return address is saved in b1 branch register

58
Itanium Instructions (contd)

Uncounted counted loop version
Set btype cloop
Loop count is in application register ar65
If ar65 not zero, decrements and takes branch
RET version
Use btype ret
Should use the indirect form and specify the
branch register that has the return address
Example 1 Conditional skip
(p3) br skip or
(p3) br.cond skip

59
Itanium Instructions (contd)

Example 2 Loop iterates 100 times
mov lc 100
Loop_back
. . .
br.cloop loop_back
Example 3 Procedure call to sum
(p0) br.call br2 sum
Example 4 Return from a procedure
(p0) br.ret br2

60
Handling Branches

Three techniques
Branch elimination
Eliminate branches
Best way to handle branches is not to have
branches
Possible to eliminate some types of branches
Branch speedup
Reduce the delay associated with branches
Reorder instructions
Speculative execution
Branch prediction
Discussed before (see Chapter 8)

61
Handling Branches (contd)

Branch elimination in Itanium
Can be done using predication
if (R1 R2)
R3 R3 R1
else
R3 R3 R1

cmp r1,r2 je equal sub r3,r1 jmp
next equal add r3,r1 next
cmp.eq p1,p2 r1,r2 (p1) add r3
r3,r1 (P2) sub r3 r3,r1
62
Handling Branches (contd)

switch (r6)
case 1
r2 r3 r4
break
case 2
r2 r3 - r4
break
case 3
r2 r3 r5
break
case 4
r2 r3 r5
break

cmp.eq p1,p0 r6,1
cmp.eq p2,p0 r6,2
cmp.eq p3,p0 r6,3
cmp.eq p4,p0 r6,4
(p1) add r2 r3,r4
(p2) sub r2 r3,r4
(p3) add r2 r3,r5
(p4) sub r2 r3,r5

63
Speculative Execution

Instructions are executed in expectation that
they will be needed
Keeps pipeline full
Masks memory latency
Itanium supports two types
Handles data dependencies
Data dependencies are discussed in Chapter 8
Handles control dependencies
Both are compiler optimizations
Reorders instructions

64
Speculative Execution (contd)

Data speculation

sub r6 r7,r8 //cycle 1 sub r9 r10,r6
//cycle 2 ld8 r4 r5 add r11 r12,r4
//cycle 4
ld8 r4 r5 //cycle 1 sub r6 r7,r8
sub r9 r10,r6 //cycle 2 add r11
r12,r4 //cycle 3
65
Speculative Execution (contd)

Ambiguous dependency between first st8 and ld8

sub r6 r7,r8 //cycle 1 st8 r9 r6
//cycle 2 ld8 r4 r5 add r11
r12,r4 //cycle 4 st8 r10 r11
//cycle 5
66
Speculative Execution (contd)

We can move such load instructions using advance
load (ld.a) and check load (ld.c)

ld8.a r4 r5 //cycle 0 or earlier . .
. sub r6 r7,r8 //cycle 1 st8 r9
r6 //cycle 2 ld8.c r4 r5 add r11
r12,r4 st8 r10 r11 //cycle 3
67
Speculative Execution (contd)

Further improvement with advance check (chk.a)

ld8.a r4 r5 //cycle -1 or earlier
. . . add r11 r12,r4 //cycle 1 sub
r6 r7,r8 st8 r9 r6 //cycle
2 chk.a r4,recover back st8 r10
r11 recover ld8 r4 r5 // reload
add r11 r12,r4 // reexecute add br
back // jump back
68
Speculative Execution (contd)

Control speculation
To reduce long latency instructions such as
loads, advance them earlier into the code

cmp.eq p1,p0 r10,10 //cycle 0 (p1) br.cond
skip //cycle 0 ld8 r1 r2
//cycle 1 add r3 r1,r4 //cycle
3 skip // other instructions
Cannot advance because of branch
69
Speculative Execution (contd)
ld8.s r1 r2 cycle 2 or earlier
//other instructions cmp.eq p1,p0
r10,10 //cycle 0 (p1) br.cond skip
//cycle 0 chk.s r1,recovery //cycle 0
add r3 r1,r4 //cycle 0 skip //other
instructions recovery ld8 r1 r2
br skip
Speculative check chk.s allows us to advance ld8
70
Branch Prediction

Branch hints
bwh completer (branch whether hint)
spnt static branch not taken
sptk static branch taken
dpnt dynamic branch not taken
dptk static branch not taken
Prefetch hint (ph)
Hint about sequential prefetch
few or many
Deallocation hint (dh)
Specifies whether branch cache should be cleared
clr indicates deallocation

Last slide

Write a Comment

User Comments (0)

About PowerShow.com

RISC Processors - PowerPoint PPT Presentation

RISC Processors

RISC Processors Chapter 14 S. Dandamudi Outline Introduction Evolution of CISC processors RISC design principles PowerPC processor Architecture Addressing modes ... – PowerPoint PPT presentation