ARM Introduction - PowerPoint PPT Presentation

About This Presentation
Title:

ARM Introduction

Description:

ARM Organization and Implementation Outline ARM organization Three-stage pipeline ARM single-cycle instruction pipeline ARM single-cycle instruction pipeline ARM ... – PowerPoint PPT presentation

Number of Views:221
Avg rating:3.0/5.0
Slides: 72
Provided by: Aleksandar58
Learn more at: http://www.ece.uah.edu
Category:
Tags: arm | introduction

less

Transcript and Presenter's Notes

Title: ARM Introduction


1
ARMIntroduction Instruction Set Architecture
  • Aleksandar Milenkovic
  • E-mail milenka_at_ece.uah.edu
  • Web http//www.ece.uah.edu/milenka

2
Outline
  • ARM Architecture
  • ARM Organization and Implementation
  • ARM Instruction Set
  • Thumb Instruction Set
  • Architectural Support for System Development
  • ARM Processor Cores
  • Memory Hierarchy
  • Architectural Support for Operating Systems
  • ARM CPU Cores
  • Embedded ARM Applications

3
ARM History
  • ARM Acorn RISC Machine (1983 1985)
  • Acorn Computers Limited, Cambridge, England
  • ARM Advanced RISC Machine 1990
  • ARM Limited, 1990
  • ARM has been licensed to many semiconductor
    manufacturers

4
ARMs visible registers
  • User level
  • 15 GPRs, PC, CPSR (current program status
    register)
  • Remaining registers are used for system-level
    programming and for handling exceptions

5
ARM CPSR format
  • N (Negative), Z (Zero), C (Carry), V (oVerflow)
  • mode control processor mode
  • T control instruction set
  • T 1 instruction stream is 16-bit Thumb
    instructions
  • T 0 instruction stream is 32-bit ARM
    instructions
  • I F interrupt enables

6
ARM memory organization
  • Linear array of bytes numbered from 0 to 232 1
  • Data items
  • bytes (8 bits)
  • half-words (16 bits) always aligned to 2-byte
    boundaries (start at an even byte address)
  • words (32 bits) always aligned to 4-byte
    boundaries (start at a byte address which is
    multiple of 4)

7
ARM instruction set
  • Load-store architecture
  • operands are in GPRs
  • load/store only instructions that operate with
    memory
  • Instructions
  • Data Processing use and change only register
    values
  • Data Transfer copy memory values into registers
    (load) or copy register values into memory
    (store)
  • Control Flow
  • branch
  • branch-and-link save return address to resume
    the original sequence
  • trapping into system code supervisor calls

8
ARM instruction set (contd)
  • Three-address data processing instructions
  • Conditional execution of every instruction
  • Powerful load/store multiple register
    instructions
  • Ability to perform a general shift operation and
    a general ALU operation in a single instruction
    that executes in a single clock cycle
  • Open instruction set extension through
    coprocessor instruction set, including adding new
    registers and data types to the programmers
    model
  • Very dense 16-bit compressed representation of
    the instruction set in the Thumb architecture

9
I/O system
  • I/O is memory mapped
  • internal registers of peripherals (disk
    controllers, network interfaces, etc) are
    addressable locations within the ARMs memory map
    and may be read and written using the load-store
    instructions
  • Peripherals may use either the normal interrupt
    (IRQ) or fast interrupt (FIQ) input
  • normally most interrupt sources share the IRQ
    input, while just one or two time-critical
    sources are connected to the FIQ input
  • Some systems may include external DMA hardware to
    handle high-bandwidth I/O traffic

10
ARM exceptions
  • ARM supports a range of interrupts, traps, and
    supervisor calls all are grouped under the
    general heading of exceptions
  • Handling exceptions
  • current state is saved by copying the PC into
    r14_exc and CPSR into SPSR_exc (exc stands for
    exception type)
  • processor operating mode is changed to the
    appropriate exception mode
  • PC is forced to a value between 0016 and 1C16,
    the particular value depending on the type of
    exception
  • instruction at the location PC is forced to (the
    vector address) usually contains a branch to the
    exception handler the exception handler will use
    r13_exc, which is normally initialized to point
    to a dedicated stack in memory, to save some user
    registers
  • return restore the user registers and then
    restore PC and CPSR atomically

11
ARM cross-development toolkit
  • Software development
  • tools developed by ARM Limited
  • public domain tools (ARM back end for gcc C
    compiler)
  • Cross-development
  • tools run on different architecture from one for
    which they produce code

12
Outline
  • ARM Architecture
  • ARM Assembly Language Programming
  • ARM Organization and Implementation
  • ARM Instruction Set
  • Architectural Support for High-level Languages
  • Thumb Instruction Set
  • Architectural Support for System Development
  • ARM Processor Cores
  • Memory Hierarchy
  • Architectural Support for Operating Systems
  • ARM CPU Cores
  • Embedded ARM Applications

13
ARM Instruction Set
  • Data Processing Instructions
  • Data Transfer Instructions
  • Control flow Instructions

14
Data Processing Instructions
  • Classes of data processing instructions
  • Arithmetic operations
  • Bit-wise logical operations
  • Register-movement operations
  • Comparison operations
  • Operands 32-bits widethere are 3 ways to
    specify operands
  • come from registers
  • the second operand may be a constant (immediate)
  • shifted register operand
  • Result 32-bits wide, placed in a register
  • long multiply produces a 64-bit result

15
Data Processing Instructions (contd)
Arithmetic Operations
Bit-wise Logical Operations
AND r0, r1, r2 r0 r1 and r2
ORR r0, r1, r2 r0 r1 or r2
EOR r0, r1, r2 r0 r1 xor r2
BIC r0, r1, r2 r0 r1 and (not) r2
ADD r0, r1, r2 r0 r1 r2
ADC r0, r1, r2 r0 r1 r2 C
SUB r0, r1, r2 r0 r1 - r2
SBC r0, r1, r2 r0 r1 - r2 C - 1
RSB r0, r1, r2 r0 r2 r1
RSC r0, r1, r2 r0 r2 r1 C - 1
Register Movement
Comparison Operations
MOV r0, r2 r0 r2
MVN r0, r2 r0 not r2
CMP r1, r2 set cc on r1 - r2
CMN r1, r2 set cc on r1 r2
TST r1, r2 set cc on r1 and r2
TEQ r1, r2 set cc on r1 xor r2
16
Data Processing Instructions (contd)
  • Immediate operandsimmediate (0-gt255) x 22n, 0
    lt n lt 12
  • Shifted register operands
  • the second operand is subject to a shift
    operation before it is combined with the first
    operand

ADD r3, r3, 3 r3 r3 3
AND r8, r7, ff r8 r770, for hex
ADD r3, r2, r1, LSL 3 r3 r2 8 x r1
ADD r5, r5, r3, LSL r2 r5 r5 2r2 x r3
17
ARM shift operations
  • LSL Logical Shift Left
  • LSR Logical Shift Right
  • ASR Arithmetic Shift Right
  • ROR Rotate Right
  • RRX Rotate Right Extended by 1 place

18
Setting the condition codes
  • Any DPI can set the condition codes (N, Z, V, and
    C)
  • for all DPIs except the comparison operations a
    specific request must be made
  • at the assembly language level this request is
    indicated by adding an S to the opcode
  • Example (r3-r2 r1-r0 r3-r2)
  • Arithmetic operations set all the flags (N, Z, C,
    and V)
  • Logical and move operations set N and Z
  • preserve V and either preserve C when there is no
    shift operation, or set C according to shift
    operation (fall off bit)

ADDS r2, r2, r0 ADC r3, r3, r1 carry out to C ... add into high word
19
Multiplies
  • Example (Multiply, Multiply-Accumulate)
  • Note
  • least significant 32-bits are placed in the
    result register, the rest are ignored
  • immediate second operand is not supported
  • result register must not be the same as the
    first source register
  • if S bit is set the V is preserved and the C
    is rendered meaningless
  • Example (r0 r0 x 35)
  • ADD r0, r0, r0, LSL 2 r0 r0 x 5RSB r3, r3,
    r1 r0 7 x r0

MUL r4, r3, r2 r4 r3 x r2lt310gt
MLA r4, r3, r2, r1 r4 r3 x r2 r1 lt310gt
20
Data transfer instructions
  • Single register load and store instructions
  • transfer of a data item (byte, half-word, word)
    between ARM registers and memory
  • Multiple register load and store instructions
  • enable transfer of large quantities of data
  • used for procedure entry and exit, to
    save/restore workspace registers, to copy blocks
    of data around memory
  • Single register swap instructions
  • allow exchange between a register and memory in
    one instruction
  • used to implement semaphores to ensure mutual
    exclusion on accesses to shared data in multis

21
Data Transfer Instructions (contd)
Register-indirect addressing
LDR r0, r1 r0 mem32r1
STR r0, r1 mem32r1 r0
Single register load and store
Note r1 keeps a word address (2 LSBs are 0)
Baseoffset addressing (offset of up to 4Kbytes)
LDRB r0, r1 r0 mem8r1
Note no restrictions for r1
LDR r0, r1, 4 r0 mem32r1 4
Auto-indexing addressing
LDR r0, r1, 4! r0 mem32r1 4r1 r1 4
Post-indexed addressing
LDR r0, r1, 4 r0 mem32r1r1 r1 4
22
Data Transfer Instructions (contd)
COPY ADR r1, TABLE1 r1 points to TABLE1 ADR r2, TABLE2 r2 points to TABLE2 LOOP LDR r0, r1 STR r0, r2 ADD r1, r1, 4 ADD r2, r2, 4 ... TABLE1 ... TABLE2...
COPY ADR r1, TABLE1 r1 points to TABLE1 ADR r2, TABLE2 r2 points to TABLE2 LOOP LDR r0, r1, 4 STR r0, r2, 4 ... TABLE1 ... TABLE2...
23
Data Transfer Instructions
Multiple register data transfers
LDMIA r1, r0, r2, r5 r0 mem32r1r2 mem32r1 4r5 mem32r1 8
Note any subset (or all) of the registers may be
transferred with a single instruction Note the
order of registers within the list is
insignificant Note including r15 in the list
will cause a change in the control flow
  • Block copy view
  • data is to be stored above or below the the
    address held in the base register
  • address incrementing or decrementing begins
    before or after storing the first value
  • Stack organizations
  • FA full ascending
  • EA empty ascending
  • FD full descending
  • ED empty descending

24
Multiple register transfer addressing modes
1018
1018
r9
r5
r9
16
16
r5
r1
r1
r0
r0
r9
100c
r9
100c
16
16
1000
1000
16
16
STMIA r9!, r0,r1,r5
STMIB r9!, r0,r1,r5
1018
1018
16
16
r5
r9
100c
r9
100c
16
16
r1
r5
r0
r1
1000
1000
r9
r0
r9
16
16
STMDA r9!, r0,r1,r5
STMDB r9!, r0,r1,r5
25
The mapping between the stack and block copy views
26
Control flow instructions
27
Conditional execution
  • Conditional execution to avoid branch
    instructions used to skip a small number of
    non-branch instructions
  • Example

CMP r0, 5 BEQ BYPASS if (r0!5) ADD r1, r1, r0 r1r1r0-r2 SUB r1, r1, r2 BYPASS ...
With conditional execution
if ((ab) (cd)) e CMP r0, r1 CMPEQ r2, r3 ADDEQ r4, r4, 1
CMP r0, 5 ADDNE r1, r1, r0 SUBNE r1, r1, r2 ...
Note add 2 letter condition after the 3-letter
opcode
28
Branch and link instructions
  • Branch to subroutine (r14 serves as a link
    register)
  • Nested subroutines

BL SUBR branch to SUBR .. return here SUBR .. SUBR entry point MOV pc, r14 return
BL SUB1 .. SUB1 save work and link register STMFD r13!, r0-r2,r14 BL SUB2 .. LDMFD r13!, r0-r2,pc SUB2 .. MOV pc, r14 copy r14 into r15
29
Supervisor calls
  • Supervisor is a program which operates at a
    privileged level it can do things that a
    user-level program cannot do directly
  • Example send text to the display
  • ARM ISA includes SWI (SoftWare Interrupt)

output r070 SWI SWI_WriteC return from a user program back to monitor SWI SWI_Exit
30
Jump tables
  • Call one of a set of subroutines depending on a
    value computed by the program

BL JTAB ... JTAB CMP r0, 0 BEQ SUB0 CMP r0, 1 BEQ SUB1 CMP r0, 2 BEQ SUB2
BL JTAB ... JTAB ADR r1, SUBTAB CMP r0, SUBMAX overrun? LDRLS pc, r1, r0, LSL 2 B ERROR SUBTAB DCD SUB0 DCD SUB1 DCD SUB2 ...
Note slow when the list is long, and all
subroutines are equally frequent
31
Hello ARM World!
AREA HelloW, CODE, READONLY declare code area SWI_WriteC EQU 0 output character in r0 SWI_Exit EQU 11 finish program ENTRY code entry point START ADR r1, TEXT r1 lt- Hello ARM World! LOOP LDRB r0, r1, 1 get the next byte CMP r0, 0 check for text end SWINE SWI_WriteC if not end of string, print BNE LOOP SWI SWI_Exit end of execution TEXT Hello ARM World!, 0a, 0d, 0 END
32
ARMOrganization and Implementation
  • Aleksandar Milenkovic
  • E-mail milenka_at_ece.uah.edu
  • Web http//www.ece.uah.edu/milenka

33
Outline
  • ARM Architecture
  • ARM Organization and Implementation
  • ARM Instruction Set
  • Architectural Support for High-level Languages
  • Thumb Instruction Set
  • Architectural Support for System Development
  • ARM Processor Cores
  • Memory Hierarchy
  • Architectural Support for Operating Systems
  • ARM CPU Cores
  • Embedded ARM Applications

34
ARM organization
A310
control
address register
  • Register file
  • 2 read ports, 1 write port 1 read, 1 write
    port reserved for r15 (pc)
  • Barrel shifter shift or rotate one operand for
    any number of bits
  • ALU performs the arithmetic and logic functions
    required
  • Memory address register incrementer
  • Memory data registers
  • Instruction decoder and associated control logic

P
incrementer
C
PC
register
bank
instruction
decode
A
multiply

L
register
U
control

A
B
b


u
b
b
s
u
u
barrel
s
s
shifter
ALU
data out register
data in register
D310
35
Three-stage pipeline
  • Fetch
  • the instruction is fetched from memory and placed
    in the instruction pipeline
  • Decode
  • the instruction is decoded and the datapath
    control signals prepared for the next cycle in
    this stage the instruction owns the decode logic
    but not the datapath
  • Execute
  • the instruction owns the datapath the register
    bank is read, an operand shifted, the ALU
    register generated and written back into a
    destination register

36
ARM single-cycle instruction pipeline
37
ARM single-cycle instruction pipeline
fetch
add r0,r1,5
sub r2,r3,r6
execute cmp
cmp r2,3
time
1
2
3
38
ARM multi-cycle instruction pipeline
Decode logic is always generating the control
signals for the datapath to use in the next cycle
39
ARM multi-cycle LDMIA (load multiple) instruction
Decode stage occupied since ldmia must continue
to remember decoded instruction
fetch
decode
ex ld r2
ex ld r3
ldmia r0,r2,r3
sub r2,r3,r6
fetch
decode
ex sub
fetch
decode
ex cmp
cmp r2,3
time
sub fetched at normal time but not decoded until
LDMIA is finishing
Instruction delayed
40
Control stalls due to branches
  • Branches often introduce stalls (branch penalty)
  • Stall time may depend on whether branch is taken
  • May have to squash instructions that already
    started executing
  • Dont know what to fetch until condition is
    evaluated

41
ARM pipelined branch
Decision not made until the third clock cycle
Two cycles of work thrown away if bne takes place
time
42
Pipeline how it works
  • All instructions occupy the datapath for one or
    more adjacent cycles
  • For each cycle that an instruction occupies the
    datapath, it occupies the decode logic in the
    immediately preceding cycle
  • During the fist datapath cycle each instruction
    issues a fetch for the next instruction but one
  • Branch instruction flush and refill the
    instruction pipeline

43
ARM9TDMI 5-stage pipeline
  • Fetch
  • Decode
  • instruction is decoded
  • register operands read (3 read ports)
  • Execute
  • an operand is shifted and the ALU result
    generated, or
  • address is computed
  • Buffer/data
  • data memory is accessed (load, store)
  • Write-back
  • write to register file

44
ARM9TDMI Data Forwarding
Data Forwarding
ADD r3, r2, r1, LSL 3 ADD r5, r5, r3, LSL r2 r3 r2 8 x r1 r5 r5 2r2 x r3
ADD r3, r2, r1, LSL 3 ADD r8, r9, r10 ADD r5, r5, r3, LSL r2 r3 r2 8 x r1 r8 r9 r10 r5 r5 2r2 x r3
Stall?
LD r3, r2 ADD r1, r2, r3 r3 memr2 r1 r2 r3
45
ARM9TDMI PC generation
  • 3-stage pipeline
  • PC behavior operands are read in execution
    stage r15 PC 8
  • 5-stage pipeline
  • operands are read in decode stage and r15 PC
    4?
  • incompatibilities between 3-stage and 5-stage
    implementations gt unacceptable
  • to avoid this 5-stage pipeline ARMs emulate the
    behavior of the older 3-stage designs

46
Data processing instruction datapath activity
(Ex)
  • Reg-Reg
  • Rd Rn op Rm
  • r15 AR 4AR AR 4
  • Reg-Imm
  • Rd Rn op Imm
  • r15 AR 4AR AR 4

(a) register register operations
(b) register immediate operations
47
STR (store register) datapath activity(Ex1, Ex2)
  • Compute address (Ex1)
  • AR Rn op Disp
  • r15 AR 4
  • Store data (Ex2)
  • AR PC
  • memAR Rdltxygt
  • If autoindexinggtRn Rn /- 4

address register
increment
PC
registers
Rn
mult
lsl 0

A
/
A
B /
A
- B
110
data out
data in
i. pipe
(b) 2nd cycle store data auto-index
(a) 1st cycle compute address
48
The first two (of three) cycles of a branch
instruction
  • Compute target address
  • AR PC Disp,lsl 2
  • Save return address (if required)
  • r14 PC
  • AR AR 4

Third cycle do a small correction to the value
stored in the link register in order that it
points to directly at the instruction which
follows the branch?
(b) 2nd cycle save return address
(a) 1st cycle compute branch target
49
ARM Implementation
  • Datapath
  • RTL (Register Transfer Level)
  • Control unit
  • FSM (Finite State Machine)

50
2-phase non-overlapping clock scheme
  • Most ARMs do not operate on edge-sensitive
    registers
  • Instead the design is based around 2-phase
    non-overlapping clocks which are generated
    internally from a single clock signal
  • Data movement is controlled by passing the data
    alternatively through latches which are open
    during phase 1 or latches during phase 2

51
ARM datapath timing
  • Register read
  • Register read buses dynamic, precharged during
    phase 2
  • During phase 1 selected registers discharge the
    read buses which become valid early in phase 1
  • Shift operation
  • second operand passes through barrel shifter
  • ALU operation
  • ALU has input latches which are open in phase
    1,allowing the operands to begin combining in
    ALU as soon as they are valid, but they close at
    the end of phase 1 so that the phase 2 precharge
    does not get through to the ALU
  • ALU processes the operands during the phase 2,
    producing the valid output towards the end of the
    phase
  • the result is latched in the destination register
    at the end of phase 2

52
ARM datapath timing (contd)
Minimum Datapath Delay Register read time
Shifter Delay ALU Delay Register write
set-up time Phase 2 to phase 1 non-overlap time
53
The original ARM1 ripple-carry adder
  • Carry logic use CMOS AOI (And-Or-Invert) gate
  • Even bits use circuit show below
  • Odd bits use the dual circuit with inverted
    inputs and outputs and AND and OR gates swapped
    around
  • Worst case path32 gates long

54
ARM2 4-bit carry look-ahead scheme
  • Carry Generate (G)Carry Propagate (P)
  • Cout3 Cin0.P G
  • Use AOI and alternate AND/OR gates
  • Worst case8 gates long

55
The ARM2 ALU logic for one result bit
  • ALU functions
  • data operations (add, sub, ...)
  • address computations for memory accesses
  • branch target computations
  • bit-wise logical operations
  • ...

56
ARM2 ALU function codes
57
The ARM6 carry-select adder scheme
  • Compute sums of various fields of the wordfor
    carry-in of zero and carry-in of one
  • Final result is selected by using the correct
    carry-in value to control a multiplexor

Worst case O(log2word width) gates long
Note Be careful! Fan-out on some of these gates
is high so direct comparison with previous
schemes is not applicable.
58
The ARM6 ALU organization
  • Not easy to merge the arithmetic and logic
    functions gta separate logic unit runs in
    parallel with the adder, and multiplexor selects
    the output

59
ARM9 carry arbitration encoding
  • Carry arbitration adder

ai bi ai-1 bi-1 Ci vi, wi
0 0 - - 0 0, 0
1 1 - - 1 1, 1
0(1) 1(0) 0 0 0 0, 0
0(1) 1(0) 1 1 1 1, 1
0(1) 1(0) 0(1) 1(0) u 1, 0
ai bi Ci vi, wi
0 0 0 0, 0
1 1 1 1, 1
1 0 u 1, 0
0 1 u 1, 0
60
The cross-bar switch barrel shifter
  • Shifter delay is critical since it contributes
    directly to the datapath cycle time
  • Cross-bar switch matrix (32 x 32)
  • Principle for 4x4 matrix

61
The cross-bar switch barrel shifter (contd)
  • Precharged logic is used gt each switch is a
    single NMOS transistor
  • Precharging sets all outputs to logic 0, so those
    which are not connected to any input during
    switching remain at 0 giving the zero filling
    required by the shift semantics
  • For rotate right, the right shift diagonal is
    enabled complementary shift left diagonal (e.
    g., right 1 left 3)
  • Arithmetic shift rightuse sign-extension gt
    separate logic is used to decode the shift amount
    and discharge those outputs appropriately

62
Multiplier design
  • All ARMs apart form the first prototype have
    included support for integer multiplication
  • older ARM cores include low-cost multiplication
    hardwarethat supports only the 32-bit result
    multiply and multiply-accumulate
  • recent ARM cores have high-performance
    multiplication hardware and support 64-bit result
    multiply andmultiply-accumulate
  • Low cost implementation
  • Use the datapath iteratively, employing the
    barrel shifterand ALU to generate 2-bit product
    in each clock cycle
  • use early termination to stop the iterations when
    there are no more ones in the multiply register

63
The 2-bit multiplication algorithm, Nth cycle
  • Control settings for the Nth cycle of the
    multiplication
  • Use existing shifter and ALU additional
    hardware
  • dedicated two-bits-per-cycle shift register for
    the multiplier and a few gates for the Booths
    algorithm control logic(overhead is a few per
    cent on the area of ARM core)

64
High speed multiplication
  • Where multiplication performance is very
    important, more hardware resources must be
    dedicated
  • in some embedded systems the ARM core is used to
    perform real-time digital signal processing (DSP)
    DSP programs are typically multiplication
    intensive
  • Use intermediate results which include partial
    sums and partial carries
  • Carry-save adders are used for this
  • These two binary results are added together at
    the end of multiplication
  • The main ALU is used for this

65
Carry-propagate (a) and carry-save (b) adder
structures
  • Carry propagate adder takes two conventional
    (irredundant) binary numbers as inputs and
    produces a binary sum
  • Carry save adder takes one binary and one
    redundant (partial sum and partial carry) input
    and produces a sum in redundant binary
    representation (sum and carry)

66
ARM high-speed multiplier organization
  • CSA has 4 layers of adders each handling 2
    multiplier bitsgt multiply 8-bits per clock
    cycle
  • Partial sum and carry are cleared at the
    beginningor initialized to accumulate a value
  • Multiplier is shifted right 8-bitsper cycle in
    the Rs register
  • Carry sum and carryare rotated right 8 bits per
    cycle
  • Performance up to 4 clock cycles (early
    termination is possible)
  • Complexity 160 bits in shift registers, 128
    bits of carry-save adder logic (up to 10 of
    simpler cores)

67
ARM high-speed multiplier organization
68
ARM2 register cell circuit
69
ARM register bank floorplan
70
ARM core datapath buses
71
ARM control logic structure
Write a Comment
User Comments (0)
About PowerShow.com