Chapter 15 IA-64 Architecture - PowerPoint PPT Presentation

About This Presentation

Title:

Chapter 15 IA-64 Architecture

Description:

Design Approach Explicit Parallelism. Compiler has vision of whole program and what is coming ... Instruction level parallelism ... – PowerPoint PPT presentation

Number of Views:269

Avg rating:3.0/5.0

Slides: 46

Provided by: adria216

Learn more at: http://faculty.washington.edu

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 15 IA-64 Architecture

1
Chapter 15IA-64 Architecture
2
Reflection on Superscalar Machines

Superscaler Machine
A Superscalar machine employs multiple
independent pipelines to executes multiple
independent instructions in parallel.
Particularly common instructions (arithmetic,
load/store, conditional branch) can be executed
independently.
Superpipelined machine
Superpiplined machines overlap pipe stages
Relies on stages being able to begin operations
before the last is complete.

3
Reflecting on Superscalar Machines
Example
4
Reflecting on Superscalar Machines
Superscalar vs Superpipelined
5
Reflection on Superscalar Machines

Challenges
Data Dependencies
Requires reordering of instructions
Procedural Dependencies
Requires reordering of fetch, execute, updating
of registers
Requires register renaming
Requires committing or retiring instructions
Resource Conflicts
Requires reordering of instructions
Superscaling has scaling challenges
Control complexity increases exponentially
Time delay increases exponentially

6
IA-64 Background

Explicitly Parallel Instruction Computing (EPIC)
- Jointly developed by Intel
Hewlett-Packard (HP)
New 64 bit architecture
Not extension of x86
Not adaptation of HP 64bit RISC architecture
To exploit increasing chip transistors and
increasing speeds
Utilizes systematic parallelism
Departure from superscalar
Note Has become the architecture of the Intel
Itanium

7
Why This New Architecture?

Processor designers obvious choices for use of
increasing number of transistors on chip and
extra speed
Bigger Caches ? diminishing returns
Increase degree of Superscaling by adding more
execution units ? complexity wall more logic,
need improved branch prediction, more renaming
registers, more complicated dependencies.
Multiple Processors ? challenge to use them
effectively in general computing

8
Design Approach Explicit Parallelism
Compiler statically schedules instructions at
compile time, rather than processor dynamically
scheduling them at run time.

Compiler has vision of whole program and what is
coming
Increase the execution units and use them
effectively
Reduce dynamic reconfigurations
Avoid exponentially increasing complex circuitry

9
Basic Concepts for IA-64

Instruction level parallelism
Explicit in machine instruction rather than
determined at run time by processor
Long or very long instruction words (LIW/VLIW)
Fetch bigger chunks already preprocessed
Branch predication (not the same as branch
prediction)
Go ahead and fetch decode instructions, but
keep track of them so the decision to issue
them, or not, can be practically made later
Speculative loading
Go ahead and load data so it is ready when need,
and have a practical way to recover is
speculation proved wrong

10
Superscalar v IA-64
11
General Organization
12
Intels Itanium Implements the IA-64
13
IA-64 Key Features

Large number of registers
IA-64 instruction format assumes 256 Registers
128 64 bit integer, logical general purpose
128 82 bit floating point and graphic
64 predicated execution registers
(To support high degree of parallelism)
Multiple execution units
8 or more

14
Predicate Registers

Used as a flag for instructions that may or may
not be executed.
A set of instructions is assigned a predicate
register when it is uncertain whether the
instruction sequence will actually be executed
(think branch).
Only instructions with a predicate value of true
are executed.
When it is known that the instruction is going to
be executed, its predicate is set. All
instructions with that predicate true can now be
completed.
Those instructions with predicate false are now
candidates for cleanup.

15
IA-64 Execution Units

I-Unit
Integer arithmetic
Shift and add
Logical
Compare
Integer multimedia ops
M-Unit
Load and store
Between register and memory
Some integer ALU operations
B-Unit
Branch instructions
F-Unit
Floating point instructions

16
Relationship between Instruction Type
Execution Unit
17
Instruction Format

128 bit bundles
Can fetch one or more bundles at a time
Bundle holds three instructions plus template
Instructions are usually 41 bit long
Have associated predicated execution registers
Template contains info on which instructions can
be executed in parallel
Not confined to single bundle
e.g. a stream of 8 instructions may be executed
in parallel
Compiler will have re-ordered instructions to
form contiguous bundles
Can mix dependent and independent instructions in
same bundle

18
Instruction Format Diagram
19
Field Encoding Instr Set Mapping
Note BAR indicates stops Possible dependencies
with Instructions after the stop
20
Assembly Language Format

qp mnemonic .comp dest srcs //
qp - predicate register
1 at execution ? execute and commit result to
hardware
0 ? result is discarded
mnemonic - name of instruction
comp one or more instruction completers used to
qualify mnemonic
dest one or more destination operands
srcs one or more source operands
- instruction groups stops (when
appropriate)
Sequence without read after write or write after
write
Do not need hardware register dependency checks
// - comment follows

21
Assembly Example
Register Dependency

ld8 r1 r5 //first group
add r3 r1, r4 //second group
Second instruction depends on value in r1
Changed by first instruction
Can not be in same group for parallel execution
Note ends the group of instructions that can
be executed in parallel

22
Assembly Example
Multiple Register Dependencies

ld8 r1 r5 //first group
sub r6 r8, r9 //first group
add r3 r1, r4 //second group
st8 r6 r12 //second group
Last instruction stores in the memory location
whose address is in r6, which is established in
the second instruction

23
Predication
24
Speculative Loading
25
Assembly Example Predicated Code
Consider the Following program with branches

if (ab)
j j 1
else
if(c)
k k 1
else
k k 1
i i 1

26
Assembly Example Predicated Code
Pentium Assembly Code cmp a, 0
compare with 0 je L1 branch to L1 if a
0 cmp b, 0 je L1 add j, 1 j j
1 jmp L3 L1 cmp c, 0 je L2 add k,
1 k k 1 jmp L3 L2 sub k, 1 k
k 1 L3 add i, 1 i i 1

Source Code
if (ab)
j j 1
else
if(c)
k k 1
else
k k 1
i i 1

27
Assembly Example Predicated Code
Pentium Code cmp a, 0 je L1 cmp b,
0 je L1 add j, 1 jmp L3 L1 cmp c,
0 je L2 add k, 1 jmp L3 L2 sub k,
1 L3 add i, 1
IA-64 Code cmp. eq p1, p2 0, a (p2)
cmp. eq p1, p3 0, b (p3) add j 1, j (p1)
cmp. ne p4, p5 0, c (p4) add k 1, k (p5)
add k -1, k add i 1, i

Source Code
if (ab)
j j 1
else
if(c)
k k 1
else
k k 1
i i 1

28
Example of Prediction
29
Control Data Speculation

Control
AKA Speculative loading
Load data from memory before needed
Data
Load moved before store that might alter memory
location
Subsequent check in value

30
Assembly Example Control Speculation
Consider the Following program

(p1) br some_label // cycle 0
ld8 r1 r5 // cycle 1
add r1 r1, r3 // cycle 3

31
Assembly Example Control Speculation
Consider the Following program
Original code
Speculated Code
ld8.s r1 r5 //cycle -2 //
other instructions (p1) br some_label
//cycle 0 chk.s r1, recovery //cycle 0
add r2 r1, r3 //cycle 0

(p1) br some_label //cycle 0
ld8 r1 r5 //cycle 1
add r1 r1, r3 //cycle 3

32
Assembly Example Data Speculation
Consider the Following program

st8 r4 r12 //cycle 0
ld8 r6 r8 //cycle 0
add r5 r6, r7 //cycle 2
st8 r18 r5 //cycle 3

What if r4 and r18 point to the same address?
33
Assembly Example Data Speculation
Consider the Following program Without Data
Speculation With Data
Speculation
ld8.a r6 r8 //cycle -2, adv // other
instructions st8 r4 r12 //cycle 0 ld8.c
r6 r8 //cycle 0, check add r5 r6, r7
//cycle 0 st8 r18 r5 //cycle 1

st8 r4 r12 //cycle 0
ld8 r6 r8 //cycle 0
add r5 r6, r7 //cycle 2
st8 r18 r5 //cycle 3

What if r4 and r18 point to the same address?
34
Assembly Example Data Speculation
Data Dependencies Speculation
Speculation with data
dependency

ld8.a r6 r8 //cycle -3,adv ld
// other instructions
add r5 r6, r7 //cycle -1,uses r6
// other instructions
st8 r4 r12 //cycle 0
chk.a r6, recover //cycle 0, check
back //return pt
st8 r18 r5 //cycle 0
recover
ld8 r6 r8 //get r6 from r8
add r5 r6, r7 //re-execute
be back //jump back

ld8.a r6 r8 //cycle-2 // other
instructions st8 r4 r12 //cycle
0 ld8.c r6 r8 //cycle 0 add r5 r6, r7
//cycle 0 st8 r18 r5 //cycle 1
35
Software Pipelining

L1 ld4 r4r5,4 //cycle 0 load postinc 4
add r7r4,r9 //cycle 2
st4 r6r7,4 //cycle 3 store postinc 4
br.cloop L1 //cycle 3
Adds constant to one vector and stores result in
another
No opportunity for instruction level parallelism
Instruction in iteration x all executed before
iteration x1 begins
If no address conflicts between loads and stores
can move independent instructions from loop x1
to loop x

36
Unrolled Loop

ld4 r32r5,4 //cycle 0
ld4 r33r5,4 //cycle 1
ld4 r34r5,4 //cycle 2
add r36r32,r9 //cycle 2
ld4 r35r5,4 //cycle 3
add r37r33,r9 //cycle 3
st4 r6r36,4 //cycle 3
ld4 r36r5,4 //cycle 3
add r38r34,r9 //cycle 4
st4 r6r37,4 //cycle 4
add r39r35,r9 //cycle 5
st4 r6r38,4 //cycle 5
add r40r36,r9 //cycle 6
st4 r6r39,4 //cycle 6
st4 r6r40,4 //cycle 7

37
Unrolled Loop Detail

Completes 5 iterations in 7 cycles
Compared with 20 cycles in original code
Assumes two memory ports
Load and store can be done in parallel

38
Software Pipeline Example Diagram
39
Support For Software Pipelining

Automatic register renaming
Fixed size are of predicate and fp register file
(p16-P32, fr32-fr127) and programmable size area
of gp register file (max r32-r127) capable of
rotation
Loop using r32 on first iteration automatically
uses r33 on second
Predication
Each instruction in loop predicated on rotating
predicate register
Determines whether pipeline is in prolog, kernel,
or epilog
Special loop termination instructions
Branch instructions that cause registers to
rotate and loop counter to decrement

40
IA-64 Register Set
41
IA-64 Registers (1)

General Registers
128 gp 64 bit registers
r0-r31 static
references interpreted literally
r32-r127 can be used as rotating registers for
software pipeline or register stack
References are virtual
Hardware may rename dynamically
Floating Point Registers
128 fp 82 bit registers
Will hold IEEE 745 double extended format
fr0-fr31 static, fr32-fr127 can be rotated for
pipeline
Predicate registers
64 1 bit registers used as predicates
pr0 always 1 to allow unpredicated instructions
pr1-pr15 static, pr16-pr63 can be rotated

42
IA-64 Registers (2)

Branch registers
8 64 bit registers
Instruction pointer
Bundle address of currently executing instruction
Current frame marker
State info relating to current general register
stack frame
Rotation info for fr and pr
User mask
Set of single bit values
Allignment traps, performance monitors, fp
register usage monitoring
Performance monitoring data registers
Support performance monitoring hardware
Application registers
Special purpose registers

43
Register Stack

Avoids unnecessary movement of data at procedure
call return
Provides procedure with new frame up to 96
registers on entry
r32-r127
Compiler specifies required number
Local
Output
Registers renamed so local registers from
previous frame hidden
Output registers from calling procedure now have
numbers starting r32
Physical registers r32-r127 allocated in circular
buffer to virtual registers
Hardware moves register contents between
registers and memory if more registers needed