AMD%20Opteron%20Overview - PowerPoint PPT Presentation

About This Presentation

Title:

AMD%20Opteron%20Overview

Description:

Number of Views:27

Avg rating:3.0/5.0

Slides: 20

Provided by: csVir

Learn more at: https://www.cs.virginia.edu

Category:

Tags: 20opteron | 20overview | amd | register

Transcript and Presenter's Notes

Title: AMD%20Opteron%20Overview

1
AMD Opteron Overview

2
Introduction

3
Fetch

Fetches 32B from L1 cache to pre-decode/Pick
buffer
For simplicity, the Barcelona uses pre-decode
information to mark the end of an instruction.

4
Inst. Decode

The instruction cache contains a pre-decoder
which scans 4B of the instruction stream each
cycle
Inserts pre-decode information from the ECC bits
of the L1I, L2 and L3 caches, along with each
line of instructions
Instructions are then passed through the sideband
stack optimizer
x86 includes instructions to directly manipulate
the stack of each thread
AMD introduced a side-band stack optimizer to
remove these stack manipulations from the
instruction stream
Thus, many stack operations can be processed in
parallel
Frees up the reservation stations, re-order
buffers, and regular ALUs for other work

5
Branch Prediction

Branch selector chooses between a bi-modal
predictor and a global predictor
The bi-modal predictor and branch selector are
both stored in the ECC bits of the instruction
cache, as pre-decode information
The global predictor combines the relative
instruction pointer (RIP) for a conditional
branch with a global history register
Tracks last 12 branches with a 16K entry
prediction table containing 2 bit saturating
counters
The branch target address calculator (BTAC)
checks the targets for relative branches
Can correct mis-predictions with a two cycle
penalty.
Barcelona uses an indirect predictor
Specifically designed to handle branches with
multiple targets (e.g. switch or case statements)
Return address stack has 24 entries

6
Pipeline

7
OO (ROB)

The Pack Buffer (post-decoding buffer) sends
groups of 3 micro-ops to the re-order buffer
(ROB)
The re-order buffer contains 24 entries, with 3
lanes per entry
Holds a total of 72 instructions
Instructions can be moved between lanes to avoid
a congested reservation station or to observe
issue restrictions
From the ROB, instructions issue to the
appropriate scheduler

8
ROB
9
Integer Future File and Register File (IFFRF)

The IFFRF contains 40 registers broken up into
three distinct sets
The Architectural Register File
Contains 16x64 bit non-speculative registers
Instructions can only modify the Architectural
Register File until they are committed
Speculative instructions read from and write to
the Future File
Contains the most recent speculative state of
the 16 architectural instructions
The last 8 registers are scratchpad registers
used by the microcode.
Should a branch mis-prediction or an exception
occur, the pipeline rolls back, and architectural
register file overwrites the contents of the
Future File
There are three reservation stations, i.e.
schedulers, within the integer cluster
Each station is tied to a specific lane in the
ROB and holds 8 instructions

10
Integer Execution

Barcelona uses three symmetric ALUs which can
execute almost any integer instruction
Three full featured ALUs require more die area
and power
Can provide higher performance for certain edge
cases
Enables a simpler design for the ROB and
schedulers.

11
Floating Point Execution

Floating Point operations are first sent to the
FP Mapper and Renamer
In the Renamer, up to 3 FP instructions each
cycle are assigned a destination register from
the 120 FP register file entries.
Once the micro-ops have been renamed, they may be
issued to the three FP schedulers
Operands can be obtained from either the FP
register file, or the forwarding network

12
Floating Point Execution (SIMD)

The FPUs are 128 bits wide so that Streaming SIMD
Extension (SSE) instructions can execute in a
single pass.
Similarly, the load-store units, and the FMISC
unit load 128 bit wide data, to improve SSE
performance.

13
Memory Overview
14
Memory Hierarchy

4 separate 128KB 2-way set associative L1 cache
Latency 3 cycles
Write-back to L2
The data paths into and from the L1D cache also
widened to 256 bits (128 bits transmit and 128
bits receive)
4 separate 512KB 16-way set associative
Latency 12 cycles
Line size is 64B

15
L3 Cache

Shared 2MB 32-way set associative L3
Latency 38 cycles
Uses 64B lines
The L3 cache was designed with data sharing in
mind
When a line is requested, if it is likely to be
shared, then it will remain in the L3
This leads to duplication which would not happen
in an exclusive hierarchy
In the past, a pseudo-LRU algorithm would evict
the oldest line in the cache.
In Barcelonas L3, the replacement algorithm has
been changed to prefer evicting unshared lines
Access to the L3 must be arbitrated since the L3
is shared between four different cores
A round-robin algorithm is used to give access
to one of the four cores each cycle.
Each core has 8 data prefetchers (a total of 32
per device)
Fill the L1D cache
Can have up to 2 outstanding fetches to any
address

16
Memory Controllers

Each memory controller supports independent 64B
transactions
Integrated DDR2 Memory controller ensures that L3
cache miss is resolved in less than 60 nanoseconds

17
TLB

Barcelona offers non-speculative memory access
re-ordering in the form of Load Store Units (LSU)
Thus, some memory operations can be issued
out-of-order
In the 12 entry LSU1, the oldest operations
translate their addresses from the virtual
address space to the physical address space using
the L1 DTLB
During this translation, the lower 12 bits of the
load operations address are tested against
previously stored addresses
If they are different, then the load proceeds
ahead of the store
If they are the same, load-store forwarding
occurs
Should a miss in the L1 DTLB occur, the L2 DTLB
will be checked
Once the load or store has located address in the
cache, the operation will move on to LSU2.
LSU2 holds up to 32 memory accesses, where they
stay until they are removed
The LSU2 handles any cache or TLB misses via
scheduling and probing
In the case of a cache miss, the LSU2 will then
look in the L2, L3 and then memory
In the case of TLB misses, it will look in the L2
TLB and then main memory
The LSU2 also holds store instructions, which are
not allowed to actually modify the caches until
retirement to ensure correctness
Thus, the LSU2 reduces the majority of the
complexity in the memory pipeline

18
Hypertransport

Barcelona has four HyperTransport 3.0 lanes for
inter-processor communications and I/O devices
HyperTransport 3.0 adds a feature called
unganging or lane-splitting
The HT3.0 links are composed of two 16 bit lanes
( in both directions)
Each can be split up into a pair of independent
8-bit wide links

19
Shanghai