AMD%20Opteron%20Overview - PowerPoint PPT Presentation

About This Presentation
Title:

AMD%20Opteron%20Overview

Description:

Inserts pre-decode information from the ECC bits of the L1I, L2 and L3 caches, ... from the virtual address space to the physical address space using the L1 DTLB ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 20
Provided by: csVir
Category:

less

Transcript and Presenter's Notes

Title: AMD%20Opteron%20Overview


1
AMD Opteron Overview
  • Michael Trotter (mjt5v)
  • Tim Kang (tjk2n)
  • Jeff Barbieri (jjb3v)

2
Introduction
  • AMD Opteron
  • Focuses on Barcelona
  • Barcelona is AMDs 65nm 4-core CPU

3
Fetch
  • Fetches 32B from L1 cache to pre-decode/Pick
    buffer
  • For simplicity, the Barcelona uses pre-decode
    information to mark the end of an instruction.

4
Inst. Decode
  • The instruction cache contains a pre-decoder
    which scans 4B of the instruction stream each
    cycle
  • Inserts pre-decode information from the ECC bits
    of the L1I, L2 and L3 caches, along with each
    line of instructions
  • Instructions are then passed through the sideband
    stack optimizer
  • x86 includes instructions to directly manipulate
    the stack of each thread
  • AMD introduced a side-band stack optimizer to
    remove these stack manipulations from the
    instruction stream
  • Thus, many stack operations can be processed in
    parallel
  • Frees up the reservation stations, re-order
    buffers, and regular ALUs for other work

5
Branch Prediction
  • Branch selector chooses between a bi-modal
    predictor and a global predictor
  • The bi-modal predictor and branch selector are
    both stored in the ECC bits of the instruction
    cache, as pre-decode information
  • The global predictor combines the relative
    instruction pointer (RIP) for a conditional
    branch with a global history register
  • Tracks last 12 branches with a 16K entry
    prediction table containing 2 bit saturating
    counters
  • The branch target address calculator (BTAC)
    checks the targets for relative branches
  • Can correct mis-predictions with a two cycle
    penalty.
  • Barcelona uses an indirect predictor
  • Specifically designed to handle branches with
    multiple targets (e.g. switch or case statements)
  • Return address stack has 24 entries

6
Pipeline
  • Uses a 12 stage pipeline

7
OO (ROB)
  • The Pack Buffer (post-decoding buffer) sends
    groups of 3 micro-ops to the re-order buffer
    (ROB)
  • The re-order buffer contains 24 entries, with 3
    lanes per entry
  • Holds a total of 72 instructions
  • Instructions can be moved between lanes to avoid
    a congested reservation station or to observe
    issue restrictions
  • From the ROB, instructions issue to the
    appropriate scheduler

8
ROB
9
Integer Future File and Register File (IFFRF)
  • The IFFRF contains 40 registers broken up into
    three distinct sets
  • The Architectural Register File
  • Contains 16x64 bit non-speculative registers
  • Instructions can only modify the Architectural
    Register File until they are committed
  • Speculative instructions read from and write to
    the Future File
  • Contains the most recent speculative state of
    the 16 architectural instructions
  • The last 8 registers are scratchpad registers
    used by the microcode.
  • Should a branch mis-prediction or an exception
    occur, the pipeline rolls back, and architectural
    register file overwrites the contents of the
    Future File
  • There are three reservation stations, i.e.
    schedulers, within the integer cluster
  • Each station is tied to a specific lane in the
    ROB and holds 8 instructions

10
Integer Execution
  • Barcelona uses three symmetric ALUs which can
    execute almost any integer instruction
  • Three full featured ALUs require more die area
    and power
  • Can provide higher performance for certain edge
    cases
  • Enables a simpler design for the ROB and
    schedulers.

11
Floating Point Execution
  • Floating Point operations are first sent to the
    FP Mapper and Renamer
  • In the Renamer, up to 3 FP instructions each
    cycle are assigned a destination register from
    the 120 FP register file entries.
  • Once the micro-ops have been renamed, they may be
    issued to the three FP schedulers
  • Operands can be obtained from either the FP
    register file, or the forwarding network

12
Floating Point Execution (SIMD)
  • The FPUs are 128 bits wide so that Streaming SIMD
    Extension (SSE) instructions can execute in a
    single pass.
  • Similarly, the load-store units, and the FMISC
    unit load 128 bit wide data, to improve SSE
    performance.

13
Memory Overview
14
Memory Hierarchy
  • 4 separate 128KB 2-way set associative L1 cache
  • Latency 3 cycles
  • Write-back to L2
  • The data paths into and from the L1D cache also
    widened to 256 bits (128 bits transmit and 128
    bits receive)
  • 4 separate 512KB 16-way set associative
  • Latency 12 cycles
  • Line size is 64B

15
L3 Cache
  • Shared 2MB 32-way set associative L3
  • Latency 38 cycles
  • Uses 64B lines
  • The L3 cache was designed with data sharing in
    mind
  • When a line is requested, if it is likely to be
    shared, then it will remain in the L3
  • This leads to duplication which would not happen
    in an exclusive hierarchy
  • In the past, a pseudo-LRU algorithm would evict
    the oldest line in the cache.
  • In Barcelonas L3, the replacement algorithm has
    been changed to prefer evicting unshared lines
  • Access to the L3 must be arbitrated since the L3
    is shared between four different cores
  • A round-robin algorithm is used to give access
    to one of the four cores each cycle.
  • Each core has 8 data prefetchers (a total of 32
    per device)
  • Fill the L1D cache
  • Can have up to 2 outstanding fetches to any
    address

16
Memory Controllers
  • Each memory controller supports independent 64B
    transactions
  • Integrated DDR2 Memory controller ensures that L3
    cache miss is resolved in less than 60 nanoseconds

17
TLB
  • Barcelona offers non-speculative memory access
    re-ordering in the form of Load Store Units (LSU)
  • Thus, some memory operations can be issued
    out-of-order
  • In the 12 entry LSU1, the oldest operations
    translate their addresses from the virtual
    address space to the physical address space using
    the L1 DTLB
  • During this translation, the lower 12 bits of the
    load operations address are tested against
    previously stored addresses
  • If they are different, then the load proceeds
    ahead of the store
  • If they are the same, load-store forwarding
    occurs
  • Should a miss in the L1 DTLB occur, the L2 DTLB
    will be checked
  • Once the load or store has located address in the
    cache, the operation will move on to LSU2.
  • LSU2 holds up to 32 memory accesses, where they
    stay until they are removed
  • The LSU2 handles any cache or TLB misses via
    scheduling and probing
  • In the case of a cache miss, the LSU2 will then
    look in the L2, L3 and then memory
  • In the case of TLB misses, it will look in the L2
    TLB and then main memory
  • The LSU2 also holds store instructions, which are
    not allowed to actually modify the caches until
    retirement to ensure correctness
  • Thus, the LSU2 reduces the majority of the
    complexity in the memory pipeline

18
Hypertransport
  • Barcelona has four HyperTransport 3.0 lanes for
    inter-processor communications and I/O devices
  • HyperTransport 3.0 adds a feature called
    unganging or lane-splitting
  • The HT3.0 links are composed of two 16 bit lanes
    ( in both directions)
  • Each can be split up into a pair of independent
    8-bit wide links

19
Shanghai
  • The latest model of the Opteron series
  • Several improvements over Barcelona
  • 45nm
  • 6MB L3 cache
  • Improved clock speeds
  • A host of other improvements
Write a Comment
User Comments (0)
About PowerShow.com