Itanium Family Architecture - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Itanium Family Architecture

Description:

Itanium Family Architecture. Marty Nicholes for EEC272. UCDavis ... The first 32 general/FP registers are available ... helps to steer instruction to ... – PowerPoint PPT presentation

Number of Views:538
Avg rating:3.0/5.0
Slides: 47
Provided by: mart89
Category:

less

Transcript and Presenter's Notes

Title: Itanium Family Architecture


1
Itanium Family Architecture
  • Marty Nicholes for EEC272
  • UCDavis

2
Topics
  • Architectural Overview
  • Registers, and ISA
  • Implementation
  • Instruction Flow
  • Data Flow
  • Benchmarks
  • Futures
  • References

3
Architectural Overview
  • What is exposed to the compiler

4
Architectural Overview IPF Register Set
5
Architectural OverviewRotating Register Set
  • The first 32 general/FP registers are available
    to all procedures
  • Each procedure uses the ALLOC instruction to gain
    access to up to 96 more registers.
  • If more logical registers are allocated than
    physical, the Register Stack Engine must spill
    older physical registers to make room for new
    allocated registers.

6
Architectural OverviewRotating Register Set
7
Architectural OverviewInstruction Set
Architecture
  • Instructions organized into groups
  • No dependencies or interlock needed
  • Can execute concurrently
  • Arbitrary number of instructions
  • Groups delimited by stop
  • Instructions come in bundles (128 bits)
  • Contains 3 instructions
  • Contains a bundle template

8
Architectural OverviewInstruction Set
Architecture
  • Bundle
  • 3 41-bit instructions
  • 5 bit bundle type
  • Bundle type communicates stop location

9
Architectural OverviewInstruction Set
Architecture
  • Template helps to steer instruction to functional
    unit
  • Implementation affects which bundle combos can
    fully issue

10
Architectural Overview
11
Itanium Implementation
  • 3 processor designs announced
  • Merced (Itanium)
  • McKinley (Itanium 2)
  • Madison (1.5 Ghz Itanium 2)

12
Implementation (Itanium Die)
13
Itanium Block Diagram
14
Itanium Pipeline
15
Itanium 2 Die
16
Itanium 2 Pipeline
17
Itanium 2 Block Diagram
18
1.5 Ghz. Itanium 2 Die
19
Instruction Flow
  • How to feed the hungry execution units

20
I-Cache (Itanium)
  • 16 Kb (32B line size)
  • 4-way set associative
  • Fully pipelined, single clock
  • 64-entry I-TLB
  • Fully associative
  • On-chip page walker
  • Both have extra port for miss checking

21
Instruction Fetch (Itanium)
22
Branch Prediction (Itanium)
  • 90 accurate 2-level BPT (512 entries (128 sets X
    4 ways), each entry is 4 bits, this value indexes
    into the pattern table for the set (128 pattern
    tables), each PT has 16 entries, each entry is a
    2 bit saturating counter.
  • Also a 64 entry, multi-way branch prediction
    table (MBPT) that is similar but has 3 history
    registers per bundle entry. Find first taken
    algorithm.
  • Branch prediction penalties
  • IP-relative branch w/correct prediction - 1 cycle
  • IP-relative branch w/wrong target - 1 cycle
  • Return branch w/correct prediction - 1 cycle
  • Last branch in counted loop prediction - 2 cycle
  • Branch Misprediction 9 cycles

23
I-Cache (Itanium 2)
  • 16KB (4 way, 64B line size)
  • LRU replacement algorithm
  • 32 GB/sec bandwidth
  • 1 cycle load to use
  • 2 level I-TLB
  • ITC (32 entry, full assoc., 0.5 clock)
  • ITLB (128 entry, fully assoc. 1 clock)
  • Supports 4K to 4GB page sizes
  • Supports 64 ITRs
  • HW page walker starts on miss

24
Branch Prediction (Itanium 2)
  • Zero clock branch prediction
  • 2 level branch prediction hierarchy
  • L1IBR Level 1 Branch Cache
  • Part of the L1 I-cache
  • 1K trigger predictions0.5K target addresses
  • L2B - Level 2 Branch Cache (12K histories)
  • PHT - Pattern History Table (16K counters)
  • Reduced prediction penalties
  • IP-relative branch w/correct prediction - 0 cycle
  • IP-relative branch w/wrong target - 1 cycle
  • Return branch w/correct prediction - 1 cycle
  • Last branch in counted loop prediction - 0 cycle
  • Branch Misprediction 6 cycles

25
Instruction Dispersal (Itanium)
  • Stop bits eliminate dependency checking
  • Templates simplify routing
  • Map instructions to first available of 9 issue
    ports. Keep issuing until
  • stop bit is hit
  • required issue port is unavailable
  • Re-map virtual register to physical register
  • New bundles presented as bundles fully issue

26
Instruction Dispersal (Itanium)
27
Instruction Dispersal (Itanium 2)
28
Register Remapping (Itanium)
  • One 7-bit adder for each register specifier
  • In total 98 7-bit adders, and 42 MUXs

29
Instruction Dispersal (Itanium 2)
  • Itanium 2 implements 11 issue ports
  • 4 Mem/ALU/Multi-Media
  • 2 Integer/ALU/Multi-Media
  • 2 FMAC
  • 3 branch

30
Execution (Itanium Itanium 2)
  • 17 execution units fed by 9 issue ports
  • 20 execution units fed by 11 issue port (Itanium
    2)
  • Scoreboard based, stall-on-use
  • Enhanced to support predication
  • Hazard evaluation in REG stage
  • Hazards can proceed into EXE stage
  • Stall occurs in EXE stage (deferred stall)
  • Obtaining operands in the EXE stage
  • Stalled instructions snoop for data values
  • Utilize register bypass hardware from REG

31
Data Flow
  • Getting operands into the core
  • Getting results stored

32
Data Flow (Itanium)
  • 128 (64 bit) integer register file
  • 8 read and 6 write ports
  • Supports 2 MEM and 2 ALU instructions
  • 2 write ports for pending loads
  • 128 (82 bit) floating point register file
  • 8 read and 4 write ports
  • Predicate register file (1 bit X 64)
  • 15 read and 11 write ports to single registers
  • Broadside read and write capability

33
Data Speculation
  • Control speculation
  • On exception, NaT bit set, or NaTVal
  • On consumption, exception is reported
  • Special load issued early
  • Address, size, and destination saved in ALAT
  • ALAT used to check for overlapping stores
  • If a match
  • the load is invalidated
  • must be reissued later when the data is to be
    used

34
ALAT Structure
35
Data Flow - FMAC Unit (Itanium)
36
Data Flow (Itanium 2 deltas)
  • Integer register file
  • 12 read 8 write ports
  • Floating point register file
  • 8 read and 6 write ports
  • FPU to L2D cache
  • 4 82-bit read ports (6 cycle latency)
  • 2 82-bit write ports

37
Data Flow - Caches (Itanium)
38
Data Flow Caches (Itanium 2)
39
Data Flow (Itanium 2 L1D)
  • L1D - 16 kB of data (4 ways 4 kB/way)
  • Prevalidated tags for fast loads
  • 4 Ports (2 load ports 2 store ports)
  • Load ports are independent.
  • Each store port has a 1 in 8 chance of
    conflicting with each other valid port.
  • True single cycle load access, including
  • address translation data delivery
  • data read integer unit data bypass
  • Physically addressed (50 physical address bits)
  • Write-through policy
  • Each way is 64B line 64 indices

40
Data Flow (Itanium 2)
41
L1D addressing (Itanium 2)
42
L1D Addressing Itanium 2
43
Addressing
  • Itanium
  • 44 bit physical addressing
  • 50 bit virtual addressing
  • Maximum page size of 256MB
  • Itanium 2
  • 50 bit physical addressing
  • 64 bit virtual addressing
  • Maximum page size of 4GB

44
Benchmarks
  • SPECFP2000
  • hp server rx5670 (1P, 1000 MHz, Itanium 2)
  • 1431 (Dec-2002, rank 2)
  • 1 Alpha _at_1482
  • SPECInt2000
  • hp server rx2600 (1P, 900 MHz, Itanium 2)
  • 674 (Dec-2002, 1 P4 3Ghz _at_ 1200)
  • TPC-C (non-clustered)
  • Hp SuperDome (64P, 1.5Ghz Itanium2)
  • 658,278 (Apr-2003, rank 2)
  • 1 IBM _at_680,613, 32P Power4 1.7Ghz 

45
Itanium Family Futures
  • IA-32 Execution Layer, expected to debut 2003. A
    1.5 GHz Itanium 2 to run 32-bit code about as
    fast as a 1.5 GHz Xeon MP chip
  • HP mx2 dual processor module using Intel
    Itanium 2 processors
  • Combines two future Itanium 2 processors and a
    32-MB L4 cache onto a single daughter card module
    that is pin-compatible with existing Madison
    Itanium 2 processor sockets.
  • 2004 model of Madison is expected to be faster
    and carry 9MB on chip L3 cache on .13u process
  • Deerfield. between 70 and 80 watts, and lower
    cost.
  • Montecito processor in 2005. 90-nanometer,
    dual-core technology, 18MB L3 cache, each core
    will have its own L3 cache
  • Multi-threading?

46
References
  • Naffziger et. al., The Implementation of the
    Itaniium 2 Microprocessor, IEEE Journal,
    November 2002
  • Naffziger, Hammond, The Implementation of the
    Next-Generation 64b Itanium Microprocessor,
    ISSCC, 2002
  • Fetzer, Orton, A Fully-Bypassed 6-Issue Integer
    Datapath and Register File on an Itanium
    Microprocessor, ISSCC, 2002
  • Lyon, Delano, Data Cache Design Considerations
    for the Itanium 2 Processor, Proc. Of the 2002
    IEEE Intl. Conf. On Computer Design, 2002
  • Shankland, CNET News.comSeptember 10, 2002
  • Singer, Intel Adds More Itanium 2 to its
    Future, siliconvalley.internet.com, January 16,
    2003
  • Swoyer, Intel Promises Improved Performance,
    Enterprise Systems, April 2003
  • www.intel.com
  • www.hp.com
  • www.spec.org
  • Sharangpani, Arora, Itanium Processor
    Microarchitecture, IEEE, 2000
  • Bradley, Mahoney, Stackhouse, The 16kB
    Single-Cycle Read-Access Cache on a
    Next-Generation 64b Itanium Microprocessor,
  • Stinson, Rusu, The 16kB Single-Cycle Read-Access
    Cache on a Next-Generation 64b Itanium
    Microprocessor,
  • Naffziger, Hammond, Next Generation Itanium
    Processor Overview, Intel Devel. Forum 2001
  • Barcella, Sankaranarayanan, Pai, ITANIUM, An
    EPIC Architecture,
  • Intel Itanium 2 Processor Reference Manual,
    Intel Corp., June 2002
  • Parmenter, Levy, The Intel Itanium,
  • K. Diefendorff, HP, Intel Complete IA-64
    Rollout, Microprocessor Report, MicroDesign
    Resources, Sunnyvale, CA, April 10, 2000
  • McNairy, Soltis, Itanium 2 Microarchitecture,
    IEEE Micro, March-April 2003
Write a Comment
User Comments (0)
About PowerShow.com