Intel Core 2 Duo - PowerPoint PPT Presentation

About This Presentation
Title:

Intel Core 2 Duo

Description:

The loop detector monitors the behavior of each branch that the processor executes in order to identify which of ... Macro-op fusion lets the decoders combine two ... – PowerPoint PPT presentation

Number of Views:905
Avg rating:3.0/5.0
Slides: 23
Provided by: jx5c
Category:
Tags: core | duo | intel | macro | processor

less

Transcript and Presenter's Notes

Title: Intel Core 2 Duo


1
Intel Core 2 Duo
  • CS 6354
  • by WeiKeng Qin, Jian Xiang, Ren Xu
  • December 8, 2009

2
Introduction
  • Motivation
  • A Multi-Core on our desks
  • A new microarchitecture to replace Netburst
  • Intel Core 2 Duo
  • A dual-core CPU
  • ISA with SIMD Extension
  • Intel Core microarchitecture
  • Memory Hierarchy System

3
Instruction Set Architecture
  • Base X86-64
  • No VLIW (Itanium)
  • SIMD Extensions MMX, SSE, SSE2, SSE3, SSSE3,
    SSE4.1

Walfdale, SSE4.1, Sep 2006
Core 2, SSSE3, July 2006
Prescott, SSE3, 2004
Pentium 4, SSE2, 2001
e.g. Permuting bytes in a word
Pentium III, SSE, 1999
DSP-oriented math, process management
Pentium MMX, 1996
Double precision, 128-bit register support
8 new registers, Float-point Operations
8 new registers, Packed data type, Integer
Operations
4
Streaming SIMD Extension (SSE) 4.1
  • Beginning with the 45 nm processors
  • 47 instructions that improve performance of media
    data manipulation
  • e.g. Fast and efficient bit width conversions
  • Convert single byte values to word (16-bit)
    values.

5
SSE2 Code
  • MOVDQU XMM0, M64
  • PXOR XMM1, XMM1
  • PUNPCKLBW XMM0, XMM1

6
SSE4.1 Code
  • PMOVZXBW XMM0, M64
  • DEST150 lt-- ZeroExtend(SRC70)
  • DEST3116 lt-- ZeroExtend(SRC158)
  • DEST4732 lt-- ZeroExtend(SRC2316)
  • DEST6348 lt-- ZeroExtend(SRC3124)
  • DEST7964 lt-- ZeroExtend(SRC3932)
  • DEST9580 lt-- ZeroExtend(SRC4740)
  • DEST11196 lt-- ZeroExtend(SRC5548)
  • DEST127112 lt-- ZeroExtend(SRC6356)
  • Benefits
  • Reduced instruction number (3?1)
  • Better performance (40 speedup each loop)
  • Reduced register pressure (2?1)

7
Microarchitecture
  • The Cores
  • Single-die(107 mm²),
  • Two identical core(L1 cache 64K x 2),
  • Shared L2 cache 6M
  • No Hyper-threading, no L3 cache
  • Keep front-side bus
  • Larger L2 cache

8
Microarchitecture
  • 14-stage Pipeline
  • 4 wide decode
  • 4 wide Retire
  • Macro-fusion
  • Enhanced ALUs
  • Deeper Buffers

9
Another View
10
Decode Hardware
128 bits fetch bandwidth 18-entry IQ
Complex Decode -produces 1-4 micro-ops
Micro-code Sequencer
11
Macro-fusion
  • New Micro-op
  • Represent instruction pair as single micro-op
  • Enhanced ALUs
  • To execute new compare and jump (CMPJCC)
    micro-op in one clock

12
Out of Order Execution
96 entries ROB 32 Entry Reservation Station
13
Execution Units
  • 6 dispatch ports(1 Load, 2 Store, 3 universal
    ports)
  • 3 integer ALU, 2 float point ALU

14
Branch Predictor
  • Loop Detector
  • - Track the number of loop iterations
  • for future reference
  • branch prediction unit (BPU) selects among for
    every branch
  • -bimodal predictor
  • -global predictor
  • -loop detector

15
  • Cache Organization
  • private L1 DCache and ICache, 32K/core, 8way, 64B
    linesize, write-back(directory-based conherence)
  • shared L2 cache, 8way, 64B linesize (E8xxx)
  • pros could be less bus traffic
  • cons longer access latency than private L2
    cache
  • potential conflict between threads
  • -- FSB 1333MHz (E8xxx)
  • Memory disambiguation
  • aggressive memory dependence speculation based on
    a load's- EIP-address-indexed hash table
  • watchdog mechanism

16
  • Prediction Implementation
  • History table indexed by Instruction Pointer
  • Each entry in the history array has a saturating
    counter
  • Once counter saturates disambiguation possible
    on this load (take effect since next iteration)
    -load is allowed to go even meet unkown store
    addresses
  • When a particular load failed disambiguation
    reset its counter
  • Each time a particular load correctly
    disambiguated increment counter

17
Predictor Lookup
  • when sent from RS, set disambiguation bit
  • If meets an older unknow store address, set
    "update"
  • If prediction is "go", dispatch, set "done"
  • Else blocked
  • A store in Load Buffer scan all previous load, if
    a match found, "reset" bit set.
  • When load commits, update history.

Load Dispatch
Prediction Verification
18
  • Execute Disable Bit Support
  • AMD Enhanced Virus Protection ARM eXecute Never
  • help prevent buffer overflow attacks
  • no need of software patches for buffer overflow
    attacks
  • segregate memory by either storage of code or
    data
  • processor disable code execution when malicious
    worms try to inserting code into data buffers
    (with OS support)

19
  • Instruction Pointer Based Prefetcher
  • L1 DCache2 IP prefetchers/core
  • L1 ICache1 traditional prefetcher
  • L2 Cache 2 IP prefetchers
  • predict what memory address will be used and
    deliver in time
  • record every load's history using Instruction
    Pointer
  • IP history array
  • parameters for prefetch traffic control
    fine-tuned for different platforms
  • prefetch monitor

20
(No Transcript)
21
References
  • Intel's Next Generation Microarchitecture
    Unveiled, by David Kanter, Real World
    Technologies
  • Intel Core Microarchitecture Briefing, by Stephen
    Smith and Bob Valentine, Intel
  • Inside Intel Core Microarchitecture Setting New
    Standards for Energy-Efficient Performance, Ofri
    Wechsler, Technology_at_Intel Magazine
  • Intel Core A Next-Generation Microarchitecture,
    by Alan Zeichick, DevX
  • too many

22
Questions?
Write a Comment
User Comments (0)
About PowerShow.com