Intel Pentium M - PowerPoint PPT Presentation

1 / 74
About This Presentation
Title:

Intel Pentium M

Description:

Register Alias Table (RAT) consulted ... A delta register (ESPD) is maintained in the front end ... Execution Unit into the renamed register during stage 83 ... – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 75
Provided by: eecsHa
Category:

less

Transcript and Presenter's Notes

Title: Intel Pentium M


1
Intel Pentium M
2
Outline
  • History
  • P6 Pipeline in detail
  • New features
  • Improved Branch Prediction
  • Micro-ops fusion
  • Speed Step technology
  • Thermal Throttle 2
  • Power and Performance

3
Quick Review of x86
  • 8080 - 8-bit
  • 8086/8088 - 16-bit (8088 had 8-bit external data
    bus) - segmented memory model
  • 286 - introduction of protected mode, which
    included segment limit checking,
    privilege levels, read- and exe-only segment
    options
  • 386 - 32-bit - segmented and flat memory
    model - paging
  • 486 - first pipeline - expanded the 386's ID
    and EX units into five-stage pipeline - first
    to include on-chip cache - integrated x87 FPU
    (before it was a coprocessor)
  • Pentium (586) - first superscalar - included
    two pipelines, u and v - virtual-8086 mode
    - MMX soon after
  • Pentium Pro (686 or P6) - three-way superscalar
    - dynamic execution - out-of-order execution,
    branch prediction, speculative execution -
    very successful micro-architecture
  • Pentium 2 and 3 - both P6
  • Pentium 4 - new NetBurst architecture
  • Pentium M - enhanced P6

4
Pentium Pro Roots
  • NexGen 586 (1994)
  • Decomposes IA32 instructions into
    simplerRISC-like operations (R-ops or micro-ops)
  • Decoupled Approach
  • NexGen bought by AMD
  • AMD K5 (1995) also used micro-ops
  • Intel Pentium Pro
  • Intels first use of decoupled architecture

5
Pentium-M Overview
  • Introduced March 12, 2003
  • Initially called Banias
  • Created by Israeli team
  • Missed deadline by less than 5 days
  • Marketed with Intels Centrino Initiative
  • Based on P6 microarchitechture

6
P6 Pipeline in a Nutshell
  • Divided into three clusters (front, middle, back)
  • In-order Front-End
  • Out-of-order Execution Core
  • Retirement
  • Each cluster is independent
  • I.e. if a mispredicted branch is detected in the
    front-end, the front-end will flush and retch
    from the corrected branch target, all while the
    execution core continues working on previous
    instructions

7
P6 Pipeline in a Nutshell
8
(No Transcript)
9
P6 Front-End
  • Major units IFU, ID, RAT, Allocator, BTB, BAC
  • Fetching (IFU)
  • Includes I-cache, I-streaming cache, ITLB, ILD
  • No pre-decoding
  • Boundary markings by instruction-length decoder
    (ILD)
  • Branch Prediction
  • Predicted (speculative) instructions are marked
  • Decoding (ID)
  • Conversion of instructions (macro-ops) into
    micro-ops
  • Allocation of Buffer Entries RS, ROB, MOB

10
P6 Execution Core
  • Reservation Station (RS)
  • Waiting micro-ops ready to go
  • Scheduler
  • Out-of-order Execution of micro-ops
  • Independent execution units (EU)
  • Must be careful about out-of-order memory access
  • Memory ordering buffer (MOB) interfaces to the
    memory subsystem
  • Requirements for execution
  • Available operands, EU, and write-back bus
  • Optimal performance

11
P6 Retirement
  • In-order updating of architected machine state
  • Re-order buffer (ROB)
  • Micro-op retirement all or none
  • Architecturally illegal to retire only partof an
    IA-32 instruction
  • In-ordering handling of exceptions
  • Legal to handle mid-execution, but illegalto
    handle mid-retirement

12
PM Changes to P6
  • Most changes made in P6 front-end
  • Added and expanded on P4 branch predictor
  • Micro-ops fusion
  • Addition of dedicated stack engine
  • Pipeline length
  • Longer than P3, shorter than P4
  • Accommodates extra features above

13
PM Changes to P6, cont.
  • Intel has not released the exact length of the
    pipeline.
  • Known to be somewhere between the P4 (20
    stage)and the P3 (10 stage). Rumored to be 12
    stages.
  • Trades off slightly lower clock frequencies (than
    P4) for better performance per clock, less branch
    prediction penalties,

14
Blue Man Group Commercial Break
15
Banias
  • 1st version
  • 77 million transistors, 23 million more than P4
  • 1 MB on die Level 2 cache
  • 400 MHz FSB (quad pumped 100 MHZ)
  • 130 nm process
  • Frequencies between 1.3 1.7 GHz
  • Thermal Design Point of 24.5 watts

http//www.intel.com/pressroom/archive/photos/cent
rino.htm
16
Dothan
  • Launched May 10, 2004
  • 140 million transistors
  • 2 MB Level 2 cache
  • 400 or 533 MHz FSB
  • Frequencies between 1.0 to 2.26 GHz
  • Thermal Design Point of 21(400 MHz FSB) to 27
    watts

http//www.intel.com/pressroom/archive/photos/cent
rino.htm
17
Dothan cont.
  • 90 nm process technology on 300 mm wafer.
  • Provide twice the capacity of the 200 mm while
    the process dimensions double the transistor
    density
  • Gate dimensions are 50nm or approx half the
    diameter if the influenza virus
  • P and n gate voltages are reduced by enhancing
    the carrier mobility of the Si lattice by 10-20
  • Draws less than 1 W average power

18
Bus
  • Utilizes a split transaction deferred reply
    protocol
  • 64-bit width
  • Delivers up to 3.2 Gbps (Banis) or 4.2 Gbps
    (Dothan) in and out of the processor
  • Utilizes source synchronous transfer of addresses
    and data
  • Data transferred 4 times per bus clock
  • Addresses can be delivered times per bus clock

19
  • Bus update in Dothan
  • http//www.intel.com/technology/itj/2005/volume09i
    ssue01/art05_perf_power

20
L1 Cache
  • 64KB total
  • 32 K instruction
  • 32 K data (4 times P4M)
  • Write-back vs. write-through on P4
  • In write-through cache, data is written to both
    L1 and main memory simultaneously
  • In write-back cache, data can be loaded without
    writing to main memory, increasing speed by
    reducing the number of slow memory writes

21
L2 cache
  • 1 2 MB
  • 8-way set associative
  • Each set is divided into 4 separate power
    quadrants.
  • Each individual power quadrant can be set to a
    sleep mode, shutting off power to those quadrants
  • Allows for only 1/32 of cache to be powered at
    any time
  • Increased latency vs. improved power consumption

22
Prefetch
  • Prefetch logic fetches data to the level 2 cache
    before L1 cache requests occur
  • Reduces compulsory misses due to an increase of
    valid data in cache
  • Reduces bus cycle penalties

23
Schedule
  • P6 Pipeline in detail
  • Front-End
  • Execution Core
  • Back-End
  • Power Issues
  • Intel SpeedStep
  • Testing the Features
  • x86 system registers
  • Performance Testing

24
P6 Front-end Instruction Fetching
  • IA-32 Memory Management
  • Classic segmented model (cannot be disabled in
    protected mode)
  • Separation of code, data, and stack into
    "segments
  • Optional paging
  • Segments divided into pages (typically 4KB)
  • Additional protection to segment-protection
  • I.e. provides read-write protection on a
    page-by-page basis
  • Stage 11 (stage 1) - Selection of address for
    next I-cache access
  • Speculation address chosen from competing
    sources (i.e. BTB, BAC, loop detector, etc.)
  • Calculation of linear address from logical
    (segment selector offset)
  • Segment selector index into a table of segment
    descriptors, which include base address, size,
    type, and access right of the segment
  • Remember only six segment selectors, so only six
    usable at a time
  • 32-bit code nowadays uses flat model, so OS can
    make do with only a few (typically four) segments
  • IFU chooses address with highest priority and
    sends it to stage two

25
P6 Front-end Instruction Fetching
  • Stage 12-13 - Accessing of caches
  • Accesses instruction caches with address
    calculated in stage one
  • Includes standard cache, victim cache, and
    streaming buffer
  • With paging, consults ITLB to determine physical
    page number (tag bits)
  • Without paging, linear address from stage one
    becomes physical address
  • Obtains branch prediction from branch target
    buffer (BTB)
  • BTB takes two cycles to complete one access
  • Instruction boundary (ILD) and BTB markings
  • Stage 14 - Completion of instruction cache access
  • Instructions and their marks are sent to
    instruction buffer or steered to ID

26
P6 Front-end Instruction Fetching
27
P6 Front-end Instruction Decoding
  • Stage 15-16 - Decoding of IA32 Instructions
  • Alignment of instruction bytes
  • Identification of the ends of up to three
    instructions
  • Conversion of instructions into micro-ops
  • Stage 17 - Branch Decoding
  • If the ID notices a branch that went unpredicted
    by the BTB (i.e. if the BTB had never seen the
    branch before), flushes the in-order pipe, and
    re-fetches from the branch target
  • Branch target calculated by BAC
  • Early catch saves speculative instructions from
    being sent through the pipeline
  • Stage 21 - Register Allocation and Renaming
  • Synonymous with stage 17 (a reminder of
    independent working units)
  • Allocator used to allocate required entries in
    ROB, RS, LB, and SB
  • Register Alias Table (RAT) consulted
  • Maps logical sources/destinations to physical
    entries in the ROB (or sometimes RRF)
  • Stage 22 Completion of Front-End
  • Marked micro-ops are forwarded to RS and ROB,
    where theyawait execution and retirement,
    respectively.

28
P6 Front-end Instruction Decoding
29
Register Alias Table Introduction
  • Provides register renaming of integer and
    floating-point registers and flags
  • Maps logical (architected) entries to physical
    entries usually in the re-order buffer (ROB)
  • Physical entries are actually allocated by the
    Allocator
  • The physical entry pointers become a part of the
    micro-ops overall state as it travels through
    the pipeline

30
RAT Details
  • P6 is 3-way super-scalar, so the RAT must be able
    to rename up to six logical sources per cycle
  • Any data dependences must be handled
  • Ex op1) ADD EAX, EBX, ECX (dest. EAX) op2)
    ADD EAX, EAX, EDX
  • op3) ADD EDX, EAX, EDX
  • Instead of making op2 wait for op1 to retire, the
    RAT provides data forwarding
  • Same case for op3, but RAT must make sure that it
    gets the result from op2 and not op1

31
RAT Implementation Difficulties
  • Speculative Renaming
  • Since speculative micro-ops flow by, the RAT must
    be able to undo its mappings in the case of a
    branch misprediction
  • Partial-width register reads and writes
  • Consider a partial-width write followed by a
    larger-width read
  • Data required by the read is an assimilation of
    multiple previous writes to the register to
    make sure, RAT must stall the pipeline
  • Retirement Overrides
  • Common interaction between RAT and ROB
  • When a micro-op retires, its ROB entry is removed
    and its result may be latched into an architected
    destination register
  • If any active micro-ops source the retired ops
    destination, they must not reference the outdated
    ROB entry
  • Mismatch stalls
  • Associated with flag renaming

32
The Allocator
  • Works in conjunction with RAT to allocate
    required entries
  • In each cycle, assumes three ROB, RS, and LB and
    two SB entries
  • Once micro-ops arrive, it determines how many
    entries are really needed
  • ROB Allocation
  • If three entries arent available the allocator
    will stall
  • RS Allocation
  • A bitmap is used to determine which entries are
    free
  • If the RS is full, pipeline is stalled
  • RS must make sure valid entries are not
    overwritten
  • MOB Allocation
  • Allocation of LB and SB entries also done by
    allocator

33
PM Changes to P6 Front-End
  • Micro-op fusion
  • Dedicated Stack Engine
  • Enhanced branch prediction
  • Additional stages
  • Intels secret
  • Most likely required for extra functionality
    above

34
Micro-ops Fusion
  • Fusion of multiple micro-ops into one micro-op
  • Less contention for buffer entries
  • Similarity to SIMD data packing
  • Two examples of fusion from Intel documentation
  • IA32 load-and-operate and store instructions
  • Not known for certain whether these are the only
    cases of fusion
  • Possibly inspired by MacroOps used in K7 (Athlon)

35
Dedicated Stack Engine
  • Traditional out-of-order implementations update
    the Stack Pointer Register (ESP) by sending a µop
    to update the ESP register with every stack
    related instruction
  • Pentium M implementation
  • A delta register (ESPD) is maintained in the
    front end
  • A historic ESP (ESPO) is then kept in the
    out-of-order execution core
  • Dedicated logic was added to update the ESP by
    adding the ESPO with the ESPD

36
Improvements
  • The ESPO value kept in the out-of-order machine
    is not changed during a sequence of stack
    operations, this allows for more parallelism
    opportunities to be realized
  • Since ESPD updates are now done by a dedicated
    adder, the execution unit is now free to work on
    other µops and the ALUs are freed to work on
    more complex operations
  • Decreased power consumption since large adders
    are not used for small operations and the
    eliminated µops do not toggle through the machine
  • Approximately 5 of the µops have been eliminated

37
Complications
  • Since the new adder lives in the front end all of
    its calculations are speculative. This
    necessitates the addition of recovery table for
    all values of ESPO and ESPD
  • If the architectural value of ESP is needed
    inside of the out-of-order machine the decode
    logic then needs to insert a µop that will carry
    out the ESP calculation

38
Branch Prediction
  • Longer pipelines mean higher penalties for
    mispredicted branches
  • Improvements result in added performance and
    hence less energy spent per instruction retired

39
Branch Prediction in Pentium M
  • Enhanced version of Pentium 4 predictor
  • Two branch predictors added that run in tandem
    with P4 predictor
  • Loop detector
  • Indirect branch detector
  • 20 lower misprediction rate than PIII resulting
    in up to 7 gain in real performance

40
Branch Prediction
Based on diagram found here http//www.cpuid.org/
reviews/PentiumM/index.php
41
Loop Detector
  • A predictor that always branches in a loop will
    always incorrectly branch on the last iteration
  • Detector analyzes branches for loop behavior
  • Benefits a wide variety of program types

http//www.intel.com/technology/itj/2003/volume07i
ssue02/art03_pentiumm/p05_branch.htm
42
Indirect Branch Predictor
  • Picks targets based on global flow control
    history
  • Benefits programs compiled to branch to
    calculated addresses

http//www.intel.com/technology/itj/2003/volume07i
ssue02/art03_pentiumm/p05_branch.htm
43
Reservation Station
  • Used as a store for µops to wait for their
    operands and execution units to become available
  • Consists of 20 entries
  • Control portion of the entry can be written to
    from one of three ports
  • Data portion can be written to from one of 6
    available ports
  • 3 for ROB
  • 3 for EU write backs
  • Scheduler then uses this to schedule up to 5 µops
    at a time
  • During pipeline stage 31 entries that are ready
    for dispatch are then sent to stage 32

44
Cancellation
  • Reservation Station assumes that all cache
    accesses will be hits
  • In the case of a cache miss micro-ops that are
    dependant on the write-back data need to be
    cancelled and rescheduled at a later time
  • Can also occur due to a future resource conflict

45
Retirement
  • Takes 2 clock cycles to complete
  • Utilizes reorder buffer (ROB) to control
    retirement or completion of µops
  • ROB is a multi-ported register file with separate
    ports for
  • Allocation time writes of µop fields needed at
    retirement
  • Execution Unit write-backs
  • ROB reads of sources for the Reservation Station
  • Retirement logic reads of speculative result data
  • Consists of 40 entries with each entry 157 bits
    wide
  • The ROB participates in
  • Speculative execution
  • Register renaming
  • Out-of-order execution

46
Speculative Execution
  • Buffers results of the execution unit before
    commit
  • Allows maximum rate for fetch and execute by
    assuming that branch prediction is perfect and no
    exceptions have occurred
  • If a misprediction occurs
  • Speculative results stored in the ROB are
    immediately discarded
  • Microengine will restart by examining the
    committed state in the ROB

47
Register Renaming
  • Entries in the ROB that will hold the results of
    speculative µops are allocated during stage 21 of
    the pipeline
  • In stage 22 the sources for the µops are
    delivered based upon the allocation in stage 21.
  • Data is written to the ROB by the Execution Unit
    into the renamed register during stage 83

48
Out-of-order Execution
  • Allows µops to complete and write back their
    results without concern for other µops executing
    simultaneously
  • The ROB reorders the completed µops into the
    original sequence and updates the architectural
    state
  • Entries in ROB are treated as FIFO during
    retirement
  • µops are originally allocated in sequential order
    so the retirement will also follow the original
    program order
  • Happens during pipeline stage 92 and 93

49
Exception Handling
  • Events are sent to the ROB by the EU during stage
    83
  • Results sent to the ROB from the Execution Unit
    are speculative results, therefore any exceptions
    encountered may not be real
  • If the ROB determines that branch prediction was
    incorrect it inserts a clear signal at the point
    just before the retirement of this operation and
    then flushes all the speculative operations from
    the machine
  • If speculation is correct, the ROB will invoke
    the correct microcode exception handler
  • All event records are saved to allow the handler
    to repair the result or invoke the correct macro
    handler
  • Pointers for the macro and micro instructions are
    also needed to allow the program to resume after
    completion by the event handler
  • If the ROB retires an operation that faults, both
    the in-order and out-of-order sections are
    cleared. This happens during pipeline stages 93
    and 94

50
Memory Subsystem
  • Memory Ordering Buffer (MOB)
  • Execution is out-of-order, but memory accesses
    cannot just be done in any order
  • Contains mainly the LB and the SB
  • Speculative loads and stores
  • Not all loads can be speculative
  • I.e. a memory-mapped I/O ld could have
    unrecoverable side effects
  • Stores are never speculative (cant get back
    overwritten bits)
  • But to improve performance, stores are queued in
    the store buffer (SB) to allow pending loads to
    proceed
  • Similar to a write-back cache

51
Schedule
  • P6 Pipeline in detail
  • Front-End
  • Execution Core
  • Back-End
  • Power Issues
  • Intel SpeedStep
  • Testing the Features
  • x86 system registers
  • Performance Testing

52
Power Issues
  • Power use a C V2 F
  • a activity factor
  • C effective capacitance
  • V voltage
  • F operating frequency
  • Power use can be reduced linearly by lowering
    frequency and capacitance and quadratically by
    scaling voltage

53
Mobile Use
  • Mobile is bursty full power is only necessary
    for brief periods
  • Intel developed SpeedStep technology to take
    advantage of this fact and reduce power
    consumption during periods of inactivity

http//www.intel.com/technology/itj/2003/volume07i
ssue02/art05_power/p05_thermal.htm
54
SpeedStep I and II
  • SpeedStep I and II used in previous generations
  • Only two states
  • High performance (High frequency mode)
  • Lower power use (Low frequency mode)
  • Problems
  • Slow transition times
  • Limited opportunity for optimization

55
Pentium M Goals
  • Optimize for performance when plugged in
  • Optimize for long battery-life when unplugged

56
SpeedStep III
  • Optimized to fix limitations of previous
    generations
  • Three innovations
  • Voltage-Frequency switching separation
  • Clock partitioning and recovery
  • Event blocking

The 6 states of the Pentium M 1,6GHz
57
Voltage-Frequency switching separation
  • Voltage scaling is stepped up and down
    incrementally
  • This prevents clock noise and allows the
    processor to remain responsive during transition
  • Once voltage target is reached, frequency is
    throttled

http//www.intel.com/technology/itj/2003/volume07i
ssue02/art03_pentiumm/p10_speedstep.htm
58
Clock partitioning and recovery
  • During transition, only the core clock and
    phase-locked-loop are stopped
  • This keeps logic active even while the clock is
    stopped

http//www.intel.com/technology/itj/2003/volume07i
ssue02/art03_pentiumm/p10_speedstep.htm
59
Event blocking
  • To prevent loss of events during frequency and
    voltage scaling when the core clock is stopped,
    interrupts, pin events, and snoop requests are
    sampled and saved
  • These events are retransmitted once the core
    clock becomes available

http//www.intel.com/technology/itj/2003/volume07i
ssue02/art03_pentiumm/p10_speedstep.htm
60
Leakage
  • Transistors in off state still draw current
  • As transistors shrink and clock speed increases,
    transistors leak more current causing higher
    temperatures and more power use

61
Strained Silicon
http//www.research.ibm.com/resources/press/strain
edsilicon/
62
Benefits of Strained Silicon
  • Electrons flow up to 70 faster due to reduced
    resistance
  • This leads to chips which are up to 35 faster,
    without decrease in chip size
  • Intels "uni-axial" strained silicon process
    reduces leakage by at least five times without
    reducing performance the 65nm process will
    realize another reduction of at least four times

63
High-K Transistor Gate Dielectric (coming soon)
  • The dielectric used since the 1960s, silicon
    dioxide, is so thin now that leakage is a
    significant problem
  • A high-k (high dielectric constant) material has
    been developed by Intel to replace silicon
    dioxide
  • This high-k material reduces leakage by a factor
    of 100 below silicon dioxide

64
More Advances to Expect
  • Continued lowering of capacitance has helped
    reduce power consumption
  • Tri-gate transistors decreases leakage by
    increasing the amount of surface area for
    electrons to flow through

65
Schedule
  • P6 Pipeline in detail
  • Front-End
  • Execution Core
  • Back-End
  • Power Issues
  • Intel SpeedStep
  • Testing the Features
  • x86 system registers
  • Performance Testing

66
x86 System Registers
  • EFLAGS
  • Various system flags
  • CPUID
  • Exposes type and available features of processor
  • Model Specific Registers (MSRs)
  • rdmsr and wrmsr
  • Examples
  • Enabling/Disabling SpeedStep
  • Determining and changing voltage/frequency points
  • More

67
Performance Testing
  • P4 2.2GHz vs. PM 1.6GHz

68
Benchmark
69
Battery Life
70
Pentium M vs AMD Turion
71
(No Transcript)
72
Gaming
73
Battery Life
74
Future Processors
  • Yonah
  • Dual-core processor
  • Manufactured on a 65 nm process
  • Starting at 2.16GHz with a 667 MHz FSB (166MHz
    quad-pumped)
  • Shared 2MB L2 cache
  • Increased floating point performance with SSE3
    instructions
  • Merom
  • Based on EM64T ISA
  • Consume 0.5 W of power, half of what the Dothan
    consumes
  • Possibility of laptops with 10 hours of battery
    life
Write a Comment
User Comments (0)
About PowerShow.com