Microarchitecture Hows, and whys - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Microarchitecture Hows, and whys

Description:

Instruction Fetch & Data Access latency. Wasted-Cycles on Deeply-Pipelined ... Fetching, Decoding (and implicit conversion to RISC instructions) Execution ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 58
Provided by: saeed
Category:

less

Transcript and Presenter's Notes

Title: Microarchitecture Hows, and whys


1
MicroarchitectureHows, and whys!
  • Saeed Beiki
  • SIC, Intel Innovation Center (IIC)
  • Dubai, Dubai Internet City (DIC), Microprocessor
    TechZone Staff Congress,
  • November 2007

2
Who Am I
  • A Senior Information Contributor (SIC) at Intel
    Innovation Center, Dubai
  • Involved in other fields that arent related to
    the present subject

3
Agenda
  • Memory Subsystem
  • Segmented and Harvard Models
  • Virtual Addresses and TLB
  • General Principles
  • Five-Steps IA, ID, EX, DA, WB
  • Instruction Lifecycle
  • RISC vs. CISC Architecture
  • Register-Register vs. Register-Memory
    Architecture
  • ILP
  • What ILP Is?
  • Pipelining
  • Pipeline Hazards
  • Block-Instructions, Pipeline Bubbles and Stall
    Cycles
  • Locality of Reference
  • Data Forwarding
  • Limits
  • Superscalar Processing
  • Single-Issue vs. Multi-Issue Architecture
  • Limits

4
Agenda (cont.)
  • DLP
  • What DLP Is?
  • SIMD and MIMD vs. SISD
  • SIMD Streaming Types
  • Limits
  • Branch Prediction
  • What BP Is?
  • Types
  • Static
  • Dynamic (BTBBHT)
  • BHB
  • BTB
  • RAB
  • Limits
  • OOO (Order-of-Order) or Dynamic Execution
  • Speculation phenomenon
  • Register-Renaming and RAT
  • ROB
  • RLSB

5
Agenda (cont.)
  • Evolutions (selected)
  • Drive Stages
  • GEHL
  • Profile Propagation
  • Overall Problems
  • Instruction Fetch Data Access latency
  • Wasted-Cycles on Deeply-Pipelined Architectures
  • Low-rated Entropy
  • SupILP-DLP
  • My Proposals
  • Sequential Speculation
  • Custom Storage
  • Line Splitter Strategy
  • QA
  • Estimated Time 215 hours

6
Memory Subsystem
7
Segmented and Harvard model
  • The memory is split into exclusive parts
    (segments) for any computing resource identity
  • CODE Text (or Code)
  • DATA Stack, Heap, BSS, Data
  • Effect of Register Length and Register State
  • Registers and Base Locations and Indexing
  • Properties of segments
  • CODE Instructions, RO
  • DATA
  • Data Global, Static, Initialized, Writable,
    Fixed
  • BSS Global, Static, Uninitialized, Writable,
    Fixed
  • Heap Other Data Types, RunTime-Alloc, Writable,
    Non-Fixed
  • Stack Other Data Types, Temporary, Writable,
    Abstract Non-Fixed
  • LIFO or FIFO Types
  • Harvard model?

8
Virtual Address
  • Access State
  • Object Hierarchy
  • Translation Look-aside Buffer (TLB)
  • Memory
  • Disk
  • TLB-Miss
  • Memory-Miss
  • Disk-Failure
  • TLB Size Calculation

9
General Principles
10
Five Key Steps in Instruction Processing
  • Any computer system should perform the below
    required steps to execute an instruction
    completely and successfully
  • Instruction Access (IA)
  • Instruction Decode (ID)
  • Execution (EX)
  • Data Access (DA)
  • Write-Back (WB)

11
Instruction Access (IA)
  • Read from Memory Subsystem
  • Indicated by Program-Counter (PC)
  • Only Physical Addresses!? Yes, TLB!
  • Can be equalized with Instruction Fetch (IF)?
    Nope!
  • Latency is Alive
  • Latency in Translation
  • Latency in Reading Fetching

12
Instruction Decode (ID)
  • Control Information
  • Operation Code (OPCode)
  • Immediate Data
  • Embedded
  • Consequence
  • Target Address
  • Instruction, eg Branch instruction(s)
  • Data, eg Load/Store instruction(s)
  • Register Data Fetch
  • CISC to RISC conversion

13
Execute (EX)
  • Perspectives
  • Mathematics operations
  • Movement operations
  • Effective Address (EF)
  • Address Offsets
  • Indirect Memory Reference
  • Register Address-housing
  • Execution Units
  • Memory units
  • Arithmetic units
  • Execution Core?
  • Flush (back)

14
Data Access (DA)
  • Give your ticket, take your data!
  • Address Bus
  • Data Bus
  • Control Bus
  • So what about Memory Store instructions?

15
Write-back (WB)
  • What is Write-back?
  • Data Load/ Data Store
  • Types
  • Register
  • Memory
  • Disk
  • Register-Memory vs. Register-Register

16
So what is IT, then?
  • Instruction Translation (IT)
  • OS-side
  • Assemblers
  • Stateful IT
  • Pre-Explored IT (PEIT)
  • Roles of Directives
  • Roles of Memory layout
  • Processor-based IT
  • Note So our five stage model will be modified
    and will get into challenges with some
    complexities.

17
Instruction Lifecycle
Load
Decode
Fetch
Execution
Write-back
Retire
  • Facts
  • Reload
  • Pre-Fetch
  • OOO Execution
  • In-Order Retire
  • Infinitive flush hazards

Flush
18
RISC vs. CISC Architecture
  • RISC (Reduced Instruction Set Computer)
  • CISC (Complex Instruction Set Computer)
  • Execution speed
  • Memory Parsing
  • Symbol states
  • Conversion Rule
  • Every modern processor will convert its CISC
    instructions to RISC

19
Register-Register vs. Register-Memory
  • Register-Register (RISC)
  • Only Load/Store operations can access Memory
  • Register-Memory (CISC)
  • Designate units (like ALU) can access Memory
  • Contract Register-Memory vs. RISC principle
  • Architectural Contract
  • Contrast of CISC-to-RISC Conversion
  • Solution Extra-RISC Instructions
  • Clock Rate state
  • Hardware Optimization state

20
Instruction-Level Parallelism (ILP)
21
Abstract
  • What ILP is?
  • Execute more instructions at the same time in
    parallel
  • Single-core ILP
  • Multi-core ILP
  • Methods
  • Pipelining
  • Superscalar
  • Super-pipelining
  • Combinations

22
Pipelining
1
2
3
4
5
IA
ID
EX
DA
WB
Normal
1
2
4
7
IA
ID
EX
DA
WB
Pipelined
5
8
3
6
9
  • First instruction 5 cycle
  • Consequence instructions 1 cycle latency
  • Still each one has 5 stages (5 cycles)

23
Pipelining (cont.)
  • Pipeline Hazard?
  • Hazard types
  • Data hazard aka Data Dependency (eg, unavailable
    data access)
  • Control hazard (eg, pipelined branch
    instructions)
  • Structural hazard (eg, instruction conflicts,
    same-time access (sta) problem)
  • This is why we separate instruction and data
    flowports
  • Stall (freeze) time
  • Wait to load, wait to fetch, wait to execute
  • Pipeline Bubble
  • Blocked pipes
  • A related set of instructions
  • Non-blocked pipes

24
Pipelining (cont.)
  • Locality of Reference
  • Data Bypassing technique
  • Data Splitting technique
  • Pipelining Limits
  • Deep pipelines are more prone to hazards
  • Solution
  • hyper-pipelining (super-pipelining)
  • Circuit design
  • Hazard cranking
  • Circuit Skewer
  • Energy (Watt) limitations
  • Latch and Setup n hold time

25
Superscalar Processing
  • Single-issue architecture
  • One clock rate, one instruction
  • Superscalar architecture
  • Fetch Bandwidth
  • Limits
  • Data hazards
  • Duplicating the hardware?

26
Puttin it together!
  • Duplication is still here
  • Pipeline duplication
  • OOO entrance
  • Execution types
  • Floating point
  • Integer
  • media
  • Media instructions
  • Video processing
  • Audio processing
  • Solutions 3DNow! (AMD), SSE, and AVX

27
Data-Level Parallelism (DLP)
28
Abstract
  • What DLP is?
  • Objective of DLP
  • media instructions sake
  • Requirements
  • More data is needed to be accessed (DA) for a
    single instruction
  • One instruction is repeated over a data set
  • Data are commonly Dependent

29
SIMD and MIMD vs. SISD
data
data
data
instruction
instruction
instruction
EX
EX
EX
SISD
SIMD
MIMD
  • Objectives and Limits
  • SISD Single Instruction, Single Data
  • SIMD Single Instruction, Multiple Data
  • MIMD Multiple Instruction, Multiple Data
  • Oops! What about MISD? Is that applicable? And
    why it should be implemented?

30
SIMD Streaming Types
  • Multiple data Load/Store operations
  • Timing constraints of media type nature
  • Data Cache

31
Faults and Limits
  • Data underflow
  • More and More Memory pressure (latency)
  • Solutions
  • Instruction Buffers (Inst. TLB)
  • Data Buffers (Data TLB)
  • I-Fetch Latency
  • Cache hierarchy
  • Branch Prediction

32
Branch Prediction
33
Abstract
  • What Branch Prediction (BP) is?
  • Guessing the branch direction
  • Branch direction
  • Forward
  • Backward
  • Pre-fetch phenomenon
  • Fetch before Need
  • There are instructions to allow the software to
    cope with locality of data (now is limited
    locality of data)
  • Load Request Buffering
  • Branch types
  • Forward Conditional (PC)
  • Backward Conditional (-PC-)
  • Unconditional (?)

34
Branch Prediction Types
  • Static Prediction
  • Statistical Analysis
  • 4/1 Comparison
  • 60 of Forward branches are taken
  • 85 of Backward branches are not taken (for the
    sake of LOOP)
  • So Coding-style is critical!
  • Accuracy is not Absolute!
  • Dynamic Prediction
  • BHB (Branch History Buffer) or Branch History
    Table (BHT)
  • Indexing address bit portions (usually 2 bits) of
    recent taken branches
  • Accuracy depends on indexing bits
  • Indexing bits are depended on bound of
    accessible memory
  • BTB (Branch Target Buffer)
  • Storing actual addresses of recent taken branches

35
Branch History Buffer (BHB)
  • 1-bit indexing
  • 1 bits are taken
  • 0 bits are not taken
  • Fault long loops, misprediction
  • 2-bits to achieve more accuracy
  • Misprediction
  • Bits are Inverted
  • Relational Branches
  • Global History Counter (GHC) or Two-layer
    predictor
  • Global Branch History (GBH)
  • Updating other branches bits
  • Implementation
  • GShare Algorithm
  • GBHR (Global Branch History Register)

36
Branch Target Buffer (BTB)
  • Storing Instruction address
  • Storing Target address
  • Scope of Pre-fetching
  • Recent Branch Addresses ? Next PC
  • Subroutines
  • Its a branch tho!
  • Return Addresses?
  • Return Address Buffer (RAB)
  • Caching recent return addresses
  • Repetitive subroutines

37
Limits
  • Cache Size
  • Bus Rate
  • Loop Detection

38
Out-of-Order (OOO) Execution
39
Abstract
  • Steps
  • speculating related instruction blocks
  • executing related instruction blocks out of
    order in execution units
  • commitment of results (data) or instructions
    in-order
  • retiring instructions

40
OOO (or Dynamic/Speculative) Execution
  • OOO or Speculative or Dynamic Execution
  • They are the same, with different names
  • How is the OOO process?
  • Fetching, Decoding (and implicit conversion to
    RISC instructions)
  • Execution
  • Being executed in the other best matches the
    available resouces
  • So they may be executed out of original order
  • The results will commit back
  • Write-back
  • Stronger branch prediction is required!
  • Memory bandwidth waste
  • Execution time waste
  • Power waste

41
Register Renaming
  • Fact
  • OOO-executed blocks cant access the same
    registers
  • OOO execution needs more than real registers to
    keep track of results
  • Solution having virtual registers
  • Register Alias Table (RAT)
  • It renames and maps GPRs to a set of temporal,
    and chip-internal register locations
  • Maximum number of instances of each register?
  • 128 / 8 16
  • Does it depend on processor wide-bits? NO!
  • Committed back when instructions are committed
    back!

42
Reorder Buffer (ROB) / Retirement Unit
  • Roles
  • Keeping track of Instruction Status (eg,
    Available Data)
  • Keeping track of Instruction State (eg,
    Completed, Flushed, Refetched, etc)
  • Retiring instructions
  • Instructions use RAT
  • Instructions will be dispatched to EUs (as data
    is available)
  • Instructions will be queued in RSs
  • Instructions will be retired in its main order
  • Performance improvement
  • Results can directly be bypassed to another
    instructions renamed registers
  • Data hazards are limited
  • Pipeline queue is moving in-consequense

43
Recent Load/Store Buffer (RLSB)
  • Speculative approach on data
  • Store instructions
  • Executing stores as we make sure the data should
    be changed
  • Stores will be buffered and committed
  • Program order is maintained through RLSB
  • Load instructions
  • Theyre more critical and latency-based
  • Speculative execution of loads
  • OOO calculation of EA
  • OOO Cache Access
  • Avoiding access-conflicts
  • Retional Cache Access (To know a store
    instruction is in pipeline-- and not committed
    back)
  • Bypassing Store results
  • If load requires a former store result
  • Saves load times

44
Evolutions
45
Drive Stage
  • More parallelism
  • Used to drive data across the microchip
  • Overcoming IC Design
  • Transmit signal across wire
  • Design is no longer being only worried in speed
    of transistors
  • Physical metal Aluminum ? Copper
  • A paradox
  • The deep-pipeline has to run at higher frequency
    to do work as equal to a shorter pipeline

46
Geometric History Length (GEHL)
  • Predictor Tables (Ti)
  • Indexed with functions (hashed) in GBH and Branch
    Addresses
  • Functions ?? Branches
  • Geometric series Li ai-1 L(1)
  • Counters (Ci)
  • Each counter is read on each predictor table
  • Counters are Signed (S)
  • Counter Scope numbers (M)
  • Prediction Calculation
  • S M /2 Sigma(0ltiltM)C(i)
  • S Positive ? Taken
  • S Negative ? Not Taken
  • Predictor Update
  • Its done only as mispredictions or being smaller
    than threshold SH
  • S lt SH
  • Predictor Output
  • Out ! 0 ? C(i) C(i) - 1
  • Out 0 ? C(i) C(i) 1

47
Profile Propagation
  • Software-based
  • Map program sections in old and new versions of
    program
  • Matching .text and .data blocks
  • Fuzzy Match
  • Changing pairs (primary, secondary)
  • Hardware-based
  • ID match
  • OPCode sequence matching
  • uOP down-ride time matching (statistically)
  • uOP Block matching
  • Data value matching
  • Data position matching

48
Overall Problems(A quick and brief overview of
important problems in discussed ones)
49
Overall Problems
  • Instruction Fetch Data Access latency
  • Latency is still alive
  • Solutions I-Fetch latency Branch prediction,
    Super-pipelining
  • Wasted-Cycles on Deeply-Pipelined Architectures
  • More pipeline stages are good, but the
    performance hit is great when a misprediction is
    occurred.
  • Flush back
  • Low-rated Entropy
  • Entropy vs. Block state
  • SupILP-DLP (Super Instruction/Data -Level
    Parallelism)
  • More Prediction accuracy
  • More Pipeline stages
  • More Superscalar execution

50
Proposals
51
Sequential Speculation
  • Rational and Sequential

ONE
To Take (2)
To Take (4)
TWO
TWO
Miss
To Take (n2)
THREE
THREE
Miss
Hier. Back
Hier. Back
TLB Entry Edit
52
Custom Storage
  • WriteRead storage
  • Signed by Directive
  • Bus Bandwidth Limit
  • Matrix objective

Data TLB
Inst. TLB
L1 Cache
L2 Cache
Mixed, Different Depths
IWR
DWR
53
Line Splitter Strategy
  • Splitting and indexing dependent on Attributes,
    attrib1, attrib2, attrib n
  • Storing in Custom Storages (Matrix), if it has
    free space available (signed not used by
    software-side)
  • Its like Drive Stages

attrib1
attrib2
attrib3
attrib n
DATA OR INSTRUCTION FLOW
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 3 3 3 3 3 3
n n n n n n n
INDEX LINES
Custom Storage Matrix View
54
Conclusion
  • Latency is always Alive!
  • I think, soon or late software-based optimization
    (like PPO) scenarios are going to come into play!
    I just mentioned some.
  • Hardware optimization is not just Clock Cycle
    Rate improvement. It includes more accurate
    branch prediction, wire attribute improvements,
    and of course calculus innovations, too! I just
    mentioned some too.

55
References
  • Intel Architecture Optimization Manual, Intel.com
  • Intel IPP (Integrated Performance Primitive)
    manuals, Intel.com
  • Intel 64 and IA-32 Architectures Optimization
    Reference Manual, Intel.com
  • Intel 64 and IA-32 Architectures Software
    Developers Manual - Documentation Changes,
    Intel.com
  • Intel 64 Architecture Memory Ordering White
    Paper, Intel.com
  • Intel 64 and IA-32 Architectures Software
    Developers Manual - Documentation Changes,
    Intel.com
  • Intel 64 and IA-32 Architectures Software
    Developers Manual - Volume 3A - System
    Programming Guide, Part1, Intel.com
  • Intel Debugger (IDB) Manual, Intel.com
  • Intel Hyper-Threading Technology Technical
    User's Guide, Intel.com
  • Intel Math Kernel Library (Reference Manual),
    Intel.com
  • Intel Pentium 4 Processor Optimization
    (Reference Manual), Intel.com
  • Intel Pentium 4 Processor with 512-KB L2 Cache
    on 0.13 Micron Process Thermal Design Guidelines,
    Intel.com
  • Intel SSE4 Programming Reference, Intel.com
  • Intel Virtualization Technology (VT) in
    Converged Application Platforms, Intel.com
  • Arstechnica.com Resource website
  • I386 - System V Application Binary Interface,
    Intel.com
  • IA64 - Intel Itanium - System V Application
    Binary Interface, Intel.com
  • ARM Architecture Reference Manual

56
QAAny questions around?
57
Thank you The Important thing is not to stop
Questioning. Sir. Albert Einstein
Write a Comment
User Comments (0)
About PowerShow.com