An Approach for Implementing Efficient Superscalar CISC Processors - PowerPoint PPT Presentation

About This Presentation
Title:

An Approach for Implementing Efficient Superscalar CISC Processors

Description:

Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James Smith. Ilhyun Kim. HPCA 2006, Austin, TX ... HPCA 2006, Austin, TX. 6. Related Work. Co-designed VM: IBM DAISY, BOA ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 34
Provided by: shili5
Category:

less

Transcript and Presenter's Notes

Title: An Approach for Implementing Efficient Superscalar CISC Processors


1
An Approach for Implementing Efficient
Superscalar CISC Processors
Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James
Smith
Ilhyun Kim
2
Processor Design Challenges
  • CISC challenges -- Suboptimal internal micro-ops.
  • Complex decoders obsolete features/instructions
  • Instruction count expansion 40 to 50 ? mgmt,
    comm
  • Redundancy Inefficiency in the cracked
    micro-ops
  • Solution Dynamic optimization
  • Other current challenges (CISC RISC)
  • Efficiency (Nowadays, less performance gain per
    transistor)
  • Power consumption has become acute
  • Solution Novel efficient microarchitectures

3
Solution Architecture Innovations
Software in Architected ISA OS, Drivers, Lib
code, Apps
Architected ISA e.g. x86

Dynamic Translation
Implementation ISA e.g. fusible ISA
HW Implementation Processors, Mem-sys, I/O
devices
  • ISA mapping
  • Hardware Simple translation, good for startup
    performance.
  • Software Dynamic optimization, good for
    hotspots.
  • Can we combine the advantages of both?
  • Startup Fast, simple translation
  • Steady State Intelligent translation/optimization
    , for hotspots.

4
Microarchitecture Macro-op Execution
  • Enhanced OoO superscalar microarchitecture
  • Process execute fused macro-ops as single
    Instructions throughout the entire pipeline
  • Analogy All lanes ? car-pool on highway ? reduce
    congestion w/ high throughput, AND raise the
    speed limit from 65mph to 80mph.

3-1 ALUs
cache
ports
Fuse
bit
Decode
Wake-
Align
Retire
WB
RF
Select
EXE
MEM
Fetch
Rename
Fuse
up
Dispatch
5
Related Work x86 processors
  • AMD K7/K8 microarchitecture
  • Macro-Operations
  • High performance, efficient pipeline
  • Intel Pentium M
  • Micro-op fusion.
  • Stack manager.
  • High performance, low power.
  • Transmeta x86 processors
  • Co-Designed x86 VM
  • VLIW engine code morphing software.

6
Related Work
  • Co-designed VM IBM DAISY, BOA
  • Full system translator on tree regions VLIW
    engine
  • Other research projects e.g. DBT for ILDP
  • Macro-op execution
  • ILDP, Dynamic Strands, Dataflow Mini-graph, CCG.
  • Fill Unit, SCISM, rePLay, PARROT.
  • Dynamic Binary Translation / Optimization
  • SW based (Often user mode only) UQBT, Dynamo
    (RIO), IA-32 EL. Java and .NET HLL VM runtime
    systems
  • HW based Trace cache fill units, rePLay, PARROT,
    etc

7
Co-designed x86 processor architecture
1
2
  • Co-designed virtual machine paradigm
  • Startup Simple hardware decode/crack for fast
    translation
  • Steady State Dynamic software translation/optimiz
    ation for hotspots.

8
Fusible Instruction Set
  • RISC-ops with unique features
  • A fusible bit per instr. for fusing
  • Dense encoding, 16/32-bit ISA
  • Special Features to Support x86
  • Condition codes
  • Addressing modes
  • Aware of long immediate values

F
10 b opcode
5b Rds
F
F
10 b opcode
5b Rsrc
F
16 bit opcode
5b Rsrc
5b Rds
10b Immediate / Disp
5b opcode
F
5b opcode
5b Rds
5b Immd
F
5b Rds
5b Rsrc
5b opcode
9
Macro-op Fusing Algorithm
  • Objectives
  • Maximize fused dependent pairs
  • Simple Fast
  • Heuristics
  • Pipelined Scheduler Only single-cycle ALU ops
    can be a head. Minimize non-fused single-cycle
    ALU ops
  • Criticality Fuse instructions that are close
    in the original sequence. ALU-ops criticality is
    easier to estimate.
  • Simplicity 2 or less distinct register operands
    per fused pair
  • Solution Two-pass Fusing Algorithm
  • The 1st pass, forward scan, prioritizes ALU ops,
    i.e. for each ALU-op tail candidate, look
    backward in the scan for its head
  • The 2nd pass considers all kinds of RISC-ops as
    tail candidates

10
Fusing Algorithm Example
x86 asm -----------------------------------------
------------------ 1. lea eax, DSedi
01 2. mov DS080b8658, eax 3. movzx ebx,
SSebp ecx ltlt 1 4. and eax, 0000007f 5.
mov edx, DSeax esi ltlt 0 0x7c
RISC-ops ----------------------------------------
------------- 1. ADD Reax, Redi, 1 2. ST
Reax, memR22 3. LD.zx Rebx, memRebp Recx
ltlt 1 4. AND Reax, 0000007f 5. ADD R17, Reax,
Resi 6. LD Redx, memR17 0x7c
After fusing Macro-ops --------------------------
--------------------------- 1. ADD R18, Redi, 1
AND Reax, R18, 007f 2. ST R18,
memR22 3. LD.zx Rebx, memRebp Recx ltlt 1 4.
ADD R17, Reax, Resi LD Rebx, memR170x7c

11
Instruction Fusing Profile
  • 55 fused RISC-ops ? increases effective ILP by
    1.4
  • Only 6 single-cycle ALU ops left un-fused.

12
Processor Pipeline
Reduced Instr. traffic throughout
Reduced forwarding
Pipelined scheduler
  • Macro-op pipeline for efficient hotspot execution
  • Execute macro-ops
  • Higher IPC, and Higher clock speed potential
  • Shorter pipeline front-end

13
Co-designed x86 pipeline frond-end
14
Co-designed x86 pipeline backend
15
Experimental Evaluation
  • x86vm Experimental framework for exploring the
    co-designed x86 virtual machine paradigm.
  • Proposed co-designed x86 processor A specific
    instantiation of the framework.
  • Software components VMM DBT, Code caches, VM
    runtime control and resource management system
    (Extracted some source code from BOCHS 2.2)
  • Hardware components Microarchitecture timing
    simulators, Baseline OoO Superscalar, Macro-op
    Execution, etc.
  • Benchmarks SPEC2000 integer

16
Performance Evaluation SPEC2000
?
17
Performance Contributors
  • Many factors contribute to the IPC performance
    improvement
  • Code straightening,
  • Macro-op fusing and execution.
  • Reduce pipeline front-end (reduce branch penalty)
  • Collapsed 3-1 ALUs (resolve branches addresses
    sooner).
  • Besides baseline and macro-op models, we model
    three middle configurations
  • M0 baseline code cache
  • M1 M0 macro-op fusing.
  • M2 M1 shorter pipeline front-end. (Macro-op
    mode)
  • Macro-op M2 collapsed 3-1 ALUs.

18
Performance Contributors SPEC2000
19
Conclusions
  • Architecture Enhancement
  • Hardware/Software co-designed paradigm ? enable
    novel designs more desirable system features
  • Fuse dependent instruction pairs ? collapse
    dataflow graph to increase ILP
  • Complexity Effectiveness
  • Pipelined 2-cycle instruction scheduler
  • Reduce ALU value forwarding network significantly
  • DBT software reduces hardware complexity
  • Power Consumption Implication
  • Reduced pipeline width
  • Reduced Inter-instruction communication and
    instruction management

20
Finale Questions Answers
Suggestions and comments are welcome, Thank you!
21
Outline
  • Motivation Introduction
  • Processor Microarchtecture Details
  • Evaluation Conclusions

22
Performance Simulation Configuration
?
23
Fuse Macro-ops An Illustrative Example
24
Translation Framework
Dynamic binary translation framework 1. Form
hotspot superblock. Crack x86 instructions into
RISC-style micro-ops 2. Perform Cluster Analysis
of embedded long immediate values and assign to
registers if necessary. 3. Generate RISC-ops
(IR form) in the implementation ISA 4.
Construct DDG (Data Dependency Graph) for the
superblock 5. Fusing Algorithm Scan looking for
dependent pairs to be fused. Forward scan,
backward pairing. Two-pass to prioritize ALU ops.
6. Assign registers re-order fused dependent
pairs together, extend live ranges for precise
traps, use consistent state mapping at superblock
exits 7. Code generation to code cache
25
Other DBT Software Profile
  • Of all fused macro-ops
  • 50 ? ALU-ALU pairs.
  • 30 ? fused condition test conditional branch
    pairs.
  • Others ? mostly ALU-MEM ops pairs.
  • Of all fused macro-ops
  • 70 are inter-x86instruction fusion.
  • 46 access two distinct source registers,
  • only 15 (6 of all instruction entities) write
    two distinct destination registers.
  • Translation Overhead Profile
  • About 1000 instructions per translated hotspot
    instruction.

26
Dependence Cycle Detection
  • All cases are generalized to (c) due to Anti-Scan
    Fusing Heuristic

27
HST back-end profile
  • Light-weight opts ProcLongImm, DDG setup, encode
    tens of instrs. each Overhead per x86
    instruction -- initial load from disk.
  • Heavy-weight opts uops translation, fusing,
    codegen none dominates

28
Hotspot Coverage vs. runs
29
Hotspot Detected vs. runs
30
Performance Evaluation SPEC2000
?
31
Performance evaluation (WSB2004)
?
32
Performance Contributors (WSB2004)
33
Future Directions
  • Co-Designed Virtual Machine Technology
  • Confidence More realistic benchmark study
    important for whole workload behavior such as
    hotspot behavior and impact of context switches.
  • Enhancement More synergetic, complexity-effective
    HW/SW Co-design techniques.
  • Application Specific enabling techniques for
    specific novel computer architectures of the
    future.
  • Example co-designed x86 processor design
  • Confidence Study as above.
  • Enhancement HW µ-Arch ? Reduce register write
    ports. VMM ? More dynamic optimizations
    in HST, e.g. CSE, software stack manager,
    SIMDification.
Write a Comment
User Comments (0)
About PowerShow.com