An Approach for Implementing Efficient Superscalar CISC Processors - PowerPoint PPT Presentation

About This Presentation

Title:

An Approach for Implementing Efficient Superscalar CISC Processors

Description:

Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James Smith. Ilhyun Kim. HPCA 2006, Austin, TX ... HPCA 2006, Austin, TX. 6. Related Work. Co-designed VM: IBM DAISY, BOA ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 34

Provided by: shili5

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: An Approach for Implementing Efficient Superscalar CISC Processors

1
An Approach for Implementing Efficient
Superscalar CISC Processors
Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James
Smith
Ilhyun Kim
2
Processor Design Challenges

CISC challenges -- Suboptimal internal micro-ops.
Complex decoders obsolete features/instructions
Instruction count expansion 40 to 50 ? mgmt,
comm
Redundancy Inefficiency in the cracked
micro-ops
Solution Dynamic optimization
Other current challenges (CISC RISC)
Efficiency (Nowadays, less performance gain per
transistor)
Power consumption has become acute
Solution Novel efficient microarchitectures

3
Solution Architecture Innovations
Software in Architected ISA OS, Drivers, Lib
code, Apps
Architected ISA e.g. x86

Dynamic Translation
Implementation ISA e.g. fusible ISA
HW Implementation Processors, Mem-sys, I/O
devices

ISA mapping
Hardware Simple translation, good for startup
performance.
Software Dynamic optimization, good for
hotspots.

Can we combine the advantages of both?
Startup Fast, simple translation
Steady State Intelligent translation/optimization
, for hotspots.

4
Microarchitecture Macro-op Execution

Enhanced OoO superscalar microarchitecture
Process execute fused macro-ops as single
Instructions throughout the entire pipeline
Analogy All lanes ? car-pool on highway ? reduce
congestion w/ high throughput, AND raise the
speed limit from 65mph to 80mph.

3-1 ALUs
cache
ports
Fuse
bit
Decode
Wake-
Align
Retire
WB
RF
Select
EXE
MEM
Fetch
Rename
Fuse
up
Dispatch
5
Related Work x86 processors

AMD K7/K8 microarchitecture
Macro-Operations
High performance, efficient pipeline
Intel Pentium M
Micro-op fusion.
Stack manager.
High performance, low power.
Transmeta x86 processors
Co-Designed x86 VM
VLIW engine code morphing software.

6
Related Work

Co-designed VM IBM DAISY, BOA
Full system translator on tree regions VLIW
engine
Other research projects e.g. DBT for ILDP
Macro-op execution
ILDP, Dynamic Strands, Dataflow Mini-graph, CCG.
Fill Unit, SCISM, rePLay, PARROT.
Dynamic Binary Translation / Optimization
SW based (Often user mode only) UQBT, Dynamo
(RIO), IA-32 EL. Java and .NET HLL VM runtime
systems
HW based Trace cache fill units, rePLay, PARROT,
etc

7
Co-designed x86 processor architecture
1
2

Co-designed virtual machine paradigm
Startup Simple hardware decode/crack for fast
translation
Steady State Dynamic software translation/optimiz
ation for hotspots.

8
Fusible Instruction Set

RISC-ops with unique features
A fusible bit per instr. for fusing
Dense encoding, 16/32-bit ISA
Special Features to Support x86
Condition codes
Addressing modes
Aware of long immediate values

F
10 b opcode
5b Rds
F
F
10 b opcode
5b Rsrc
F
16 bit opcode
5b Rsrc
5b Rds
10b Immediate / Disp
5b opcode
F
5b opcode
5b Rds
5b Immd
F
5b Rds
5b Rsrc
5b opcode
9
Macro-op Fusing Algorithm

Objectives
Maximize fused dependent pairs
Simple Fast
Heuristics
Pipelined Scheduler Only single-cycle ALU ops
can be a head. Minimize non-fused single-cycle
ALU ops
Criticality Fuse instructions that are close
in the original sequence. ALU-ops criticality is
easier to estimate.
Simplicity 2 or less distinct register operands
per fused pair
Solution Two-pass Fusing Algorithm
The 1st pass, forward scan, prioritizes ALU ops,
i.e. for each ALU-op tail candidate, look
backward in the scan for its head
The 2nd pass considers all kinds of RISC-ops as
tail candidates

10
Fusing Algorithm Example
x86 asm -----------------------------------------
------------------ 1. lea eax, DSedi
01 2. mov DS080b8658, eax 3. movzx ebx,
SSebp ecx ltlt 1 4. and eax, 0000007f 5.
mov edx, DSeax esi ltlt 0 0x7c
RISC-ops ----------------------------------------
------------- 1. ADD Reax, Redi, 1 2. ST
Reax, memR22 3. LD.zx Rebx, memRebp Recx
ltlt 1 4. AND Reax, 0000007f 5. ADD R17, Reax,
Resi 6. LD Redx, memR17 0x7c
After fusing Macro-ops --------------------------
--------------------------- 1. ADD R18, Redi, 1
AND Reax, R18, 007f 2. ST R18,
memR22 3. LD.zx Rebx, memRebp Recx ltlt 1 4.
ADD R17, Reax, Resi LD Rebx, memR170x7c

11
Instruction Fusing Profile

55 fused RISC-ops ? increases effective ILP by
1.4
Only 6 single-cycle ALU ops left un-fused.

12
Processor Pipeline
Reduced Instr. traffic throughout
Reduced forwarding
Pipelined scheduler

Macro-op pipeline for efficient hotspot execution
Execute macro-ops
Higher IPC, and Higher clock speed potential
Shorter pipeline front-end

13
Co-designed x86 pipeline frond-end
14
Co-designed x86 pipeline backend
15
Experimental Evaluation

x86vm Experimental framework for exploring the
co-designed x86 virtual machine paradigm.
Proposed co-designed x86 processor A specific
instantiation of the framework.
Software components VMM DBT, Code caches, VM
runtime control and resource management system
(Extracted some source code from BOCHS 2.2)
Hardware components Microarchitecture timing
simulators, Baseline OoO Superscalar, Macro-op
Execution, etc.
Benchmarks SPEC2000 integer

16
Performance Evaluation SPEC2000
?
17
Performance Contributors

Many factors contribute to the IPC performance
improvement
Code straightening,
Macro-op fusing and execution.
Reduce pipeline front-end (reduce branch penalty)
Collapsed 3-1 ALUs (resolve branches addresses
sooner).
Besides baseline and macro-op models, we model
three middle configurations
M0 baseline code cache
M1 M0 macro-op fusing.
M2 M1 shorter pipeline front-end. (Macro-op
mode)
Macro-op M2 collapsed 3-1 ALUs.

18
Performance Contributors SPEC2000
19
Conclusions

Architecture Enhancement
Hardware/Software co-designed paradigm ? enable
novel designs more desirable system features
Fuse dependent instruction pairs ? collapse
dataflow graph to increase ILP
Complexity Effectiveness
Pipelined 2-cycle instruction scheduler
Reduce ALU value forwarding network significantly
DBT software reduces hardware complexity
Power Consumption Implication
Reduced pipeline width
Reduced Inter-instruction communication and
instruction management

20
Finale Questions Answers
Suggestions and comments are welcome, Thank you!
21
Outline

Motivation Introduction
Processor Microarchtecture Details
Evaluation Conclusions

22
Performance Simulation Configuration
?
23
Fuse Macro-ops An Illustrative Example
24
Translation Framework
Dynamic binary translation framework 1. Form
hotspot superblock. Crack x86 instructions into
RISC-style micro-ops 2. Perform Cluster Analysis
of embedded long immediate values and assign to
registers if necessary. 3. Generate RISC-ops
(IR form) in the implementation ISA 4.
Construct DDG (Data Dependency Graph) for the
superblock 5. Fusing Algorithm Scan looking for
dependent pairs to be fused. Forward scan,
backward pairing. Two-pass to prioritize ALU ops.
6. Assign registers re-order fused dependent
pairs together, extend live ranges for precise
traps, use consistent state mapping at superblock
exits 7. Code generation to code cache
25
Other DBT Software Profile

Of all fused macro-ops
50 ? ALU-ALU pairs.
30 ? fused condition test conditional branch
pairs.
Others ? mostly ALU-MEM ops pairs.
Of all fused macro-ops
70 are inter-x86instruction fusion.
46 access two distinct source registers,
only 15 (6 of all instruction entities) write
two distinct destination registers.
Translation Overhead Profile
About 1000 instructions per translated hotspot
instruction.

26
Dependence Cycle Detection

All cases are generalized to (c) due to Anti-Scan
Fusing Heuristic

27
HST back-end profile

Light-weight opts ProcLongImm, DDG setup, encode
tens of instrs. each Overhead per x86
instruction -- initial load from disk.
Heavy-weight opts uops translation, fusing,
codegen none dominates

28
Hotspot Coverage vs. runs
29
Hotspot Detected vs. runs
30
Performance Evaluation SPEC2000
?
31
Performance evaluation (WSB2004)
?
32
Performance Contributors (WSB2004)
33
Future Directions

Co-Designed Virtual Machine Technology
Confidence More realistic benchmark study
important for whole workload behavior such as
hotspot behavior and impact of context switches.
Enhancement More synergetic, complexity-effective
HW/SW Co-design techniques.
Application Specific enabling techniques for
specific novel computer architectures of the
future.
Example co-designed x86 processor design
Confidence Study as above.
Enhancement HW µ-Arch ? Reduce register write
ports. VMM ? More dynamic optimizations
in HST, e.g. CSE, software stack manager,
SIMDification.