Title: Safe RTL Annotations for Low Power Microprocessor Design
1Safe RTL Annotations for Low Power Microprocessor
Design
Vinod Viswanath Department of Electrical and
Computer Engineering University of Texas at Austin
2Outline
- Power Dissipation in Hardware Circuits
- Instruction-driven Slicing to attain lower power
dissipation - Automatically annotates microprocessor
description at the Register Transfer Level and
Architectural level - Correctness of the introduced annotations
- Case studies
3Power Dissipation
- Switching activity power dissipation
- To charge and discharge nodes
- Short Circuit power dissipation
- High only for output drivers, clock buffers
- Static power dissipation
- Due to leakage current
4Switching Activity Power Dissipation
- Reduce the squared term VDD
- Leads to exponential increase in Ileak
- Host of techniques to reduce switching power at
the gate level - Clock gating
- Relatively much lesser at the RTL
- Use program structure and dataflow information
available at that level of abstraction
5Transistor-level Methods
- Designing Complex Gates
- Reordering transistors for optimizing power/delay
- Transistor Sizing
- Transistor size inversely proportional to gate
delay - Transistor size proportional to power dissipated
at the gate - Given a delay constraint, size the transistor to
minimize power dissipation
6Gate-level Methods
- Combinational logic optimizations
- Dont-care optimizations
- Path balancing
- Sequential logic optimizations
- Encoding
- Pre-computation based optimization
- Guarded evaluation
7Combinational Logic Optimizations
- Dont-care optimization
- Optimize for input/output patterns that
can/should not ever occur - Path Balancing
- Typically path balancing is done to eliminate
spurious transitions - Adds unit delay buffers, increasing the power
dissipation - Useful skew in clocktrees
8Sequential Logic Optimizations
- Encoding
- Encode state transition graphs
- Encode values in datapath logic
- Passing data value 0010 followed by 1101 on a bus
9Sequential Logic Optimizations
- Pre-computation based optimization
- Selectively precompute outputs of the circuit one
cycle before they are required - If the output value is computes, the circuit can
be turned off for the next cycle - Size of pre-computation logic determines power
dissipation reduction, area increase, and delay
increase - Use predictor functions
- Pre-compute outputs based on subset of inputs
10Sequential Logic Optimizations
11Instruction-driven Slice
- An instruction-driven slice of a microprocessor
design is - all the relevant circuitry of the design required
to completely execute a specific instruction - Parts of the decode, execute, writeback etc.
blocks - Cone of influence of the semantics of the
instruction
12Instruction-driven Slicing
- Given a microprocessor design and an instruction
- Identify the instruction-driven slice
- Shut off the rest of the circuitry
- This might include
- Gating out parts of different blocks
- Gating out floating point units during integer
ALU execution - Turning off certain FSMs in different control
blocks since exact constraints on their inputs
are available due to instruction-driven slicing
13Algorithm (High Level)
- Algorithm instruction-driven-slicing.
- Begin
- Inputs vRTL (Verilog RTL), insts (instructions)
- Output aRTL (Annotated RTL)
- Parse vRTL to obtain the Abstract Syntax Program
Graph (ASPG) - For each instruction I in insts repeat
- Slice the ASPG for instruction I
- Traverse the ASPG
- Add annotation variables if such a block is found
- If a particular flop is already gated, then
- add the current annotation in an optimal
fashion - Return the annotated ASPG
- Generate Verilog code (aRTL) for the annotated
ASPG - End.
14or1200_ctrl.lsu_op
15or1200_ctrl.pre_branch_op
16Methodology
- In order to demonstrate our technique
- We have incorporated instruction-driven slicing
as part of the traditional design flow - The vRTL model is annotated to obtain the aRTL
model - Synopsys Design Environment has been sufficiently
modified to accept the aRTL, SPEC2000 benchmarks
and power process parameters and estimate the
power dissipation due to switching activity - The annotated Architectural model is fed to the
SimpleScalar simulator with the Wattch power
estimator to estimate the power dissipation
17Methodology
18Experiment OR1200
- We have used our tool-chain to test our
methodology on OR1200 - OR1200 is a pipelined microprocessor implementing
the OpenRISC ISA. - 4-stage integer pipeline with single instruction
issue per cycle - We have annotated both the RTL and the
architectural models of OR1200
19OR1200 single instruction issue pipelined
microprocessor
20OR1200 Power Gain Results
- Results are shown after annotating the
- RTL (left) and Architectural (Right) models
- For un-sliced and sliced on 1, 4, 10 instructions
- For SPECINT2000 benchmarks
- Power dissipation decreases consistently
21OR1200 Results (contd.)
Fig.2a
Fig.2b
Fig. 1
- Power gains are consistently good (Fig. 1)
- Power gains far outperform area losses (Fig 1)
- Flop distribution shown before slicing (Fig. 2a)
after slicing on add (Fig. 2b) and after slicing
on load (Fig. 2c)
Fig.2c
22Experiment PUMA
- We have used our tool-chain to test our
methodology on PUMA - PUMA is a dual-issue, out-of-order super-scalar,
fixed-point PowerPC core - We have annotated both the RTL and the
architectural models of PUMA
23PUMA a fixed point PowerPC core
24PUMA Power Gain Results
- Results are shown after annotating the
- RTL (left) and Architectural (Right) models
- For un-sliced and sliced on 1, 4, 10 instructions
- For SPECINT2000 benchmarks
- Power dissipation decreases consistently
25PUMA Results (contd.)
Fig.3a
Fig. 1
Fig. 2
Fig.3b
- Power gains are good upon slicing for a few
instructions (7) before delay losses start
dominating (Fig. 1) - Power gains far outperform area losses (Fig 2)
- Flop distribution shown before slicing (Fig. 3a)
after slicing on add (Fig. 3b) and after slicing
on load (Fig. 3c)
Fig.3c
26Comparing OR1200 and PUMA
27Correct Annotations
- Notion of correctness
- Original RTL and the annotated RTL should be
functionally equivalent under all conditions - Correctness theorem
- (defthm or1200_slicing_correct
- (equal (or1200_cpu n)
- (or1200_cpu_sliced n)))
28ACL2 Theorem Prover
- First order logic general purpose theorem prover
- Breakdown the theorem into sub-goals
- Many engines work on the sub-goals and will
either prove them or break them down further and
add to the central pool of goals to be proved - Success story in Hardware
- Verified FDIV in the AMD processors
29Proof Methodology
30Proof Methodology
- The RTL is a shallow embedding in ACL2
- Convert Verilog RTL into ACL2RTL
- We have created a large RTL library to recognize
as well as analyze ACL2RTL - Slicing is done on the Verilog code
- Both original and annotated Verilog are converted
into ACL2 and we construct the functional
equivalence proof in ACL2
31Verilog to ACL2
32Proof Structure
- Create a library of functions to interpret the
ACL2 model of the RTL - Functional equivalence theorem is built up block
by block - Per instruction basis
33Conclusions
- Proposed Instruction-driven Slicing as a new
technique to automatically reduce power
dissipation - Implemented the methodology of incorporating
instruction-driven slicing into the design flow
tool-chain - Inserting these annotations preserves the
functionality of the circuit
34Conclusions (continued)
- This technique seems most applicable to
single-issue multi-staged pipelined machines. - When there are multiple instructions in-flight in
the same pipeline stage, the gains of a
single-instruction-abstraction are lost. - Graphics processors, various embedded
applications are more often better suited for
this technique than general purpose out-of-order
superscalars.