Title: Safe RTL Annotations for Low Power Microprocessor Design
1Safe RTL Annotations for Low Power Microprocessor
Design
Vinod Viswanath Department of Electrical and
Computer Engineering University of Texas at Austin
Talk at Tata Institute of Fundamental Research,
Mumbai, India.
2Outline
- Power Dissipation in Hardware Circuits
- Instruction-driven Slicing to attain lower power
dissipation - Automatically annotates microprocessor
description at the Register Transfer Level and
Architectural level - Correctness of the introduced annotations
- Case studies
3Power Dissipation
P 1/2 C V2DD f N QSC VDD f
N Ileak VDD
- Switching activity power dissipation
- To charge and discharge nodes
- Short Circuit power dissipation
- High only for output drivers, clock buffers
- Static power dissipation
- Due to leakage current
4Switching Activity Power Dissipation
- Reduce the squared term VDD
- Leads to exponential increase in Ileak
- Host of techniques to reduce switching power at
the gate level - Clock gating
- Relatively much lesser at the RTL
- Use program structure and dataflow information
available at that level of abstraction
5Instruction-driven Slice
- An instruction-driven slice of a microprocessor
design is - all the relevant circuitry of the design required
to completely execute a specific instruction - Parts of the decode, execute, writeback etc.
blocks - Cone of influence of the semantics of the
instruction
6Instruction-driven Slicing
- Given a microprocessor design and an instruction
- Identify the instruction-driven slice
- Shut off the rest of the circuitry
- This might include
- Gating out parts of different blocks
- Gating out floating point units during integer
ALU execution - Turning off certain FSMs in different control
blocks since exact constraints on their inputs
are available due to instruction-driven slicing
7Algorithm (High Level)
- Algorithm instruction-driven-slicing.
- Begin
- Inputs vRTL (Verilog RTL), insts (instructions)
- Output aRTL (Annotated RTL)
- Parse vRTL to obtain the Abstract Syntax Program
Graph (ASPG) - For each instruction I in insts repeat
- Slice the ASPG for instruction I
- Traverse the ASPG
- Add annotation variables if such a block is found
- If a particular flop is already gated, then
- add the current annotation in an optimal
fashion - Return the annotated ASPG
- Generate Verilog code (aRTL) for the annotated
ASPG - End.
8Instructions as LTL Properties
- Let I i1 Æ X i2 Æ XX i3 ... Xn-1 in be an
instruction written as an LTL property, such that
ir represents the conditions for the instruction
I on clock cycle r. - i1 represents the instruction word.
9RISC Pipeline (OR1200)
- 5 stage RISC pipeline implementation
- Condition for slicing on ADDC instruction
- i1 ((icpu_dat_i31266b 111000) Æ
- (!rst) Æ (!flushpipe) Æ (!if_freeze))
- i2 (!id_freeze)
- i3 (!ex_freeze)
- i4 (!mem_freeze)
- i5 (!wb_freeze)
- I i1 Æ X i2 Æ X2i3 Æ X3i4 Æ X4i5
10OR1200 ADDC Instruction
- Introduces five variables
- iADDC_if i1
- iADDC_id 1 iADDC_if Æ i2
- iADDC_ex 1 iADDC_id Æ i3
- iADDC_mem 1 iADDC_ex Æ i4
- iADDC_wb 1 iADDC_mem Æ i5
11or1200_ctrl.lsu_op
12or1200_ctrl.pre_branch_op
13Correct Annotations
- Notion of correctness
- Original RTL and the annotated RTL should be
functionally equivalent under all conditions - Correctness theorem
- (defthm or1200_slicing_correct
- (equal (or1200_cpu n)
- (or1200_cpu_sliced n)))
14ACL2 Theorem Prover
- First order logic general purpose theorem prover
- Breakdown the theorem into sub-goals
- Many engines work on the sub-goals and will
either prove them or break them down further and
add to the central pool of goals to be proved - Success story in Hardware
- Verified FDIV in the AMD processors
15Proof Methodology
- The RTL is a shallow embedding in ACL2
- Convert Verilog RTL into ACL2RTL
- We have created a large RTL library to recognize
as well as analyze ACL2RTL - Slicing is done on the Verilog code
- Both original and annotated Verilog are converted
into ACL2 and we construct the functional
equivalence proof in ACL2
16Verilog to ACL2
17Methodology
- In order to demonstrate our technique
- We have incorporated instruction-driven slicing
as part of the traditional design flow - The vRTL model is annotated to obtain the aRTL
model - Synopsys Design Environment has been sufficiently
modified to accept the aRTL, SPEC2000 benchmarks
and power process parameters and estimate the
power dissipation due to switching activity - The annotated Architectural model is fed to the
SimpleScalar simulator with the Wattch power
estimator to estimate the power dissipation
18Methodology
19Experiment OR1200
- We have used our tool-chain to test our
methodology on OR1200 - OR1200 is a pipelined microprocessor implementing
the OpenRISC ISA. - 5-stage integer pipeline with single instruction
issue per cycle - We have annotated both the RTL and the
architectural models of OR1200
20OR1200 single instruction issue pipelined
microprocessor
21OR1200 Power Gain Results
- Results are shown after annotating the
- RTL (left) and Architectural (Right) models
- For un-sliced and sliced on 1, 4, 10 instructions
- For SPECINT2000 benchmarks
- Power dissipation decreases consistently
22OR1200 Results (contd.)
Fig.2a
Fig.2b
Fig. 1
- Power gains are consistently good (Fig. 1)
- Power gains far outperform area losses (Fig 1)
- Flop distribution shown before slicing (Fig. 2a)
after slicing on add (Fig. 2b) and after slicing
on load (Fig. 2c)
Fig.2c
23Experiment PUMA
- We have used our tool-chain to test our
methodology on PUMA - PUMA is a dual-issue, out-of-order super-scalar,
fixed-point PowerPC core - We have annotated both the RTL and the
architectural models of PUMA
24PUMA a fixed point PowerPC core
25PUMA Power Gain Results
- Results are shown after annotating the
- RTL (left) and Architectural (Right) models
- For un-sliced and sliced on 1, 4, 10 instructions
- For SPECINT2000 benchmarks
- Power dissipation decreases consistently
26PUMA Results (contd.)
Fig.3a
Fig. 1
Fig. 2
Fig.3b
- Power gains are good upon slicing for a few
instructions (7) before delay losses start
dominating (Fig. 1) - Power gains far outperform area losses (Fig 2)
- Flop distribution shown before slicing (Fig. 3a)
after slicing on add (Fig. 3b) and after slicing
on load (Fig. 3c)
Fig.3c
27Comparing OR1200 and PUMA
28Conclusions
- Proposed Instruction-driven Slicing as a new
technique to automatically reduce power
dissipation - Implemented the methodology of incorporating
instruction-driven slicing into the design flow
tool-chain - Inserting these annotations preserves the
functionality of the circuit
29Conclusions (continued)
- This technique seems most applicable to
single-issue multi-staged pipelined machines. - When there are multiple instructions in-flight in
the same pipeline stage, the gains of a
single-instruction-abstraction are lost. - Graphics processors, various embedded
applications are more often better suited for
this technique than general purpose out-of-order
superscalars.