Safe RTL Annotations for Low Power Microprocessor Design - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Safe RTL Annotations for Low Power Microprocessor Design

Description:

Safe RTL Annotations for Low Power Microprocessor Design Vinod Viswanath Department of Electrical and Computer Engineering University of Texas at Austin – PowerPoint PPT presentation

Number of Views:181
Avg rating:3.0/5.0
Slides: 35
Provided by: utexasEdu
Category:

less

Transcript and Presenter's Notes

Title: Safe RTL Annotations for Low Power Microprocessor Design


1
Safe RTL Annotations for Low Power Microprocessor
Design
Vinod Viswanath Department of Electrical and
Computer Engineering University of Texas at Austin
2
Outline
  • Power Dissipation in Hardware Circuits
  • Instruction-driven Slicing to attain lower power
    dissipation
  • Automatically annotates microprocessor
    description at the Register Transfer Level and
    Architectural level
  • Correctness of the introduced annotations
  • Case studies

3
Power Dissipation
  • Switching activity power dissipation
  • To charge and discharge nodes
  • Short Circuit power dissipation
  • High only for output drivers, clock buffers
  • Static power dissipation
  • Due to leakage current

4
Switching Activity Power Dissipation
  • Reduce the squared term VDD
  • Leads to exponential increase in Ileak
  • Host of techniques to reduce switching power at
    the gate level
  • Clock gating
  • Relatively much lesser at the RTL
  • Use program structure and dataflow information
    available at that level of abstraction

5
Transistor-level Methods
  • Designing Complex Gates
  • Reordering transistors for optimizing power/delay
  • Transistor Sizing
  • Transistor size inversely proportional to gate
    delay
  • Transistor size proportional to power dissipated
    at the gate
  • Given a delay constraint, size the transistor to
    minimize power dissipation

6
Gate-level Methods
  • Combinational logic optimizations
  • Dont-care optimizations
  • Path balancing
  • Sequential logic optimizations
  • Encoding
  • Pre-computation based optimization
  • Guarded evaluation

7
Combinational Logic Optimizations
  • Dont-care optimization
  • Optimize for input/output patterns that
    can/should not ever occur
  • Path Balancing
  • Typically path balancing is done to eliminate
    spurious transitions
  • Adds unit delay buffers, increasing the power
    dissipation
  • Useful skew in clocktrees

8
Sequential Logic Optimizations
  • Encoding
  • Encode state transition graphs
  • Encode values in datapath logic
  • Passing data value 0010 followed by 1101 on a bus

9
Sequential Logic Optimizations
  • Pre-computation based optimization
  • Selectively precompute outputs of the circuit one
    cycle before they are required
  • If the output value is computes, the circuit can
    be turned off for the next cycle
  • Size of pre-computation logic determines power
    dissipation reduction, area increase, and delay
    increase
  • Use predictor functions
  • Pre-compute outputs based on subset of inputs

10
Sequential Logic Optimizations
  • Guarded Evaluation

11
Instruction-driven Slice
  • An instruction-driven slice of a microprocessor
    design is
  • all the relevant circuitry of the design required
    to completely execute a specific instruction
  • Parts of the decode, execute, writeback etc.
    blocks
  • Cone of influence of the semantics of the
    instruction

12
Instruction-driven Slicing
  • Given a microprocessor design and an instruction
  • Identify the instruction-driven slice
  • Shut off the rest of the circuitry
  • This might include
  • Gating out parts of different blocks
  • Gating out floating point units during integer
    ALU execution
  • Turning off certain FSMs in different control
    blocks since exact constraints on their inputs
    are available due to instruction-driven slicing

13
Algorithm (High Level)
  • Algorithm instruction-driven-slicing.
  • Begin
  • Inputs vRTL (Verilog RTL), insts (instructions)
  • Output aRTL (Annotated RTL)
  • Parse vRTL to obtain the Abstract Syntax Program
    Graph (ASPG)
  • For each instruction I in insts repeat
  • Slice the ASPG for instruction I
  • Traverse the ASPG
  • Add annotation variables if such a block is found
  • If a particular flop is already gated, then
  • add the current annotation in an optimal
    fashion
  • Return the annotated ASPG
  • Generate Verilog code (aRTL) for the annotated
    ASPG
  • End.

14
or1200_ctrl.lsu_op
15
or1200_ctrl.pre_branch_op
16
Methodology
  • In order to demonstrate our technique
  • We have incorporated instruction-driven slicing
    as part of the traditional design flow
  • The vRTL model is annotated to obtain the aRTL
    model
  • Synopsys Design Environment has been sufficiently
    modified to accept the aRTL, SPEC2000 benchmarks
    and power process parameters and estimate the
    power dissipation due to switching activity
  • The annotated Architectural model is fed to the
    SimpleScalar simulator with the Wattch power
    estimator to estimate the power dissipation

17
Methodology
18
Experiment OR1200
  • We have used our tool-chain to test our
    methodology on OR1200
  • OR1200 is a pipelined microprocessor implementing
    the OpenRISC ISA.
  • 4-stage integer pipeline with single instruction
    issue per cycle
  • We have annotated both the RTL and the
    architectural models of OR1200

19
OR1200 single instruction issue pipelined
microprocessor
20
OR1200 Power Gain Results
  • Results are shown after annotating the
  • RTL (left) and Architectural (Right) models
  • For un-sliced and sliced on 1, 4, 10 instructions
  • For SPECINT2000 benchmarks
  • Power dissipation decreases consistently

21
OR1200 Results (contd.)
Fig.2a
Fig.2b
Fig. 1
  • Power gains are consistently good (Fig. 1)
  • Power gains far outperform area losses (Fig 1)
  • Flop distribution shown before slicing (Fig. 2a)
    after slicing on add (Fig. 2b) and after slicing
    on load (Fig. 2c)

Fig.2c
22
Experiment PUMA
  • We have used our tool-chain to test our
    methodology on PUMA
  • PUMA is a dual-issue, out-of-order super-scalar,
    fixed-point PowerPC core
  • We have annotated both the RTL and the
    architectural models of PUMA

23
PUMA a fixed point PowerPC core
24
PUMA Power Gain Results
  • Results are shown after annotating the
  • RTL (left) and Architectural (Right) models
  • For un-sliced and sliced on 1, 4, 10 instructions
  • For SPECINT2000 benchmarks
  • Power dissipation decreases consistently

25
PUMA Results (contd.)
Fig.3a
Fig. 1
Fig. 2
Fig.3b
  • Power gains are good upon slicing for a few
    instructions (7) before delay losses start
    dominating (Fig. 1)
  • Power gains far outperform area losses (Fig 2)
  • Flop distribution shown before slicing (Fig. 3a)
    after slicing on add (Fig. 3b) and after slicing
    on load (Fig. 3c)

Fig.3c
26
Comparing OR1200 and PUMA
27
Correct Annotations
  • Notion of correctness
  • Original RTL and the annotated RTL should be
    functionally equivalent under all conditions
  • Correctness theorem
  • (defthm or1200_slicing_correct
  • (equal (or1200_cpu n)
  • (or1200_cpu_sliced n)))

28
ACL2 Theorem Prover
  • First order logic general purpose theorem prover
  • Breakdown the theorem into sub-goals
  • Many engines work on the sub-goals and will
    either prove them or break them down further and
    add to the central pool of goals to be proved
  • Success story in Hardware
  • Verified FDIV in the AMD processors

29
Proof Methodology
30
Proof Methodology
  • The RTL is a shallow embedding in ACL2
  • Convert Verilog RTL into ACL2RTL
  • We have created a large RTL library to recognize
    as well as analyze ACL2RTL
  • Slicing is done on the Verilog code
  • Both original and annotated Verilog are converted
    into ACL2 and we construct the functional
    equivalence proof in ACL2

31
Verilog to ACL2
32
Proof Structure
  • Create a library of functions to interpret the
    ACL2 model of the RTL
  • Functional equivalence theorem is built up block
    by block
  • Per instruction basis

33
Conclusions
  • Proposed Instruction-driven Slicing as a new
    technique to automatically reduce power
    dissipation
  • Implemented the methodology of incorporating
    instruction-driven slicing into the design flow
    tool-chain
  • Inserting these annotations preserves the
    functionality of the circuit

34
Conclusions (continued)
  • This technique seems most applicable to
    single-issue multi-staged pipelined machines.
  • When there are multiple instructions in-flight in
    the same pipeline stage, the gains of a
    single-instruction-abstraction are lost.
  • Graphics processors, various embedded
    applications are more often better suited for
    this technique than general purpose out-of-order
    superscalars.
Write a Comment
User Comments (0)
About PowerShow.com