Safe RTL Annotations for Low Power Microprocessor Design - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Safe RTL Annotations for Low Power Microprocessor Design

Description:

Instruction-driven Slicing to attain lower power dissipation ... microprocessor description at the Register Transfer Level and Architectural level ... – PowerPoint PPT presentation

Number of Views:133
Avg rating:3.0/5.0
Slides: 30
Provided by: cercU
Category:

less

Transcript and Presenter's Notes

Title: Safe RTL Annotations for Low Power Microprocessor Design


1
Safe RTL Annotations for Low Power Microprocessor
Design
Vinod Viswanath Department of Electrical and
Computer Engineering University of Texas at Austin
Talk at Tata Institute of Fundamental Research,
Mumbai, India.
2
Outline
  • Power Dissipation in Hardware Circuits
  • Instruction-driven Slicing to attain lower power
    dissipation
  • Automatically annotates microprocessor
    description at the Register Transfer Level and
    Architectural level
  • Correctness of the introduced annotations
  • Case studies

3
Power Dissipation
P 1/2 C V2DD f N QSC VDD f
N Ileak VDD
  • Switching activity power dissipation
  • To charge and discharge nodes
  • Short Circuit power dissipation
  • High only for output drivers, clock buffers
  • Static power dissipation
  • Due to leakage current

4
Switching Activity Power Dissipation
  • Reduce the squared term VDD
  • Leads to exponential increase in Ileak
  • Host of techniques to reduce switching power at
    the gate level
  • Clock gating
  • Relatively much lesser at the RTL
  • Use program structure and dataflow information
    available at that level of abstraction

5
Instruction-driven Slice
  • An instruction-driven slice of a microprocessor
    design is
  • all the relevant circuitry of the design required
    to completely execute a specific instruction
  • Parts of the decode, execute, writeback etc.
    blocks
  • Cone of influence of the semantics of the
    instruction

6
Instruction-driven Slicing
  • Given a microprocessor design and an instruction
  • Identify the instruction-driven slice
  • Shut off the rest of the circuitry
  • This might include
  • Gating out parts of different blocks
  • Gating out floating point units during integer
    ALU execution
  • Turning off certain FSMs in different control
    blocks since exact constraints on their inputs
    are available due to instruction-driven slicing

7
Algorithm (High Level)
  • Algorithm instruction-driven-slicing.
  • Begin
  • Inputs vRTL (Verilog RTL), insts (instructions)
  • Output aRTL (Annotated RTL)
  • Parse vRTL to obtain the Abstract Syntax Program
    Graph (ASPG)
  • For each instruction I in insts repeat
  • Slice the ASPG for instruction I
  • Traverse the ASPG
  • Add annotation variables if such a block is found
  • If a particular flop is already gated, then
  • add the current annotation in an optimal
    fashion
  • Return the annotated ASPG
  • Generate Verilog code (aRTL) for the annotated
    ASPG
  • End.

8
Instructions as LTL Properties
  • Let I i1 Æ X i2 Æ XX i3 ... Xn-1 in be an
    instruction written as an LTL property, such that
    ir represents the conditions for the instruction
    I on clock cycle r.
  • i1 represents the instruction word.

9
RISC Pipeline (OR1200)
  • 5 stage RISC pipeline implementation
  • Condition for slicing on ADDC instruction
  • i1 ((icpu_dat_i31266b 111000) Æ
  • (!rst) Æ (!flushpipe) Æ (!if_freeze))
  • i2 (!id_freeze)
  • i3 (!ex_freeze)
  • i4 (!mem_freeze)
  • i5 (!wb_freeze)
  • I i1 Æ X i2 Æ X2i3 Æ X3i4 Æ X4i5

10
OR1200 ADDC Instruction
  • Introduces five variables
  • iADDC_if i1
  • iADDC_id 1 iADDC_if Æ i2
  • iADDC_ex 1 iADDC_id Æ i3
  • iADDC_mem 1 iADDC_ex Æ i4
  • iADDC_wb 1 iADDC_mem Æ i5

11
or1200_ctrl.lsu_op
12
or1200_ctrl.pre_branch_op
13
Correct Annotations
  • Notion of correctness
  • Original RTL and the annotated RTL should be
    functionally equivalent under all conditions
  • Correctness theorem
  • (defthm or1200_slicing_correct
  • (equal (or1200_cpu n)
  • (or1200_cpu_sliced n)))

14
ACL2 Theorem Prover
  • First order logic general purpose theorem prover
  • Breakdown the theorem into sub-goals
  • Many engines work on the sub-goals and will
    either prove them or break them down further and
    add to the central pool of goals to be proved
  • Success story in Hardware
  • Verified FDIV in the AMD processors

15
Proof Methodology
  • The RTL is a shallow embedding in ACL2
  • Convert Verilog RTL into ACL2RTL
  • We have created a large RTL library to recognize
    as well as analyze ACL2RTL
  • Slicing is done on the Verilog code
  • Both original and annotated Verilog are converted
    into ACL2 and we construct the functional
    equivalence proof in ACL2

16
Verilog to ACL2
17
Methodology
  • In order to demonstrate our technique
  • We have incorporated instruction-driven slicing
    as part of the traditional design flow
  • The vRTL model is annotated to obtain the aRTL
    model
  • Synopsys Design Environment has been sufficiently
    modified to accept the aRTL, SPEC2000 benchmarks
    and power process parameters and estimate the
    power dissipation due to switching activity
  • The annotated Architectural model is fed to the
    SimpleScalar simulator with the Wattch power
    estimator to estimate the power dissipation

18
Methodology
19
Experiment OR1200
  • We have used our tool-chain to test our
    methodology on OR1200
  • OR1200 is a pipelined microprocessor implementing
    the OpenRISC ISA.
  • 5-stage integer pipeline with single instruction
    issue per cycle
  • We have annotated both the RTL and the
    architectural models of OR1200

20
OR1200 single instruction issue pipelined
microprocessor
21
OR1200 Power Gain Results
  • Results are shown after annotating the
  • RTL (left) and Architectural (Right) models
  • For un-sliced and sliced on 1, 4, 10 instructions
  • For SPECINT2000 benchmarks
  • Power dissipation decreases consistently

22
OR1200 Results (contd.)
Fig.2a
Fig.2b
Fig. 1
  • Power gains are consistently good (Fig. 1)
  • Power gains far outperform area losses (Fig 1)
  • Flop distribution shown before slicing (Fig. 2a)
    after slicing on add (Fig. 2b) and after slicing
    on load (Fig. 2c)

Fig.2c
23
Experiment PUMA
  • We have used our tool-chain to test our
    methodology on PUMA
  • PUMA is a dual-issue, out-of-order super-scalar,
    fixed-point PowerPC core
  • We have annotated both the RTL and the
    architectural models of PUMA

24
PUMA a fixed point PowerPC core
25
PUMA Power Gain Results
  • Results are shown after annotating the
  • RTL (left) and Architectural (Right) models
  • For un-sliced and sliced on 1, 4, 10 instructions
  • For SPECINT2000 benchmarks
  • Power dissipation decreases consistently

26
PUMA Results (contd.)
Fig.3a
Fig. 1
Fig. 2
Fig.3b
  • Power gains are good upon slicing for a few
    instructions (7) before delay losses start
    dominating (Fig. 1)
  • Power gains far outperform area losses (Fig 2)
  • Flop distribution shown before slicing (Fig. 3a)
    after slicing on add (Fig. 3b) and after slicing
    on load (Fig. 3c)

Fig.3c
27
Comparing OR1200 and PUMA
28
Conclusions
  • Proposed Instruction-driven Slicing as a new
    technique to automatically reduce power
    dissipation
  • Implemented the methodology of incorporating
    instruction-driven slicing into the design flow
    tool-chain
  • Inserting these annotations preserves the
    functionality of the circuit

29
Conclusions (continued)
  • This technique seems most applicable to
    single-issue multi-staged pipelined machines.
  • When there are multiple instructions in-flight in
    the same pipeline stage, the gains of a
    single-instruction-abstraction are lost.
  • Graphics processors, various embedded
    applications are more often better suited for
    this technique than general purpose out-of-order
    superscalars.
Write a Comment
User Comments (0)
About PowerShow.com