Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

Description:

Problem Formulation Input A clock gating domain contains a set of FFs which are controlled by the gated clock signals whose switching activities are the same. – PowerPoint PPT presentation

Number of Views:184
Avg rating:3.0/5.0
Slides: 38
Provided by: terence
Category:

less

Transcript and Presenter's Notes

Title: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging


1
Power Optimization for Clock Network with
Clock GateCloning and Flip-Flop Merging
  • Shih-Chuan Lo
  • Chih-Cheng Hsu
  • Mark Po-Hung Lin

2
Outline
  • Introduction
  • Preliminaries
  • The Proposed Algorithms
  • Experimental Results
  • Conclusions

3
Outline
  • Introduction
  • Low Power Design Methodologies
  • The Concept of Clock-Gating Cell
  • The Concept of Clock-Gate Cloning
  • The Concept of Flip-Flop Merging
  • Previous Work
  • Our Contributions
  • Preliminaries
  • The Proposed Algorithms
  • Experimental Results
  • Conclusions

4
Low Power Design Methodologies
  • Clock gating cell (CG)
  • Wu et al., TCAS'00, Shen et al., TVLSI'10,
  • Clock gate cloning
  • Teng Soin, ICSE'10, Vishweshwara et al.,
    ISQED'12
  • Multi-bit flip-flop (MBFF)
  • Pokala et al., ASIC92, Kretchmer, EE Times
    Asia'01,
  • Chen et al., SNUG10, Lin et al.,
    TCAD'11,
  • Wang et al.,TCAD'12, Jiang et al.,
    TCAD'12,
  • Shyu et al., TVLSI13, Tsai et al.,
    ISPD13

5
The Concept of Clock-Gating Cell
  • A clock-gating cell can turn off the clocks at
    flip-flop inputs when they are not required.
  • In Fig.(a), the FFs will load new data at their
    input pins D only when the enable signal EN
    is active.
  • In Fig.(b), the CG can shut off gclk to the FFs
    when Din is not changed.

Less clock network power and smaller chip area
6
The Concept of Clock-Gate Cloning
  • Clock buffer chain may result in
  • Longer delay
  • Degrade the circuit performance
  • Induce power consumption
  • After replicate sufficient CGs and connect each
    CG to a smaller number of FFs
  • The number of required clock buffers can be
    reduced.
  • Power consumption and path delay of the gated
    clock network can be minimized.

7
The Concept of Multi-bit flip-flop
  • Replacing 1-bit FFs with MBFFs can reach up to
    30 total clock power reduction.
  • Jiang et al., TCAD'12
  • An MBFF contains several 1-bit FFs which share
    common inverters in the MBFF cell.
  • Chen et al., SNUG'10
  • Replacing several 1-bit FFs with an MBFF will
    reduce
  • Inverters in FF cells
  • Clock sinks
  • Clock drivers

8
Previous Work of CG Cloning
  • Teng Soin, ICSE'10
  • Introduced cutting-based algorithm to split a CG
    and redistribute the CG fanout according to the
    cut line.
  • The CG splitting algorithm is iteratively
    performed until the timing violation of each CGs
    enable signal is eliminated.
  • Vishweshwara et al., ISQED'12
  • Proposed a clustering-based algorithm to
    recursively replicate a CG when the CG has a
    large number of fanout, or when the spreading
    area of its fanout is larger than a limit.

9
Previous Work of FF Merging
  • Kretchmer, EE Times Asia'01, Chen et al.,
    SNUG10
  • Demonstrated the feasibility of applying MBFFs
    during logic synthesis.
  • Pokala et al., ASIC92
  • Applied MBFFs before placement optimization.
  • Tsai et al., ISPD13
  • Applied MBFFs during placement optimization.
  • Lin et al., TCAD'11, Wang et al.,TCAD'12,
  • Jiang et al., TCAD'12, Shyu et al.,
    TVLSI13
  • Perform power optimization with MBFFs at the
    post-placement stage for better timing budgeting.

10
Our Contributions
  • We present the first problem formulation
  • For gated clock network optimization with
    simultaneous CG cloning and FF merging.
  • We introduce a novel optimization flow consisting
    of
  • MBFF aware CG cloning
  • CG-based FF merging
  • MBFF and CG placement optimization
  • We formulate the MBFF-aware CG cloning
    optimization problem as a partitioning problem.
  • Our formulation is to maximize skew slack
    corresponding to different CGs subject to bounded
    slack constraints.
  • Our experimental results show that the proposed
    approach leads to better dynamic power and clock
    wirelength.

11
Outline
  • Introduction
  • Preliminaries
  • Power Model of Gated Clock Network
  • Inter-CG Clock Skew due to CG Cloning
  • Control-Path Timing Constraint for Gated Clock
    Network
  • Data-Path Timing Constraint for FF Merging
  • Placement Density Constraint for CGs and MBFFs
  • Problem Formulation
  • The Proposed Algorithms
  • Experimental Results
  • Conclusions

12
Power Model of Gated Clock Network
  • The power dissipated in the gated clock network
    can be modelled as follows.
  • Shen et al., TVLSI'10

clock net
gated clock tree
enable signal net







dynamic power consumption
supply voltage
clock period
unit wire capacitance
input capacitance
wirelength
switching activity
13
Inter-CG Clock Skew due to CG Cloning
  • When a CG is replicated in the gated clock
    network, the inter-CG clock skew , can be
    calculated as follows.
  • To minimize , we shall balance the
    wirelength and flip-flop fanout numbers among all
    different CGs.






the CG
CG delay
inter-CG clock skew among gated FFs
interconnection delay from the clock root to gi
interconnection delay from gi to the farthest
gated FF
14
Control-Path Timing Constraint for Gated Clock
Network
  • The figure shows the control-path timing of the
    gated clock network.

CG delay


interconnection delay from the clock root to gi
15
Data-Path Timing Constraint for FF Merging
  • Only the FFs which have common intersection of
    their timing-feasible regions can be merged.
  • Lin et al., TCAD'11, Wang et al.,TCAD'12,
    Jiang et al., TCAD'12
  • The timing-feasible region of a flip-flop can be
    obtained from the available timing slack on the
    corresponding data paths.

16
Placement Density Constraint for CGs and MBFFs
  • We divide the chip area into a number of bins
    with equal size.
  • Lin et al., TCAD'11, Wang et al.,TCAD'12,
    Jiang et al., TCAD'12
  • A CG or an MBFF can only be placed in a bin whose
    density is less than the maximum placement
    density.
  • To evenly distribute logic cells throughout the
    chip area, in order to avoid routing congestion.

17
Problem Formulation
  • Input
  • A clock gating domain contains a set of FFs which
    are controlled by the gated clock signals whose
    switching activities are the same.
  • A cell library containing both CG and MBFF cells.
  • Objectives
  • Minimize Pd and Tskew of the clock-gating domain
  • (Pd is the primary objective, while Tskew is
    the secondary one because Tskew can be further
    minimized after clock tree routing.)
  • Constraint
  • Control-path timing constraint
  • Data-path timing constraint
  • Placement density constraint.

18
Outline
  • Introduction
  • Preliminaries
  • The Proposed Algorithms
  • The Proposed Algorithms Flow
  • MBFF-aware CG Cloning
  • CG-based FF Merging
  • MBFF and CG Placement Optimization
  • Experimental Results
  • Conclusions

19
The Proposed Algorithms Flow
Initial placement / Cell library / Design
constraints
MBFF-aware CG Cloning
CG-based FF Merging
MBFF CG Placement Opt.
Optimized placement containing newly generated
CGs and MBFFs
20
MBFF-aware CG Cloning
  • The CG must be replicated and the fanout FFs are
    bisected when
  • Control path violates the timing constraint
  • CG drives too many FFs leading to larger clock
    power consumption.

21
Hyper Graph Construction
  • According to the timing-feasible region of each
    FF, we construct the hypergraph, H(V,E).
  • vi the timing-feasible region of the FF fi.
  • ei the intersection among the timing
    feasible regions of different fi.
  • w(ei) the number of vertices connected by ei.

w(e3)3
w(e1)4
w(e2)2
22
Cut-line Determination with Inter-CG Skew
Budgeting
  • The cut direction is determined by the physical
    dimension of the FF bounding box. Teng Soin,
    ICSE'10
  • A vertical (horizontal) cut is applied if the
    dimension in x-direction is larger (smaller) than
    that in y-direction.
  • To balance the delay passing through different
    CGs, we sweep the cut line to search for the
    maximum skew slack .

23
Skew Slack (1/2)
  • In Fig.(c) (Fig.(d)), the CGs are placed at the
    position closest to (farthest from) the clock
    root within the respective FF bounding boxes,
    resulting in the shortest (longest ) clock signal
    delay from the clock root to the FFs.

24
Skew Slack (2/2)
  • The skew slack, can be calculated by the
    difference between the minimum longest and the
    maximum shortest clock signal delay.
  • To more easily balance the delay passing through
    different CGs, we would like to find out a
    physical cut line which maximizes the skew slack.

25
MBFF-aware FF Swapping
  • We perform the FM algorithm on H(V,E) to move FFs
    between different FF sets such that the cut size
    is minimized.
  • Cut size sum of edge weights on the cut line
  • A balance condition that the skew slack after
    moving an FF to the other FF set must not less
    than .
  • is a balance factor, .

26
CG-based FF Merging
  • We merge 1-bit FFs into MBFFs starting from the
    four boundaries of the FF bounding box to the
    center area, based on
  • INTEGRA Jiang et al., TCAD'12
  • Spiral clustering technique Chang et al.,
    ISPD'12

27
MBFF and CG Placement Optimization
  • We perform MBFF and CG placement optimization to
  • Minimize inter-CG clock skew
  • Minimize wirelength
  • Minimize required clock buffers
  • Satisfying control/data-path timing constraints
  • Satisfying placement density constraints

28
MBFF Placement
  • When placing the MBFFs controlled by the same CG,
    we search for the placement bins, which satisfy
  • Placement density constraint
  • In the timing-feasible region corresponding to
    each MBFF
  • The FF bounding box of the CG fanouts is
    minimized.
  • The smaller FF bounding box can result in shorter
    gated clock signal wirelength, and hence smaller
    and .

29
CG Placement
  • The CGs are initially placed inside their
    feasible positions which satisfy the control-path
    timing constraings.
  • The feasible region of a CG is roughly an ellipse
    whose the two foci are at the positions of the
    enable logic and one of the CG fanout FFs.
  • We perform an iterative optimization algorithm
    to
  • Move CGs around their feasible regions until
    inter-CG clock skew cannot be further minimized.
  • Add clock buffers to either clock path from the
    clock root to a CG for delay balance.
  • Insert buffers to either enable signal path from
    the enable logic to a CG for a larger feasible
    region of the CG.

30
Outline
  • Introduction
  • Preliminaries
  • The Proposed Algorithms
  • Experiments
  • Experimental Setups
  • Experimental Comparisons
  • Experimental Results
  • Conclusions

31
Experimental Setups
  • Programming language
  • C
  • Platform
  • 2.26GHz Intel Xeon machine under the Linux
    operating system
  • We adopted the benchmark circuits in Jiang et
    al., TCAD'12
  • Add other logical, physical and timing
    information for CGs, clock root, and EL.
  • Referred to the Nangate 45nm Open Cell Library to
    set the input capacitance.
  • Assumed that all FFs in each circuit are
    initially connected to the same CG.
  • Chose the circuits containing less than 1,000 FFs
    with reasonable FF bounding boxes.

32
Experimental Comparisons
  • Reference Flow 1 2
  • CG cloning technique is based on the MBFF-aware
    CG cloning without applying MBFF-aware FF
    swapping.
  • FF merging technique is exactly the same as the
    CG-based FF merging.

33
Experimental Results (1/2)
  • Comparisons the numbers of MBFFs with different
    bit numbers ( of FFs) and CG numbers ( of
    CGs).
  • When comparing with Reference Flow 1 the
    proposed flow results in much more MBFFs with
    similar clock gate numbers.
  • When comparing with Reference Flow 2 the
    proposed flow results in much slightly more CGs
    and slightly fewer MBFFs.

34
Experimental Results (2/2)
  • Comparisons of the dynamic power consumption
  • 15 less than that resulting from Reference
    Flow 1.
  • 10 less than that resulting from Reference
    Flow 2.
  • Comparisons of the clock net wirelength
  • 22 less than that resulting from Reference
    Flow 1.
  • 18 less than that resulting from Reference
    Flow 2.
  • Comparisons of the signal net wirelength
  • 2 less than that resulting from Reference Flow
    2.

35
Outline
  • Introduction
  • Preliminaries
  • The Proposed Algorithms
  • Experimental Results
  • Conclusions

36
Conclusions
  • We have presented a new problem formulation for
    clock network optimization with both CGs and
    MBFFs.
  • We have also introduced novel techniques to
    optimize gated clock network with CG cloning and
    FF merging simultaneously.
  • The experimental results have shown that the
    proposed approach results in better dynamic power
    and clock wirelength compared with those which
    optimize gated clock network with CGs and MBFFs
    separately.

37
Thanks for Your Attention
Write a Comment
User Comments (0)
About PowerShow.com