Title: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging
1 Power Optimization for Clock Network with
Clock GateCloning and Flip-Flop Merging
- Shih-Chuan Lo
- Chih-Cheng Hsu
- Mark Po-Hung Lin
2Outline
- Introduction
- Preliminaries
- The Proposed Algorithms
- Experimental Results
- Conclusions
3Outline
- Introduction
- Low Power Design Methodologies
- The Concept of Clock-Gating Cell
- The Concept of Clock-Gate Cloning
- The Concept of Flip-Flop Merging
- Previous Work
- Our Contributions
- Preliminaries
- The Proposed Algorithms
- Experimental Results
- Conclusions
4Low Power Design Methodologies
- Clock gating cell (CG)
- Wu et al., TCAS'00, Shen et al., TVLSI'10,
- Clock gate cloning
- Teng Soin, ICSE'10, Vishweshwara et al.,
ISQED'12 - Multi-bit flip-flop (MBFF)
- Pokala et al., ASIC92, Kretchmer, EE Times
Asia'01, - Chen et al., SNUG10, Lin et al.,
TCAD'11, - Wang et al.,TCAD'12, Jiang et al.,
TCAD'12, - Shyu et al., TVLSI13, Tsai et al.,
ISPD13
5The Concept of Clock-Gating Cell
- A clock-gating cell can turn off the clocks at
flip-flop inputs when they are not required. - In Fig.(a), the FFs will load new data at their
input pins D only when the enable signal EN
is active. - In Fig.(b), the CG can shut off gclk to the FFs
when Din is not changed.
Less clock network power and smaller chip area
6The Concept of Clock-Gate Cloning
- Clock buffer chain may result in
- Longer delay
- Degrade the circuit performance
- Induce power consumption
- After replicate sufficient CGs and connect each
CG to a smaller number of FFs - The number of required clock buffers can be
reduced. - Power consumption and path delay of the gated
clock network can be minimized.
7The Concept of Multi-bit flip-flop
- Replacing 1-bit FFs with MBFFs can reach up to
30 total clock power reduction. - Jiang et al., TCAD'12
- An MBFF contains several 1-bit FFs which share
common inverters in the MBFF cell. - Chen et al., SNUG'10
- Replacing several 1-bit FFs with an MBFF will
reduce - Inverters in FF cells
- Clock sinks
- Clock drivers
8Previous Work of CG Cloning
- Teng Soin, ICSE'10
- Introduced cutting-based algorithm to split a CG
and redistribute the CG fanout according to the
cut line. - The CG splitting algorithm is iteratively
performed until the timing violation of each CGs
enable signal is eliminated. - Vishweshwara et al., ISQED'12
- Proposed a clustering-based algorithm to
recursively replicate a CG when the CG has a
large number of fanout, or when the spreading
area of its fanout is larger than a limit.
9Previous Work of FF Merging
- Kretchmer, EE Times Asia'01, Chen et al.,
SNUG10 - Demonstrated the feasibility of applying MBFFs
during logic synthesis. - Pokala et al., ASIC92
- Applied MBFFs before placement optimization.
- Tsai et al., ISPD13
- Applied MBFFs during placement optimization.
- Lin et al., TCAD'11, Wang et al.,TCAD'12,
- Jiang et al., TCAD'12, Shyu et al.,
TVLSI13 - Perform power optimization with MBFFs at the
post-placement stage for better timing budgeting.
10Our Contributions
- We present the first problem formulation
- For gated clock network optimization with
simultaneous CG cloning and FF merging. - We introduce a novel optimization flow consisting
of - MBFF aware CG cloning
- CG-based FF merging
- MBFF and CG placement optimization
- We formulate the MBFF-aware CG cloning
optimization problem as a partitioning problem. - Our formulation is to maximize skew slack
corresponding to different CGs subject to bounded
slack constraints. - Our experimental results show that the proposed
approach leads to better dynamic power and clock
wirelength.
11Outline
- Introduction
- Preliminaries
- Power Model of Gated Clock Network
- Inter-CG Clock Skew due to CG Cloning
- Control-Path Timing Constraint for Gated Clock
Network - Data-Path Timing Constraint for FF Merging
- Placement Density Constraint for CGs and MBFFs
- Problem Formulation
- The Proposed Algorithms
- Experimental Results
- Conclusions
12Power Model of Gated Clock Network
- The power dissipated in the gated clock network
can be modelled as follows. - Shen et al., TVLSI'10
clock net
gated clock tree
enable signal net
dynamic power consumption
supply voltage
clock period
unit wire capacitance
input capacitance
wirelength
switching activity
13Inter-CG Clock Skew due to CG Cloning
- When a CG is replicated in the gated clock
network, the inter-CG clock skew , can be
calculated as follows. - To minimize , we shall balance the
wirelength and flip-flop fanout numbers among all
different CGs.
the CG
CG delay
inter-CG clock skew among gated FFs
interconnection delay from the clock root to gi
interconnection delay from gi to the farthest
gated FF
14Control-Path Timing Constraint for Gated Clock
Network
- The figure shows the control-path timing of the
gated clock network.
CG delay
interconnection delay from the clock root to gi
15Data-Path Timing Constraint for FF Merging
- Only the FFs which have common intersection of
their timing-feasible regions can be merged. - Lin et al., TCAD'11, Wang et al.,TCAD'12,
Jiang et al., TCAD'12 - The timing-feasible region of a flip-flop can be
obtained from the available timing slack on the
corresponding data paths.
16Placement Density Constraint for CGs and MBFFs
- We divide the chip area into a number of bins
with equal size. - Lin et al., TCAD'11, Wang et al.,TCAD'12,
Jiang et al., TCAD'12 - A CG or an MBFF can only be placed in a bin whose
density is less than the maximum placement
density. - To evenly distribute logic cells throughout the
chip area, in order to avoid routing congestion.
17Problem Formulation
- Input
- A clock gating domain contains a set of FFs which
are controlled by the gated clock signals whose
switching activities are the same. - A cell library containing both CG and MBFF cells.
- Objectives
- Minimize Pd and Tskew of the clock-gating domain
- (Pd is the primary objective, while Tskew is
the secondary one because Tskew can be further
minimized after clock tree routing.) - Constraint
- Control-path timing constraint
- Data-path timing constraint
- Placement density constraint.
18Outline
- Introduction
- Preliminaries
- The Proposed Algorithms
- The Proposed Algorithms Flow
- MBFF-aware CG Cloning
- CG-based FF Merging
- MBFF and CG Placement Optimization
- Experimental Results
- Conclusions
19The Proposed Algorithms Flow
Initial placement / Cell library / Design
constraints
MBFF-aware CG Cloning
CG-based FF Merging
MBFF CG Placement Opt.
Optimized placement containing newly generated
CGs and MBFFs
20MBFF-aware CG Cloning
- The CG must be replicated and the fanout FFs are
bisected when - Control path violates the timing constraint
- CG drives too many FFs leading to larger clock
power consumption.
21Hyper Graph Construction
- According to the timing-feasible region of each
FF, we construct the hypergraph, H(V,E). - vi the timing-feasible region of the FF fi.
- ei the intersection among the timing
feasible regions of different fi. - w(ei) the number of vertices connected by ei.
w(e3)3
w(e1)4
w(e2)2
22Cut-line Determination with Inter-CG Skew
Budgeting
- The cut direction is determined by the physical
dimension of the FF bounding box. Teng Soin,
ICSE'10 - A vertical (horizontal) cut is applied if the
dimension in x-direction is larger (smaller) than
that in y-direction. - To balance the delay passing through different
CGs, we sweep the cut line to search for the
maximum skew slack .
23Skew Slack (1/2)
- In Fig.(c) (Fig.(d)), the CGs are placed at the
position closest to (farthest from) the clock
root within the respective FF bounding boxes,
resulting in the shortest (longest ) clock signal
delay from the clock root to the FFs.
24Skew Slack (2/2)
- The skew slack, can be calculated by the
difference between the minimum longest and the
maximum shortest clock signal delay. - To more easily balance the delay passing through
different CGs, we would like to find out a
physical cut line which maximizes the skew slack.
25MBFF-aware FF Swapping
- We perform the FM algorithm on H(V,E) to move FFs
between different FF sets such that the cut size
is minimized. - Cut size sum of edge weights on the cut line
- A balance condition that the skew slack after
moving an FF to the other FF set must not less
than .
- is a balance factor, .
26CG-based FF Merging
- We merge 1-bit FFs into MBFFs starting from the
four boundaries of the FF bounding box to the
center area, based on - INTEGRA Jiang et al., TCAD'12
- Spiral clustering technique Chang et al.,
ISPD'12
27MBFF and CG Placement Optimization
- We perform MBFF and CG placement optimization to
- Minimize inter-CG clock skew
- Minimize wirelength
- Minimize required clock buffers
- Satisfying control/data-path timing constraints
- Satisfying placement density constraints
28MBFF Placement
- When placing the MBFFs controlled by the same CG,
we search for the placement bins, which satisfy - Placement density constraint
- In the timing-feasible region corresponding to
each MBFF - The FF bounding box of the CG fanouts is
minimized. - The smaller FF bounding box can result in shorter
gated clock signal wirelength, and hence smaller
and .
29CG Placement
- The CGs are initially placed inside their
feasible positions which satisfy the control-path
timing constraings. - The feasible region of a CG is roughly an ellipse
whose the two foci are at the positions of the
enable logic and one of the CG fanout FFs. - We perform an iterative optimization algorithm
to - Move CGs around their feasible regions until
inter-CG clock skew cannot be further minimized. - Add clock buffers to either clock path from the
clock root to a CG for delay balance. - Insert buffers to either enable signal path from
the enable logic to a CG for a larger feasible
region of the CG.
30Outline
- Introduction
- Preliminaries
- The Proposed Algorithms
- Experiments
- Experimental Setups
- Experimental Comparisons
- Experimental Results
- Conclusions
31Experimental Setups
- Programming language
- C
- Platform
- 2.26GHz Intel Xeon machine under the Linux
operating system - We adopted the benchmark circuits in Jiang et
al., TCAD'12 - Add other logical, physical and timing
information for CGs, clock root, and EL. - Referred to the Nangate 45nm Open Cell Library to
set the input capacitance. - Assumed that all FFs in each circuit are
initially connected to the same CG. - Chose the circuits containing less than 1,000 FFs
with reasonable FF bounding boxes.
32Experimental Comparisons
- Reference Flow 1 2
- CG cloning technique is based on the MBFF-aware
CG cloning without applying MBFF-aware FF
swapping. - FF merging technique is exactly the same as the
CG-based FF merging.
33Experimental Results (1/2)
- Comparisons the numbers of MBFFs with different
bit numbers ( of FFs) and CG numbers ( of
CGs). - When comparing with Reference Flow 1 the
proposed flow results in much more MBFFs with
similar clock gate numbers. - When comparing with Reference Flow 2 the
proposed flow results in much slightly more CGs
and slightly fewer MBFFs.
34Experimental Results (2/2)
- Comparisons of the dynamic power consumption
- 15 less than that resulting from Reference
Flow 1. - 10 less than that resulting from Reference
Flow 2. - Comparisons of the clock net wirelength
- 22 less than that resulting from Reference
Flow 1. - 18 less than that resulting from Reference
Flow 2. - Comparisons of the signal net wirelength
- 2 less than that resulting from Reference Flow
2.
35Outline
- Introduction
- Preliminaries
- The Proposed Algorithms
- Experimental Results
- Conclusions
36Conclusions
- We have presented a new problem formulation for
clock network optimization with both CGs and
MBFFs. - We have also introduced novel techniques to
optimize gated clock network with CG cloning and
FF merging simultaneously. - The experimental results have shown that the
proposed approach results in better dynamic power
and clock wirelength compared with those which
optimize gated clock network with CGs and MBFFs
separately.
37Thanks for Your Attention