Efficient DAG mapping using decomposition selection and area-delay curves using a mapping graph - PowerPoint PPT Presentation

About This Presentation

Title:

Efficient DAG mapping using decomposition selection and area-delay curves using a mapping graph

Description:

Maintain an area-delay curve at each node composed of non inferior results of matching. ... tree matching - to much crossover if non critical. Is result from ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 45

Provided by: manpree9

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient DAG mapping using decomposition selection and area-delay curves using a mapping graph

1
Efficient DAG mapping using decomposition
selection and area-delay curves using a mapping
graph

Dirk-Jan Jongeneel
dirkjjn_at_cas.et.tudelft.nl

2
Acknowledgements

R.H.J.M. Otten
R. Brayton
Y. Watanabe
Y. Kukimoto
P. Sawkar
S. Burns

3
Agenda

Why a new mapping approach
Algorithm Lehman-Watanabe.
Potential problems for practical use.
Area-delay optimization
Massouds extension.
Implementation using fanout load and area guesses
Repowering possibilities.
Heuristic for backward traversal for optimal
cover selection to get multiple design points.
Results
Encountered problems/Potential solutions

4
Technology mapping

Input
Technology independent optimized logic network.
Description of gates in a library with there
costs.
Output
Net list of gates (from the library) which
minimize total cost .
General approach
Construct a subject DAG for the network.
Represent each gate in target library by pattern
DAGs.
Find an optimal-cost covering of the subject DAG
using the collection of pattern DAGs.

5
Current Mapping strategies

Complexity of DAG covering
NP-Hard
Remains NP-hard even when the nodes have degree ?
2.
Tree-mapping proposed for optimal min area cover
and later also used for min delay Keutzer.
If subject DAG and pattern DAGs are trees, an
efficient algorithm to find the best cover
exists.
based on dynamic programming algorithm.
DAG-mapping is possible for optimal min delay
Kukimoto.
The subject DAG is not broken into trees and the
matching part of the algorithm is slightly
modified.

Normal approach
Phase 1 Technology independent optimization
commit to a particular Boolean network.
algebraic decomposition is used.
Phase 2 AND2/INV decomposition
commit to a particular decomposition of a general
Boolean network using 2-input ANDs and Inverters.
Phase 3 Technology mapping
a two step dynamic programming algorithm is
used
From PI to PO for all nodes find all the matches
at a node with their costs using tree-matching
and select the one with lowest cost.
From PO to PI select the best match at a node to
cover a part of the subject DAG and continue
recursively at the inputs of the current selected
match.

7
Current drawback and solution using
Lehman-Watanabe Method

Drawbacks Procedures in each phase are
disconnected resulting in optimal sub-results but
possible sub-optimal overall result.
Phase 1 and 2 make critical decisions about
algebraic and AND2/INV decompositions without
knowing much about constraints and library.
Phase 3 knows about the constraints and library
but the solution space has already been limited
by the decisions made earlier.
Lehman-Watanabe Method.
Efficiently encode a set of AND2/INV
decompositions into a single structure called a
mapping graph.
Apply a modified tree-based or a new partial
technology mapper while dynamically performing
algebraic logic decomposition on the mapping
graph.
DAG-mapping is naturally introduced

8
Mapping graph AND2/INV decompositions

f abc can be represented in various ways.
We can combine them with a choice node.

9
Mapping graph AND2/INV decompositions

This can compactly be represented by this.
Which also encode the new following
decomposition.

10
Mapping graph AND2/INV decompositions

The complete decomposition in Ugates is

11
Mapping graph AND2/INV decompositions

The mapping graph is a modified Boolean network
Choice node Makes choices possible between
different decompositions.
Cyclic Functions written in terms of each
other, e.g. inverter chain with arbitrary length.
Reduced No two choice nodes with the same
functions. No two AND2s with same fanins.
Ugates Efficient implementation because of
regularity.
For cht benchmark (MCNC91), there are 2.2 x 1093
AND2/INV decompositions. All are encoded only
with 400 ugates containing 599 AND2s in total.

12
Tree-mapping on a Mapping Graph

Every time a choice-node is reached an input is
selected and tree matching continues as usual.
For every choice-node all inputs have to be
tried.
Cycles may occur iterate until costs are stable.
There are the inverter cycles and cycles
introduced by reduction and multiple encoding
fgh1 and gfh2.
DAG-mapping is automatically introduced
Because the number of fanouts is unknown during
mapping splitting in trees is not possible so
that matches passes multi fanout points resulting
in DAG-mapping.
Select the cover as usual.

13
Example Tree-mapping

Best choice if c is later than a and b.
subject Graph library pattern graph
i3 is faster than i1 and i2.

14
Graph-Mapping Theory

Graph-mapping(?) min ( tree-mapping(?) )
???
? mapping graph
? AND2/INV decomposition encode in ?
Graph-mapping finds an optimal tree
implementation for each primary output over all
the AND2/INV decompositions encoded in ?.
Graph-mapping is as powerful as applying
tree-matching exhaustively, but is typically
exponentially faster.

15
Lambda and Delta Mapping

Lambda mapping
1 encode all the AND2 decompositions of the
product terms and then all the sum terms for all
the nodes.
2 Apply graph-mapping.
Takes together phase 2 and 3 (AND2/INV
decomposition and mapping)
Delta mapping
1 encode all the AND2 decompositions of the
product terms and then all the sum terms for all
the nodes.
2 Iteratively apply graph-mapping and logic
decomposition until nothing changes any more.
Takes together phase 1, 2 and 3(algebraric
optimization, AND2/INV decomposition and
mapping)

16
Dynamic Logic Decomposition

During mapping find D-patterns and add
corresponding F-pattern dynamically.
D-pattern ab ab F-pattern a(bc)
If a is critical F-pattern is usually better.

17
Dynamic Logic Decomposition

D-pattern search and F-pattern adding in a Graph.
note adding a F-pattern may introduce a new
D-pattern

18
Example choosing the right decomposition

Tree-matching on graph and AND2/INV
decompositions.
AND8 node with arrival time a3delay(AND2).

19
Possible problems for practical use

The size of the graph becomes larger depending on
initial node size N needing more memory.
The size of the choice nodes become larger
depending on initial size N, slowing down
tree-matching.

These two problems can partially be solved by
choosing a value for N.
Large value More memory and longer run time but
much more possibilities to find matches
resulting in a better cover.
Small value Less memory and smaller run time but
not such a good cover.
Another possibility is the use of Partial
Matching.
Depending on library model and delay model the
same matches can be found much faster because of
pruning except for leaf DAGs.
Worse case AND10 decomposition and AND4 cell
tree-matching 61 sec
partial-matching 0.74 sec
Disadvantage as soon as we do some more complex
modeling it becomes an approximation and we might
prune away potentially better matches.
We save at each node all different partial
matches causing an increase in memory use.

21
Partial matching

The library
A library cell is composed out of its
root-cell of its AND2/INV decomposition and the
partials that represent its inputs.

22
Example Partial matching

At each node try to find all the partial matches
by combining all the partials at the inputs of
the root-cell. Then evaluate the partials that
are also complete matches and save the best as
-.

23
Condition for equality

Partial match and tree match will give equal
results if we assume that
Delays of all inputs to output of the library
cell are equal under all conditions.
partial1 i11, i24 (better ? worse)
partial2 i12, i23
Different input output delaysand2 a2, b2
-gt2 and4 a1, b4, c4, d4 -gt1
Change due to load dependencyand2 a2, b2
-gt2 loaded and2 a3, b6 -gt1
The area of the match is the sum of the area of
the inputs and the cell itself.
Leaf DAGs are not possible because there is no
relation known between the partials of two inputs.

24
Advantages of Graph-mapping

Optimal decomposition is chosen with respect to
constraints.
Dynamic decomposition can do a better job than
technology independent optimization.
Encoding more initial circuits possible.
Sharing is maximized resulting in lower area.
Has potential for interesting repowering
decisions during matching.
Tradeoffs possible between runtime, quality and
memory

25
Area-Delay estimation

Massouds proposal for area-delay tradeoff
Maintain an area-delay curve at each node
composed of non inferior results of matching.
Solution(t1,a1) is non-inferior if there is no
solution (t2,a2) such that t2ltt1 and a2lta1
OR t2ltt1 and a2lta1.
Use a delta-t to cut down the number of points in
the curve because combining curves could give an
exponentially increase each stage.
During selection of matches to cover the graph
select the match that meets required timing, and
recurs as usual.
To improve results loads can be used as soon as
they are known.
Outside the critical path we can now select
smaller covers.
Optimal only for trees under no load condition.
DAG approximation possible using an area guess of
area 1/n for an n fanout point to encourage
sharing.
Load can be approximated by a ndefault value,
and could be corrected for real load during
matching and covering.

26
Example using area-delay curve

Combine the points of the two inputs to create a
new curve with non inferior points.
At cover selection we note a non critical input
(A) and select a mach with lower area.
Result area7.5 -gt normally would be 9.

27
Implementation using area and fanout load guesses

We use as area guess area1/n to encourage
sharing, but n is special.
n is the number of fanouts of the node in the
original network. The number of real fanouts in
the graph cant be used because it is unknown how
many of them will actually be used.
Nodes inside a decomposition will most likely
only be used to get the best decomposition thus
n1
When a part of a match is crossing a multi fanout
point the area of these inputs of the match are
also multiplied by 1/n where n is the value at
the multi fanout point.
The assumption is that after reconversion the
divided areas add up to the original value again

28
Implementation using area and fanout load guesses

To account for load to be able to keep the
non-inferior points of the curve and put them in
the right place we use the a load guess nlavr
n is the same as in the area guess.
lavr should be about equal for all library gates.
PO loads are directly used to get an exact as
possible result. They can be considerably higher.
If a match crosses a multi fanout point the
delays of the inputs that cross are increased.
This accounts for the fact that now possibly the
load at nodes inside the decomposition are not
1lavr any more but could be used more than one
time.

29
Reduction of curve/runtime

Area spacing based reduction to a certain
limit,say 10, such that for a n-input cell at
the most 10n number of matches have to be
evaluated.
If area difference lt2 of the area -gt delete the
slowest
increase percentage until number of points
ltlimit.
More points gives exponential growing runtime.
Tradeoff possible between runtime and quality of
the result.
If we do this based on time we get a curve with a
few fast points and a lot slow points with almost
equal area.

30
Repowering opportunities

Several different repowering possibilities will
be in the curve and the best one under the
current conditions will be chosen.
Capacitance-splitting Serial repowering

31
Repowering opportunities

Complex repowering

32
Heuristics for cover selection.

Select match which meet timing and calculate
required times for its fanin.
Do parallel repowering in case of critical path.

33
Results

Mcnc test cases
use modified lib2.genlib (input-output delays are
equal)
using partial matching (equal to tree because
of library)
Nmax10
using load and area guesses

34
Results

Compare results of DAG mapping with DAG mapping
with area recovering and Graph mapping with area
recovery.

35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39

Difference between partial and tree matching when
crossing multi fanout.
partial matching -gt hardly crossover
tree matching -gt to much crossover if non
critical
Is result from model used at crossover.
Partial matching Crossover guesses only at
complete matches -gt early prunes out matches that
cross.This can be improved by using crossing
information also for partials.
Tree matching Crossover matches are competing
with matches at the fanout point -gt they are some
faster but have about equal area guess.Crossover
matches should only be used for critical paths
and if we therefore make the area some larger
they will not compete with the other matches at
non critical paths.
Important keep comparisons fair not only for
cone but also between cones.

40
Encountered problems/Potential solutions

Runtime is a problem using tree match and quality
using partial match.
C1355 tree -gt gt12 hours. partial -gt 10 min.
Tree matching is very simple implementation.
Explore faster techniques using properties of the
graph.
Is needed now for leaf-DAGs (but MUX4 is
already disaster)
Partial matching
Approximate library cells by one delay for inputs
- output.
Use tree matching if one input is really much
slower.
Use partial ordering for area delay and 1,4 lt-gt
2,3 problem.

BDDs are used for reduction of the Graph causing
a problem in case of blow up.
This often occurs when there are a lot of
XOR/selector type of gates.
Not always using BDDs or doing something about
ordering could be a solution.
It is possible not to use BDDs, but then sharing
will be less. Multiple encoding of networks is
not possible.

Cycles or loops are difficult in matching and
covering.
Large loops have to be matched by iteration until
nothing changes, giving problem for area and load
guesses.
The inverter cycle exists always and has a
typical problem for the load guesses because of
inverters following each other after connecting
two matches.
Extended data structures are needed to store more
information.
During iteration we have to keep track of what
information has come from what assumption
considering fanouts.
For the inverter cycle we need an extended data
structure to take into account where a match
comes from and where it connects to.

Inverter problems for guesses.
Multi fanout point f Nand match crossing f -gt
increase delay
But if we add an inverter we end at the multi
fanout f and add delay again
Do not count in case of Nand. But if connected to
Nor, delay should increase to favor sharing.

44
Conclusion

Graph mapping has the potential of finding
smaller and/or faster results.
Offers different design points to chose from.
Run time, Quality and memory use are adjustable
and an trade off between each other.
Could be extended to make also decisions about
other aspects, such as power.

Write a Comment

User Comments (0)