Title: Efficient DAG mapping using decomposition selection and area-delay curves using a mapping graph
1Efficient DAG mapping using decomposition
selection and area-delay curves using a mapping
graph
- Dirk-Jan Jongeneel
- dirkjjn_at_cas.et.tudelft.nl
2Acknowledgements
- R.H.J.M. Otten
- R. Brayton
- Y. Watanabe
- Y. Kukimoto
- P. Sawkar
- S. Burns
3Agenda
- Why a new mapping approach
- Algorithm Lehman-Watanabe.
- Potential problems for practical use.
- Area-delay optimization
- Massouds extension.
- Implementation using fanout load and area guesses
- Repowering possibilities.
- Heuristic for backward traversal for optimal
cover selection to get multiple design points. - Results
- Encountered problems/Potential solutions
4Technology mapping
- Input
- Technology independent optimized logic network.
- Description of gates in a library with there
costs. - Output
- Net list of gates (from the library) which
minimize total cost . - General approach
- Construct a subject DAG for the network.
- Represent each gate in target library by pattern
DAGs. - Find an optimal-cost covering of the subject DAG
using the collection of pattern DAGs.
5Current Mapping strategies
- Complexity of DAG covering
- NP-Hard
- Remains NP-hard even when the nodes have degree ?
2. - Tree-mapping proposed for optimal min area cover
and later also used for min delay Keutzer. - If subject DAG and pattern DAGs are trees, an
efficient algorithm to find the best cover
exists. - based on dynamic programming algorithm.
- DAG-mapping is possible for optimal min delay
Kukimoto. - The subject DAG is not broken into trees and the
matching part of the algorithm is slightly
modified.
6- Normal approach
- Phase 1 Technology independent optimization
- commit to a particular Boolean network.
- algebraic decomposition is used.
- Phase 2 AND2/INV decomposition
- commit to a particular decomposition of a general
Boolean network using 2-input ANDs and Inverters. - Phase 3 Technology mapping
- a two step dynamic programming algorithm is
used - From PI to PO for all nodes find all the matches
at a node with their costs using tree-matching
and select the one with lowest cost. - From PO to PI select the best match at a node to
cover a part of the subject DAG and continue
recursively at the inputs of the current selected
match.
7Current drawback and solution using
Lehman-Watanabe Method
- Drawbacks Procedures in each phase are
disconnected resulting in optimal sub-results but
possible sub-optimal overall result. - Phase 1 and 2 make critical decisions about
algebraic and AND2/INV decompositions without
knowing much about constraints and library. - Phase 3 knows about the constraints and library
but the solution space has already been limited
by the decisions made earlier. - Lehman-Watanabe Method.
- Efficiently encode a set of AND2/INV
decompositions into a single structure called a
mapping graph. - Apply a modified tree-based or a new partial
technology mapper while dynamically performing
algebraic logic decomposition on the mapping
graph. - DAG-mapping is naturally introduced
8Mapping graph AND2/INV decompositions
- f abc can be represented in various ways.
- We can combine them with a choice node.
9Mapping graph AND2/INV decompositions
- This can compactly be represented by this.
- Which also encode the new following
decomposition.
10Mapping graph AND2/INV decompositions
- The complete decomposition in Ugates is
11Mapping graph AND2/INV decompositions
- The mapping graph is a modified Boolean network
- Choice node Makes choices possible between
different decompositions. - Cyclic Functions written in terms of each
other, e.g. inverter chain with arbitrary length. - Reduced No two choice nodes with the same
functions. No two AND2s with same fanins. - Ugates Efficient implementation because of
regularity. - For cht benchmark (MCNC91), there are 2.2 x 1093
AND2/INV decompositions. All are encoded only
with 400 ugates containing 599 AND2s in total.
12Tree-mapping on a Mapping Graph
- Every time a choice-node is reached an input is
selected and tree matching continues as usual. - For every choice-node all inputs have to be
tried. - Cycles may occur iterate until costs are stable.
- There are the inverter cycles and cycles
introduced by reduction and multiple encoding
fgh1 and gfh2. - DAG-mapping is automatically introduced
- Because the number of fanouts is unknown during
mapping splitting in trees is not possible so
that matches passes multi fanout points resulting
in DAG-mapping. - Select the cover as usual.
13Example Tree-mapping
- Best choice if c is later than a and b.
- subject Graph library pattern graph
- i3 is faster than i1 and i2.
14Graph-Mapping Theory
- Graph-mapping(?) min ( tree-mapping(?) )
- ???
- ? mapping graph
- ? AND2/INV decomposition encode in ?
- Graph-mapping finds an optimal tree
implementation for each primary output over all
the AND2/INV decompositions encoded in ?. - Graph-mapping is as powerful as applying
tree-matching exhaustively, but is typically
exponentially faster.
15Lambda and Delta Mapping
- Lambda mapping
- 1 encode all the AND2 decompositions of the
product terms and then all the sum terms for all
the nodes. - 2 Apply graph-mapping.
- Takes together phase 2 and 3 (AND2/INV
decomposition and mapping) - Delta mapping
- 1 encode all the AND2 decompositions of the
product terms and then all the sum terms for all
the nodes. - 2 Iteratively apply graph-mapping and logic
decomposition until nothing changes any more. - Takes together phase 1, 2 and 3(algebraric
optimization, AND2/INV decomposition and
mapping)
16Dynamic Logic Decomposition
- During mapping find D-patterns and add
corresponding F-pattern dynamically. - D-pattern ab ab F-pattern a(bc)
- If a is critical F-pattern is usually better.
17Dynamic Logic Decomposition
- D-pattern search and F-pattern adding in a Graph.
- note adding a F-pattern may introduce a new
D-pattern
18Example choosing the right decomposition
- Tree-matching on graph and AND2/INV
decompositions. - AND8 node with arrival time a3delay(AND2).
19Possible problems for practical use
- The size of the graph becomes larger depending on
initial node size N needing more memory. - The size of the choice nodes become larger
depending on initial size N, slowing down
tree-matching.
20- These two problems can partially be solved by
choosing a value for N. - Large value More memory and longer run time but
much more possibilities to find matches
resulting in a better cover. - Small value Less memory and smaller run time but
not such a good cover. - Another possibility is the use of Partial
Matching. - Depending on library model and delay model the
same matches can be found much faster because of
pruning except for leaf DAGs. - Worse case AND10 decomposition and AND4 cell
- tree-matching 61 sec
- partial-matching 0.74 sec
- Disadvantage as soon as we do some more complex
modeling it becomes an approximation and we might
prune away potentially better matches. - We save at each node all different partial
matches causing an increase in memory use.
21Partial matching
- The library
- A library cell is composed out of its
root-cell of its AND2/INV decomposition and the
partials that represent its inputs.
22Example Partial matching
- At each node try to find all the partial matches
by combining all the partials at the inputs of
the root-cell. Then evaluate the partials that
are also complete matches and save the best as
-.
23Condition for equality
- Partial match and tree match will give equal
results if we assume that - Delays of all inputs to output of the library
cell are equal under all conditions. - partial1 i11, i24 (better ? worse)
partial2 i12, i23 - Different input output delaysand2 a2, b2
-gt2 and4 a1, b4, c4, d4 -gt1 - Change due to load dependencyand2 a2, b2
-gt2 loaded and2 a3, b6 -gt1 - The area of the match is the sum of the area of
the inputs and the cell itself. - Leaf DAGs are not possible because there is no
relation known between the partials of two inputs.
24Advantages of Graph-mapping
- Optimal decomposition is chosen with respect to
constraints. - Dynamic decomposition can do a better job than
technology independent optimization. - Encoding more initial circuits possible.
- Sharing is maximized resulting in lower area.
- Has potential for interesting repowering
decisions during matching. - Tradeoffs possible between runtime, quality and
memory
25Area-Delay estimation
- Massouds proposal for area-delay tradeoff
- Maintain an area-delay curve at each node
composed of non inferior results of matching. - Solution(t1,a1) is non-inferior if there is no
solution (t2,a2) such that t2ltt1 and a2lta1
OR t2ltt1 and a2lta1. - Use a delta-t to cut down the number of points in
the curve because combining curves could give an
exponentially increase each stage. - During selection of matches to cover the graph
select the match that meets required timing, and
recurs as usual. - To improve results loads can be used as soon as
they are known. - Outside the critical path we can now select
smaller covers. - Optimal only for trees under no load condition.
- DAG approximation possible using an area guess of
area 1/n for an n fanout point to encourage
sharing. - Load can be approximated by a ndefault value,
and could be corrected for real load during
matching and covering.
26Example using area-delay curve
- Combine the points of the two inputs to create a
new curve with non inferior points. - At cover selection we note a non critical input
(A) and select a mach with lower area. - Result area7.5 -gt normally would be 9.
27Implementation using area and fanout load guesses
- We use as area guess area1/n to encourage
sharing, but n is special. - n is the number of fanouts of the node in the
original network. The number of real fanouts in
the graph cant be used because it is unknown how
many of them will actually be used. - Nodes inside a decomposition will most likely
only be used to get the best decomposition thus
n1 - When a part of a match is crossing a multi fanout
point the area of these inputs of the match are
also multiplied by 1/n where n is the value at
the multi fanout point. - The assumption is that after reconversion the
divided areas add up to the original value again
28Implementation using area and fanout load guesses
- To account for load to be able to keep the
non-inferior points of the curve and put them in
the right place we use the a load guess nlavr - n is the same as in the area guess.
- lavr should be about equal for all library gates.
- PO loads are directly used to get an exact as
possible result. They can be considerably higher. - If a match crosses a multi fanout point the
delays of the inputs that cross are increased.
This accounts for the fact that now possibly the
load at nodes inside the decomposition are not
1lavr any more but could be used more than one
time.
29Reduction of curve/runtime
- Area spacing based reduction to a certain
limit,say 10, such that for a n-input cell at
the most 10n number of matches have to be
evaluated. - If area difference lt2 of the area -gt delete the
slowest - increase percentage until number of points
ltlimit. - More points gives exponential growing runtime.
- Tradeoff possible between runtime and quality of
the result. - If we do this based on time we get a curve with a
few fast points and a lot slow points with almost
equal area.
30Repowering opportunities
- Several different repowering possibilities will
be in the curve and the best one under the
current conditions will be chosen. - Capacitance-splitting Serial repowering
31Repowering opportunities
32Heuristics for cover selection.
- Select match which meet timing and calculate
required times for its fanin. - Do parallel repowering in case of critical path.
33Results
- Mcnc test cases
- use modified lib2.genlib (input-output delays are
equal) - using partial matching (equal to tree because
of library) - Nmax10
- using load and area guesses
34Results
- Compare results of DAG mapping with DAG mapping
with area recovering and Graph mapping with area
recovery.
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)
39- Difference between partial and tree matching when
crossing multi fanout. - partial matching -gt hardly crossover
- tree matching -gt to much crossover if non
critical - Is result from model used at crossover.
- Partial matching Crossover guesses only at
complete matches -gt early prunes out matches that
cross.This can be improved by using crossing
information also for partials. - Tree matching Crossover matches are competing
with matches at the fanout point -gt they are some
faster but have about equal area guess.Crossover
matches should only be used for critical paths
and if we therefore make the area some larger
they will not compete with the other matches at
non critical paths. - Important keep comparisons fair not only for
cone but also between cones.
40Encountered problems/Potential solutions
- Runtime is a problem using tree match and quality
using partial match. - C1355 tree -gt gt12 hours. partial -gt 10 min.
- Tree matching is very simple implementation.
- Explore faster techniques using properties of the
graph. - Is needed now for leaf-DAGs (but MUX4 is
already disaster) - Partial matching
- Approximate library cells by one delay for inputs
- output. - Use tree matching if one input is really much
slower. - Use partial ordering for area delay and 1,4 lt-gt
2,3 problem.
41- BDDs are used for reduction of the Graph causing
a problem in case of blow up. - This often occurs when there are a lot of
XOR/selector type of gates. - Not always using BDDs or doing something about
ordering could be a solution. - It is possible not to use BDDs, but then sharing
will be less. Multiple encoding of networks is
not possible.
42- Cycles or loops are difficult in matching and
covering. - Large loops have to be matched by iteration until
nothing changes, giving problem for area and load
guesses. - The inverter cycle exists always and has a
typical problem for the load guesses because of
inverters following each other after connecting
two matches. - Extended data structures are needed to store more
information. - During iteration we have to keep track of what
information has come from what assumption
considering fanouts. - For the inverter cycle we need an extended data
structure to take into account where a match
comes from and where it connects to.
43- Inverter problems for guesses.
- Multi fanout point f Nand match crossing f -gt
increase delay - But if we add an inverter we end at the multi
fanout f and add delay again - Do not count in case of Nand. But if connected to
Nor, delay should increase to favor sharing.
44Conclusion
- Graph mapping has the potential of finding
smaller and/or faster results. - Offers different design points to chose from.
- Run time, Quality and memory use are adjustable
and an trade off between each other. - Could be extended to make also decisions about
other aspects, such as power.