Fatih Kocan and Jason Meyer - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Fatih Kocan and Jason Meyer

Description:

J. Meyer and F. Kocan, 'Sharing of SRAM Tables among NPN-Equivalent ... Triptych architecture. Logic cells allocated for either logic or routing. Hybrid FPGAs ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 53
Provided by: J3152
Category:
Tags: fatih | jason | kocan | meyer | triptych

less

Transcript and Presenter's Notes

Title: Fatih Kocan and Jason Meyer


1
Novel SRAM-based FPGA Architectures and
Supporting CAD Tools
  • Fatih Kocan and Jason Meyer
  • Computer Science
  • Southern Methodist University
  • October 10, 2007

2
Publications Patent
  • 1. J. Meyer and F. Kocan, Sharing of SRAM Tables
    among NPN-Equivalent LUTs in SRAM-based FPGAs,
    IEEE Transactions on VLSI Systems, pp. 182-195,
    vol 15, no. 2, Feb. 2007.
  • 2. J. Meyer and F. Kocan, Improving Critical
    Path Delay and Sharing in Shared SRAM Table based
    FPGAs, in preparation.
  • 3. J. Meyer and F. Kocan, Reducing Critical Path
    Delay in FPGAs with SRAM Tables Shared by
    NPN-Equivalent Functions, International
    Conference on Engineering of Reconfigurable
    Systems and Algorithms June 25-28, 2007.
  • 4. J. Meyer and F. Kocan, "Sharing FPGA SRAM
    Tables among NPN Equivalent LUTs", IEEE Int'l.
    Midwest Symposium on Circuits and Systems,
    Cincinnati, Ohio, August 7-10, 2005.
  • 5. F. Kocan and J. Meyer, Logic Modules with
    Shared SRAM Tables for Field-Programmable Gate
    Arrays, in Field Programmable Logic and
    Application, 14th International Conference,
    Leuven, Belgium, August 30-September 1, 2004,
    Proceedings, vol. 3203 of Lecture Notes in
    Computer Science, pp. 289300, Springer, 2004.
  • Fatih Kocan, Sharing a static random-access
    memory (SRAM) table betweeen two or more lookup
    tables (LUTs) that are equivalent to each other ,
    US Patent 20070046324

3
Outline
  • FPGA Architectures
  • CLB Architectures
  • CAD for FPGAs
  • Synthesis, Placement, Routing
  • NPN Equivalence Classes of Functions
  • Motivation
  • Analysis of Benchmarks
  • Proposed LUT
  • CLB with Proposed LUTs
  • Experimental Results
  • Conclusions
  • Future Work

4
Typical Island Style SRAM-Based FPGA
  • Reconfigurable Computing
  • Bridges general purpose and application specific
    computing
  • Faster than general purpose
  • More flexible than application specific
  • FPGAs are building blocks of reconfigurable
    computing

5
Fundamental Components of FPGAs
2-input LUT
Configurable Logic Block (CLB)
0
1
z
1
0
x y
Basic Logic Element (BLE)
Inputs
K-input LUT
Out
D Q
gt Q
6
Past Work on FPGA Architectures
  • Triptych architecture
  • Logic cells allocated for either logic or routing
  • Hybrid FPGAs
  • Combine LUT-based FPGAs and PLA-based CPLDs
  • Some parts suited for LUTs, others suited for
    products
  • Vantis FPGA
  • Variable granularity
  • Configurable building block (CBB)
  • Variable-grain block (VGB) consisting of 4 CBBs
  • Super variable-grain consisting of 4 VGBs
  • Function folding method attempted to reduce
    memory sizes based on fractions of functions.

7
Past Work on CLB Architectures (1)
  • Internal connections
  • Initially assumed to be fully connected
  • Sparsely populated connections was proposed
  • Single Event Upset faults
  • New architecture proposed to detect and correct
    these faults
  • Based on maps and Remaps

8
Past Work on CLB Architectures (2)
  • Altera Stratix II ALM Architecture
  • ALM 8 inputs divided into 2 functions (with
    different inputs)
  • LAB (CLB) 8 ALMs

9
Lessons Learned from CLB Research
  • For a cluster of size N, 2N2 inputs are
    sufficient
  • For a cluster size N with k-input LUTs,
    (k/2)(N1) inputs are sufficient
  • 4-input functions (LUTs) are sufficient
  • Good results are obtained with various cluster
    sizes between 4 and 8

10
FPGA CAD Flow
Synthesis
Pack CLBs
Placement
Routing
Design
  • Design circuit is created using Verilog or VHDL
    etc.
  • Synthesis circuit technology mapped to FPGA
    (SIS/RASP)
  • Packing LUTs packed into CLBs (e.g., T-VPACK)
  • Placement CLBs are physically placed in FPGA
    (VPR)
  • Routing connections are wired to channels (VPR)

11
VPR GUI from Toronto University
12
Synthesis Fundamentals
  • Goal Given a multilevel network of logic gates,
    transform it into a network of LUTs, each of no
    more than K inputs
  • Objectives
  • Minimize number of LUTs
  • Minimize delay, area, power
  • Parts of Synthesis
  • Logic Optimization
  • Transform gate-level network into smaller
    gate-level network (fewer gates)
  • Technology Mapping
  • Cover the gate-level network with K-LUTs

13
Logic Optimization Methods
  • Node Decomposition re-express a single node with
    logically equivalent composition of 2 or more
    nodes
  • Structural Decomposition
  • Symbolic Decomposition
  • Boolean Decomposition
  • Network Simplification

14
Technology Mapping
  • Example Technology Mapping with K3

15
NPN-equivalence classes of Boolean Functions (1)
  • Input Negation (NI) equivalence
  • Negate some inputs to g so that g f
  • For example, let f(a,b) ab and g(a,b) ab
  • g made equivalent to f by inverting a
  • Extra inverters and conditional negation required
  • Permutation (P) equivalence
  • Reorder some of the inputs to g so that g f
  • Let f(a,b) ab and g(a,b) ab
  • g made equivalent to f by reordering
  • No extra logic required

0
1
g
f
0
0
b a
a b
0
1
g
f
0
0
a b
a b
16
NPN-equivalence classes of Boolean Functions (2)
  • Output Negation (NO) equivalence
  • NO equivalent if g f or g f
  • Let f(a,b) ab and g(a,b) a or b
  • g made equivalent to f by inverting output
  • Extra inverters and conditional negation required
  • NPN equivalence
  • g and f are NPN equivalent
  • if any combination of NI, P,
  • and NO equivalence
  • yield g f

0
1
g
f
0
0
b a
a b
0
0
1
1
g
g
f
f
0
0
0
0
a b
a b
a b
a b
17
Specialization (Bridging)
  • Bridging
  • function f1 is bridged if over f2 iff there
    exists xi such that f1(x1, . . . , xn, xi)
    f2(y1, . . . , yn).

18
Specialization (Constant Assignment)
  • Constant Assignment
  • f1 is said to be C over f2 iff the cofactor of
    f1 with respect to xn1 is equivalent to f2
  • f1(x1, . . . , xn, 1) f2(y1, . . . , yn).
  • f1 is said to be C- over f2 iff the cofactor of
    f1 with respect to xn1 is equivalent to f2
  • f1(x1, . . . , xn, 0) f2(y1, . . . , yn).

19
Universal Logic Module (ULM) Design
  • Universal logic blocks
  • Prior to SRAM-based logic modules
  • Blocks that supported a majority of functions
  • Can implement functions that are
  • Negated at primary inputs
  • Permuted at the inputs
  • Negated at primary outputs
  • NPN Equivalence studied in this context
  • SRAM Tables already Universal
  • Our goal Why study NPN equivalence?
  • Answer Sharing SRAM Tables among NPN-equivalent
    LUTs

20
Motivation for Sharing SRAM Tables
  • Reducing SRAM Cells implies
  • Reduced Area
  • Reduced Power
  • Reduced Number Configuration Bits
  • Reduced Configuration Time ? Reduced test time
  • Even if area is not lowered, extra resources can
    be used to
  • Radiation harden the circuit,
  • Increase routing resources
  • Buffer I/O
  • Potential Adverse Effects of Sharing
  • Increased routing resources
  • Increased critical path delays

21
Practicality of NPN-Equivalence Based Sharing
  • We analyzed MCNC, ITC99, and ISCAS85 Benchmarks
    with
  • Academic Tools (SIS, RASP)
  • Industrial Tools (Mentor Precision Synthesis RTL)
  • We expect to find an abundance of NPN-equivalent
    functions

of functions
22
Analysis of MCNC Benchmarks
Benchmark descriptions
23
NPN-equivalence classes used, Combinational MCNC
Benchmarks
24
NPN-equivalence classes used, Sequential
Benchmarks
25
ITC 99 and ISCAS 85 Benchmarks
26
Mentor Precision RTL Synthesis
27
Analysis Results
  • Expectations met!!!
  • Synthesis tools are biased towards some classes
    of functions
  • Assuming tools give near-optimal solutions, maybe
    it is not necessary to utilize all equivalence
    classes of functions to implement a circuit?
  • Another research problem for a PhD student

28
Possible Sharing Architectures
  • P

NP
PN
NPN1
NPN2
29
Architectural Changes to Support NPN Sharing
  • Changes to LUT structures

Conditional Negation logic (CN)
MUX with CN
30
Power and Delay Measurements
  • Plug-in added to VPR to calculate power
  • Added power for conditional negation
  • Subtracted power for fewer SRAM tables
  • Added extra delay for conditional negation
  • ORCAD 9.2 and PSpice used to determine power and
    delay with the NPN architectures

31
Shared CLB
Shared CLB
32
Updated CAD Flow
Equiv. Table
Equiv. Analysis
Synthesis
Pack CLBs
Placement
Routing
Design
  • Add equivalency analysis stage before packing
    CLBs.
  • Take NPN equivalence into account when packing
    CLBs.

33
Searching for Optimal CLB Architectures
  • Investigated (near) optimal CLB architectures
  • Homogeneous larger, mixed CLBs
  • 8-16 LUTs per CLB
  • About 25 of SRAM tables are shared by 2 LUTs
  • Good architectures tested

34
Delay Results
35
Routing Results
36
Power Results
37
Area Results
38
The Need for Post-Routing Delay Improvement
  • Unbalanced delays
  • Place and route as if balanced delays
  • After routing, do iterative improvements to
    critical path delays

39
Local Configuration Changes
  • Two LUTs can be swapped without requiring global
    reconfiguration if
  • Both fanouts on the same side of the CLB
  • Neither fanout outside of the CLB
  • Fanout to different sides of CLB but converge
    before fanning out further

40
Critical Path Improvement Algorithms
  • Greedy algorithm does not work
  • Swap1 would not be made in a greedy algorithm
  • However, if both Swap1 and Swap2 made, critical
    path is reduced
  • Developed three post-routing algorithms to
    improve the critical path delays
  • Genetic Algorithm
  • Simulated Annealing
  • Branch and Bound

41
Genetic Algorithm
  • Objective function critical path delay
  • Linear time algorithm, one pass through the
    circuit
  • Population
  • Gene
  • 1 bit for every shared SRAM table in FPGA
  • 1 first LUT fast, second LUT slow
  • 0 first LUT slow, second LUT fast
  • Start with 300 individuals, created randomly
  • Operations
  • Run set number of generations
  • Crossover
  • Mutation

42
Simulated Annealing Algorithm
  • Objective function critical path delay
  • Linear time algorithm, one pass through the
    circuit
  • Gene
  • 1 bit for every shared SRAM table in FPGA
  • 1 first LUT fast, second LUT slow
  • 0 first LUT slow, second LUT fast
  • Start with 1 individual, created randomly
  • Operation
  • Mutation
  • Pick a single bit at random and flip

43
Branch and Bound Algorithm
  • Enumerate and check all possible swaps
  • Representation
  • 1 bit for every LUT in FPGA
  • 1 fast side of SRAM table
  • 0 slow side of SRAM table
  • Configuration is consistent
  • If one LUT for an SRAM is fast, other must be
    slow
  • LUTs mapped to unshared SRAM are fast
  • Each iteration, swap 2 bits for LUTs on SRAM
    table
  • If critical path lowers, keep the swap
  • If critical path route changes, keep the swap
  • Exponential, but
  • Only need to swap LUTs on critical path
  • Must save off configurations to prevent trying
    them again

44
Critical Path Improvement Algorithm Results
45
Comparing Sharing and Non-sharing when Branch and
Bound Algorithm Utilized
46
Conclusions
  • Optimal CLB architecture for studied benchmarks
  • 16 LUTs/CLB, 34 inputs, 7 shared SRAM tables
  • Potential for large savings in SRAM cells
  • Pessimistic results
  • Reduced of SRAM tables by 44!
  • Reduced area by 4.4, power by 2!
  • Reduced configuration bits
  • configuration time, test time
  • No degradation in routing, wirelength, or
    critical path delay

47
Future Work
  • Develop synthesis and resynthesis algorithm to
    increase NPN equivalence in a LUT level circuit
  • Possibly increase amount of logic, but offset by
    greater SRAM table sharing
  • 3 synthesis approaches
  • Restrict synthesis to functions that belong to a
    specific set of permissible equivalence classes
    (NAND gate ex).
  • Restrict synthesis to specific equivalence
    classes locally, but no restrictions globally
  • Restrict LUT sizes to 3 inputs instead of 4
    inputs.
  • Only 14 3-input NPN equivalence classes, but 222
    4-input classes

48
Example Re-synthesis
  • Fig(a) No NPN Equivalency
  • Fig(b) All equivalent

49
LP-based Synthesis
  • Alan Walker is doing a PhD thesis on this topic.

50
Complementary Nano-Electro Mechanical Switch
(CNEMS) and CNEM LUTs
51
Two CNEMS-based LUTs
52
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com