Post Placement CSlow Retiming for Xilinx Virtex FPGAs - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Post Placement CSlow Retiming for Xilinx Virtex FPGAs

Description:

Retiming 3 Benchmarks. The tests. Automatic C-Slow Retiming for Virtex FPGAs. 3 ... Some AES hand benchmarks used SRL16 delay chains. Simple is pretty good ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 25
Provided by: Nicholas222
Category:

less

Transcript and Presenter's Notes

Title: Post Placement CSlow Retiming for Xilinx Virtex FPGAs


1
Post Placement C-Slow Retiming for Xilinx Virtex
FPGAs
  • UC Berkeley Reconfigurable Architectures,
    Systems, and Software (BRASS) Group
  • ACM Symposium on Field Programmable Gate Arrays
    (FPGA)
  • February 2x, 2003
  • http//www.cs.berkeley.edu/nweaver/cslow.html

2
Outline
  • Automatically Double Your Throughput
  • You paid for those registers, heres how to
    use them
  • Retiming and C-slow Retiming
  • The transformation
  • C-slow Retiming and the Virtex FPGA
  • The target
  • Retiming 3 Benchmarks
  • The tests

3
Retiming and Repipelining
  • Retiming
  • Automatically moving registers to minimize the
    clock period
  • Benefits limited by the number of registers
  • Algorithm developed by Leiserson et al
  • Repipelining
  • Adding registers to the front or back
  • Let retiming then move them around
  • But What About Feedback Loops?
  • Retiming and repipelining are of limited benefit
    when you have feedback loops

4
C-Slow Retiming
  • Replace every register with a sequenceof C
    registers.
  • With more registersretiming can break the
    design into finer pieces
  • Again proposed by Leiserson et al, to meet
    systolic slowdown
  • Semantic altering transformation
  • But resulting semantics are predictable and
    useful
  • Ideal C-slow in synthesis, retime after
    placement
  • Our prototype C-slow and retime after placement

5
Design Semantics After C-Slowing
  • Design operates on C independent data streams
  • Data streams are externally interleaved on round
    robin basis
  • Semantics apply to designs with Task Level
    Parallelism
  • Encryption
  • Counter (CTR) mode works on independent blocks
  • Sequence matching
  • Compare sequence vs database
  • C-slowing improves throughput but adds latency
    and registers

6
C-slowing, Retiming, and the Virtex FPGA
  • Every 4-LUT has associated register
  • Register can, almost always, be used
    independently of the LUT
  • LUTs can act as clocked shiftregisters (SRL16s)
  • Used in our AES hand-benchmark
  • Not used in our tool
  • Many designs have low register utilization
  • Excess of registers available in unoptimized
    designs
  • Retiming best performed with/after placement
  • Xilinx placement operates on mapped slices
  • Need net delay information for better results

7
Sketch of Tools Operation
  • Convert .ncd to .xdl after placement
  • Load design into graph representation
  • Replace registers with edge annotations to
    represent registers
  • Replace every single register with C registers
  • Compute costs based on delay model
  • Retime
  • Convert edge annotations back to instance
    registers
  • Write out .xdl, convert to .ncd
  • Route

Placer
Router
8
Experiment 1How Good is the Tool?
  • Tool is a simple prototype
  • Manhattan distance delay estimate
  • No attempt to minimize flip-flops
  • Basic flip-flop allocation
  • Two benchmarks AES and Smith/Waterman
  • Hand mapped
  • (optionally) hand placed
  • (optionally) hand C-slowed and retimed
  • Our Best hand AES implementation
  • 1.3 Gb/s
  • lt800 Slices, 10 BlockRAMs
  • 10 part, Spartan II-100

9
Experiment 1AES, Automatically Placed
  • Just retiming is of no benefit
  • Automatic C-slowing very effective
  • But could do even better

10
Experiment 1Smith/Waterman, Automatically Placed
  • Again, just retiming is of no benefit
  • C-slowing highly effective
  • Within 7 of hand-built implementation

11
Experiment 1Comments
  • Just retiming is of no benefit
  • Both designs limited by single cycle feedback
    loops
  • C-Slowing very effective
  • Able to automatically nearly double throughput
  • Hand implementations more than doubled throughput
  • Reasonable numbers of additional registers
  • Limitations of prototype tool
  • Flip-flop allocation routines could be better
  • Some AES hand benchmarks used SRL16 delay chains
  • Simple is pretty good
  • Relatively simplistic implementation gets
    reasonably close to hand-mapped performance

12
Experiment 2 Retiming LEON
  • Can we automatically C-slow a large, synthesized
    design?
  • Leon 1 A synthesized , GPLed SPARCcompatible
    microprocessor core 1
  • 5 stage pipeline, integer only
  • Modify register file to use BlockRAMs
  • BlockRAMs are used as negative edge devices
  • Remove caches, I/O, etc
  • Synthesize, using Symplify with CEs disabled
  • Edit EDIF to replace Sets/Resets
  • Retime and C-slow with prototype tool
  • Prototype tool converts BlockRAMs to positive
    edge
  • C-slow a microprocessor core...
  • Get an interleaved multithreaded architecture

1 Leon 1, by Jiri Gaisler, http//www.gaisler.co
m/leonmain.html
13
Experiment 2Results
  • Retiming alone worked surprisingly well
  • 2-slowing very effective
  • 3-slowing hit diminishing returns

6132 Luts for all designs
14
Experiment 2Comments
  • Retiming alone worked surprisingly well
  • Tool automatically converted BlockRAMs to
    positive-edge clocking and rebalanced the
    pipeline
  • 2-slowing very effective
  • Effectively doubled the initial throughput
  • NO slowdown in latency over initial design
    because retiming was effective without C-slowing
  • Used more many registers, but fewer registers
    than LUTs
  • 3-slowing hit diminishing returns
  • Too many registers required combined with poor
    register allocation ? poor performance

15
Conclusions
  • C-slow retiming is very effective
  • "Automatically double your throughput"
  • Benefits More throughput
  • Costs More Flip Flops, worse latency
  • Post-placement retiming appropriate
  • Independent Flip Flop usage critical
  • Have delay model for interconnect as well as
    logic
  • Some room for improvement
  • Faster/Better implementation
  • Minimize Flip Flop usage as well as delay
  • Use SRL16s
  • Better placement of Flip Flops
  • Experience suggests more Flip Flops/LUT would be
    useful

16
Backup Slide Why Not Use (Current) Synthesis
Tools?
  • Many synthesis tools support retiming, but with
    caveats
  • ONLY works for synthesized items
  • AES and Smith/Waterman didn't use synthesis
  • Can't automatically C-slow
  • Can't retime through memory blocks
  • Can't accurately guesstimate interconnect delay
    before placement
  • gt½ of the delay is the interconnect
  • Can't effectively scavenge unused flip-flops
    before placement
  • Xilinx placement operates on slices, not luts

17
Backup Slide Why the limitations on total
speedup?
  • Absolute maximum
  • Interconnect LUT Flip-Flop
  • Practical maximums
  • Too many flip-flops to allocate
  • Only one flip-flop per LUT available
  • Flip-flop allocation poor
  • Quick and dirty greedy heuristic
  • Works well for mild C-slowing
  • Fails with highly aggressive C-slowing
  • Tool doesnt minimize flip-flops
  • Critical path is defined by the single worst path
  • Tool uses Cheap and dirty interconnect delay
    model

18
(Backup Slide) Design Restrictions to Enable
C-slowing
  • Resets and Clock Enables
  • Convert to explicit logic
  • Memories
  • Increase by a factor of C
  • Add high bits of addr to provide round-robin
    access
  • Every stream sees an independent memory
  • Global Set/Reset
  • Convert to individual resets
  • Still highly restrictive
  • Interleave/deinterleave IO
  • Requires external logic
  • No asynchronous sets/resets

19
Scrap Image
20
Scrap Image 2-
21
Scrap Image 3
22
Scrap Image 4
23
Scrap 5
24
Scrap 6
Write a Comment
User Comments (0)
About PowerShow.com