Fast 16point FFT Core for VirtexII FPGA - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Fast 16point FFT Core for VirtexII FPGA

Description:

... a power of four, we used radix-4 decimation-in-frequency algorithm by breaking ... Radix-4 FFT : 64 multiplications, 192 additions! Our Design: version 1.0 ... – PowerPoint PPT presentation

Number of Views:640
Avg rating:3.0/5.0
Slides: 16
Provided by: mesl2
Category:
Tags: 16point | fft | fpga | virtexii | core | fast | radix

less

Transcript and Presenter's Notes

Title: Fast 16point FFT Core for VirtexII FPGA


1
Fast 16-point FFT Core for Virtex-II FPGA
  • Subhradyuti Sarkar
  • Rakesh Kumar
  • s1sarkar, rakumar_at_cs.ucsd.edu

2
Organization
  • Goal
  • Design
  • Implementation
  • Results

3
Goal of the Project
  • Fast Fourier Transform is a very commonly used
    routine in a DSP application
  • It is computationally intensive as well, hence a
    software implementation might not meet the timing
    requirement in a real-time embedded system
  • Our aim was to design/implement a lightweight and
    fast FFT processor targeting Virtex-II FPGA

4
High-level Schematic
16-point FFT processor
output
input (streaming)
reset
clock
5
What is SPARK? And why didnt we use it?
  • SPARK is a high-level synthesis framework that
    can read a restricted C program and convert it
    into a VHDL description
  • Since our core will accept streaming input, it
    must have pipelined READ, EXECUTE and OUTPUT
    stages
  • Nearly impossible to express streaming and
    pipelining in C

6
FFT 101
  • The original DFT
  • This requires O(N2) computations. FFT uses
    divide-and-conquer techniques and bring down the
    complexity to O(N log N).
  • Since 16 is a power of four, we used radix-4
    decimation-in-frequency algorithm by breaking the
    16-point DFT formula into four smaller DFTs. The
    final set of transforms look like
  • Original DFT 256 multiplication, 240
    additions
  • Radix-4 FFT 64 multiplications, 192 additions!

7
Our Design version 1.0
  • Pipelined READ, EXECUTE and OUTPUT stage
  • Read one complex input (32-bits) in every cycle
  • 16 point input is read in 16 cycles
  • Output one complex result (32-bits) in every
    cycle
  • EXECUTE stage also takes 16 cycles
  • Performance
  • Latency 17 cycles
  • Throughput 1 transform per 16 cycles

READ
EXEC
OUTPUT
READ
OUTPUT
EXEC
8
Our Design version 2.0
  • Completing the execution stage in 16 cycles
    resulted in large fan-outs for the internal
    registers very poor timing characteristics
  • By trial and error, we divided the execution
    stage into 4 pipelined sub-stages
  • That necessitated 5x internal storage, but 2560
    bits of storage is no big deal for XC2V2000.
  • Final performance
  • Latency 65 cycles
  • Throughput 1 transform per 16 cycles (same as
    before)

R
E1
E2
E3
E4
O
O
E4
R
E1
E2
E3
9
Xilinx ISE Design Flow
VHDL
Constraints
(pin, area, timing)
Synthesize
net-list(s)
Translate
merges net-lists and constraints to a single
database
Map
maps the design to the board logic
Place/Route
floorplanned, placed and routed design
Configure
the design to downloaded to board
Verify
using Xilinx Chipscope Pro
10
Characteristics of the Final Design
  • Resource
  • 1432 out of 21504 flip-flops (6)
  • 1037 out of 21504 LUT (4)
  • 65 out of 624 I/O pins (10)
  • Timing
  • Minimum clock period 5.899ns (Maximum frequency
    169.520MHz)
  • Power
  • 440 mW estimated by XPower

11
Place-and-routed Design
Without any PIN constraint
  • Input pins are constrained to Bank0
  • Output pins are constrained to Bank1

12
Post Place-and-route Simulation
  • Input data 0,32,0,32,0,32,0,32,0,32,0,32,0,32,
    0,32
  • Expected output 16,0,0,0,0,0,0,0,-16,0,0,0,0,0
    ,0,0

13
Speedup from Software Implementation
  • Naïve Software Implementation
  • 2.4 GHz P-4, 512 MB RAM, gcc 3.3.1
  • On average every transform takes 325 ns
  • FPGA Implementation
  • Virtex-II XC2V2000 FF896 board with speed grade
    -4
  • In steady state every transform takes 5.9?16
    94.4 ns
  • 3.44X speedup

14
Configuration and Verification
  • The board was connected to the LPT1 port of the
    PC using Xilinx parallel cable 4 (JTAG Mode)
  • iMPact successfully downloaded the bitmap file
    to the board
  • Unfortunately, Chipscope Pro could not upload
    the data generated by our test- bench

15
Thank You
QUESTIONS?
Write a Comment
User Comments (0)
About PowerShow.com