Fast 16point FFT Core for VirtexII FPGA

About This Presentation

Title:

Fast 16point FFT Core for VirtexII FPGA

Description:

... a power of four, we used radix-4 decimation-in-frequency algorithm by breaking ... Radix-4 FFT : 64 multiplications, 192 additions! Our Design: version 1.0 ... – PowerPoint PPT presentation

Number of Views:640

Avg rating:3.0/5.0

Slides: 16

Provided by: mesl2

Category:

more less

Transcript and Presenter's Notes

Title: Fast 16point FFT Core for VirtexII FPGA

1
Fast 16-point FFT Core for Virtex-II FPGA

Subhradyuti Sarkar
Rakesh Kumar
s1sarkar, rakumar_at_cs.ucsd.edu

2
Organization

Goal
Design
Implementation
Results

3
Goal of the Project

Fast Fourier Transform is a very commonly used
routine in a DSP application
It is computationally intensive as well, hence a
software implementation might not meet the timing
requirement in a real-time embedded system
Our aim was to design/implement a lightweight and
fast FFT processor targeting Virtex-II FPGA

4
High-level Schematic
16-point FFT processor
output
input (streaming)
reset
clock
5
What is SPARK? And why didnt we use it?

SPARK is a high-level synthesis framework that
can read a restricted C program and convert it
into a VHDL description
Since our core will accept streaming input, it
must have pipelined READ, EXECUTE and OUTPUT
stages
Nearly impossible to express streaming and
pipelining in C

6
FFT 101

The original DFT

This requires O(N2) computations. FFT uses
divide-and-conquer techniques and bring down the
complexity to O(N log N).
Since 16 is a power of four, we used radix-4
decimation-in-frequency algorithm by breaking the
16-point DFT formula into four smaller DFTs. The
final set of transforms look like

Original DFT 256 multiplication, 240
additions
Radix-4 FFT 64 multiplications, 192 additions!

7
Our Design version 1.0

Pipelined READ, EXECUTE and OUTPUT stage
Read one complex input (32-bits) in every cycle
16 point input is read in 16 cycles
Output one complex result (32-bits) in every
cycle
EXECUTE stage also takes 16 cycles
Performance
Latency 17 cycles
Throughput 1 transform per 16 cycles

READ
EXEC
OUTPUT
READ
OUTPUT
EXEC
8
Our Design version 2.0

Completing the execution stage in 16 cycles
resulted in large fan-outs for the internal
registers very poor timing characteristics
By trial and error, we divided the execution
stage into 4 pipelined sub-stages
That necessitated 5x internal storage, but 2560
bits of storage is no big deal for XC2V2000.
Final performance
Latency 65 cycles
Throughput 1 transform per 16 cycles (same as
before)

R
E1
E2
E3
E4
O
O
E4
R
E1
E2
E3
9
Xilinx ISE Design Flow
VHDL
Constraints
(pin, area, timing)
Synthesize
net-list(s)
Translate
merges net-lists and constraints to a single
database
Map
maps the design to the board logic
Place/Route
floorplanned, placed and routed design
Configure
the design to downloaded to board
Verify
using Xilinx Chipscope Pro
10
Characteristics of the Final Design

Resource
1432 out of 21504 flip-flops (6)
1037 out of 21504 LUT (4)
65 out of 624 I/O pins (10)
Timing
Minimum clock period 5.899ns (Maximum frequency
169.520MHz)
Power
440 mW estimated by XPower

11
Place-and-routed Design
Without any PIN constraint

Input pins are constrained to Bank0
Output pins are constrained to Bank1

12
Post Place-and-route Simulation

Input data 0,32,0,32,0,32,0,32,0,32,0,32,0,32,
0,32
Expected output 16,0,0,0,0,0,0,0,-16,0,0,0,0,0
,0,0

13
Speedup from Software Implementation

Naïve Software Implementation
2.4 GHz P-4, 512 MB RAM, gcc 3.3.1
On average every transform takes 325 ns
FPGA Implementation
Virtex-II XC2V2000 FF896 board with speed grade
-4
In steady state every transform takes 5.9?16
94.4 ns
3.44X speedup

14
Configuration and Verification

The board was connected to the LPT1 port of the
PC using Xilinx parallel cable 4 (JTAG Mode)
iMPact successfully downloaded the bitmap file
to the board
Unfortunately, Chipscope Pro could not upload
the data generated by our test- bench

15
Thank You
QUESTIONS?

Write a Comment

User Comments (0)