Title: Advanced Topics on FPGA Applications Screen A
1Advanced Topics on FPGA ApplicationsScreen A
- Wu, Jinyuan
- Fermilab
- IEEE NSS 2007 Refresher Course
- Supplemental Materials
- Oct, 2007
2Doublet Matching,Hash Sorter
3Hit Matching
Software FPGA Typical FPGA Resource Saving Approaches
O(n2) for() for() O(n)O(N) Comparator Array Hash Sorter O(n)O(N) in RAM
O(n3) for() for() for() O(n)O(N2) CAM, Hugh Trans. Tiny Triplet Finder O(n)O(NlogN)
O(n4) for() for() for() for()
4Hash Sorter
- Pass 1
- Data in Group 1 are stored in the hash sorter
bins based on key number K. - Pass 2
- Data in Group 2 are fetched though and paired up
with corresponding Group 1 data with same key
number K.
K
D
Group 1
K
Group 2
K
D
5Hash Sorter
K
6Hash Sorter Implementation
Single clock cycle fast reset
Pipelined structure Single clock cycle push or
pop
7An Example of Track Recognition Event
- We explain the track recognition process using
this 20-track example.
8Tangent Angle Measurements
- There are various techniques to measure the
tangent angle of the track segment (or doublet,
or cluster). - Sometimes extra ghost segments may exist.
- The ghost segments may be resolved in track
recognition process later.
a
9A Large Curvature Track
- A soft track hits large f region.
- A global algorithm is better suited.
- The high-pT approximation is not valid
globally. - Exact track equation is needed.
R
r
Parameter Radius of curvature
Measure the tangent angle..
f
a0
Parameter Initial angle
10An Example of Track Recognition Clustering
For doublets on the seeding super layer in this
bin
The 9-bin scheme
The 4-bin scheme
For doublets on the seeding super layer in this
bin
search for coincident in these 9 bins.
search for coincident in these 4 bins.
The doublets in clusters are grouped together.
clustering
c0
The ghost doublets are gone.
a0
11FPGA Block Diagram
Hash sorters for a0
Hash sorters for c0
12Without Full Track Recognition
- Two track parameters can be calculated for each
doublet. - Useful trigger primitives can be found without
full track recognition. - For example
13Triplet Finding,Tiny Triplet Finder
14Hit Matching
Software FPGA Typical FPGA Resource Saving Approaches
O(n2) for() for() O(n)O(N) Comparator Array Hash Sorter O(n)O(N) in RAM
O(n3) for() for() for() O(n)O(N2) AM, CAM, Hugh Trans. Tiny Triplet Finder O(n)O(NlogN)
O(n4) for() for() for() for()
15Hits, Hit Data Triplets
- Hit data come out of the detector planes in
random order. - Hit data from 3 planes generated by same particle
tracks are organized together to form triplets.
16TTF OperationsPhase I Filling Bit Arrays
Bit Array/Shifters
Note Flipped Bit Order
- xA xC 2 xB
- xA - xC constant
Physical Planes
Fill a corresponding logic cell.
For any hit
17TTF Operations Phase II Making Match
Bit Array/Shifters
Triplet is found.
Logically shift the bit array.
Perform bit-wise AND in this range.
Physical Planes
For any center plane hit
18Tiny Triplet FinderReuse Coincident Logic via
Shifting Hit Patterns
C3
C2
C1
One set of coincident logic is implemented.
For an arbitrary hit on C3, rotate, i.e., shift
the hit patterns for C1 and C2 to search for
coincidence.
19Tiny Triplet Finder for Circular Tracks
Also works with more than 3 layers
Shifter
Shifter
Bit-wise Coincident Logic
Bit Array
Bit Array
- Fill the C1 and C2 bit arrays. (n1 clock cycles)
- Loop over C3 hits, shift bit arrays and check for
coincidence. (n3 clock cycles)
R1/R3
R2/R3
Triplet Map Output To Decoder
20Tiny? Yes, Tiny! Logic Cell Usage
AM, CAM, Hough Transform etc., O(N2)
Tiny Triplet Finder O(NlogN)
21Complex Triplet Fining Problems
22Options of Sequence Control
23Micro-computing vs. Reconfigurable Computing
(1003-4)57 ?
100
3
Data 100,3,4,5,7
4
5
7
Control
LD
(-)
()
()
()
FPGA
Data
CPU
Data
Program
Program
Configuration
- In microprocessor, the users specify program on
fixed logic circuits. - In FPGA, the users specify logic circuits (as
well as program). - The FPGA computing needs not to follow
microprocessor architectures. (But useful
experiences can be borrowed.) - The usefulness of FPGA reconfigurable computing
is still to be fully appreciated.
24ELMS Enclosed Loop Micro-Sequencer
Allows jump back as in microprocessors
Special in ELMS Supports FOR loops at machine
code level
- PCROM is a good sequencer in FPGA.
- Adding Conditional Branch Logic allows the
program to loop back. - Loop Return Logic Stack is a special feature
in ELMS that supports FOR loops at machine code
level.
PC Control Signals Opration 00 000000000000000
01 001000100011010 LD R1, n 02 000010001000000
LD R2, addr_a 03 000000000000100 LD R3,
addr_X 04 000000010001000 LD R7,
0 05 000000000100001 BckA1 LD R4,
(R2) 06 000100000010000 INC R2 07 000001000100000
LD R5, (R3) 08 000100010000001 INC R3 09 001001
000100000 MUL R6, R4, R5 0a 000000010001000 EndA1
ADD R7, R7, R6 0b 000010000010000 DEC R1 0c 0000
00100000100 BRNZ BckA1
25Software Using Spread Sheet as Compiler
26Whats Good about ELMSNo ALU gt Small Resource
Usage
Princeton Architecture
Harvard Architecture
Fermilab Architecture(?)
Program DATA Memory
Program Control
Program Memory
Program Control
Program Memory
Sequencer (ELMS)
ALU
ALU
DATA Memory
DATA Memory
Data Processor
- The Princeton Architecture is more suitable at
system level while Harvard Architecture is better
suited at micro-structure level. - Regular microprocessors cannot run looped program
without an ALU. - The ALU takes large amount of resource while may
not be efficiently utilized for data processing
tasks in FPGA.
- The ELMS can run nested loop program without an
ALU. - Further separation of Program and data is
therefore possible. - The ELMS is kept small.
27Recursive Structure
28The Digitizer Card for the Fermilab Beam Loss
Monitor System
- Beam loss input signals from ion chambers are
integrated and digitized. - Sliding sums are accumulated and compared with
pre-loaded thresholds. - Over threshold in several places causes beam
abort based on pre-defined setting. - Beam loss signals are filtered and de-rippled
for display purposes. - Sequence is controlled by Seq128 block.
29Filter Functions
21ms/sample 124 samples
Sliding Sum
Cascaded Integrator Comb (CIC) Sum of 2nd Order
First Zero _at_ 360 Hz
- The CIC sum is a sliding sum of sliding sums.
- The frequency response of CIC sum is a sinc2(x)
function that has 2nd order zeros and better stop
band suppression.
Frequency
30Filter Implementation
Recursive ! IIR
Finite Impulse Respond (FIR)
Infinite Impulse Respond (IIR)
Non-Recursive Implementation
Yes
NO
Resource Friendly
Recursive Implementation
Possible
Yes
Sliding Sum
- The non-recursive implementation needs
- 124 memory fetches,
- 124 additions and
- more ops for longer sum lengths.
- The recursive implementation needs
- 1 memory fetch,
- 2 add/sub operations
- regardless sum length.
31BLM DC Process Sequencing
Fully Sequencing
Partially Flat
- The processes of calculating sliding sums and CIC
sums are fully sequenced. - The de-ripple processor is flat for the process
path. But it operates sequentially for 4
channels.
32The EndThanks
33Resource Saving Tricks
Loop Reduction Tricks The number of computations
in a given task is reduced by (1) using fewer
iterations in loops or/and (2) using fewer
operations in each iteration.
Non-Loop Reduction Tricks The number of
computations in a given task is unchanged. The
FPGA resource is saved by (1) reusing the
resources multiple times via sequencing or/and
(2) using transistor-saving resources such as RAM.
34Resource Saving TricksLoop-Reduction
Recursive Implementation of FIR Filter
Tiny Triplet Finder O(n)O(Nlog(N))
Multiplier-less (ML) Approaches
FFT O(n)O(log(N))
35Resource Saving TricksNon-Loop-Reduction
Sequencing
Using RAM Hash Sorter/Histogram
Initialization
Initialization 1
Initialization 2
Initialization 3
OP4
OP3
OP2
OP1
OP2
OP3
OP4
OP1
OP4
OP3
OP2
OP1
OP2
OP3
OP4
OP1
OP4
OP3
OP2
OP1
OP2
OP3
OP4
OP1
OP4
OP3
OP2
OP1
OP2
OP3
OP4
OP1