Master Thesis Defense A Scalable MultiFPGA Network System for Gene Sequencing - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Master Thesis Defense A Scalable MultiFPGA Network System for Gene Sequencing

Description:

Bioinformatics Convergence of computing and biology ... New computing architectures required for bioinformatics high performance computing ... – PowerPoint PPT presentation

Number of Views:658

Avg rating:3.0/5.0

Slides: 40

Provided by: JON96

Category:

more less

Transcript and Presenter's Notes

Title: Master Thesis Defense A Scalable MultiFPGA Network System for Gene Sequencing

1
Master Thesis Defense A Scalable Multi-FPGA
Network System for Gene Sequencing

Jong-Ho Byun, M.S. Candidate
Department of Electrical and Computer Engineering
Committee Members
Dr. Arindam Mukherjee (Advisor)
Dr. Yogendra P. Kakad
Dr. Arun Ravindran
May 27, 2005, Cameron 119

2
Contents

Motivation
Our Vision
Methods
Smith-Waterman Algorithm
Implementation
Experimental Results
Checkpointing
Conclusions

3
Motivation

Bioinformatics Convergence of computing and
biology
Genbank data doubling every 16 months vs.
transistor count doubling every 18 months
(Moores law)
Proteomic and cellular imaging data is expected
to grow even faster
New computing architectures required for
bioinformatics high performance computing

4
Our Vision

Allow large number of users access to low cost,
high performance computing platforms
Enhance existing computing resources
(workstations, clusters etc.) with high speed
reconfigurable FPGAs
Design flexible architectures, optimized for
diverse bioinformatics algorithms
Allow user programs to access our accelerated
computation platforms through APIs
Automate optimal mapping of applications to
computation platforms

5
Methods

Hardware Devices
Virtex-II 3000/6000 FPGAs
CLBs, Switch boxes
BenNuey-PCI Motherboard
32/64 bit and 66 MHz PCI connection
Three slots for DIME-II modules
PCI/User FPGAs
BenBlue-II Module
One of DIME-II modules
Two user FPGAs

Software Tools
ISE development system
Synthesize, place and routing
DIMEtalk
A system level FPGA development tool
Four basic elements
(Edge, Router, Bridge, Node)
DIMEtalk API
Runtime host connectivity
Direct communication

6
Smith-Waterman (1)

One of the alignment methods
Identify the similarities/dissimilarities of two
biological sequences of gene or protein by
resulting scores
Resulting score
edit-distance (d)
computed with two entire strings by penalties
using a dynamic programming solution
Penalty (between two sequences S and T of length
n and m respectively)
1 for gap (insertion or deletion)
2 for substitution.

7
Fig.1 The time complexity of Smith-Waterman
algorithm
Smith-Waterman (2)

Time/Space complexity
O(nm)
The first row/column cells
initialized with a specific number
The nucleotide bases
Adenine (A)
Thymine (T)
Guanine (G)
Cytosine (C)

8
Implementation (1)

Multi-FPGA
Design Components
PCI Interface
Clock Driver, Clock Deskew
Router, 34-bit wide bus Bridge
32-bit 512 FIFO, 4K BRAM
SW_IF, SW_Core
BUSes
64-bit PCI bus
40-bit PCI communication bus
122-bit Adjacent bus
159-bit Primary and Secondary communication bus

Fig. 2 Block Diagram of Multi-FPGA Network

9
Implementation (2)

Fig. 3 Design flow for implementing
Smith-Waterman on FPGA network

10
Implementation (3)

Systolic Arrays (SAs)
Generated by C-based routine
Length of the systolic array
correspond length of query sequence (S)
Time complexity O(mn)
Fig. 4 A Systolic Arrays for Smith-Waterman
algorithm

11
Implementation (4)

Processing Element (PE)

Two bits for the four nucleotide bases (A, T, G,
C)
Two bits for the each intermediate value (a, b,
c, d)
By observation of Lipton and Lopresti
reduce 3-bit for intermediate values
LSB of b, c always opposite with LSB of a
LSB of d always equal with LSB of a
By coded nucleotide bases of query sequence S
Reduce 2-bit

12
Implementation (5)

UP/DOWN Counter
After last column in two dimensional matrix
Calculate edit-distance
Previous d (2-bit) of last column
Current d (2-bit) of last column
Initialize 32-bit final edit-distance with number
of PEs in a systolic array
00gt01gt10gt11 up count
11gt10gt01gt00 down count

Table 1
The operation of UP/DOWN Counter

13
Implementation (6)

Fig. 5 FSM for reading operation

SW Interface
Between memories and
a systolic array
Reading operation
FSM
Unpack individual nucleotide bases (32-bit FIFO
data consists 8 nucleotide bases)
Writing operation
Store final edit_distance
(32-bit)
Maximum size of BRAM

14
Experimental Results (1)

Optimized Implementation of a Systolic Array
Use RPM macro for each PE
Reuse PE Design for different systolic arrays
More efficient implement by ISE Floor-Planner
editor
Fig. 6 Design Flow of RPM for a single Processing
Element Design
Systolic array in Fig.7 corresponding 2037

15
Experimental Results (2)

Map without NGC-based RPM of PE (b) Maps with
NGC-based RPM of PE
Fig. 7 Maps on XC2V3000-4 for a systolic array
containing 2037 PEs

16
Experimental Results (3)

Table 2 Used Silicon Resources
Estimating Power Dissipation (only inputs and
outputs)
Table 3 Results of Estimating Power Dissipation

17
Experimental Results (4)

Implementation Multi-FPGA Network
Systolic arrays
XC2V3000 corresponding 3026 (85)
XC2V6000 corresponding 5028 (51)
Fig. 8 Different Implementations in Multi-FPGA
Network

18
Experimental Results (5)

Performance of a single XC2V3000
Systolic arrays corresponding 3026
Systems
SunFire 280R two Ultrasparc III processors 1.05
GHz,
8MB L2 cache and 8GB memory, running
Solaris 9
XC2V3000 Clocked at 100 MHz
Fig. 9 Performance of a single XCV3000 for
Genbank databases

19
Experimental Results (6)

Application Program Interface
Formatting and Packing
Genbank databases
Configuration bitstreams
Set/Reset system
Control reading/writing databases/results
Compare with expecting results
Formatting/Packing
A ux00, C ux01,
G ux10, T ux11
8 nucleotide bases
into 32-bit PCI bus

20
Experimental Results (7)

Hybrid serial-parallel performance
Time required to compare a query sequence
Processing time of a truly parallel FPGA network
Processing time of a serial FPGA network

D
the length of
given database
sequence D
Q
the length of
query sequence Q
T
the number of
nucleotide sequences
in the database
N
the number of FPGAs
in the network
tclk
the clock period in the FPGA network

21
Checkpointing (1)

Concept
To reduce the re-computation time in the presence
of faults
Assumption
Host computer or any FPGA can detects a fault
during the operation
A fault is detected as soon as it occurs
Any fault does not occur during checkpoint
retrieval or saving
Implementations
Using local registers (each PE has own 7-bit
register)
Using global BRAMs (several PEs share one BRAM)

22
Checkpointing (2)
Fig. 10 Comparing of Execution Time
23
Checkpointing (3)

Total operation time of a task
(a) without checkpointing
(b) with checkpointing
Using local registers
Ts0 and Tr1
Operation time saved
(positive number)

Using global BRAM
Ts40 and Tr42 (40 PEs)

24
Conclusion

Conclusion
The system computes orders of magnitude faster
than a microprocessor based computational server.
Additional FPGAs in the network speed up the
computation.
On-board memory is necessary to minimize the
impact of the bottleneck caused by the single PCI
interface.
Future Directions
Optimal partitioning of bioinformatics
applications across the entire networks of
computers and FPGAs.
Implementations of other bioinformatics such as
HMMER, FASTA, BLASTA, etc.

25
Questions?
?
26
Clocks/Buses

CLKA System clock
(20120 MHz)
CLKB DSP clock
(3540 MHz)
1 64-bit PCI bus
2 40-bit PCI communication bus (120 MHz)
3 122-bit Adjacent bus (150 MHz)
4 159-bit Primary and Secondary communication
bus (120 MHz)

27
Time Complexity

Fig. Time complexity on systolic array

28
SW_IF Reading Operation

Fig. Output of FSM

29
API Execution

Prepare data
Formatting
each nucleotide base uses 4-bit
A gt ux00 (0) C gt ux01 (1) G gt ux10 (2) T gt
ux11 (4)
u set to 0
x set to 1 when first nucleotide base, else
set to 0
Packing
8 nucleotide bases into 32-bit
Ex) database sequence first last
A G C T A C C T. 32-bits
MSB..LSB
ux11 ux01 ux01 ux00 ux11 ux01 ux10 ux00
(0011 0001 0001 0000 0011 0001
0010 0100)

30
Checkpointing
31
Checkpointing
32
Estimating Power

Fig. Steps for Estimating Power
The script file will generate a VCD
data file that checks the transitions of
all the inputs and outputs in given
simulation after one 100ns internval.