Master Thesis Defense A Scalable MultiFPGA Network System for Gene Sequencing - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Master Thesis Defense A Scalable MultiFPGA Network System for Gene Sequencing

Description:

Bioinformatics Convergence of computing and biology ... New computing architectures required for bioinformatics high performance computing ... – PowerPoint PPT presentation

Number of Views:658
Avg rating:3.0/5.0
Slides: 40
Provided by: JON96
Category:

less

Transcript and Presenter's Notes

Title: Master Thesis Defense A Scalable MultiFPGA Network System for Gene Sequencing


1
Master Thesis Defense A Scalable Multi-FPGA
Network System for Gene Sequencing
  • Jong-Ho Byun, M.S. Candidate
  • Department of Electrical and Computer Engineering
  • Committee Members
  • Dr. Arindam Mukherjee (Advisor)
  • Dr. Yogendra P. Kakad
  • Dr. Arun Ravindran
  • May 27, 2005, Cameron 119

2
Contents
  • Motivation
  • Our Vision
  • Methods
  • Smith-Waterman Algorithm
  • Implementation
  • Experimental Results
  • Checkpointing
  • Conclusions

3
Motivation
  • Bioinformatics Convergence of computing and
    biology
  • Genbank data doubling every 16 months vs.
    transistor count doubling every 18 months
    (Moores law)
  • Proteomic and cellular imaging data is expected
    to grow even faster
  • New computing architectures required for
    bioinformatics high performance computing

4
Our Vision
  • Allow large number of users access to low cost,
    high performance computing platforms
  • Enhance existing computing resources
    (workstations, clusters etc.) with high speed
    reconfigurable FPGAs
  • Design flexible architectures, optimized for
    diverse bioinformatics algorithms
  • Allow user programs to access our accelerated
    computation platforms through APIs
  • Automate optimal mapping of applications to
    computation platforms

5
Methods
  • Hardware Devices
  • Virtex-II 3000/6000 FPGAs
  • CLBs, Switch boxes
  • BenNuey-PCI Motherboard
  • 32/64 bit and 66 MHz PCI connection
  • Three slots for DIME-II modules
  • PCI/User FPGAs
  • BenBlue-II Module
  • One of DIME-II modules
  • Two user FPGAs
  • Software Tools
  • ISE development system
  • Synthesize, place and routing
  • DIMEtalk
  • A system level FPGA development tool
  • Four basic elements
  • (Edge, Router, Bridge, Node)
  • DIMEtalk API
  • Runtime host connectivity
  • Direct communication

6
Smith-Waterman (1)
  • One of the alignment methods
  • Identify the similarities/dissimilarities of two
    biological sequences of gene or protein by
    resulting scores
  • Resulting score
  • edit-distance (d)
  • computed with two entire strings by penalties
  • using a dynamic programming solution
  • Penalty (between two sequences S and T of length
    n and m respectively)
  • 1 for gap (insertion or deletion)
  • 2 for substitution.

7
Fig.1 The time complexity of Smith-Waterman
algorithm
Smith-Waterman (2)
  • Time/Space complexity
  • O(nm)
  • The first row/column cells
  • initialized with a specific number
  • The nucleotide bases
  • Adenine (A)
  • Thymine (T)
  • Guanine (G)
  • Cytosine (C)

8
Implementation (1)
  • Multi-FPGA
  • Design Components
  • PCI Interface
  • Clock Driver, Clock Deskew
  • Router, 34-bit wide bus Bridge
  • 32-bit 512 FIFO, 4K BRAM
  • SW_IF, SW_Core
  • BUSes
  • 64-bit PCI bus
  • 40-bit PCI communication bus
  • 122-bit Adjacent bus
  • 159-bit Primary and Secondary communication bus
  • Fig. 2 Block Diagram of Multi-FPGA Network

9
Implementation (2)
  • Fig. 3 Design flow for implementing
    Smith-Waterman on FPGA network

10
Implementation (3)
  • Systolic Arrays (SAs)
  • Generated by C-based routine
  • Length of the systolic array
  • correspond length of query sequence (S)
  • Time complexity O(mn)
  • Fig. 4 A Systolic Arrays for Smith-Waterman
    algorithm

11
Implementation (4)
  • Processing Element (PE)
  • Two bits for the four nucleotide bases (A, T, G,
    C)
  • Two bits for the each intermediate value (a, b,
    c, d)
  • By observation of Lipton and Lopresti
  • reduce 3-bit for intermediate values
  • LSB of b, c always opposite with LSB of a
  • LSB of d always equal with LSB of a
  • By coded nucleotide bases of query sequence S
  • Reduce 2-bit

12
Implementation (5)
  • UP/DOWN Counter
  • After last column in two dimensional matrix
  • Calculate edit-distance
  • Previous d (2-bit) of last column
  • Current d (2-bit) of last column
  • Initialize 32-bit final edit-distance with number
    of PEs in a systolic array
  • 00gt01gt10gt11 up count
  • 11gt10gt01gt00 down count
  • Table 1
  • The operation of UP/DOWN Counter

13
Implementation (6)
  • Fig. 5 FSM for reading operation
  • SW Interface
  • Between memories and
  • a systolic array
  • Reading operation
  • FSM
  • Unpack individual nucleotide bases (32-bit FIFO
    data consists 8 nucleotide bases)
  • Writing operation
  • Store final edit_distance
  • (32-bit)
  • Maximum size of BRAM

14
Experimental Results (1)
  • Optimized Implementation of a Systolic Array
  • Use RPM macro for each PE
  • Reuse PE Design for different systolic arrays
  • More efficient implement by ISE Floor-Planner
    editor
  • Fig. 6 Design Flow of RPM for a single Processing
    Element Design
  • Systolic array in Fig.7 corresponding 2037

15
Experimental Results (2)
  • Map without NGC-based RPM of PE (b) Maps with
    NGC-based RPM of PE
  • Fig. 7 Maps on XC2V3000-4 for a systolic array
    containing 2037 PEs

16
Experimental Results (3)
  • Table 2 Used Silicon Resources
  • Estimating Power Dissipation (only inputs and
    outputs)
  • Table 3 Results of Estimating Power Dissipation

17
Experimental Results (4)
  • Implementation Multi-FPGA Network
  • Systolic arrays
  • XC2V3000 corresponding 3026 (85)
  • XC2V6000 corresponding 5028 (51)
  • Fig. 8 Different Implementations in Multi-FPGA
    Network

18
Experimental Results (5)
  • Performance of a single XC2V3000
  • Systolic arrays corresponding 3026
  • Systems
  • SunFire 280R two Ultrasparc III processors 1.05
    GHz,
  • 8MB L2 cache and 8GB memory, running
    Solaris 9
  • XC2V3000 Clocked at 100 MHz
  • Fig. 9 Performance of a single XCV3000 for
    Genbank databases

19
Experimental Results (6)
  • Application Program Interface
  • Formatting and Packing
  • Genbank databases
  • Configuration bitstreams
  • Set/Reset system
  • Control reading/writing databases/results
  • Compare with expecting results
  • Formatting/Packing
  • A ux00, C ux01,
  • G ux10, T ux11
  • 8 nucleotide bases
  • into 32-bit PCI bus

20
Experimental Results (7)
  • Hybrid serial-parallel performance
  • Time required to compare a query sequence
  • Processing time of a truly parallel FPGA network
  • Processing time of a serial FPGA network
  • D
  • the length of
  • given database
  • sequence D
  • Q
  • the length of
  • query sequence Q
  • T
  • the number of
  • nucleotide sequences
  • in the database
  • N
  • the number of FPGAs
  • in the network
  • tclk
  • the clock period in the FPGA network

21
Checkpointing (1)
  • Concept
  • To reduce the re-computation time in the presence
    of faults
  • Assumption
  • Host computer or any FPGA can detects a fault
    during the operation
  • A fault is detected as soon as it occurs
  • Any fault does not occur during checkpoint
    retrieval or saving
  • Implementations
  • Using local registers (each PE has own 7-bit
    register)
  • Using global BRAMs (several PEs share one BRAM)

22
Checkpointing (2)
Fig. 10 Comparing of Execution Time
23
Checkpointing (3)
  • Total operation time of a task
  • (a) without checkpointing
  • (b) with checkpointing
  • Using local registers
  • Ts0 and Tr1
  • Operation time saved
  • (positive number)
  • Using global BRAM
  • Ts40 and Tr42 (40 PEs)

24
Conclusion
  • Conclusion
  • The system computes orders of magnitude faster
    than a microprocessor based computational server.
  • Additional FPGAs in the network speed up the
    computation.
  • On-board memory is necessary to minimize the
    impact of the bottleneck caused by the single PCI
    interface.
  • Future Directions
  • Optimal partitioning of bioinformatics
    applications across the entire networks of
    computers and FPGAs.
  • Implementations of other bioinformatics such as
    HMMER, FASTA, BLASTA, etc.

25
Questions?
?
26
Clocks/Buses
  • CLKA System clock
  • (20120 MHz)
  • CLKB DSP clock
  • (3540 MHz)
  • 1 64-bit PCI bus
  • 2 40-bit PCI communication bus (120 MHz)
  • 3 122-bit Adjacent bus (150 MHz)
  • 4 159-bit Primary and Secondary communication
    bus (120 MHz)

27
Time Complexity
  • Fig. Time complexity on systolic array

28
SW_IF Reading Operation
  • Fig. Output of FSM

29
API Execution
  • Prepare data
  • Formatting
  • each nucleotide base uses 4-bit
  • A gt ux00 (0) C gt ux01 (1) G gt ux10 (2) T gt
    ux11 (4)
  • u set to 0
  • x set to 1 when first nucleotide base, else
    set to 0
  • Packing
  • 8 nucleotide bases into 32-bit
  • Ex) database sequence first last
  • A G C T A C C T. 32-bits
    MSB..LSB
  • ux11 ux01 ux01 ux00 ux11 ux01 ux10 ux00
  • (0011 0001 0001 0000 0011 0001
    0010 0100)

30
Checkpointing
31
Checkpointing
32
Estimating Power
  • Fig. Steps for Estimating Power
  • The script file will generate a VCD
    data file that checks the transitions of
    all the inputs and outputs in given
    simulation after one 100ns internval.

33
BenNuey/BenBlue
34
DIMEtalk
35
DIMEtalk
36
Performance
Hybrid serial-parallel performance
37
Optimized map
38
PE
39
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com