Title: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers
1Efficient Implementation of a String Matching
Algorithm for SRC and Cray Reconfigurable
Computers
- Esam El-Araby1, Mohamed Taher1, Tarek
El-Ghazawi1, - Mohamed Abouellail1, Nandakishore Sastry2, and
Kris Gaj2 1The George Washington
University,2George Mason University
2Outline
- Introduction
- SRC Hardware Software
- Cray XD1 Hardware Software
- String Matching Algorithms
- Implementation Methodology
- Results and Comparisons
- Conclusions
3Introduction
4Outline
- Introduction
- SRC Hardware Software
- Cray XD1 Hardware Software
- String Matching Algorithms
- Implementation Methodology
- Results and Comparisons
- Conclusions
5SRC Architecture(Hi-BarTM Based Systems)
- Hi-Bar sustains 1.4 GB/s per port with 180 ns
latency per tier - Up to 256 input and 256 output ports with two
tiers of switch - Common Memory (CM) has controller with DMA
capability - Controller can perform other functions such as
scatter/gather - Up to 8 GB DDR SDRAM supported per CM node
6SRC Reconfigurable Processor
7SRC Programming Environment
8SRC Programming Environment (cntd)
9SRC Programming Environment (cntd)
FPGA contents after the Function_1 call
Program in C or Fortran
Main program
Function_1
a
FPGA
Macro_1(a, b, c) Macro_2(b, d) Macro_2(c, e)
Function_1(a, d, e)
Macro_1
c
b
Function_2
Macro_2
Macro_2
Macro_3(s, t) Macro_1(n, b) Macro_4(t, k)
Function_2(d, e, f)
d
e
10Outline
- Introduction
- SRC Hardware Software
- Cray XD1 Hardware Software
- String Matching Algorithms
- Implementation Methodology
- Results and Comparisons
- Conclusions
11Cray XD1 System Architecture(One Chassis)
- Compute
- 12 AMD Opteron 32/64 bit, x86 processors
- High Performance Linux
- RapidArray Interconnect
- 12 communications processors
- 1 Tb/s switch fabric
- Active Management
- Dedicated processor
- Application Acceleration
- 6 co-processors
FPGA and 2nd RAP are on Expansion Module
12Cray XD1 Application Acceleration Interfaces
- XC2VP30-50 running at up to 200 MHz
- 4 QDR II RAM with over 400 HSTL-I I/O at 200 MHz
DDR (400 MTransfers/s) - 16 bit simplified HyperTransport I/F at 400 MHz
DDR (800 MTransfers/s) - QDR and HT I/F take up lt20 of XC2VP30. The
rest is available for user applications
13Cray XD1 Development Flow
14Cray XD1 Hardware Development Flow
15Design Methodology using Cray XD1
- Write application in C for system microprocessor
- Identify computation intense routine(s)
- Generate a bitstream using Cray Cores (RT
QDRII) and language of choice - Create module in HDL (Verilog, VHDL)
- Create module using High Level Language Tools
- Validate Module
- Synthesize using (XST, Leonardo, Synplify Pro)
- Create bitstream using Xilinx place route tools
- Replace routines with Cray API calls
- Run Application
16Outline
- Introduction
- SRC Hardware Software
- Cray XD1 Hardware Software
- String Matching Algorithms
- Implementation Methodology
- Results and Comparisons
- Conclusions
17String Matching - Introduction
- String Matching detecting the occurrence of a
particular substring, called the pattern, in
another string, called the text - Types of String matching
- Exact string matching
- Approximate string matching
- Exact string matching
- Involves match patterns, where they exist
completely, that is unbroken and with no
irrelevant data in between any letters - Numerous Applications NIDS, text editing, etc.
- Approximate string matching
- Pattern rarely matches the text completely
- Finds application in Computational biology (DNA
matching), image detection, handwriting
recognitionetc.
18DNA Matching Basics
- Problem
- find the best pairwise alignment of GAATC and
CATAC
- Why align two protein or DNA sequences?
- Determine whether they are descended from a
common ancestor (homologous) - Infer a common function
- Locate functional elements
- Infer protein structure, if the structure of one
of the sequences is known
- We need a way to measure the quality of a
candidate alignment - Alignment scores consist of two parts
- substitution matrix
- gap penalty
19DNA Matching Basics (cntd)
Scoring aligned bases
Purine A G
Pyrimidine C T
Transversion (expensive)
GAAT-C CA-TAC
Transition (cheap)
-5 10 ? 10 ? 10 ?
Scoring gaps
- Linear gap penalty every gap receives a score of
d
GAAT-C d-4 CA-TAC
-5 10 -4 10 -4 10 17
- Affine gap penalty opening a gap receives a
score of d extending a gap receives a score of e
G--AATC d-4 CATA--C e-1
-5 -4 -1 10 -4 -1 10 5
20Approximate String Matching Algorithm(Smith-Water
man Algorithm)
21Outline
- Introduction
- SRC Hardware Software
- Cray XD1 Hardware Software
- String Matching Algorithms
- Implementation Methodology
- Results and Comparisons
- Conclusions
22Implementation Schemes in SRC
23Operational Scenarios for Cray XD1
24Outline
- Introduction
- SRC Hardware Software
- Cray XD1 Hardware Software
- String Matching Algorithms
- Implementation Methodology
- Results and Comparisons
- Conclusions
25Performance Results
- Rate (FPGA freq.) X (cycles/cell) X ( SWPEs)
- Opteron Implementation (SSEARCH34)
- 100 Million Cell Updates Per Second (CUPS)
- Cray Inc. Implementation
- Current unoptimized design
- 80 MHz X 1 X 32 2.56 Billion CUPS (GCUPS)
- With optimization
- 100 MHZ x 1 x 50 5.0 GCUPS
- With future Virtex 4 FPGA
- 100 MHZ x 1 x 150 15 GCUPS
- 25x speedup vs. Opteron
- Our Implementation
- SRC-6
- Current unoptimized design
- 100 MHz X 1 X (16x16) 25.6 GCUPS
- 10x speedup vs. Cray
- 256x speedup vs. Opteron
- Cray XD1
- Current unoptimized design
CUG05, New Mexico, May 2005
26Conclusions
- Smith-Waterman sequence alignment algorithm has
been implemented on both SRC-6 and Cray XD1
systems - Similarities and differences are highlighted with
regard to - System hardware architecture
- Ease of programming
- Programming model
- Development time
- Hardware/software libraries
- Performance
- The speed-up vs. microprocessor is reported
- Primary bottlenecks limiting the performance of
both systems are recognized - The capability to share and port applications
between the SRC and Cray systems is explored