Title: CS 6293 Advanced Topics: Current Bioinformatics
1CS 6293 Advanced Topics Current Bioinformatics
- Lectures 3-4 Pair-wise Sequence Alignment
2Outline
- Biological background
- Global sequence alignment
- Local sequence alignment
- Optional linear-space alignment algorithm
- Heuristic alignment BLAST
3Evolution at the DNA level
C
ACGGTGCAGTCACCA
ACGTTGC-GTCCACCA
DNA evolutionary events (sequence
edits) Mutation, deletion, insertion
4Sequence conservation implies function
next generation
OK
OK
OK
X
X
Still OK?
5Why sequence alignment?
- Conserved regions are more likely to be
functional - Can be used for finding genes, regulatory
elements, etc. - Similar sequences often have similar origin and
function - Can be used to predict functions for new genes /
proteins - Sequence alignment is one of the most widely used
computational tools in biology
6Global Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
S
T
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
S
T
- Definition
- An alignment of two strings S, T is a pair of
strings S, T (with spaces) s.t. - S T, and (S length of S)
- removing all spaces in S, T leaves S, T
7What is a good alignment?
- Alignment
- The best way to match the letters of one
sequence with those of the other - How do we define best?
8S -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
- The score of aligning (characters or spaces) x
y is s (x,y). - Score of an alignment
- An optimal alignment one with max score
9Scoring Function
- Sequence edits
- AGGCCTC
- Mutations AGGACTC
- Insertions AGGGCCTC
- Deletions AGG-CTC
- Scoring Function
- Match m AAC
- Mismatch -s A-A
- Gap (indel) -d
10- Match 2, mismatch -1, gap -1
- Score 3 x 2 2 x 1 1 x 1 3
11More complex scoring function
- Substitution matrix
- Similarity score of matching two letters a, b
should reflect the probability of a, b derived
from the same ancestor - It is usually defined by log likelihood ratio
- Active research area. Especially for proteins.
- Commonly used PAM, BLOSUM
12An example substitution matrix
A C G T
A 3 -2 -1 -2
C 3 -2 -1
G 3 -2
T 3
13How to find an optimal alignment?
- A naïve algorithm
- for all subseqs A of S, B of T s.t. A B do
- align Ai with Bi, 1 i A
- align all other chars to spaces
- compute its value
- retain the max
- end
- output the retained alignment
S abcd A cd T wxyz B xz -abc-d
a-bc-d w--xyz -w-xyz
14Analysis
- Assume S T n
- Cost of evaluating one alignment n
- How many alignments are there
- pick n chars of S,T together
- say k of them are in S
- match these k to the k unpicked chars of T
- Total time
- E.g., for n 20, time is gt 240 gt1012 operations
15Dynamic Programming for sequence alignment
- Suppose we wish to align
- x1xM
- y1yN
- Let F(i,j) optimal score of aligning
- x1xi
- y1yj
- Scoring Function
- Match m
- Mismatch -s
- Gap (indel) -d
16Elements of dynamic programming
- Optimal sub-structures
- Optimal solutions to the original problem
contains optimal solutions to sub-problems - Overlapping sub-problems
- Some sub-problems appear in many solutions
- Memorization and reuse
- Carefully choose the order that sub-problems are
solved
17Optimal substructure
- If xi is aligned to yj in the optimal
alignment between x1..M and y1..N, then - The alignment between x1..i and y1..j is also
optimal - Easy to prove by contradiction
18Recursive solution
- Notice three possible cases
- xM aligns to yN
- xM
- yN
- 2. xM aligns to a gap
- xM
- ?
- yN aligns to a gap
- ?
- yN
m, if xM yN F(M,N)
F(M-1, N-1) -s, if not
F(M,N) F(M-1, N) - d
F(M,N) F(M, N-1) - d
19Recursive solution
- Generalize
- F(i-1, j-1) ?(Xi,Yj)
- F(i,j) max F(i-1, j) d
- F(i, j-1) d
- ?(Xi,Yj) m if Xi Yj, and s otherwise
- Boundary conditions
- F(0, 0) 0.
- F(0, j) ?
- F(i, 0) ?
-jd y1..j aligned to gaps.
-id x1..i aligned to gaps.
20What order to fill?
F(0,0)
F(M,N)
i
j
21What order to fill?
F(0,0)
F(M,N)
22Example
F(i,j) i 0 1 2 3 4
A G T A
A
T
A
j 0
1
2
3
23Example
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1
T -2
A -3
j 0
1
2
3
24Example
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2
A -3
j 0
1
2
3
25Example
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3
j 0
1
2
3
26Example
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4,3) 2
j 0
1
2
3
27Example
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4,3) 2 This only tells us
the best score
j 0
1
2
3
28Trace-back
F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
A
A
2
3
29Trace-back
F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
T A
T A
2
3
30Trace-back
F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
G T A
- T A
2
3
31Trace-back
F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
A G T A
A - T A
2
3
32Trace-back
F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4,3) 2 AGTA A?TA
j 0
1
2
3
33Using trace-back pointers
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1
T -2
A -3
j 0
1
2
3
34Using trace-back pointers
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2
A -3
j 0
1
2
3
35Using trace-back pointers
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3
j 0
1
2
3
36Using trace-back pointers
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
2
3
37Using trace-back pointers
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
2
3
38Using trace-back pointers
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
2
3
39Using trace-back pointers
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
2
3
40Using trace-back pointers
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4,3) 2 AGTA A?TA
j 0
1
2
3
41The Needleman-Wunsch Algorithm
- Initialization.
- F(0, 0) 0
- F(0, j) - j ? d
- F(i, 0) - i ? d
- Main Iteration. Filling in scores
- For each i 1M
- For each j 1N
- F(i-1,j) d case 1
- F(i, j) max F(i, j-1) d
case 2 - F(i-1, j-1) s(xi, yj) case 3
- UP, if case 1
- Ptr(i,j) LEFT if case 2
- DIAG if case 3
- Termination. F(M, N) is the optimal score, and
from Ptr(M, N) can trace back optimal alignment
42Complexity
- Time
- O(NM)
- Space
- O(NM)
- Linear-space algorithms do exist (with the same
time complexity)
43Equivalent graph problem
S1
G
A
T
A
(0,0)
? a gap in the 2nd sequence ? a gap in the 1st
sequence match / mismatch
1
1
A
S2
1
T
Value on vertical/horizontal line -d Value on
diagonal m or -s
1
1
A
(3,4)
- Number of steps length of the alignment
- Path length alignment score
- Optimal alignment find the longest path from (0,
0) to (3, 4) - General longest path problem cannot be found with
DP. Longest path on this graph can be found by DP
since no cycle is possible.
44Question
- If we change the scoring scheme, will the optimal
alignment be changed? - Old Match 1, mismatch gap -1
- New match 2, mismatch gap 0
- New Match 2, mismatch gap -2?
45Question
- What kind of alignment is represented by these
paths?
A
A
A
A
A
B C
B C
B C
B C
B C
A- BC
A-- -BC
--A BC-
-A- B-C
-A BC
Alternating gaps are impossible if s gt -2d
46A variant of the basic algorithm
- Scoring scheme m s d 1
- Seq1 CAGCA-CTTGGATTCTCGG
-
- Seq2 ---CAGCGTGG--------
- Seq1 CAGCACTTGGATTCTCGG
-
- Seq2 CAGC-----G-T----GG
- The first alignment may be biologically more
realistic in some cases (e.g. if we know s2 is a
subsequence of s1)
Score -7
Score -2
47A variant of the basic algorithm
- Maybe it is OK to have an unlimited of gaps in
the beginning and end
----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCG
AGTTCATCTATCAC--GACCGC--GGTCG--------------
- Then, we dont want to penalize gaps in the ends
48The Overlap Detection variant
- Changes
- Initialization
- For all i, j,
- F(i, 0) 0
- F(0, j) 0
- Termination
- maxi F(i, N)
- FOPT max maxj F(M, j)
x1 xM
yN y1
49Different types of overlaps
x
x
y
y
50The local alignment problem
- Given two strings X x1xM,
- Y y1yN
- Find substrings x, y whose similarity (optimal
global alignment value) is maximum - e.g. X abcxdex X cxde
- Y xxxcde Y c-de
x
y
51Why local alignment
- Conserved regions may be a small part of the
whole - Global alignment might miss them if flanking
junk outweighs similar regions - Genes are shuffled between genomes
C
D
B
A
D
A
B
C
52Naïve algorithm
- for all substrings X of X and Y of Y
- Align X Y via dynamic programming
- Retain pair with max value
- end
- Output the retained pair
- Time O(n2) choices for A, O(m2) for B, O(nm) for
DP, so O(n3m3 ) total.
53Reminder
- The overlap detection algorithm
- We do not give penalty to gaps at either end
Free gap
Free gap
54The local alignment idea
- Do not penalize the unaligned regions (gaps or
mismatches) - The alignment can start anywhere and ends
anywhere - Strategy whenever we get to some low similarity
region (negative score), we restart a new
alignment - By resetting alignment score to zero
55The Smith-Waterman algorithm
- Initialization F(0, j) F(i, 0) 0
-
- 0
- F(i 1, j) d
- F(i, j 1) d
- F(i 1, j 1) ?(xi, yj)
Iteration F(i, j) max
56The Smith-Waterman algorithm
- Termination
- If we want the best local alignment
- FOPT maxi,j F(i, j)
- If we want all local alignments scoring gt t
- For all i, j find F(i, j) gt t, and trace back
57x x x c d e
0 0 0 0 0 0 0
a 0
b 0
c 0
x 0
d 0
e 0
x 0
Match 2 Mismatch -1 Gap -1
58x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0
x 0
d 0
e 0
x 0
Match 2 Mismatch -1 Gap -1
59x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0
d 0
e 0
x 0
Match 2 Mismatch -1 Gap -1
60x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0
e 0
x 0
Match 2 Mismatch -1 Gap -1
61x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0 1 1 1 1 3 2
e 0
x 0
Match 2 Mismatch -1 Gap -1
62x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0
Match 2 Mismatch -1 Gap -1
63x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match 2 Mismatch -1 Gap -1
64Trace back
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match 2 Mismatch -1 Gap -1
65Trace back
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match 2 Mismatch -1 Gap -1
cxde c-de
x-de xcde
66- No negative values in local alignment DP array
- Optimal local alignment will never have a gap on
either end - Local alignment Smith-Waterman
- Global alignment Needleman-Wunsch
67Analysis
- Time
- O(MN) for finding the best alignment
- Time to report all alignments depends on the
number of sub-opt alignments - Memory
- O(MN)
- O(MN) possible
68- Optional more efficient alignment algorithms
69- Given two sequences of length M, N
- Time O(MN)
- ok
- Space O(MN)
- bad
- 1Mb seq x 1Mb seq 1000G memory
- Can we do better?
70Bounded alignment
- Good alignment should appear near the diagonal
71Bounded Dynamic Programming
- If we know that x and y are very similar
- Assumption gaps(x, y) lt k
- xi
- Then, implies i j lt k
- yj
72Bounded Dynamic Programming
- Initialization
- F(i,0), F(0,j) undefined for i, j gt k
- Iteration
- For i 1M
- For j max(1, i k)min(N, ik)
- F(i 1, j 1) ?(xi, yj)
- F(i, j) max F(i, j 1) d, if j gt i k
- F(i 1, j) d, if j lt i k
- Termination same
x1 xM
yN y1
k
73Analysis
- Time O(kM) ltlt O(MN)
- Space O(kM) with some tricks
gt
M
M
2k
2k
74(No Transcript)
75- Given two sequences of length M, N
- Time O(MN)
- ok
- Space O(MN)
- bad
- 1mb seq x 1mb seq 1000G memory
- Can we do better?
76Linear space algorithm
- If all we need is the alignment score but not the
alignment, easy!
We only need to keep two rows (You only need one
row, with a little trick)
But how do we get the alignment?
77Linear space algorithm
- When we finish, we know how we have aligned the
ends of the sequences
XM YN
Naïve idea Repeat on the smaller subproblem
F(M-1, N-1) Time complexity O((MN)(MN))
78(0, 0)
M/2
(M, N)
Key observation optimal alignment (longest path)
must use an intermediate point on the M/2-th row.
Call it (M/2, k), where k is unknown.
79(0,0)
(3,2)
(3,4)
(3,6)
(3,0)
(6,6)
- Longest path from (0, 0) to (6, 6) is max_k
(LP(0,0,3,k) LP(6,6,3,k))
80Hirschbergs idea
Y
Forward algorithm Align x1x2xM/2 with Y
X
M/2
F(M/2, k) represents the best alignment between
x1x2xM/2 and y1y2yk
81Backward Algorithm
Y
Backward algorithm Align reverse(xM/21xM) with
reverse(Y)
X
M/2
B(M/2, k) represents the best alignment between
reverse(xM/21xM) and reverse(ykyk1yN )
82Linear-space alignment
- Using 2 (4) rows of space, we can compute
- for k 1N, F(M/2, k), B(M/2, k)
M/2
83Linear-space alignment
- Now, we can find k maximizing F(M/2, k) B(M/2,
k) - Also, we can trace the path exiting column M/2
from k
Conclusion In O(NM) time, O(N) space, we found
optimal alignment path at row M/2
84Linear-space alignment
- Iterate this procedure to the two sub-problems!
M/2
k
M/2
N-k
85Analysis
- Memory O(N) for computation, O(NM) to store the
optimal alignment - Time
- MN for first iteration
- k M/2 (N-k) M/2 MN/2 for second
k
M/2
M/2
N-k
86MN
MN/2
MN/4
MN MN/2 MN/4 MN/8 MN (1 ½ ¼
1/8 1/16 ) 2MN O(MN)
MN/8
87Heuristic Local Sequence Alignment BLAST
88State of biological databases
- Sequenced Genomes
- Human 3?109 Yeast 1.2?107
- Mouse 2.7?109 Rat 2.6?109
- Neurospora 4?107 Fugu fish 3.3?108
- Tetraodon 3?108 Mosquito 2.8?108
- Drosophila 1.2?108 Worm 1.0?108
- Rice 1.0?109 Arabidopsis 1.2?108
- sea squirts 1.6?108
- Current rate of sequencing (before new-generation
sequencing) - 4 big labs ? 3 ?109 bp /year/lab
- 10s small labs
- Private sectors
- With new-generation sequencing
- Easily generating billions of reads daily
89Some useful applications of alignments
- Given a newly discovered gene,
- - Does it occur in other species?
- Assume we try Smith-Waterman
Our new gene
104
The entire genomic database
1010 - 1011
May take several weeks!
90Some useful applications of alignments
- Given a newly sequenced organism,
- - Which subregions align with other organisms?
- - Potential genes
- - Other functional units
- Assume we try Smith-Waterman
Our newly sequenced mammal
3?109
The entire genomic database
1010 - 1011
gt 1000 years ???
91BLAST
- Basic Local Alignment Search Tool
- Altschul, Gish, Miller, Myers, Lipman, J Mol Biol
1990 - The most widely used bioinformatics tool
- Which is better long mediocre match or a few
nearby, short, strong matches with the same total
score? - Score-wise, exactly equivalent
- Biologically, later may be more interesting, is
common - At least, if must miss some, rather miss the
former - BLAST is a heuristic algorithm emphasizing the
later - speed/sensitivity tradeoff BLAST may miss
former, but gains greatly in speed
92BLAST
- Available at NCBI (National Center for
Biotechnology Information) for download and
online use. http//blast.ncbi.nlm.nih.gov/ - Along with many sequence databases
- Main idea
- Construct a dictionary of all the words in the
query - Initiate a local alignment for each word match
between query and DB - Running Time O(MN)
- However, orders of magnitude faster than
Smith-Waterman
query
DB
93BLAST ? Original Version
- Dictionary
- All words of length k (11 for DNA, 3 for
proteins) - Alignment initiated between words of alignment
score ? T (typically T k) - Alignment
- Ungapped extensions until score
- below statistical threshold
- Output
- All local alignments with score
- gt statistical threshold
query
scan
DB
query
94BLAST ? Original Version
A C G A A G T A A G G T C
C A G T
Example k 4, T 4 The matching word GGTC
initiates an alignment Extension to the left and
right with no gaps until alignment falls lt
50 Output GTAAGGTCC GTTAGGTCC
C C C T T C C T G G A T T
G C G A
95Gapped BLAST
A C G A A G T A A G G T C
C A G T
- Added features
- Pairs of words can initiate alignment
- Extensions with gaps in a band around anchor
- Output
- GTAAGGTCCAGT
- GTTAGGTC-AGT
C T G A T C C T G G A T T
G C G A
96Example
- Query gattacaccccgattacaccccgattaca (29 letters)
2 mins - Database All GenBankEMBLDDBJPDB sequences
(but no EST, STS, GSS, or phase 0, 1 or 2 HTGS
sequences) 1,726,556 sequences 8,074,398,388
total letters - gtgi28570323gbAC108906.9 Oryza sativa
chromosome 3 BAC OSJNBa0087C10 genomic sequence,
complete sequence Length 144487 Score 34.2
bits (17), Expect 4.5 Identities 20/21 (95)
Strand Plus / Plus - Query 4 tacaccccgattacaccccga 24
-
- Sbjct 125138 tacacccagattacaccccga 125158
- Score 34.2 bits (17),
- Expect 4.5 Identities 20/21 (95) Strand
Plus / Plus - Query 4 tacaccccgattacaccccga 24
-
- Sbjct 125104 tacacccagattacaccccga 125124
- gtgi28173089gbAC104321.7 Oryza sativa
chromosome 3 BAC OSJNBa0052F07 genomic sequence,
complete sequence Length 139823 Score 34.2
bits (17), Expect 4.5 Identities 20/21 (95)
Strand Plus / Plus
97Example
- Query Human atoh enhancer, 179 letters 1.5
min - Result 57 blast hits
- gi7677270gbAF218259.1AF218259 Homo sapiens
ATOH1 enhanc... 355 1e-95 - gi22779500gbAC091158.11 Mus musculus Strain
C57BL6/J ch... 264 4e-68 - gi7677269gbAF218258.1AF218258 Mus musculus
Atoh1 enhanc... 256 9e-66 - gi28875397gbAF467292.1 Gallus gallus CATH1
(CATH1) gene... 78 5e-12 - gi27550980embAL807792.6 Zebrafish DNA
sequence from clo... 54 7e-05 - gi22002129gbAC092389.4 Oryza sativa
chromosome 10 BAC O... 44 0.068 - gi22094122refNM_013676.1 Mus musculus
suppressor of Ty ... 42 0.27 - gi13938031gbBC007132.1 Mus musculus, Similar
to suppres... 42 0.27 - gi7677269gbAF218258.1AF218258 Mus musculus
Atoh1 enhancer sequence Length 1517 - Score 256 bits (129), Expect 9e-66
Identities 167/177 (94), - Gaps 2/177 (1) Strand Plus / Plus
- Query 3 tgacaatagagggtctggcagaggctcctggccgcggt
gcggagcgtctggagcggagca 62 -
- Sbjct 1144 tgacaatagaggggctggcagaggctcctggccccggt
gcggagcgtctggagcggagca 1203
98Different types of BLAST
- blastn search nucleic acid databases
- blastp search protein databases
- blastx you give a nucleic acid sequence, search
protein databases - tblastn you give a protein sequence, search
nucleic acid databases - tblastx you give a nucleic sequence, search
nucleic acid database, implicitly translate both
into protein sequences
99BLAST cons and pros
- Advantages
- Fast!!!!
- A few minutes to search a database of 1011 bases
- Disadvantages
- Sensitivity may be low
- Often misses weak homologies
- New improvement
- Make it even faster
- Mainly for aligning very similar sequences or
really long sequences - E.g. whole genome vs whole genome
- Make it more sensitive
- PSI-BLAST iteratively add more homologous
sequences - PatternHunter discontinuous seeds
100Variants of BLAST
- NCBI-BLAST most widely used version
- WU-BLAST (Washington University BLAST) another
popular version - Optimized, added features
- MEGABLAST
- Optimized to align very similar sequences. Linear
gap penalty - BLAT Blast-Like Alignment Tool
- BlastZ
- Optimized for aligning two genomes
- PSI-BLAST
- BLAST produces many hits
- Those are aligned, and a pattern is extracted
- Pattern is used for next search above steps
iterated - Sensitive for weak homologies
- Slower
101Pattern hunter
- Instead of exact matches of consecutive matches
of k-mer, we can - look for discontinuous matches
- My query sequence looks like
- ACGTAGACTAGCAGTTAAG
- Search for sequences in database that match
- AXGXAGXCTAXC
- X stands for dont care
- Seed 101011011101
102Pattern hunter
- A good seed may give you both a higher
sensitivity and higher specificity - You may think 110110110110 is the best seed
- Because mutation in the third position of a codon
often doesnt change the amino acid - Best seed is actually 110100110010101111
- Empirically determined
- How to design such seed is an open problem
- May combine multiple random seeds
103Things weve covered so far
- Global alignment
- Needleman-Wunsch and variants
- Local Alignment
- Smith-Waterman
- Improvement on space and time
- Heuristic algorithms
- BLAST families
- Things we did not cover
- Statistics for sequence alignment
- To handle gaps more accurately affine gap
penalty - Multiple sequence alignment