CS 6293 Advanced Topics: Current Bioinformatics - PowerPoint PPT Presentation

1 / 103
About This Presentation
Title:

CS 6293 Advanced Topics: Current Bioinformatics

Description:

CS 6293 Advanced Topics: Current Bioinformatics Lectures 3-4: Pair-wise Sequence Alignment Outline Biological background Global sequence alignment Local sequence ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 104
Provided by: Jianhu3
Learn more at: http://www.cs.utsa.edu
Category:

less

Transcript and Presenter's Notes

Title: CS 6293 Advanced Topics: Current Bioinformatics


1
CS 6293 Advanced Topics Current Bioinformatics
  • Lectures 3-4 Pair-wise Sequence Alignment

2
Outline
  • Biological background
  • Global sequence alignment
  • Local sequence alignment
  • Optional linear-space alignment algorithm
  • Heuristic alignment BLAST

3
Evolution at the DNA level
C
ACGGTGCAGTCACCA
ACGTTGC-GTCCACCA
DNA evolutionary events (sequence
edits) Mutation, deletion, insertion
4
Sequence conservation implies function



next generation
OK



OK



OK



X



X



Still OK?



5
Why sequence alignment?
  • Conserved regions are more likely to be
    functional
  • Can be used for finding genes, regulatory
    elements, etc.
  • Similar sequences often have similar origin and
    function
  • Can be used to predict functions for new genes /
    proteins
  • Sequence alignment is one of the most widely used
    computational tools in biology

6
Global Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
S
T
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
S
T
  • Definition
  • An alignment of two strings S, T is a pair of
    strings S, T (with spaces) s.t.
  • S T, and (S length of S)
  • removing all spaces in S, T leaves S, T

7
What is a good alignment?
  • Alignment
  • The best way to match the letters of one
    sequence with those of the other
  • How do we define best?

8
S -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
  • The score of aligning (characters or spaces) x
    y is s (x,y).
  • Score of an alignment
  • An optimal alignment one with max score

9
Scoring Function
  • Sequence edits
  • AGGCCTC
  • Mutations AGGACTC
  • Insertions AGGGCCTC
  • Deletions AGG-CTC
  • Scoring Function
  • Match m AAC
  • Mismatch -s A-A
  • Gap (indel) -d

10
  • Match 2, mismatch -1, gap -1
  • Score 3 x 2 2 x 1 1 x 1 3

11
More complex scoring function
  • Substitution matrix
  • Similarity score of matching two letters a, b
    should reflect the probability of a, b derived
    from the same ancestor
  • It is usually defined by log likelihood ratio
  • Active research area. Especially for proteins.
  • Commonly used PAM, BLOSUM

12
An example substitution matrix
A C G T
A 3 -2 -1 -2
C 3 -2 -1
G 3 -2
T 3
13
How to find an optimal alignment?
  • A naïve algorithm
  • for all subseqs A of S, B of T s.t. A B do
  • align Ai with Bi, 1 i A
  • align all other chars to spaces
  • compute its value
  • retain the max
  • end
  • output the retained alignment

S abcd A cd T wxyz B xz -abc-d
a-bc-d w--xyz -w-xyz
14
Analysis
  • Assume S T n
  • Cost of evaluating one alignment n
  • How many alignments are there
  • pick n chars of S,T together
  • say k of them are in S
  • match these k to the k unpicked chars of T
  • Total time
  • E.g., for n 20, time is gt 240 gt1012 operations

15
Dynamic Programming for sequence alignment
  • Suppose we wish to align
  • x1xM
  • y1yN
  • Let F(i,j) optimal score of aligning
  • x1xi
  • y1yj
  • Scoring Function
  • Match m
  • Mismatch -s
  • Gap (indel) -d

16
Elements of dynamic programming
  • Optimal sub-structures
  • Optimal solutions to the original problem
    contains optimal solutions to sub-problems
  • Overlapping sub-problems
  • Some sub-problems appear in many solutions
  • Memorization and reuse
  • Carefully choose the order that sub-problems are
    solved

17
Optimal substructure
  • If xi is aligned to yj in the optimal
    alignment between x1..M and y1..N, then
  • The alignment between x1..i and y1..j is also
    optimal
  • Easy to prove by contradiction

18
Recursive solution
  • Notice three possible cases
  • xM aligns to yN
  • xM
  • yN
  • 2. xM aligns to a gap
  • xM
  • ?
  • yN aligns to a gap
  • ?
  • yN

m, if xM yN F(M,N)
F(M-1, N-1) -s, if not
F(M,N) F(M-1, N) - d
F(M,N) F(M, N-1) - d
19
Recursive solution
  • Generalize
  • F(i-1, j-1) ?(Xi,Yj)
  • F(i,j) max F(i-1, j) d
  • F(i, j-1) d
  • ?(Xi,Yj) m if Xi Yj, and s otherwise
  • Boundary conditions
  • F(0, 0) 0.
  • F(0, j) ?
  • F(i, 0) ?

-jd y1..j aligned to gaps.
-id x1..i aligned to gaps.
20
What order to fill?
F(0,0)





F(M,N)
i
j
21
What order to fill?
F(0,0)





F(M,N)
22
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A

A
T
A
j 0
1
2
3
23
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1
T -2
A -3
j 0
1
2
3
24
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2
A -3
j 0
1
2
3
25
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3
j 0
1
2
3
26
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4,3) 2
j 0
1
2
3
27
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4,3) 2 This only tells us
the best score
j 0
1
2
3
28
Trace-back
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
A
A
2
3
29
Trace-back
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
T A
T A
2
3
30
Trace-back
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
G T A
- T A
2
3
31
Trace-back
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
A G T A
A - T A
2
3
32
Trace-back
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4,3) 2 AGTA A?TA
j 0
1
2
3
33
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1
T -2
A -3
j 0
1
2
3
34
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2
A -3
j 0
1
2
3
35
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3
j 0
1
2
3
36
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
2
3
37
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
2
3
38
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
2
3
39
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
2
3
40
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4,3) 2 AGTA A?TA
j 0
1
2
3
41
The Needleman-Wunsch Algorithm
  • Initialization.
  • F(0, 0) 0
  • F(0, j) - j ? d
  • F(i, 0) - i ? d
  • Main Iteration. Filling in scores
  • For each i 1M
  • For each j 1N
  • F(i-1,j) d case 1
  • F(i, j) max F(i, j-1) d
    case 2
  • F(i-1, j-1) s(xi, yj) case 3
  • UP, if case 1
  • Ptr(i,j) LEFT if case 2
  • DIAG if case 3
  • Termination. F(M, N) is the optimal score, and
    from Ptr(M, N) can trace back optimal alignment

42
Complexity
  • Time
  • O(NM)
  • Space
  • O(NM)
  • Linear-space algorithms do exist (with the same
    time complexity)

43
Equivalent graph problem
S1
G
A
T
A
(0,0)



? a gap in the 2nd sequence ? a gap in the 1st
sequence match / mismatch
1
1
A
S2
1
T
Value on vertical/horizontal line -d Value on
diagonal m or -s
1
1
A
(3,4)
  • Number of steps length of the alignment
  • Path length alignment score
  • Optimal alignment find the longest path from (0,
    0) to (3, 4)
  • General longest path problem cannot be found with
    DP. Longest path on this graph can be found by DP
    since no cycle is possible.

44
Question
  • If we change the scoring scheme, will the optimal
    alignment be changed?
  • Old Match 1, mismatch gap -1
  • New match 2, mismatch gap 0
  • New Match 2, mismatch gap -2?

45
Question
  • What kind of alignment is represented by these
    paths?

A
A
A
A
A










B C
B C
B C
B C
B C
A- BC
A-- -BC
--A BC-
-A- B-C
-A BC
Alternating gaps are impossible if s gt -2d
46
A variant of the basic algorithm
  • Scoring scheme m s d 1
  • Seq1 CAGCA-CTTGGATTCTCGG
  • Seq2 ---CAGCGTGG--------
  • Seq1 CAGCACTTGGATTCTCGG
  • Seq2 CAGC-----G-T----GG
  • The first alignment may be biologically more
    realistic in some cases (e.g. if we know s2 is a
    subsequence of s1)

Score -7
Score -2
47
A variant of the basic algorithm
  • Maybe it is OK to have an unlimited of gaps in
    the beginning and end

----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCG
AGTTCATCTATCAC--GACCGC--GGTCG--------------
  • Then, we dont want to penalize gaps in the ends

48
The Overlap Detection variant
  • Changes
  • Initialization
  • For all i, j,
  • F(i, 0) 0
  • F(0, j) 0
  • Termination
  • maxi F(i, N)
  • FOPT max maxj F(M, j)

x1 xM
yN y1
49
Different types of overlaps
x
x
y
y
50
The local alignment problem
  • Given two strings X x1xM,
  • Y y1yN
  • Find substrings x, y whose similarity (optimal
    global alignment value) is maximum
  • e.g. X abcxdex X cxde
  • Y xxxcde Y c-de

x
y
51
Why local alignment
  • Conserved regions may be a small part of the
    whole
  • Global alignment might miss them if flanking
    junk outweighs similar regions
  • Genes are shuffled between genomes

C
D
B
A
D
A
B
C
52
Naïve algorithm
  • for all substrings X of X and Y of Y
  • Align X Y via dynamic programming
  • Retain pair with max value
  • end
  • Output the retained pair
  • Time O(n2) choices for A, O(m2) for B, O(nm) for
    DP, so O(n3m3 ) total.

53
Reminder
  • The overlap detection algorithm
  • We do not give penalty to gaps at either end

Free gap
Free gap
54
The local alignment idea
  • Do not penalize the unaligned regions (gaps or
    mismatches)
  • The alignment can start anywhere and ends
    anywhere
  • Strategy whenever we get to some low similarity
    region (negative score), we restart a new
    alignment
  • By resetting alignment score to zero

55
The Smith-Waterman algorithm
  • Initialization F(0, j) F(i, 0) 0
  • 0
  • F(i 1, j) d
  • F(i, j 1) d
  • F(i 1, j 1) ?(xi, yj)

Iteration F(i, j) max
56
The Smith-Waterman algorithm
  • Termination
  • If we want the best local alignment
  • FOPT maxi,j F(i, j)
  • If we want all local alignments scoring gt t
  • For all i, j find F(i, j) gt t, and trace back

57
x x x c d e
0 0 0 0 0 0 0
a 0
b 0
c 0
x 0
d 0
e 0
x 0
Match 2 Mismatch -1 Gap -1
58
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0
x 0
d 0
e 0
x 0
Match 2 Mismatch -1 Gap -1
59
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0
d 0
e 0
x 0
Match 2 Mismatch -1 Gap -1
60
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0
e 0
x 0
Match 2 Mismatch -1 Gap -1
61
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0 1 1 1 1 3 2
e 0
x 0
Match 2 Mismatch -1 Gap -1
62
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0
Match 2 Mismatch -1 Gap -1
63
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match 2 Mismatch -1 Gap -1
64
Trace back
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match 2 Mismatch -1 Gap -1
65
Trace back
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match 2 Mismatch -1 Gap -1
cxde c-de
x-de xcde
66
  • No negative values in local alignment DP array
  • Optimal local alignment will never have a gap on
    either end
  • Local alignment Smith-Waterman
  • Global alignment Needleman-Wunsch

67
Analysis
  • Time
  • O(MN) for finding the best alignment
  • Time to report all alignments depends on the
    number of sub-opt alignments
  • Memory
  • O(MN)
  • O(MN) possible

68
  • Optional more efficient alignment algorithms

69
  • Given two sequences of length M, N
  • Time O(MN)
  • ok
  • Space O(MN)
  • bad
  • 1Mb seq x 1Mb seq 1000G memory
  • Can we do better?

70
Bounded alignment
  • Good alignment should appear near the diagonal





















71
Bounded Dynamic Programming
  • If we know that x and y are very similar
  • Assumption gaps(x, y) lt k
  • xi
  • Then, implies i j lt k
  • yj

72
Bounded Dynamic Programming
  • Initialization
  • F(i,0), F(0,j) undefined for i, j gt k
  • Iteration
  • For i 1M
  • For j max(1, i k)min(N, ik)
  • F(i 1, j 1) ?(xi, yj)
  • F(i, j) max F(i, j 1) d, if j gt i k
  • F(i 1, j) d, if j lt i k
  • Termination same

x1 xM
yN y1
k
73
Analysis
  • Time O(kM) ltlt O(MN)
  • Space O(kM) with some tricks

gt
M
M
2k
2k
74




(No Transcript)
75
  • Given two sequences of length M, N
  • Time O(MN)
  • ok
  • Space O(MN)
  • bad
  • 1mb seq x 1mb seq 1000G memory
  • Can we do better?

76
Linear space algorithm
  • If all we need is the alignment score but not the
    alignment, easy!

We only need to keep two rows (You only need one
row, with a little trick)










But how do we get the alignment?
77
Linear space algorithm
  • When we finish, we know how we have aligned the
    ends of the sequences

XM YN










Naïve idea Repeat on the smaller subproblem
F(M-1, N-1) Time complexity O((MN)(MN))
78
(0, 0)
M/2
(M, N)
Key observation optimal alignment (longest path)
must use an intermediate point on the M/2-th row.
Call it (M/2, k), where k is unknown.
79
(0,0)






(3,2)
(3,4)
(3,6)
(3,0)
(6,6)
  • Longest path from (0, 0) to (6, 6) is max_k
    (LP(0,0,3,k) LP(6,6,3,k))

80
Hirschbergs idea
  • Divide and conquer!

Y










Forward algorithm Align x1x2xM/2 with Y
X
M/2
F(M/2, k) represents the best alignment between
x1x2xM/2 and y1y2yk
81
Backward Algorithm
Y










Backward algorithm Align reverse(xM/21xM) with
reverse(Y)
X
M/2
B(M/2, k) represents the best alignment between
reverse(xM/21xM) and reverse(ykyk1yN )
82
Linear-space alignment
  • Using 2 (4) rows of space, we can compute
  • for k 1N, F(M/2, k), B(M/2, k)





















M/2
83
Linear-space alignment
  • Now, we can find k maximizing F(M/2, k) B(M/2,
    k)
  • Also, we can trace the path exiting column M/2
    from k











Conclusion In O(NM) time, O(N) space, we found
optimal alignment path at row M/2
84
Linear-space alignment
  • Iterate this procedure to the two sub-problems!

M/2
k
M/2
N-k
85
Analysis
  • Memory O(N) for computation, O(NM) to store the
    optimal alignment
  • Time
  • MN for first iteration
  • k M/2 (N-k) M/2 MN/2 for second

k
M/2
M/2
N-k
86
MN
MN/2
MN/4
MN MN/2 MN/4 MN/8 MN (1 ½ ¼
1/8 1/16 ) 2MN O(MN)
MN/8
87
Heuristic Local Sequence Alignment BLAST
88
State of biological databases
  • Sequenced Genomes
  • Human 3?109 Yeast 1.2?107
  • Mouse 2.7?109 Rat 2.6?109
  • Neurospora 4?107 Fugu fish 3.3?108
  • Tetraodon 3?108 Mosquito 2.8?108
  • Drosophila 1.2?108 Worm 1.0?108
  • Rice 1.0?109 Arabidopsis 1.2?108
  • sea squirts 1.6?108
  • Current rate of sequencing (before new-generation
    sequencing)
  • 4 big labs ? 3 ?109 bp /year/lab
  • 10s small labs
  • Private sectors
  • With new-generation sequencing
  • Easily generating billions of reads daily

89
Some useful applications of alignments
  • Given a newly discovered gene,
  • - Does it occur in other species?
  • Assume we try Smith-Waterman

Our new gene
104
The entire genomic database
1010 - 1011
May take several weeks!
90
Some useful applications of alignments
  • Given a newly sequenced organism,
  • - Which subregions align with other organisms?
  • - Potential genes
  • - Other functional units
  • Assume we try Smith-Waterman

Our newly sequenced mammal
3?109
The entire genomic database
1010 - 1011
gt 1000 years ???
91
BLAST
  • Basic Local Alignment Search Tool
  • Altschul, Gish, Miller, Myers, Lipman, J Mol Biol
    1990
  • The most widely used bioinformatics tool
  • Which is better long mediocre match or a few
    nearby, short, strong matches with the same total
    score?
  • Score-wise, exactly equivalent
  • Biologically, later may be more interesting, is
    common
  • At least, if must miss some, rather miss the
    former
  • BLAST is a heuristic algorithm emphasizing the
    later
  • speed/sensitivity tradeoff BLAST may miss
    former, but gains greatly in speed

92
BLAST
  • Available at NCBI (National Center for
    Biotechnology Information) for download and
    online use. http//blast.ncbi.nlm.nih.gov/
  • Along with many sequence databases
  • Main idea
  • Construct a dictionary of all the words in the
    query
  • Initiate a local alignment for each word match
    between query and DB
  • Running Time O(MN)
  • However, orders of magnitude faster than
    Smith-Waterman

query
DB
93
BLAST ? Original Version
  • Dictionary
  • All words of length k (11 for DNA, 3 for
    proteins)
  • Alignment initiated between words of alignment
    score ? T (typically T k)
  • Alignment
  • Ungapped extensions until score
  • below statistical threshold
  • Output
  • All local alignments with score
  • gt statistical threshold


query

scan
DB
query
94
BLAST ? Original Version
A C G A A G T A A G G T C
C A G T
Example k 4, T 4 The matching word GGTC
initiates an alignment Extension to the left and
right with no gaps until alignment falls lt
50 Output GTAAGGTCC GTTAGGTCC

















C C C T T C C T G G A T T
G C G A
95
Gapped BLAST
A C G A A G T A A G G T C
C A G T
  • Added features
  • Pairs of words can initiate alignment
  • Extensions with gaps in a band around anchor
  • Output
  • GTAAGGTCCAGT
  • GTTAGGTC-AGT


















C T G A T C C T G G A T T
G C G A
96
Example
  • Query gattacaccccgattacaccccgattaca (29 letters)
    2 mins
  • Database All GenBankEMBLDDBJPDB sequences
    (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS
    sequences) 1,726,556 sequences 8,074,398,388
    total letters
  • gtgi28570323gbAC108906.9 Oryza sativa
    chromosome 3 BAC OSJNBa0087C10 genomic sequence,
    complete sequence Length 144487 Score 34.2
    bits (17), Expect 4.5 Identities 20/21 (95)
    Strand Plus / Plus
  • Query 4 tacaccccgattacaccccga 24
  • Sbjct 125138 tacacccagattacaccccga 125158
  • Score 34.2 bits (17),
  • Expect 4.5 Identities 20/21 (95) Strand
    Plus / Plus
  • Query 4 tacaccccgattacaccccga 24
  • Sbjct 125104 tacacccagattacaccccga 125124
  • gtgi28173089gbAC104321.7 Oryza sativa
    chromosome 3 BAC OSJNBa0052F07 genomic sequence,
    complete sequence Length 139823 Score 34.2
    bits (17), Expect 4.5 Identities 20/21 (95)
    Strand Plus / Plus

97
Example
  • Query Human atoh enhancer, 179 letters 1.5
    min
  • Result 57 blast hits
  • gi7677270gbAF218259.1AF218259 Homo sapiens
    ATOH1 enhanc... 355 1e-95
  • gi22779500gbAC091158.11 Mus musculus Strain
    C57BL6/J ch... 264 4e-68
  • gi7677269gbAF218258.1AF218258 Mus musculus
    Atoh1 enhanc... 256 9e-66
  • gi28875397gbAF467292.1 Gallus gallus CATH1
    (CATH1) gene... 78 5e-12
  • gi27550980embAL807792.6 Zebrafish DNA
    sequence from clo... 54 7e-05
  • gi22002129gbAC092389.4 Oryza sativa
    chromosome 10 BAC O... 44 0.068
  • gi22094122refNM_013676.1 Mus musculus
    suppressor of Ty ... 42 0.27
  • gi13938031gbBC007132.1 Mus musculus, Similar
    to suppres... 42 0.27
  • gi7677269gbAF218258.1AF218258 Mus musculus
    Atoh1 enhancer sequence Length 1517
  • Score 256 bits (129), Expect 9e-66
    Identities 167/177 (94),
  • Gaps 2/177 (1) Strand Plus / Plus
  • Query 3 tgacaatagagggtctggcagaggctcctggccgcggt
    gcggagcgtctggagcggagca 62

  • Sbjct 1144 tgacaatagaggggctggcagaggctcctggccccggt
    gcggagcgtctggagcggagca 1203

98
Different types of BLAST
  • blastn search nucleic acid databases
  • blastp search protein databases
  • blastx you give a nucleic acid sequence, search
    protein databases
  • tblastn you give a protein sequence, search
    nucleic acid databases
  • tblastx you give a nucleic sequence, search
    nucleic acid database, implicitly translate both
    into protein sequences

99
BLAST cons and pros
  • Advantages
  • Fast!!!!
  • A few minutes to search a database of 1011 bases
  • Disadvantages
  • Sensitivity may be low
  • Often misses weak homologies
  • New improvement
  • Make it even faster
  • Mainly for aligning very similar sequences or
    really long sequences
  • E.g. whole genome vs whole genome
  • Make it more sensitive
  • PSI-BLAST iteratively add more homologous
    sequences
  • PatternHunter discontinuous seeds

100
Variants of BLAST
  • NCBI-BLAST most widely used version
  • WU-BLAST (Washington University BLAST) another
    popular version
  • Optimized, added features
  • MEGABLAST
  • Optimized to align very similar sequences. Linear
    gap penalty
  • BLAT Blast-Like Alignment Tool
  • BlastZ
  • Optimized for aligning two genomes
  • PSI-BLAST
  • BLAST produces many hits
  • Those are aligned, and a pattern is extracted
  • Pattern is used for next search above steps
    iterated
  • Sensitive for weak homologies
  • Slower

101
Pattern hunter
  • Instead of exact matches of consecutive matches
    of k-mer, we can
  • look for discontinuous matches
  • My query sequence looks like
  • ACGTAGACTAGCAGTTAAG
  • Search for sequences in database that match
  • AXGXAGXCTAXC
  • X stands for dont care
  • Seed 101011011101

102
Pattern hunter
  • A good seed may give you both a higher
    sensitivity and higher specificity
  • You may think 110110110110 is the best seed
  • Because mutation in the third position of a codon
    often doesnt change the amino acid
  • Best seed is actually 110100110010101111
  • Empirically determined
  • How to design such seed is an open problem
  • May combine multiple random seeds

103
Things weve covered so far
  • Global alignment
  • Needleman-Wunsch and variants
  • Local Alignment
  • Smith-Waterman
  • Improvement on space and time
  • Heuristic algorithms
  • BLAST families
  • Things we did not cover
  • Statistics for sequence alignment
  • To handle gaps more accurately affine gap
    penalty
  • Multiple sequence alignment
Write a Comment
User Comments (0)
About PowerShow.com