Title: Bioinformatics PhD. Course
1Bioinformatics PhD. Course
Summary (approximate)
- 1. Biological introduction
- 2. Comparison of short sequences (lt10.000 bps)
- 3 Comparison of large sequences (up to 250 000
000)
- 5 Efficient data search structures and algorithms
22. Comparison of short sequences (lt10.000 bps)
Summary (more or less)
- 2.1 Dot matrix
- 2.2 Pairwise alignment.
- 2.3 Hash algorithms.
- 2.4 Multiple alignment.
32.2 Pairwise alignment
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) from the alphabet a,c,t,g we
say that A and B from a,c,t,g,- are aligned
iff
- A and B become A and B if gaps ( ) are
removed. - AB
- For all i, it is not possible that ai bi -
How many alignments of two sequences exist?
Which is the best alignment?
42.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) there are
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with
(an,-) (a1a2...an ,b1b2...bm-1) those
that end with (-,bm) (a1a2...an-1
,b1b2...bm-1) those that end with (an,bm)
(a1,b1)
52.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) there are
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with
(an,-) (a1a2...an ,b1b2...bm-1) those
that end with (-,bm) (a1a2...an-1
,b1b2...bm-1) those that end with (an,bm)
1
1
1
1
1 1 1
62.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) there are
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with
(an,-) (a1a2...an ,b1b2...bm-1) those
that end with (-,bm) (a1a2...an-1
,b1b2...bm-1) those that end with (an,bm)
1
1
1
1
1 1 1
3
? ?
72.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) there are
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with
(an,-) (a1a2...an ,b1b2...bm-1) those
that end with (-,bm) (a1a2...an-1
,b1b2...bm-1) those that end with (an,bm)
1
1
1
1
1 1 1
3
5 7
5 7
?
82.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) then
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with ( an ,
-) (a1a2...an ,b1b2...bm-1) those
that end with ( - , bm) (a1a2...an-1
,b1b2...bm-1) those that end with ( an , bm)
1
1
1
1
1 1 1
3
5 7
5 7
But, what is the assymptotic value?
92.2 Assymptotic value
As
(a1a2...an ,b1b2...bm)
and
n! nn e-n (Stirling approximation)
then
(a1a2...an ,b1b2...bn) gt 22n
102.2 Best alignment
How can an alignment be scored?
catcactactgacgactatcgtagcgcggctatacatctacgccaa-
ctac-t-gtgtagatcgccgg c- tgactgc--acgactatcgt-
attgcggctacacactacgcacaactactgtatgtcgc-cgg----
Then we assign a score for each case, for example
1,-1,-2.
How can the best alignment be found?
112.2 Edit distance and alignment of strings
The best alignment of two strings
is related with the edit distance, first
discussed in 1966...
The most efficient algorithm was proposed in
1968 and in 1970
using the technique called Dynamic programming
122.2 Best alignment
C T A C T A C T A C G T A C T G A
132.2 Best alignment
C T A C T A C T A C G T A C T G A
142.2 Best alignment
C T A C T A C T A C G T A C T G A
The cell contains the score of the best
alignment of AC and
CTACT.
152.2 Best alignment
C T A C T A C T A C G T 0 A C T
G A
?
162.2 Best alignment
C T A C T A C T A C G T 0 -2 A C T
G A
?
- C
172.2 Best alignment
C T A C T A C T A C G T 0 -2 -4 A C
T G A
?
- - CT
182.2 Best alignment
C T A C T A C T A C G T 0 -2-4-6 -8
A C T G A
- - - - - - CTACTA
192.2 Best alignment
C T A C T A C T A C G T 0 -2-4-6 -8
A ? C ? T ? G A
202.2 Best alignment
C T A C T A C T A C G T 0 -2-4-6 -8
A-2 C-4 T -6 G A
ACT - - -
212.2 Best alignment
C T A C T A C T A C G T
A
C
T G A
C T A C T A C T A C G T 0 -2 -4-6 -8
A-2 C-4 T -6 G A
s(AC,CTA)-2
s(A,CTA)1
BA(AC,CTAC) best
s(AC,CTAC)max
s(A,CTAC)-2
22Best alignment
Given the maximum score, how can the best
alignment be found?
- Quadratic cost in space and time
- Up to 10,000 bps sequences in length
232.2 Best alignment
- Connect to
- http//alggen.lsi.upc.es/docencia/ember/lepa/Tfc1.
htm - and use the global method.