Title: Exon prediction by Genomic Sequence alignment Burkhard Morgenstern and Oliver Rinner
1 Vorlesung Grundlagen der Bioinformatik http//g
obics.de/lectures/ss07/grundlagen
2 Sequence alignment in molecular data analysis
Information from a Single Sequence Alone
3 Sequence alignment in molecular data analysis
Information from a Single Sequence Alone
Multi-Organism High Quality Sequences
(M. Brudno)
4Tools for multiple sequence alignment
- seq1 T Y I M R E A Q Y E
- seq2 T C I V M R E A Y E
- seq3 Y I M Q E V Q Q E
- seq4 Y I A M R E Q Y E
-
5Tools for multiple sequence alignment
- seq1 T Y I - M R E A Q Y E
- seq2 T C I V M R E A - Y E
- seq3 Y - I - M Q E V Q Q E
- seq4 Y I A M R E - Q Y E
-
6Tools for multiple sequence alignment
- seq1 T Y I - M R E A Q Y E
- seq2 T C I V M R E A - Y E
- seq3 Y - I - M Q E V Q Q E
- seq4 Y I A M R E - Q Y E
-
7Tools for multiple sequence alignment
- seq1 T Y I - M R E A Q Y E
- seq2 T C I V M R E A - Y E
- seq3 Y - I - M Q E V Q Q E
- seq4 Y I A M R E - Q Y E
-
8Tools for multiple sequence alignment
- seq1 T Y I - M R E A Q Y E
- seq2 T C I V M R E A - Y E
- seq3 Y - I - M Q E V Q Q E
- seq4 Y I A M R E - Q Y E
-
9Tools for multiple sequence alignment
- seq1 T Y I - M R E A Q Y E
- seq2 T C I V M R E A - Y E
- seq3 Y - I - M Q E V Q Q E
- seq4 Y I A M R E - Q Y E
- Functionally important regions more conserved
than non-functional regions
10Tools for multiple sequence alignment
- seq1 T Y I - M R E A Q Y E
- seq2 T C I V M R E A - Y E
- seq3 Y - I - M Q E V Q Q E
- seq4 Y I A M R E - Q Y E
- Functionally important regions more conserved
than non-functional regions - Local sequence conservation indicates
functionality!
11Tools for multiple sequence alignment
- seq1 T Y I - M R E A Q Y E
- seq2 T C I V M R E A - Y E
- seq3 - Y I - M Q E V Q Q E
- seq4 Y I A M R E - Q Y E
- Astronomical Number of possible alignments!
12Tools for multiple sequence alignment
- seq1 T Y I - M R E A Q Y E
- seq2 T C I V - M R E A Y E
- seq3 - Y I - M Q E V Q Q E
- seq4 Y I A M R E - Q Y E
- Astronomical Number of possible alignments!
13Tools for multiple sequence alignment
- seq1 T Y I - M R E A Q Y E
- seq2 T C I V M R E A - Y E
- seq3 - Y I - M Q E V Q Q E
- seq4 Y I A M R E - Q Y E
- Which one is the best ???
14Tools for multiple sequence alignment
- Questions in development of alignment programs
- (1) What is a good alignment?
- ? objective function (score)
- (2) How to find a good alignment?
- ? optimization algorithm
- First question far more important !
15Tools for multiple sequence alignment
- Most important scoring scheme for multiple
alignment - Sum-of-pairs score for global alignment.
16Divide-and-Conquer Alignment (DCA)
- J. Stoye, A. Dress (Bielefeld)
- Approximate optimal global multiple alignment
- Divide sequences into small sub-sequences
- Use MSA to calculate optimal alignment for
sub-sequences - Concatenate sub-alignments
17Divide-and-Conquer Alignment (DCA)
18Divide-and-Conquer Alignment (DCA)
19Tools for multiple sequence alignment
- Problems with traditional approach
- Results depend on gap penalty
- Heuristic guide tree determines alignment
alignment used for phylogeny reconstruction - Algorithm produces global alignments.
20 First step in sequence comparison alignment
- global alignment (Needleman and Wunsch, 1970
Clustal W) - atctaatagttaatactcgtccaagtat
-
- atctgtattactaaacaactggtgctacta
21 First step in sequence comparison alignment
- global alignment (Needleman and Wunsch, 1970
Clustal W) - atc--taatagttaat--actcgtccaagtat
-
- atctgtattact-aaacaactggtgctacta-
22 First step in sequence comparison alignment
- global alignment (Needleman and Wunsch, 1970
Clustal W) - atc--taatagttaat--actcgtccaagtat
-
- atctgtattact-aaacaactggtgctacta-
- local alignment (Smith and Waterman, 1983)
- atctaatagttaatactcgtccaagtat
-
- gcgtgtattactaaacggttcaatctaacat
23 First step in sequence comparison alignment
- global alignment (Needleman and Wunsch, 1970
Clustal W) - atc--taatagttaat--actcgtccaagtat
-
- atctgtattact-aaacaactggtgctacta-
- local alignment (Smith and Waterman, 1983)
- atctaatagttaatactcgtccaagtat
-
- gcgtgtattactaaacggttcaatctaacat
24 First step in sequence comparison alignment
- global alignment (Needleman and Wunsch, 1970
Clustal W) - atc--taatagttaat--actcgtccaagtat
-
- atctgtattact-aaacaactggtgctacta-
- local alignment (Smith and Waterman, 1983)
- atc--taatagttaatactcgtccaagtat
-
- gcgtgtattact-aaacggttcaatctaacat
25New question sequence families with multiple
local similarities
Neither local nor global methods appliccable
26New question sequence families with multiple
local similarities
Alignment possible if order conserved
27The DIALIGN approach
-
- Morgenstern, Dress, Werner (1996),
- PNAS 93, 12098-12103
- Combination of global and local methods
- Assemble multiple alignment from
- gap-free local pair-wise alignments
-
- (,,fragments)
-
-
28The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
29The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
30The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
31The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
32The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
33The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
34The DIALIGN approach
-
- atc------taatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
35The DIALIGN approach
-
- atc------taatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaa--gagtatcacccctgaattgaataa
-
36The DIALIGN approach
-
- atc------taatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaa--gagtatcacc----------cctgaattgaataa
-
37The DIALIGN approach
-
- atc------taatagttaaactcccccgtgc-ttag
- cagtgcgtgtattactaac----------gg-ttcaatcgcg
- caaa--gagtatcacc----------cctgaattgaataa
-
38The DIALIGN approach
Consistency!
-
- atc------taatagttaaactcccccgtgc-ttag
- cagtgcgtgtattactaac----------gg-ttcaatcgcg
- caaa--gagtatcacc----------cctgaattgaataa
-
39The DIALIGN approach
-
- atc------TAATAGTTAaactccccCGTGC-TTag
- cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg
- caaa--GAGTATCAcc----------CCTGaaTTGAATaa
-
40The DIALIGN approach
- Score of an alignment
- Define score of fragment f
- l(f) length of f
- s(f) sum of matches (similarity values)
- P(f) probability to find a fragment with length
l(f) and at least s(f) matches in random
sequences that have the same length as the input
sequences. - Score w(f) -ln P(f)
41The DIALIGN approach
- Score of an alignment
- Define score of fragment f
- Define score of alignment as
- sum of scores of involved fragments
- No gap penalty!
42The DIALIGN approach
- Score of an alignment
- Goal in fragment-based alignment approach find
- Consistent collection of fragments with maximum
sum of weight scores -
43The DIALIGN approach
-
- atctaatagttaaaccccctcgtgcttagagatccaaac
- cagtgcgtgtattactaacggttcaatcgcgcacatccgc
-
- Pair-wise alignment
-
44The DIALIGN approach
-
- atctaatagttaaaccccctcgtgcttagagatccaaac
- cagtgcgtgtattactaacggttcaatcgcgcacatccgc
-
- Pair-wise alignment
- recursive algorithm finds optimal chain of
- fragments.
-
45The DIALIGN approach
-
- ------atctaatagttaaaccccctcgtgcttag-------agatccaa
ac - cagtgcgtgtattactaac----------ggttcaatcgcgcacatccgc
-- -
- Pair-wise alignment
- recursive algorithm finds optimal chain of
- fragments.
-
46The DIALIGN approach
-
- ------atctaatagttaaaccccctcgtgcttag-------agatccaa
ac - cagtgcgtgtattactaac----------ggttcaatcgcgcacatccgc
-- -
- Optimal pairwise alignment chain of fragments
with maximum sum of weights found by dynamic
programming - Standard fragment-chaining algorithm
- Space-efficient algorithm
47The DIALIGN approach
-
- Multiple alignment
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
-
48The DIALIGN approach
-
- Multiple alignment
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaccctgaattgaagagtatcacataa
- (1) Calculate all optimal pair-wise alignments
-
49The DIALIGN approach
-
- Multiple alignment
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
- (1) Calculate all optimal pair-wise alignments
-
50The DIALIGN approach
-
- Multiple alignment
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
- (1) Calculate all optimal pair-wise alignments
-
51The DIALIGN approach
-
- Fragments from optimal pair-wise alignments
- might be inconsistent
-
-
52The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
53The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
54The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
55The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaa--gagtatcacccctgaattgaataa
-
56The DIALIGN approach
-
-
- atc------taatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaa--gagtatcacccctgaattgaataa
-
57The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
58The DIALIGN approach
- Fragments from optimal pair-wise alignments might
be inconsistent - Sort fragments according to scores
- Include them one-by-one into growing multiple
alignment as long as they are consistent - (greedy algorithm, comparable to rucksack
problem)
59The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
60The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
61The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
62The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
63The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
- Consistency problem
-
64The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
- Consistency problem
-
65The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
- Upper and lower bounds for alignable positions
-
66The DIALIGN approach
-
- atc------taatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaa--gagtatcacccctgaattgaataa
- Upper and lower bounds for alignable positions
-
67The DIALIGN approach
-
- atc------taatagt taaactcccccgtgcttag
- Cagtgcgtgtattact aacggttcaatcgcg
- caaa--gagtatcacccctgaattgaataa
- Upper and lower bounds for alignable positions
-
68The DIALIGN approach
-
- atc------taata-----gttaaactcccccgtgcttag
- Cagtgcgtgtatta-----ctaacggttcaatcgcg
- caaa--gagtatcacccctgaattgaataa
- Upper and lower bounds for alignable positions
-
69The DIALIGN approach
site x i,p (sequence i, position p)
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
- Upper and lower bounds for alignable positions
-
70The DIALIGN approach
Calculate upper bound bl(x,i) and lower
bound bu(x,i) for each x and sequence i
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
- Upper and lower bounds for alignable positions
-
71The DIALIGN approach
bl(x,i) and bu(x,i) updated for each new
fragment in alignment
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
- Upper and lower bounds for alignable positions
-
72The DIALIGN approach
-
- Consistency bounds are to be updated for each
- new fragment that is included in to the growing
- Alignment
- Efficient algorithm
- (Abdeddaim and Morgenstern, 2002)
-
73The DIALIGN approach
-
- Advantages of segment-based approach
- Program can produce global and local alignments!
- Sequence families alignable that cannot be
aligned with standard methods
74Program input
-
- Program usage
- gt dialign2-2 options ltinput_filegt
- ltinput_filegt multi-sequence file in
FASTA-format -
75Program output
-
DIALIGN 2.2.1 -
-
- Program code written by Burkhard
Morgenstern and Said Abdeddaim - e-mail contact bmorgen_at_gwdg.de
-
- Published research assisted by
DIALIGN 2 should cite -
- Burkhard Morgenstern (1999).
- DIALIGN 2 improvement of the
segment-to-segment - approach to multiple sequence
alignment. - Bioinformatics 15, 211 - 218.
- For more information, please visit
the DIALIGN home page at - http//bibiserv.techfak.uni-bielefe
ld.de/dialign/ -
76Program output
- Alignment (DIALIGN format)
-
-
- dog_il4 1 cagg------ ----GTTTGA
atctgataca ttgc------ ---------- - bla 1 ctga------ ----------
---------- --------GC CAAGTGGGAA - blu 1 ttttgatatg agaaGTGTGA
aacaagctat cctatattGC TAAGTGGCAG -
- 0000000000 0000000000
0000000000 0000000011 1111111111 -
-
- dog_il4 25 ---------- --ATGGCACT
GGGGTGAATG AGGCAGGCAG CAGAATGATC - bla 17 ggtgtgaata catgggtttc
cagtaccttc tgaggtccag agtacc---- - blu 51 ccctggcttt ctATGTGCAC
AGAATGGGAG GAAAGTGCCT GCTAGTGAGC -
- 0000000000 0000000000
0000000000 0000000000 0000000000 -
-
- dog_il4 63 GTACTGCAGC CCTGAGCTTC
CACTGGCCCA TGTTGGTATC CTTGTATTTT - bla 63 ---------- ----------
---TTTCCCA TGTGCTCCAT GGTGGAATGG
77The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
78The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
79The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
80The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
81The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
82The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
83The DIALIGN approach
-
- atc------taatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
84The DIALIGN approach
-
- atc------taatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaa--gagtatcacccctgaattgaataa
-
85The DIALIGN approach
-
- atc------taatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaa--gagtatcacc----------cctgaattgaataa
-
86The DIALIGN approach
-
- atc------taatagttaaactcccccgtgcttag
- cagtgcgtgtattactaac----------ggttcaatcgcg
- caaa--gagtatcacc----------cctgaattgaataa
-
87The DIALIGN approach
-
- atc------taatagttaaactcccccgtgc-ttag
- cagtgcgtgtattactaac----------gg-ttcaatcgcg
- caaa--gagtatcacc----------cctgaattgaataa
-
88The DIALIGN approach
-
- atc------TAATAGTTAaactccccCGTGC-TTag------
- cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg
- caaa--GAGTATCAcc----------CCTGaaTTGAATaa--
-
89The DIALIGN approach
-
- atc------taatagttaaactcccccgtgc-ttag
- cagtgcgtgtattactaac----------gg-ttcaatcgcg
- caaa--gagtatcacc----------cctgaattgaataa
-
90Alignment of large genomic sequences
- Fragment-based alignment approach useful for
alignment of genomic sequences. - Possible applications
- Detection of regulatory elements
- Identification of pathogenic microorganisms
- Gene prediction
91DIALIGN alignment of human and murine genomic
sequences
92DIALIGN alignment of tomato and Thaliana genomic
sequences
93Alignment of large genomic sequences
Gene-regulatory sites identified by mulitple
sequence alignment (phylogenetic footprinting)
94Alignment of large genomic sequences
95Performance of long-range alignment programs for
exon discovery (human - mouse comparison)
96Performance of long-range alignment programs for
exon discovery (thaliana - tomato comparison)