Title: Annotation and Alignment of the Drosophila Genomes
1Annotation and Alignment of the Drosophila Genomes
2(No Transcript)
3(No Transcript)
4Genes or Regulation?
- 10,516 putative orthologs have been identified
as a core gene set conserved over 2555 million
years (Myr) since the pseudoobscura/melanogaster
divergence - Cis-regulatory sequences are more conserved
than random and nearby sequences between the
speciesbut the difference is slight, suggesting
that the evolution of cis-regulatory elements is
flexible
Richards et al., Comparative genome sequencing of
Drosophila pseudoobscura Chromosomal, gene, and
cis-element evolution, Genome Res., Jan 2005.
5Genes or Regulatory Elements?
- 10,516 10,867 putative orthologs have been
identified as a core gene set conserved over
2555 million years (Myr) since the
pseudoobscura/melanogaster divergence - Cis-regulatory sequences are more conserved
than random and nearby sequences between the
speciesbut the difference is slight, suggesting
that the evolution of cis-regulatory elements is
flexible
Richards et al., Comparative genome sequencing of
Drosophila pseudoobscura Chromosomal, gene, and
cis-element evolution, Genome Res., Jan 2005.
6BP England, U Heberlein, R Tjian. Purified
Drosophila transcription factor, Adh distal
factor-1 (Adf-1), binds to sites in several
Drosophila promoters and activates transcription,
J Biol Chem 1990.
7S. Chatterji and L. Pachter, GeneMapper
Reference based annotation with GeneMapper,2005.
8Genes or Regulatory Elements?
- 10,516 10,867 putative orthologs have been
identified as a core gene set conserved over
2555 million years (Myr) since the
pseudoobscura/melanogaster divergence - Cis-regulatory sequences are more conserved
than random and nearby sequences between the
speciesbut the difference is slight, suggesting
that the evolution of cis-regulatory elements is
flexible
Richards et al., Comparative genome sequencing of
Drosophila pseudoobscura Chromosomal, gene, and
cis-element evolution, Genome Res., Jan 2005.
9http//rana.lbl.gov/drosophila/
10Alignment of coding sequence
DroAna_20041206_ GTCGCTCAACCAGCATTTGCAAAAGTCG
CAGAACTTGCGCTCATTGGATTTCCAGTACTC DroMel_4_
GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTT
TGATTTCCAGTACTC DroMoj_20041206_
GTCGCTTAACCAGCATTTACAGAAATCGCAATACTTGCGTTCATTGGATT
TCCAGTACTC DroPse_1_
GTCGCTCAGCCAGCACTTGCAGAAGTCGCAGTACTTGCGCTCGTTTGATT
TCCAGAATTC DroSim_20040829_
GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATT
TCCAGTACTC DroVir_20041029_
GTCGCTCAACCAGCATTTGCAGAAGTCGCAATACTTGCGTTCATTCGACT
TCCAGTACTC DroYak_1_
GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTCCGCTCGTTTGACT
TCCAGTACTC
Alignment of non-coding sequence
DroAna_20041206_ CTGAAGGAAT-------TCTATATT---
------AAAGAAGATTTCTCATCATTGGTTG DroMel_4_
CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA
---------GTTT DroMoj_20041206_
CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGA
AA------- DroPse_1_
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCAT
CATCG----DroSim_20040829_ CTGCGGGATTAGGAGTCAT
TAGAGT---------GCGGAAAAGCGG---------GTT-DroVir_200
41029_ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT-
-AATTTGGTCCAAA------- DroYak_1_
CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC----
-----CTTT
DroAna_20041206_ AATC-----ACTTAC DroMel_4_
ATTCTATGGACTCAC DroMoj_20041206_
----TATTTACTCAC DroPse_1_
------TGTACTTAC DroSim_20040829_
ATTCTATGGACTCAC DroVir_20041029_
----TATTTACTCAC DroYak_1_
ATTTCATAAACTCAC
11Alignment of coding sequence
DroAna_20041206_ GTCGCTCAACCAGCATTTGCAAAAGTCG
CAGAACTTGCGCTCATTGGATTTCCAGTACTC DroMel_4_
GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTT
TGATTTCCAGTACTC DroMoj_20041206_
GTCGCTTAACCAGCATTTACAGAAATCGCAATACTTGCGTTCATTGGATT
TCCAGTACTC DroPse_1_
GTCGCTCAGCCAGCACTTGCAGAAGTCGCAGTACTTGCGCTCGTTTGATT
TCCAGAATTC DroSim_20040829_
GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATT
TCCAGTACTC DroVir_20041029_
GTCGCTCAACCAGCATTTGCAGAAGTCGCAATACTTGCGTTCATTCGACT
TCCAGTACTC DroYak_1_
GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTCCGCTCGTTTGACT
TCCAGTACTC
Alignment of non-coding sequence
droAna1.2448876 CTGAAGGAATTCTA--TATTAAAG----
--------------------------- dm2.chr2L
CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-T
TATTC droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGT
AA---------CACATAAA--CGTTTTAAATTC dp3.chr4_group3
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGA
GGCCATCATCG droSim1.chr2L
CTGCGGGATTAGGAGTCATTAGAG---------TGCGGAAAAGCGGG--T
TATTC droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGT
AA---------TAAACAA----TTCTCTAATTT droYak1.chr2L
CTGCGGGATTAGCGGTCATTGGTG---------TGAAGAATAGA
TCCT-TTATTT
droAna1.2448876 AAGATTTCTCATCATTGGTTGAATC-
--------------------ACTTAC dm2.chr2L
-----------------------------------------TATGG
ACTCAC droMoj1.contig_2959 ---------------------
----AAATATTT--------TATTGACTCAC dp3.chr4_group3
-----------------------------------------TGT--
ACTTAC droSim1.chr2L ---------------------
--------------------TATGGACTCAC droVir1.scaffold_6
---------------------------------AAATATTTGGTCC
ACTCAC droYak1.chr2L ---------------------
--------------------CATAAACTCAC
12Per site analysis Group 1 mean per site identity 51.3 51.3 47.8
Group 2 mean per site identity 47.8 42.9 42.9
Difference of means (group 1 group 2) 3.6 8.4 4.9
Difference of means resampling p-value 0.05 0.003 1E-5
Distribution comparison KS p-value 0.026 0.0016 2E-6
Per base analysis Group 1 mean per base identity 47.8 47.8 46.3
Group 2 mean per base identity 46.3 42.4 42.4
Difference of means (group 1 group 2) 1.5 5.4 3.9
Richards et al., Comparative genome sequencing of
Drosophila pseudoobscura Chromosomal, gene, and
cis-element evolution, Genome Res., Jan 2005.
13How is an alignment made from two sequences?
Given two sequences of lengths n,m
gtdm2.chr2L CTGCGGGATTAGGGGTCATTAGAGT
GCCGAAAAGCGAGTTTATTCTATGGACTCAC gtdp3.chr4_group3 C
TGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATC
ATCGTGTACTTAC
n56
m64
?
dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---
------TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCA
TCATCG dm2.chr2L TATGGACTCAC dp3.chr4
_group3 TGT--ACTTAC
14 dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---
------TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCA
TCATCG dm2.chr2L TATGGACTCAC dp3.chr4
_group3 TGT--ACTTAC
DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---
------GCCGAAAAGCGA---------GTTT DroPse_1_
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGG
CCATCATCG----
DroMel_4_ ATTCTATGGACTCAC DroPse_1_
------TGTACTTAC
Each alignment can be summarized by counting the
number of matches (M), mismatches (X), gaps
(G), and spaces (S).
15 dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---
------TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCA
TCATCG dm2.chr2L TATGGACTCAC dp3.chr4
_group3 TGT--ACTTAC
DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---
------GCCGAAAAGCGA---------GTTT DroPse_1_
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGG
CCATCATCG----
DroMel_4_ ATTCTATGGACTCAC DroPse_1_
------TGTACTTAC
Each alignment can be summarized by counting the
number of matches (M), mismatches (X), gaps
(G), and spaces (S).
2(MX)S112 so X,G and S suffice to
specify a summary.
16The summary of an alignment is a point in 3
dimensional space. For example, the two
alignments just shown correspond to the
points (22,3,12) (18,3,28)
17The summary of an alignment is a point in 3
dimensional space. For example, the two
alignments just shown correspond to the
points (22,3,12) (18,3,28) In the example
of our two sequences there are 434615666279134990
029695804618937526970374145 different alignments.
18The summary of an alignment is a point in 3
dimensional space. For example, the two
alignments just shown correspond to the
points (22,3,12) (18,3,28) In the example
of our two sequences there are 379522884096444556
699773447791552717765633 different alignments,
but only 53890 different summaries. So we dont
need to plot that many points.
19The summary of an alignment is a point in 3
dimensional space. For example, the two
alignments just shown correspond to the
points (22,3,12) (18,3,28) In the example
of our two sequences there are 379522884096444556
699773447791552717765633 different alignments,
but only 53890 different summaries. So we dont
need to plot that many points. But 53890 is
still quite a large number. Fortunately, there
are only 69 vertices on the convex hull of the
53890 points. These are the interesting ones,
and we can even draw them
20gtmel CTGCGGGATTAGGGGTCATTAGAGTGCCGA AAAGCGAGTTTATT
CTATGGAC gtpse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGG
CGA GGAGAGGCCATCATCGTGTAC
For the sequences
the alignment polytope is
21mel CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAG
CGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGG
GATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel
CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGAGTTT
ATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCAT
GGGGGCGAGGAGAGGCCATCATCG-TGTAC mel
CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGTTT
ATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCAT
GGGGGCGAGGAGAGGCCATCATC-GTGTAC mel
CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGTTT
ATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCAT
GGGGGCGAGGAGAGGCCATCATCG-TGTAC mel
CTGCGGGATTAGGGGTCATTAGA---------GTGCCGAAAAGCGAGTTT
ATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCAT
GGGGGCGAGGAGAGGCCATCATC-GTGTAC mel
CTGCGGGATTAGGGGTCATTAGA---------GTGCCGAAAAGCGAGTTT
ATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCAT
GGGGGCGAGGAGAGGCCATCATCG-TGTAC mel
CTGCGGGATTAGGGGTCATTAG---------AGTGCCGAAAAGCGAGTTT
ATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCAT
GGGGGCGAGGAGAGGCCATCATC-GTGTAC mel
CTGCGGGATTAGGGGTCATTAG---------AGTGCCGAAAAGCGAGTTT
ATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCAT
GGGGGCGAGGAGAGGCCATCATCG-TGTAC
22mel CTGCGGGATTAGGGGTCATTAGAGT------GCCGAA
AAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGT
AGGGGATCCATGGGGGCGAGGAGAGGCCATCATCGTGTAC
Consensus at a vertex
23The vertices of the polytope have special
significance. Given parameters for a model,
e.g. the default parameters for MULTIZ M
100, X -100, S -30, G
-400 the summary is
the result of maximizing the linear
form -200(X)-400(G)-80(S) over the
polytope. Thus, the vertices of the polytope
correspond to optimal alignments.
24Needleman-Wunsch Alignment
What is usually done, is that a single set of
parameters is specified (M 100, X -100, S
-30, G -400 is a standard default) and then the
optimal vertex is identified using dynamic
programming. An alignment optimal for the vertex
is then selected. The running time of the
algorithm is O(nm) Needleman-Wunsch, 1970,
Smith-Waterman, 1981 and it requires O(nm)
space Hirschberg 1975 . Standard scoring
schemes are Parameters Model
M,X,S Jukes-Cantor with linear
gap penalty M,X,S,G Jukes-Cantor with
affine gap penalty M,XTS,XTV,S,G Kimura-2
parameter with affine gap penalty
25Needleman-Wunsch algorithm
Score of best alignment of positions 1,i and
1,j in each sequence
plus
max
Wi,j SWi-1,jSWi,j-1(X or M)Wi-1,j-1
A
C
A
T
T
A
G
A
A
A
G
A
T
T
A
C
C
A
C
A
26Building Drosophila whole genome multiple
alignments
- MAVID
- http//hanuman.math.berkeley.edu/kbrowser
- MULTIZ
- http//genome.ucsc.edu/
- (currently no D. erecta)
27DroAna_20041206_ CTGAAGGAAT-------TCTATATT---
------AAAGAAGATTTCTCATCATTGGTTG DroMel_4_
CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA
---------GTTT DroMoj_20041206_
CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGA
AA------- DroPse_1_
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCAT
CATCG----DroSim_20040829_ CTGCGGGATTAGGAGTCAT
TAGAGT---------GCGGAAAAGCGG---------GTT-DroVir_200
41029_ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT-
-AATTTGGTCCAAA------- DroYak_1_
CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC----
-----CTTT
DroAna_20041206_ AATC-----ACTTAC DroMel_4_
ATTCTATGGACTCAC DroMoj_20041206_
----TATTTACTCAC DroPse_1_
------TGTACTTAC DroSim_20040829_
ATTCTATGGACTCAC DroVir_20041029_
----TATTTACTCAC DroYak_1_
ATTTCATAAACTCAC
MAVID
N. Bray and L. Pachter, MAVID Constrained
ancestral alignment of multiple sequences, Genome
Research 14 (2004) p 693--699
28(No Transcript)
29droAna1.2448876 CTGAAGGAATTCTA--TATTAAAG----
--------------------------- dm2.chr2L
CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-T
TATTC droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGT
AA---------CACATAAA--CGTTTTAAATTC dp3.chr4_group3
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGA
GGCCATCATCG droSim1.chr2L
CTGCGGGATTAGGAGTCATTAGAG---------TGCGGAAAAGCGGG--T
TATTC droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGT
AA---------TAAACAA----TTCTCTAATTT droYak1.chr2L
CTGCGGGATTAGCGGTCATTGGTG---------TGAAGAATAGA
TCCT-TTATTT
droAna1.2448876 AAGATTTCTCATCATTGGTTGAATC-
--------------------ACTTAC dm2.chr2L
-----------------------------------------TATGG
ACTCAC droMoj1.contig_2959 ---------------------
----AAATATTT--------TATTGACTCAC dp3.chr4_group3
-----------------------------------------TGT--
ACTTAC droSim1.chr2L ---------------------
--------------------TATGGACTCAC droVir1.scaffold_6
---------------------------------AAATATTTGGTCC
ACTCAC droYak1.chr2L ---------------------
--------------------CATAAACTCAC
MULTIZ
Blanchette et al., Aligning multiple sequences
with the threaded blockset aligner, Genome
Research 14 (2004) p 708--715
30(No Transcript)
31One (possibly wrong) alignment is not enough the
history of parametric inference
- 1992 Waterman, M., Eggert, M. Lander, E.
- Parametric sequence comparisons, Proc. Natl.
Acad. Sci. USA 89, 6090-6093 - 1994 Gusfield, D., Balasubramanian, K. Naor,
D. - Parametric optimization of sequence alignment,
Algorithmica 12, 312-326. - 2003 Wang, L., Zhao, J.
- Parametric alignment of ordered trees,
Bioinformatics, 19 2237-2245. - 2004 Fernández-Baca, D., Seppäläinen, T.
Slutzki, G. - Parametric Multiple Sequence Alignment and
Phylogeny Construction, Journal of Discrete
Algorithms, 2 271-287.
XPARAL by Kristian Stevens and Dan Gusfield
32Whole Genome Parametric AlignmentColin Dewey,
Peter Huggins, Lior Pachter, Bernd Sturmfels and
Kevin Woods
- Mathematics and Computer Science
- Parametric alignment in higher dimensions.
- Faster new algorithms.
- Deeper understanding of alignment polytopes.
- Biology
- Whole genome parametric alignment.
- Biological implications of alignment
parameters. - Alignment with biology rather than for biology.