Annotation and Alignment of the Drosophila Genomes - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Annotation and Alignment of the Drosophila Genomes

Description:

... J Biol Chem 1990. S. Chatterji and L. Pachter, GeneMapper: Reference based annotation with GeneMapper,2005. ... U.C. Berkeley Other titles: – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 33
Provided by: Lior55
Category:

less

Transcript and Presenter's Notes

Title: Annotation and Alignment of the Drosophila Genomes


1
Annotation and Alignment of the Drosophila Genomes
2
(No Transcript)
3
(No Transcript)
4
Genes or Regulation?
  • 10,516 putative orthologs have been identified
    as a core gene set conserved over 2555 million
    years (Myr) since the pseudoobscura/melanogaster
    divergence
  • Cis-regulatory sequences are more conserved
    than random and nearby sequences between the
    speciesbut the difference is slight, suggesting
    that the evolution of cis-regulatory elements is
    flexible

Richards et al., Comparative genome sequencing of
Drosophila pseudoobscura Chromosomal, gene, and
cis-element evolution, Genome Res., Jan 2005.
5
Genes or Regulatory Elements?
  • 10,516 10,867 putative orthologs have been
    identified as a core gene set conserved over
    2555 million years (Myr) since the
    pseudoobscura/melanogaster divergence
  • Cis-regulatory sequences are more conserved
    than random and nearby sequences between the
    speciesbut the difference is slight, suggesting
    that the evolution of cis-regulatory elements is
    flexible

Richards et al., Comparative genome sequencing of
Drosophila pseudoobscura Chromosomal, gene, and
cis-element evolution, Genome Res., Jan 2005.
6
BP England, U Heberlein, R Tjian. Purified
Drosophila transcription factor, Adh distal
factor-1 (Adf-1), binds to sites in several
Drosophila promoters and activates transcription,
J Biol Chem 1990.
7
S. Chatterji and L. Pachter, GeneMapper
Reference based annotation with GeneMapper,2005.
8
Genes or Regulatory Elements?
  • 10,516 10,867 putative orthologs have been
    identified as a core gene set conserved over
    2555 million years (Myr) since the
    pseudoobscura/melanogaster divergence
  • Cis-regulatory sequences are more conserved
    than random and nearby sequences between the
    speciesbut the difference is slight, suggesting
    that the evolution of cis-regulatory elements is
    flexible

Richards et al., Comparative genome sequencing of
Drosophila pseudoobscura Chromosomal, gene, and
cis-element evolution, Genome Res., Jan 2005.
9
http//rana.lbl.gov/drosophila/
10
Alignment of coding sequence
DroAna_20041206_ GTCGCTCAACCAGCATTTGCAAAAGTCG
CAGAACTTGCGCTCATTGGATTTCCAGTACTC DroMel_4_
GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTT
TGATTTCCAGTACTC DroMoj_20041206_
GTCGCTTAACCAGCATTTACAGAAATCGCAATACTTGCGTTCATTGGATT
TCCAGTACTC DroPse_1_
GTCGCTCAGCCAGCACTTGCAGAAGTCGCAGTACTTGCGCTCGTTTGATT
TCCAGAATTC DroSim_20040829_
GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATT
TCCAGTACTC DroVir_20041029_
GTCGCTCAACCAGCATTTGCAGAAGTCGCAATACTTGCGTTCATTCGACT
TCCAGTACTC DroYak_1_
GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTCCGCTCGTTTGACT
TCCAGTACTC


Alignment of non-coding sequence
DroAna_20041206_ CTGAAGGAAT-------TCTATATT---
------AAAGAAGATTTCTCATCATTGGTTG DroMel_4_
CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA
---------GTTT DroMoj_20041206_
CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGA
AA------- DroPse_1_
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCAT
CATCG----DroSim_20040829_ CTGCGGGATTAGGAGTCAT
TAGAGT---------GCGGAAAAGCGG---------GTT-DroVir_200
41029_ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT-
-AATTTGGTCCAAA------- DroYak_1_
CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC----
-----CTTT

DroAna_20041206_ AATC-----ACTTAC DroMel_4_
ATTCTATGGACTCAC DroMoj_20041206_
----TATTTACTCAC DroPse_1_
------TGTACTTAC DroSim_20040829_
ATTCTATGGACTCAC DroVir_20041029_
----TATTTACTCAC DroYak_1_
ATTTCATAAACTCAC
11
Alignment of coding sequence
DroAna_20041206_ GTCGCTCAACCAGCATTTGCAAAAGTCG
CAGAACTTGCGCTCATTGGATTTCCAGTACTC DroMel_4_
GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTT
TGATTTCCAGTACTC DroMoj_20041206_
GTCGCTTAACCAGCATTTACAGAAATCGCAATACTTGCGTTCATTGGATT
TCCAGTACTC DroPse_1_
GTCGCTCAGCCAGCACTTGCAGAAGTCGCAGTACTTGCGCTCGTTTGATT
TCCAGAATTC DroSim_20040829_
GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATT
TCCAGTACTC DroVir_20041029_
GTCGCTCAACCAGCATTTGCAGAAGTCGCAATACTTGCGTTCATTCGACT
TCCAGTACTC DroYak_1_
GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTCCGCTCGTTTGACT
TCCAGTACTC


Alignment of non-coding sequence
droAna1.2448876 CTGAAGGAATTCTA--TATTAAAG----
--------------------------- dm2.chr2L
CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-T
TATTC droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGT
AA---------CACATAAA--CGTTTTAAATTC dp3.chr4_group3
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGA
GGCCATCATCG droSim1.chr2L
CTGCGGGATTAGGAGTCATTAGAG---------TGCGGAAAAGCGGG--T
TATTC droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGT
AA---------TAAACAA----TTCTCTAATTT droYak1.chr2L
CTGCGGGATTAGCGGTCATTGGTG---------TGAAGAATAGA
TCCT-TTATTT

droAna1.2448876 AAGATTTCTCATCATTGGTTGAATC-
--------------------ACTTAC dm2.chr2L
-----------------------------------------TATGG
ACTCAC droMoj1.contig_2959 ---------------------
----AAATATTT--------TATTGACTCAC dp3.chr4_group3
-----------------------------------------TGT--
ACTTAC droSim1.chr2L ---------------------
--------------------TATGGACTCAC droVir1.scaffold_6
---------------------------------AAATATTTGGTCC
ACTCAC droYak1.chr2L ---------------------
--------------------CATAAACTCAC


12
Per site analysis Group 1 mean per site identity 51.3 51.3 47.8
Group 2 mean per site identity 47.8 42.9 42.9
Difference of means (group 1 group 2) 3.6 8.4 4.9
Difference of means resampling p-value 0.05 0.003 1E-5
Distribution comparison KS p-value 0.026 0.0016 2E-6
Per base analysis Group 1 mean per base identity 47.8 47.8 46.3
Group 2 mean per base identity 46.3 42.4 42.4
Difference of means (group 1 group 2) 1.5 5.4 3.9
Richards et al., Comparative genome sequencing of
Drosophila pseudoobscura Chromosomal, gene, and
cis-element evolution, Genome Res., Jan 2005.
13
How is an alignment made from two sequences?
Given two sequences of lengths n,m
gtdm2.chr2L CTGCGGGATTAGGGGTCATTAGAGT
GCCGAAAAGCGAGTTTATTCTATGGACTCAC gtdp3.chr4_group3 C
TGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATC
ATCGTGTACTTAC
n56
m64
?
dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---
------TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCA
TCATCG dm2.chr2L TATGGACTCAC dp3.chr4
_group3 TGT--ACTTAC
14
dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---
------TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCA
TCATCG dm2.chr2L TATGGACTCAC dp3.chr4
_group3 TGT--ACTTAC
DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---
------GCCGAAAAGCGA---------GTTT DroPse_1_
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGG
CCATCATCG----
DroMel_4_ ATTCTATGGACTCAC DroPse_1_
------TGTACTTAC
Each alignment can be summarized by counting the
number of matches (M), mismatches (X), gaps
(G), and spaces (S).
15
dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---
------TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCA
TCATCG dm2.chr2L TATGGACTCAC dp3.chr4
_group3 TGT--ACTTAC
DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---
------GCCGAAAAGCGA---------GTTT DroPse_1_
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGG
CCATCATCG----
DroMel_4_ ATTCTATGGACTCAC DroPse_1_
------TGTACTTAC
Each alignment can be summarized by counting the
number of matches (M), mismatches (X), gaps
(G), and spaces (S).
2(MX)S112 so X,G and S suffice to
specify a summary.
16
The summary of an alignment is a point in 3
dimensional space. For example, the two
alignments just shown correspond to the
points (22,3,12) (18,3,28)
17
The summary of an alignment is a point in 3
dimensional space. For example, the two
alignments just shown correspond to the
points (22,3,12) (18,3,28) In the example
of our two sequences there are 434615666279134990
029695804618937526970374145 different alignments.
18
The summary of an alignment is a point in 3
dimensional space. For example, the two
alignments just shown correspond to the
points (22,3,12) (18,3,28) In the example
of our two sequences there are 379522884096444556
699773447791552717765633 different alignments,
but only 53890 different summaries. So we dont
need to plot that many points.
19
The summary of an alignment is a point in 3
dimensional space. For example, the two
alignments just shown correspond to the
points (22,3,12) (18,3,28) In the example
of our two sequences there are 379522884096444556
699773447791552717765633 different alignments,
but only 53890 different summaries. So we dont
need to plot that many points. But 53890 is
still quite a large number. Fortunately, there
are only 69 vertices on the convex hull of the
53890 points. These are the interesting ones,
and we can even draw them
20
gtmel CTGCGGGATTAGGGGTCATTAGAGTGCCGA AAAGCGAGTTTATT
CTATGGAC gtpse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGG
CGA GGAGAGGCCATCATCGTGTAC
For the sequences
the alignment polytope is
21
mel CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAG
CGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGG
GATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel
CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGAGTTT
ATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCAT
GGGGGCGAGGAGAGGCCATCATCG-TGTAC mel
CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGTTT
ATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCAT
GGGGGCGAGGAGAGGCCATCATC-GTGTAC mel
CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGTTT
ATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCAT
GGGGGCGAGGAGAGGCCATCATCG-TGTAC mel
CTGCGGGATTAGGGGTCATTAGA---------GTGCCGAAAAGCGAGTTT
ATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCAT
GGGGGCGAGGAGAGGCCATCATC-GTGTAC mel
CTGCGGGATTAGGGGTCATTAGA---------GTGCCGAAAAGCGAGTTT
ATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCAT
GGGGGCGAGGAGAGGCCATCATCG-TGTAC mel
CTGCGGGATTAGGGGTCATTAG---------AGTGCCGAAAAGCGAGTTT
ATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCAT
GGGGGCGAGGAGAGGCCATCATC-GTGTAC mel
CTGCGGGATTAGGGGTCATTAG---------AGTGCCGAAAAGCGAGTTT
ATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCAT
GGGGGCGAGGAGAGGCCATCATCG-TGTAC
22
mel CTGCGGGATTAGGGGTCATTAGAGT------GCCGAA
AAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGT
AGGGGATCCATGGGGGCGAGGAGAGGCCATCATCGTGTAC
Consensus at a vertex
23
The vertices of the polytope have special
significance. Given parameters for a model,
e.g. the default parameters for MULTIZ M
100, X -100, S -30, G
-400 the summary is
the result of maximizing the linear
form -200(X)-400(G)-80(S) over the
polytope. Thus, the vertices of the polytope
correspond to optimal alignments.
24
Needleman-Wunsch Alignment
What is usually done, is that a single set of
parameters is specified (M 100, X -100, S
-30, G -400 is a standard default) and then the
optimal vertex is identified using dynamic
programming. An alignment optimal for the vertex
is then selected. The running time of the
algorithm is O(nm) Needleman-Wunsch, 1970,
Smith-Waterman, 1981 and it requires O(nm)
space Hirschberg 1975 . Standard scoring
schemes are Parameters Model
M,X,S Jukes-Cantor with linear
gap penalty M,X,S,G Jukes-Cantor with
affine gap penalty M,XTS,XTV,S,G Kimura-2
parameter with affine gap penalty
25
Needleman-Wunsch algorithm
Score of best alignment of positions 1,i and
1,j in each sequence
plus
max
Wi,j SWi-1,jSWi,j-1(X or M)Wi-1,j-1
A
C
A
T
T
A
G
A
A
A
G
A
T
T
A
C
C
A
C
A
26
Building Drosophila whole genome multiple
alignments
  • MAVID
  • http//hanuman.math.berkeley.edu/kbrowser
  • MULTIZ
  • http//genome.ucsc.edu/
  • (currently no D. erecta)

27
DroAna_20041206_ CTGAAGGAAT-------TCTATATT---
------AAAGAAGATTTCTCATCATTGGTTG DroMel_4_
CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA
---------GTTT DroMoj_20041206_
CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGA
AA------- DroPse_1_
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCAT
CATCG----DroSim_20040829_ CTGCGGGATTAGGAGTCAT
TAGAGT---------GCGGAAAAGCGG---------GTT-DroVir_200
41029_ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT-
-AATTTGGTCCAAA------- DroYak_1_
CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC----
-----CTTT

DroAna_20041206_ AATC-----ACTTAC DroMel_4_
ATTCTATGGACTCAC DroMoj_20041206_
----TATTTACTCAC DroPse_1_
------TGTACTTAC DroSim_20040829_
ATTCTATGGACTCAC DroVir_20041029_
----TATTTACTCAC DroYak_1_
ATTTCATAAACTCAC
MAVID
N. Bray and L. Pachter, MAVID Constrained
ancestral alignment of multiple sequences, Genome
Research 14 (2004) p 693--699
28
(No Transcript)
29
droAna1.2448876 CTGAAGGAATTCTA--TATTAAAG----
--------------------------- dm2.chr2L
CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-T
TATTC droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGT
AA---------CACATAAA--CGTTTTAAATTC dp3.chr4_group3
CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGA
GGCCATCATCG droSim1.chr2L
CTGCGGGATTAGGAGTCATTAGAG---------TGCGGAAAAGCGGG--T
TATTC droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGT
AA---------TAAACAA----TTCTCTAATTT droYak1.chr2L
CTGCGGGATTAGCGGTCATTGGTG---------TGAAGAATAGA
TCCT-TTATTT

droAna1.2448876 AAGATTTCTCATCATTGGTTGAATC-
--------------------ACTTAC dm2.chr2L
-----------------------------------------TATGG
ACTCAC droMoj1.contig_2959 ---------------------
----AAATATTT--------TATTGACTCAC dp3.chr4_group3
-----------------------------------------TGT--
ACTTAC droSim1.chr2L ---------------------
--------------------TATGGACTCAC droVir1.scaffold_6
---------------------------------AAATATTTGGTCC
ACTCAC droYak1.chr2L ---------------------
--------------------CATAAACTCAC


MULTIZ
Blanchette et al., Aligning multiple sequences
with the threaded blockset aligner, Genome
Research 14 (2004) p 708--715
30
(No Transcript)
31
One (possibly wrong) alignment is not enough the
history of parametric inference
  • 1992 Waterman, M., Eggert, M. Lander, E.
  • Parametric sequence comparisons, Proc. Natl.
    Acad. Sci. USA 89, 6090-6093
  • 1994 Gusfield, D., Balasubramanian, K. Naor,
    D.
  • Parametric optimization of sequence alignment,
    Algorithmica 12, 312-326.
  • 2003 Wang, L., Zhao, J.
  • Parametric alignment of ordered trees,
    Bioinformatics, 19 2237-2245.
  • 2004 Fernández-Baca, D., Seppäläinen, T.
    Slutzki, G.
  • Parametric Multiple Sequence Alignment and
    Phylogeny Construction, Journal of Discrete
    Algorithms, 2 271-287.

XPARAL by Kristian Stevens and Dan Gusfield
32
Whole Genome Parametric AlignmentColin Dewey,
Peter Huggins, Lior Pachter, Bernd Sturmfels and
Kevin Woods
  • Mathematics and Computer Science
  • Parametric alignment in higher dimensions.
  • Faster new algorithms.
  • Deeper understanding of alignment polytopes.
  • Biology
  • Whole genome parametric alignment.
  • Biological implications of alignment
    parameters.
  • Alignment with biology rather than for biology.
Write a Comment
User Comments (0)
About PowerShow.com