Overview - PowerPoint PPT Presentation

About This Presentation
Title:

Overview

Description:

... to Data Analysis. Data {GTCAT,GTTGGT,GTCA,CTCA} GT-CAT. GTTGGT ... Can align alignments and given a tree make a multiple alignment. alkmny-trwq acdeqrt ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 37
Provided by: Jotun
Category:
Tags: cat | how | make | overview | to | tree

less

Transcript and Presenter's Notes

Title: Overview


1
Overview (http//www.stats.ox.ac.uk/people/hein/le
ctures.htm) Pairwise Alignment Again Triple
Quadruple - Many Similarity-Distance
Conversion Local Alignment Statistical
alignment Pairwise Multiple Conclusion
2
Approaches to Data Analysis
Data GTCAT,GTTGGT,GTCA,CTCA
Parsimony, similarity, optimisation.
GT-CAT GTTGGT GT-CA- CT-CA-

statistics
statistics
Ideal Practice 1 phase analysis.
Actual Practice 2 phase analysis.
3
Parsimony Alignment of two strings. Sequences
s1CTAGG s2TTGT. Basic operations
transitions 2 (C-T A-G), transversions 5,
indels (g) 10.
CTA,TTAL
GG CTAG,TTGAL
CTA,TTGAL G-

CTAG,TTAL -G Initial condition D0,00.
(Di,j D(s11i, s21j)) Di,jmin
Di-1,j-1 d(s1i,s2j), Di,j-1 g, Di-1,j
g DCTA,TT w(GG) 12 0
12 D4,3DCTAG,TTGminDCTA,TTG w(G-) 4 10
14 DCTAG,TT w(-G) 22 10 32
4
40 32 22 14 9 17 T
/ 30 22 12 4 12 22 G
/ 20 12 2 - 12 22 32 T
/ 10 2 10 20 30 40 T
/ 0 10 20 30 40 50 C T
A G G CTAGG
Alignment i v Cost 17
TT-GT
5
Alignment of three sequences. s1ATCG s2ATGCC
s3CTCC Alignment AT-CG ATGCC
CT-CC Consensus sequence
ATCC Configurations in an alignment column -
- n n n - n - - n - n -
n n - n - - - n n n
- Recursion Di,j,k minDi-i',j-j',k-k'
d(i,i',j,j',k,k') Initial condition D0,0,0
0. Running time l1l2l3(23-1) Memory
requirement l1l2l3 New phenomena ancestral
sequence.
6
Parsimony Alignment of four sequences s1ATCG
s2ATGCC s3CTCC s4ACGCG Alignment AT-CG
ATGCC CT-CC
ACGCG Configurations in alignment columns -
- - n - - - n n n - n n n n - -
- n - n n - n - - n - n n n - -
n - - n - n - n - n n - n n - n
- - - - n n - - n n n n - n
- Recursion Di minDi-? d(i,?) ?
0,14\04 Initial condition D0
0. Computation time l1l2l3l424 Memory
l1l2l3l4
7
Alignment of many sequences. s1ATCG, s2ATGCC,
......., snACGCG Alignment AT-CG
s1 s3 s4 ATGCC
\ ! / .....
---------- ..... /
\ ACGCG s2
s5 Configurations in an alignment column
2n-1 Recursion DiminDi-? d(i,?) ?
0,1n\0n Initial condition D0,0,..0
0. Computation time ln(2n-1)n Memory
requirement ln (lsequence length, nnumber of
sequences)
8
Fitch-Hartigan-Sankoff Algorithm
(A,C,G,T) (9,7,7,7)
/ \ / \ Costs
Transition 2, / \ (A ,C,G,
T) \ Transversion 5, indel 10.
(10,2,10,2) \ / \ \
/ \ \ / \ \
/ \ \ / \
\ (A,C,G,T) (A,C,G,T) (A,C,G,T) 0
0 0 Indel Constraint
Nucleotides is connected set.
9
Longer Indels TCATGGTACCGTTAGCGT GCA-----------GC
AT gk cost of indel of length k. Initial
condition D0,00 Di,j min Di-1,j-1
d(s1i,s2j), Di,j-1 g1,Di,j-2
g2,Di,j-3 g3,, Di-1,j g1,Di-2,j
g2,Di-3,j g3,, Cubic running
time. Quadratic memory.
10
If gk a bk, then quadratic running
time. Gotoh (1982) Di,j is split into 3 types
1. D0i,j as Di,j, except s1i must mactch
s2j. 2. D1i,j as Di,j, except s1i is
matched with "-". 3. D2i,j as Di,j, except
s2i is matched with "-". ThenD0i,j
min(D0i-1,j-1, D1i-1,j-1, D2i-1,j-1)
d(s1i,s2j) D1i,j min(D1i,j-1 b,
D0i,j-1 a b) D2i,j min(D2i-1,j b,
D0i-1,j a b) Comment 1. Evolutionary
Consistency Condition gi gj gt gij
11
Distance-Similarity. (Smith-Waterman-Fitch,1982)
Di,jminDi-1,j-1 d(s1i,s2j), Di,j-1 g,
Di-1,j g Si,jmaxDi-1,j-1 s(s1i,s2j),
Si,j-1 -w, Si-1,j-w Distance Transitions2
Transversions 5 Indels10 M largest distance
between two nucleotides (5). Similarity
s(n1,n2) M - d(n1,n2)
wk k/(2M) gk w
1/(2M) g Similarity
Parameters Transversions0 Transitions3
Identity5 Indels 10 1/10
12
40/-40.4 32/-27.3 22/-12.2 14/0.9
9/11.0 17/2.9 T 30/-30.3 22/-17.2
12/-2.1 4/11.0 12/2.9 22/-7.2 G
20/-20.2 12/-7.1 2/8.0 12/-2.1
22/-12.2 32/-22.3 T 10/-10.1 2/3.0
10/-7.1 20/-17.2 30/-27.3 40/-37.4 T
0/0 10/-10.1 20/-20.2 30/-30.3
40/-40.4 50/-50.5 C T
A G G
Comments 1. The Switch from Dist to Sim is
highly analogous to Maximizing -f(x) instead of
Minimizing f(x). 2. Dist will based on a
metric i. d(x,x) 0, ii. d(x,y) gt0, iii.
d(x,y) d(y,x) iv. d(x,z) d(z,y) gt
d(x,y). There are no analogous restrictions
on Sim, giving it a larger parameter space.
13
Local alignment Smith,Waterman (1981 Global
Alignment Si,jmaxDi-1,j-1
s(s1i,s2j), Si,j-1 -w, Si-1,j-w Local
Si,jmaxDi-1,j-1 s(s1i,s2j),
Si,j-1 -w, Si-1,j-w,0 0 1 0 .6 1
2 .6 1.6 1.6 3 2.6 Score
Parameters C 0 0 1 0 1 .3
.6 0.6 2 3 1.6 Match 1 A 0
0 0 1.3 0 1 1 2 3.3 2
1.6 Mismatch -1/3 G
/ 0 0 .3 .3 1.3
1 2.3 2.3 2 .6 1.6 Gap 1
k/3 C / 0
0 .6 1.6 .3 1.3 2.6 2.3 1 .6
1.6 GCC-UCG U
/ GCCAUUG 0
0 2 .6 .3 1.6 2.6 1.3 1 .6
1 A ! 0 1 .6
0 1 3 1.6 1.3 1 1.3 1.6 C
/ 0 1 0 0 2
1.3 .3 1 .3 2 .6 C
/ 0 0 0 1 .3 0 0
.6 1 0 0 G / 0 0
0 .6 1 0 0 0 1 1 2
U 0 0 1 .6 0 0 0 0
0 0 0 A 0 0 1 0 0 0
0 0 0 0 0 A 0 0 0 0
0 0 0 0 0 0 0 C
A G C C U C G C U
U
14
Progressive Alignment (Feng-Doolittle 1987
J.Mol.Evol.) Can align alignments and given a
tree make a multiple alignment.
alkmny-trwq acdeqrt akkmdyftrwq
acdehrt kkkmemftrwq P(n,q) P(n,h) P(d,q)
P(d,h) P(e,q) P(e,h)/6

Sodh
atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvh
qfg----ndtagct sagphfnp lsrk Sodb
atkavcvlkgdgpqvqgtinfeak-gdtvkvwgsikglte-glhgfhvh
qfg----ndtagct sagphfnp lsrk Sodl
atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvh
qfg----ndtagct sagphfnp lsrk Sddm
atkavcvlkgdgpqvq -infeak-gdtvkvwgsikglte-glhgfhvh
qfg----ndtagct sagphfnp lsrk Sdmz
atkavcvlkgdgpqvq infeqkesdgpvkvwgsikglteglhgfhvh
qfg----ndtagct sagphfnp Lsrk Sods
vatkavcvlkgdgpqvq infeak-gdtvkvwgsikgltepnglhgfhv
hqfg----ndtagct sagphfnp lsrk Sdpb
datkavcvlkgdgpqvq-infeqkesdgpv----wgsikgltglhgfhv
hqfgscasndtagctvlggssagphfnpehtnk
sddm
Sodb
Sodl
Sodh
Sdmz
sods
Sdpb
15
Thorne-Kishino-Felsenstein Process
l lt m P(s) (1-l/m)(l/m)l pA A .. pT
T Time reversible
16
Time reversibility
Pi,j(t) probability that i has evolved into j
after time t. p(i) probability of i after
infinitely long time - equilibrium
distribution p(i) Pi,j(t) p(j) Pj,i(t)
t1-----------t2---------t3
17
Diff. Equations for p-functions
- - ... - ...
Dpk Dtl(k-1) pk-1 mkpk1 -
(lm)kpk - - - ... -
- ... DpkDtl(k-1)
pk-1m(k1)pk1-(lm)kpkmpk1
- - - ... - ...
DpkDtlkpk-1m(k-1)pk1-((k1)lmk)
pk Initial Conditions pk(0) pk(0) pk
(0) 0 kgt1 p0(0)
p0(0) 1. p0 (0) 0
18
l m into Alignment Blocks A. Amino Acids
Ignored - - - - - - -
- - - - -
k
k
k e-mt1-lb(t)(lb(t))k-1 1-e-mt-mb(t)1-lb(
t)(lb(t))k-1 1-lb(t)(lb(t))k
pk(t) pk(t)
pk(t)
p0(t) mb(t)
where b(t)1-e(l-m)t/m-l B. AA Considered
T - - - R Q S W
Pt(T--gtR)pQ..pWp4(t)
4 T - - - -
- R Q S W pR pQ..pWp4(t)
4
19
Basic Pairwise Recursion (O(length3))
i
j
Survives
Dies
i-1
i-1
i
i
j-1
j
j
i-1
i
i-1
i
j-2
j
j
j-1
20
Fundamental Pairwise Recursion. P(s1i-gts2j)
p0P(s1i-1-gts2j) Initial Condition
P(s10 -gts2j) pjps21j Simplification
Ri,j (p1f(s1i,s2jp1ps2jj)P(s1i-1-gts2j-1)
P(s1i-gts2j) Ri,j p0 P(s1i-gts2j-1) P(s1i-gts2
j) p0P(s1i-1-gts2j)
???????????????????
lbP(s1i-gts2j-1)
(p1f(s1i,s2jp1p?s2jj- lb p?s2jj
))P(s1i-1-gts2j-1) Probability of observation
P(s1 , s2) P(s1) P(s1 -gts2)
21
a-globin (141) and b-globin (146) 430.108
-log(a-globin) 327.320 -log(a-globin ?
b-globin) 730.428 -log(l(sumalign)) lt
0.0371805 /- 0.0135899 mt 0.0374396
/- 0.0136846 st 0.91701 /-
0.119556 E(Length) E(Insertions,Deletions)
E(Substitutions) 143.499 5.37255
131.59 Maximum contributing
alignment V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
TTKTYFPHF-DLS--H---GSAQVKGHGKKVADAL
VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGD
LSTPDAVMGNPKVKAHGKKVLGAF TNAVAHVDDMPNALSALSDLHAHK
LRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY
R SDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
KEFTPPVQAAYQKVVAGVANALAHKYH Ratio
l(maxalign)/l(sumalign) 0.00565064
22
Probability for substitution 0.46 children
p p' p'' 0. 0.0
3.60 10-2 9.64 10-1 1. 9.28 10-1
6.30 10-4 3.45 10-2
2. 3.32 10-2
2.26 10-5 1.23 10-3
3. 1.19 10-3
8.10 10-7 4.43 10-5 4. 4.27 10-5 2.90
10-8 1.59 10-6
5. 1.53 10-6 1.04
10-9 5.70 10-8 b(t) 9.64 10-1
lb(t) 3.46 10-2
23
Length Evolution Immortal Link
24
Accelerations of pairwise algorithm
1
2 - Better numerical search
3 - Simpler recursion
4 - Better computers
1991-gt2000 an 106 acceleration for 2 proteins
1500 long.
25
Likelihood Surface
26
Homology Test Wi,j -ln(piP2.5i,j/(pipj)) D(s
1,s2) is evaluated in D(s1,s2) Real s1
ATWYFCAK-AC Random s1 ATWYFC-AKAC
s2 ETWYKCALLAD s2
LTAYKADCWLE
This test 1. Test the competing
hypothesis that 2 sequences are 2.5 events apart
versus infinitely far apart. 2. It only handles
substitutions correctly. The rationale for
indel costs are more arbitrary. 3. It samples in
(pipj) by permuting the order of amino acids in
the second. I.e. uses drawing without
replacement a hypergeometric distribution.
27
(No Transcript)
28
Steel-Hein Algorithm
TTGT
ACGC
s2
s1
a

s3
ACGGT
29
Binary Tree Problem
TGA
ACCT
s1
s3
a1
a2
s2
s4
GTT
ACG
30
Markov Chains Generating the p-functions.
31
Generating Ancestral Sequences 1 Sequence
E
l/m 1- l/m
l/m 1- l/m 2 Sequence -
E
-
E
lb l/m (1- lb)e-m
l/m (1- lb)(1- e-m)
(1- l/m) (1- lb) - lb
l/m (1- lb)e-m
l/m (1- lb)(1- e-m) (1- l/m) (1-
lb) _ lb
l/m (1- lb)e-m l/m (1-
lb)(1- e-m) (1- l/m) (1- lb) -


lb a1
- E a2
E
lb l/m (1- lb)e-m
(1- l/m) (1- lb)
32
Fundamental Multiple Recursion I s1 - C
G C T A s2 A G A A
T T a1 - a
---gt ? e . . . . . .
. a2 s3 A G C
G G s4 G - C C T G
Sum over all String partitions - Anc. state
survivals - Anc. state MC jumps
33
Fundamental Multiple Recursion II
Pa(Sk) Epifixes (S1k1l1) starting in given
MC starts i state a.
Pa(Sk)
Where P(kS i,H ! ?)
F(kSi,H)
34
Fundamental 4 sequence Recursion Not a proper
recursion! Initialisation PEE(Ø) 1 and
Pa(Ø) are directly calculatable.
O(l2k)?shown, O(lk) Algorithm possible Toy
4-Sequence Program almost ready. This
approach could analyse up to 6-7 sequences.
Jens Ledet and others are working on Gibbs
sampler approach.
35
Statistical Alignment Summary Motivation for
statistical alignment Data is sequences
not alignment! Problems ahead Longer
Insertion Deletion Process Position
Heterogeneous Process Simultaneous Comparative
Gene Finding and Alignment. Making an
TKF-process with a given HMM as
stationary distribution. Explore non-TKF
processes.
36
References
Write a Comment
User Comments (0)
About PowerShow.com