Phylogenetic trees: - PowerPoint PPT Presentation

About This Presentation
Title:

Phylogenetic trees:

Description:

Phylogenetic trees: What to look for and where? Lessons from Statistical Physics ... Lesson 1: Phylogenetic lower bound for forgetful trees. Th[M2004; Trans AMS] ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 24
Provided by: chris802
Category:

less

Transcript and Presenter's Notes

Title: Phylogenetic trees:


1
Phylogenetic trees What to look for and where?
Lessons from Statistical Physics
Elchanan Mossel, U.C. Berkeley and Microsoft
Research mossel_at_stat.berkeley.edu,
www.stat.berkeley.edu/mossel/
2
Statistical physics
  • Statistical physics is a sub-field of
    mathematical physics studying complex systems
    with simple microscopic interactions.
  • The Ising model on a graph G(V,E) is a
    probability measure (Gibbs distribution) on the
    space of configurations s V ? -1,1 such that
    Ps is given by
  • exp(S(v, w) e E s(v)s(w)/T)/Z exp(? S(v, w) e
    E s(v)s(w))/Z
  • Or, Weight(?) exp(? u v ?(u) ?(v) )
  • Traditionally studied on cubes in Zd.

The Ising model on 200 x 200 grid
3
Statistical physics - intuition
  • The Ising model on the nxn grid is given by
  • exp(S(v, w) e E s(v)s(w)/T)/Z exp(? S(v, w) e
    E s(v)s(w))/Z
  • We expect that
  • T small, ? large ) strong correlations
  • Corr(?boundary,?0) gt ? gt 0 for all n.
  • T large, ? small ) weak correlations
  • Corr(?boundary,?0) ! 0 as n ! 1.

2n
0
boundary
  • Onsager (1944) proved it where
  • Critical ? ?c ln(121/2)/2
  • For most other graphs, we know very little

The Ising model on 200 x 200 grid ? ?c
4
Statistical physics on trees
  • The Ising model on a tree T(V,E) is given by
  • exp( S(v, w) e E ?(v,w) ?(v) ?(w))/Z
  • It is equivalent to the following model
  • Let r be a root (chosen arbitrarily).
  • Let ?(r) 1 with probability ½ and for
  • Each edge (u,v) directed away from the root, let
  • ?(v) ?(u) with probability ?(u,v).
  • ?(v) is independent 1 otherwise.
  • ?(u,v) ( e?(u,v)-e-?(u,v) )/
    (e?(u,v)e-?(u,v))




-



-



-
-

5
Ising Model on binary Trees
low
interm.
high
bias
bias
no bias
no bias
bias
typical boundary
typical boundary
Unique Gibbs measure 8 e, 2?(e) 1
Extermality 8 e, 2?(e) gt 1 8 e, 2 ?2(e) 1
Non-Extermality 8 e, 2?(e)2 gt 1
6
Statistical physics on trees History
  • Uniqueness studied by Bethe (1930s).
  • Extremality phase more recently Spitzer 75,
    Higuchi 77, Bleher-Ruiz-Zagrebnov 95,
    Evans-Kenyon-Peres-Schulman 2000, Ioffe 99, M 98,
    Haggstrom-M 2000, Kenyon-M-Peres 2001,
    Martinelli-Sinclair Weitz- 2003, Martin-2003
  • Many problems are still open.
  • Extremality has rich connections with
  • Noisy computation/communication
  • von-Neumann53, Evans-Shculmann00,
  • Mixing of Markov chains Berger-Kenyon-Mossel-Pere
    s01,Martinelli-Sinclair-Weitz05
  • Spinglasses and Random Sat problems
    Parisi,Mezard,Montanari Mezard-Montanari06

7
Phylogeny
  • Phylogeny is the true evolutionary relationships
    between groups of living things

Noah
Shem
Ham
Japheth
Cush
Mizraim
Kannan
8
History of Phylogeny
  • Intuitively animal kingdom or plant
    kingdom.
  • More scientifically morphology, fossils, etc.
    Darwin
  • But Is a human more like a great ape or like a
    chimpanzee?

No brain, Cant move
Stupid Walks
Stupid Swims
Stupid Flies
Too smart Barely moves
9
Molecular Phylogeny
  • Molecular Phylogeny Based on DNA, RNA or protein
    sequences of organisms.
  • Mutation mechanisms
  • Substitutions
  • Transpositions
  • Insertions, Deletions, etc.
  • Will only consider substitutions
  • and assume sequences are aligned.

Noah
acctga
Shem
Ham
Japheth
acctga
acctaa
acctga
Put
Cush
Mizraim
Kannan
acctga
acctga
agctga
acctga
10
Simplifying assumptions models
  • Assumption Letters of sequences (characters)
    evolve independently and identically.
  • CFN model The first stochastic model invented by
    Cavender, Farris and Neyman (70s)
  • Let ?(r) 1 with probability ½ and for
  • Each edge (u,v) directed away from the root, let
  • ?(v) ?(u) with probability ?(u,v).
  • ?(v) is independent 1 otherwise.
  • This is exactly the Ising model on the
    evolutionary tree!
  • Dictionary A,C (Pyrimidine group) G,T
    - (Purine group).
  • Some results can be generalized to other models.

11
Simplifying assumptions trees
  • Assumption 1 Evolution is on a tree.
  • Assumption 2 Trees are binary -- All internal
    degrees are 3.
  • Given a set of species (labeled vertices) X, an
    X-tree is a tree which has X as the set of
    leaves.
  • Two X-trees T1 and T2 are identical if theres a
    graph isomorphism between T1 and T2 that is the
    identity map on X.
  • Most results to trees all of whose internal
    degrees are at least 3.

u
u
Me
v
Me Me
Me
w
w
d
a
c
b
d
a
b
c
c
a
b
d
12
The Phylogenetic Challenge
Time
Contemporary Genetic sequences
Evolutionary model
Genetic sequences
??
  • How to reconstruct Phylogenetic tree from genetic
    data at contemporary species??

13
Phylogeny
  • Tree is unknown.
  • Given sequences at the leaves of the tree.
  • Want to reconstruct the tree (un-rooted).
  • How hard is it as a function of
  • n size of tree leaves.
  • k length of sequences.

14
Phylogeny
15

n and k
Length of sequence!
  • Interested to know k characters needed to
    reconstruct the tree with n leaves.
  • Erdos-Steel-Szekeley-Warnow96
  • If ? lt ?(e) lt 1 - ? for all e.
  • Tree can be recovered from
  • Sequences of length k nc.
  • In polynomial time.
  • Question How about shorter sequences?
  • Previously, best lower bound on sequence length
    is k ?(log n).
  • However, in practice
  • Sometimes hard to find long sequences.
  • Short sequences often suffice.

16

Lesson 1 Phylogenetic lower bound for forgetful
trees
  • ThM2004 Trans AMS
  • If 2 ?2(e) lt 1 for all e then we show
  • A lower bound on sequence length of k nc, where
  • c gt 0 is a function of ? maxe ?(e) and
  • c ! 1 as ? ! 0.
  • Th M2003 JCB
  • Similar theorem for general mutation models if
    mutation rates are high.
  • Proofs are easy.

17

Poly. lower bound for Phylogeny
  • Proof by coupling

XT
?
?
L
k
Known
Known
q-L
k
  • If for all k characters we can couple bottom q-L
    levels, then X is independent of the data.
  • By forgetfulness of tree, if k lt nc, X is
    independent of data with high probability.
  • Similar idea can be used to test trees
    (MRiesenfeld)

18

Lesson 2 Recent history is easy
  • In the proof of lower bound, the deep
    convergences were hard to reconstruct.
  • Theorem M04
  • If ? lt ?(e) lt 1 - ? for all e, then
  • most of the tree can be
  • reconstructed from
  • sequences of length k O(log n).
  • most of tree a forest F such that the true
    tree is obtained from F by adding o(n) edges.
  • Result were refined experiments in
    Daskalakis-Hill-Jaffe-Miahescu-Mossel-Rao
  • Proof is not easy based on Distorted Metrics.

19

Lesson 3 Species that remember their past can
reconstruct their history.
  • Thm Daskalakis-Mossel-Roch To appear STOC06
  • If 2 ?2(e) gt 1 for all e then
  • The tree can be recovered with high probability
    from sequences of length
  • k O( log n ).
  • Solves M. Steels Favourite conjecture
  • Builds on M2004 Trans AMS
  • Hard proof Mixes probability, algorithms,
    statistical physics.

20

Proof Sketch Logarithmic reconstruction
  • Two parts of the proofs
  • I. Statistical / algorithmic.
  • II. Probability / statistical physics.
  • By Forest result we may recover a forest
    containing 90 of the edges of the tree from
    O(log n) samples.
  • Doesnt use the 2 ?2 gt 1

21

Logarithmic Reconstruction
  • II. Here we use the condition that 2 ? 2 gt 1 in
    order to estimate the characters at the inner
    nodes of the forest.

Like I.
22
Ising Model on binary Trees
low
interm.
high
bias
bias
no bias
k ?(nc)
no bias
bias
Most tree from k O(log n)
k O( log n )
typical boundary
typical boundary
Unique Gibbs measure 8 e, 2?(e) 1
Extermality 8 e, 2?(e) gt 1 8 e, 2 ?2(e) 1
Non-Extermality 8 e, 2?(e)2 gt 1
23

Many more challenges to come
  • We know very little
  • We dont understand methods used in practice
  • Maximum Likelihood (NP hard on arbitrary data
    Chor-Tuller05 Roch05)
  • Markov Chain Monte Carlo (Can be exponentially
    slow on mixtures M-Vigda05).
  • In what sense Parsimony Maximum Likelihood?
    (2 Conjectures by Steel)
  • Other mutation models rates across sites, gene
    order etc. etc.
  • all the problems on Gibbs measures on trees
Write a Comment
User Comments (0)
About PowerShow.com