Loading...

PPT – Parallel Bayesian Phylogenetic Inference PowerPoint presentation | free to view - id: 715787-MjA4M

The Adobe Flash plugin is needed to view this content

Parallel Bayesian Phylogenetic Inference

- Xizhou Feng
- Directed by Dr. Duncan Buell
- Department of Computer Science and Engineering
- University of South Carolina, Columbia
- 2002-10-11

Topics

- Background
- Bayesian phylogenetic inference and MCMC
- Serial implementation
- Parallel implementation
- Numerical result
- Future research

Background

- Darwin Species are related through a history of

common descent, and - The history can be organized as a tree structure

(phylogeny). - Modern species are put on the leaf nodes
- Ancient species are put on the internal nodes
- The time of the divergence is described by the

length of the branches. - A clade is a group of organisms whose members

share homologous features derived from a common

ancestor.

Phylogenetic tree

Internal node Ancestral species

Clade

Leaf node Current species

Branch

Branch length

Applications

- Phylogenetic tree is
- fundamental to understand evolution and diversity
- Principle to organize biological data
- Central to organism comparison
- Practical examples
- Resolve quarrel over bacteria-to-human gene

transfers (Nature 2001) - Tracing route of infectious disease transmission
- Identify new pathogens
- Phylogenetic distribution of biochemical pathways

Use DNA data for phylogenetic inference

Objectives of phylogenetic inference

output

input

- Major objectives include
- Estimate the tree topology
- Estimate the branch Length, and
- Describe the credibility of the result

Phylogenetic inference methods

- Algorithmic methods
- Defining a sequence of specific steps that lead

to the determination of a tree - e.g. UPGMA (unweighted pair group method using

arithmetic average) and Neighbor-Joining - Optimality criterion-based methods
- 1) Define a criterion 2)Search the tree with

best values - Maximum parsimony (minimize the total tree

length) - Maximum likelihood (maximize the likelihood)
- Maximum posterior probability (the tree with the

highest probability to be the true tree)

Common Used phylogeny methods

Algorithm

Statistical Supported

Search Strategy

Bayesian Methods

Stochastic search

Maximum Likelihood

Optimization method

Divide Conquer

Maximum Parsimony

Greedy search

GA, SA MCMC

Exact search

Fitch-Margolish

DCM, HGT, Quartet

Algorithmic method

Stepwise addition Global arrangement Star

decomposition

Exhaustive Branch Bound

Neighbor-join

UPGMA

Data set

Distance matrix

Character data

Aspects of phylogenetic methods

- Accuracy
- Is the constructed tree a true tree?
- If not, the percentage of the wrong edges?
- Complexity
- Neighbor-Join O(n3)
- Maximum Parsimony (provably NP hard)
- Maximum Likelihood (conjectured NP hard)
- Scalability
- Good for small tree, how about large tree

- Robustness
- If the model or assumption or the data is not

exact correct, how about the result? - Convergence rate
- How long a sequence is needed to recover the true

tree? - Statistical support
- With what probability is the computed tree the

true tree?

The computational challenge

- gt1.7 million known species
- Number of trees increase exponentially as new

species was added

- The complex of evolution
- Data Collection Computational system

Compute the tree of life Source

http//www.npaci.edu/envision/v16.3/hillis.html

Topics

- Background
- Bayesian phylogenetic inference and MCMC
- Serial implementation
- Parallel implementation
- Numerical result
- Future research

Bayesian Inference-1

- Both observed data and parameters of models are

random variables - Setting up the joint distribution

Topology

Branch length

Parameter of models

- When data D is known, Bayes theory gives

likelihood

Prior probability

Posterior probability

Unconditional Probability of data

Bayesian Inference-2

- Having the posterior probability distribution ,

we can compute the marginal probability of T as

- P(TD) can be interpreted as the probability of

the tree is correct - We need to do at least two things
- Approximate the posterior probability

distribution - Evaluate the integral for P(TD)
- These can be done via Markov Chain Monte Carlo

Method

Markov chain Monte Carlo (MCMC)

- The basic idea of MCMC is
- To construct a Markov chain such that
- Have the parameters as the state space, and
- the stationary distribution is the posterior

probability distribution of the parameters - Simulate the chain
- Treat the realization as a sample from the

posterior probability distribution - MCMC sampling continue search

Markov chain

- A Markov chain is a sequence of random variables

X0, X1, X2, whose transition kernel T(Xt,

Xt1) is determined only by the value of Xt

(tgt0). - Stationary distribution ?(x)Sx(?(x)T(x,x))

is invariant - Ergodic property pn(x) converges to ?(x) as n??
- A homogeneous Markov chain is ergodic if

min(T(x,x)/ ?(x)gt0

Metropolis-Hasting algorithm-1

- Cornerstone of all MCMC methods, Metropolis(1953)
- Hasting proposed a generalized version in (1970)
- The key point is to how to define the accepted

probability - Metropolis
- Hasting

Proposal probability Can be any form Such that

Metropolis-Hasting algorithm-2

- Initialize x0, set t0
- Repeat
- Sample x from T(xt, x)
- Draw Uuniform0,1
- Update

Problems of MH Algorithm Improvement

- Problems
- Mixing rate is slow when
- Small step-gtlow movement
- Larger step-gtlow acceptance
- Stopped at local optima
- Dimension of state space may vary
- Improvement
- Metropolis-coupled MCMC
- Multipoint MCMC
- Population-based MCMC
- Time-reversible jump MCMC

Metropolis-coupled MCMC (Geyer 1991)

- Run several MCMC chains with different

distribution ?i(x) (i1..m) in parallel - ?1(x) is used to sampling
- ?i(x) (i2..m) are used to improve mixing
- For example ?i(x) ?(x)1/(1?(I-1))
- After each iteration, attempt to swap the states

of two chains using a Metropolis-Hasting step

with acceptance probability of

Illustration of Metropolis-coupled MCMC

?1(x)T10

?2(x) T22

?3(x) T34

?4(x) T48

Metropolis-coupled MCMC is also called Parallel

tempering

Multiple Try Metropolis (Liu et al 2000)

Population-based MCMC

- Metropolis-coupled MCMC uses a minimal

interaction between multiple chains, why not more

active interaction - Evolutionary Monte Carlo (Liang et al 2000)
- Combine Genetic Algorithm with MCMC
- Used to Simulate protein folding
- Conjugate Gradient Monte Carlo (Liu et al 2000)
- Use local optimization for adaptation
- An improvement of ADS (Adaptive Direction

Sampling)

Topics

- Background
- Bayesian phylogenetic inference and MCMC
- Serial implementation
- Choose the evolutionary model
- Compute the likelihood
- Design proposal mechanisms
- Parallel implementation
- Numerical result
- Future research

DNA substitution rate matrix

Purine

transition

transversion

Pyrimidine

- Consider inference of un-rooted tree and the

computational complications, some simplified

models are used (see next slide)

GTR-family of substitution models

GTR

- GTR general time-reversible model, corresponding

to a symmetric rate matrix.

Three substitution type (1 transversion, 2

transition)

Equal base frequencies

TN93

SYM

Two substitution types (transition v.s.

tranversion)

Three substitution type (1 transversion, 2

transition)

HKY85 F84

K3ST

Equal base frequencies

Two substitution types (transition v.s.

tranversion)

Single substitution type

K2P

F81

Single substitution type

Equal base frequencies

JC69

More complex models

- Substitute rates vary across sites
- Invariable sites models gamma distribution
- Correlation in the rates at adjacent sites
- Codon models 61X61 instantaneous rate matrix
- Secondary structure models

Compute conditional probability of branch

- Given substitution rate matrix, how to compute

p(ba,t)-the probability of a is substituted by b

after time t

Eigenvalue of Q

Likelihood of a phylogeny tree for one site

When x4 x5 are known,

When x4 x5 are unknown,

Likelihood calculation (Felsenstein 1981)

- Given a rooted tree with n leaf nodes (species) ,

and each leaf node is represented by a sequence

xi with length N, the likelihood of a rooted tree

is represented as

Likelihood calculation-2

- Felsensteins algorithms for likelihood

calculation(1981) - Initiation
- Set k2n-1
- Recursion Compute for all a as follows
- If k is a leaf node
- Set if
- Set if .
- If k is not a leaf node
- compute for all a its children

nodes i, j. - And set
- Termination Likelihood at site u is

Note algorithm modified from Durbin et al (1998)

Likelihood calculation-3

Site 1

Site 2

Site m-1

Site m-2

Taxa-1

Taxa-2

Taxa-3

Taxa-n

- The likelihood calculation requires filling an N

X M X S X R table - N number of sequences M number of sites
- S number of state of characters R number of

rate categories

Local update Likelihood

- If we just change the topology and branch length

of tree locally, we only need refresh the table

at those affected nodes. - In the following example, only the nodes with red

color need to change their conditional likelihood.

Original tree

Proposed tree

Proposal mechanism for trees

- Stochastic Nearest-neighbor interchange (NNI)
- Larget et al (1999) Huelsenbeck (2000)

Proposal mechanisms for parameters

Larget et al. (1999)

- Independent parameter
- e.g. transition/tranversion ratio k

- A set of parameters constrained to sum to a

constant - e.g. base frequency distribution
- Draw a sample from the Dirichlet distribution

Bayesian phylogenetic inference

Prior probability

DNA Data

Likelihood

Evolutionary model

Phylogenetic tree

Posterior prob.

Starting tree

Proposal

inference

MCMC

A sequence of Samples

Approximate the distribution

Topics

- Background
- Bayesian phylogenetic inference and MCMC
- Serial implementation
- Parallel implementation
- Challenges of serial computation
- Difficulty
- MCMC is a serial algorithm
- Multiple chains need to be synchronized
- Choose appropriate grid topology
- Synchronize using random number
- Numerical result
- Future research

Computational challenge

- Computing global likelihood needs O(NMRS2)

multiplications - Local updating topology branch length needs

O(MRS2log(N)) - Updating model parameter needs O(NMRS2)
- local update needs all required data in memory
- Given N1000 species, each sequence has M5000

sites, rate category R5, and DNA nucleotide

model S4 - Run 5 chains each with length of 100 million

generations - Needs 400 days (assume 1 global updates, 99

local update) - And O(NMRSLX2X2X8)32Gigabyte memory
- gtSo until more advanced algorithms are

developed, parallel computation is the direct

solution. - Use 32 processor with 1 gigabyte memory, we can

compute the problem in 2 weeks

Characteristic of good parallel algorithms

- Balancing workload
- Concurrency identify, manage, and granularity
- Reducing communication
- Communication-to-computation ratio
- Frequency, volume, balance
- Reducing extra work
- Computing assignment
- Redundant work

Single-chain MCMC algorithm

Multiple-chain MCMC algorithm

Chain 1

Chain 2

Chain 3

Chain 4

Generate S1(0) t0

Generate S2(0) t0

Generate S3(0) t0

Generate S4(0) t0

Propose Update S1(t)

Propose Update S2(t)

Propose Update S3(t)

Propose Update S4(t)

choose two chains to swap

Compute R and U

UltR

Swap the two selected chains

tt1

tt1

tt1

tt1

Task Concurrency

- In one likelihood evaluation, the task can be

divided into n_chain x n_site x n_rate concurrent

tasks

Task decomposition and assignment

Concurrent tasks

Processors virtual topology

- We organize the processor pool into an nx X ny X

nz 3D grid - Each grid runs n_chain/ny chains, each chain has

n_site/nx sites, and each site has n_rate/nz

category rate. (my current implementation dosent

consider rate dimension, so the topology is a 2D

grid). - Choose nx, ny, and nz such that each processor

has equal load

Parallel configuration Modes

- Chain-level parallel configuration (Mode I)
- Each chain runs on one row
- Number of chains number of rows
- Site-level parallel configuration (Mode II)
- All chains run on the same row
- Different subsequences run on different columns
- Number of subsequences number of columns
- Hybrid parallel configuration (Mode III)
- Combine the first two configurations
- Several chains may run on the same row and also
- Divide the whole into segments

Orchestration

- The data can be distributed at chain level and

sub-sequence level, spatial locality is always

preserved. At the same time, each process only

needs 1/(nxXny) original memory - Each step, each processor needs 1 collective

communication along its row and at most 2 (or 1)

direct communication if it is one of the selected

chain for swapping. - Processors at the same row can easily be

synchronized because they use the same random

number.

Communication between different rows

- Processors at different rows need some trick to

synchronize without communication - The trick is to use two random number generators
- the first one is used to generate the proposal,

processors at different row use different seeds

for random number generator. - The second one is used to synchronize processors

in different rows, using the same seed. So

processors know whats going on in the whole

grid. - To reduce the size of the message, the selected

processors swap message in two steps - Swap current likelihood and temperature
- Compute the acceptance ratio r and draw a random

number U, if Ultr, then swap the tree and

parameters else continue to next step.

Symmetric parallel algorithm-1

- Build grid topology
- Initialize
- Set seed for random number Rand_1(for proposal)

and Rand_2(for synchronize) - For each chain
- t 0
- Building random tree T0
- Choose initial parameter ?0
- Set initial state as x0
- While (tltmax_generation) do
- 3.1 In-chain update
- For each chain
- Choose proposal type using Rand_2
- Generate the proposal x
- Evaluate log likelihood locally
- Collect the sum of the local log likelihood
- For each chain
- Compute r using hasting transition kernel
- Draw a sample u from uniform distribution using

Rand_1 - If ultr set xx

Symmetric parallel algorithm-2

- 3.2 inter-chain swap
- Draw two chain indexes I and J from Rand_2
- If IJ go to 3.3
- Map I, J to the processor coordinate row_I and

row_J - If row_I row_J
- Compute acceptance ratio r
- Draw u from uniform distribution using Rand_2
- If ultr Swap state x and the temperature of

chain I and chain J - Go to 3.3
- If row_I ! row_J
- Swap log likelihood and temperature of chain I

and chain J between processors - on chain row_I and row_J
- Compute acceptance r
- Draw u from from uniform distribution using

Rand_2 - If ultr Swap state x of chain I and chain J

between between processors - on chain row_I and row_J
- 3.3 tt1
- 3.4 if tsample_cycle0 output state x

Load Balancing

- For simplicity, from now we just consider 2D grid

(x-y). - Along x-axis, the processors are synchronized

since they evaluate the same topology, same

branch length, and same local update step - But along y-axis, different chains will propose

different topologies, so according to local

likelihood evaluation, imbalance will occur. - In the parameter update step, we need global

likelihood evaluation, balancing is not a problem.

Asymmetric parallel algorithm

- The symmetric algorithm assumes each processor

has the same performance and deals with the

imbalance between different rows by waiting and

synchronization. - An improvement is to use an asymmetric algorithm.

The basic idea is - Introduce a processor as the mediator
- Each row send its state to the mediator after it

finished in-chain update - The mediator keeps the current states of

different rows, and decides which two chains

should swap - The selected chains get new states from the

mediator - Asymmetric algorithm can solve the unbalance

problem, but assumes that different chains dont

need to synchronize.

Topics

- Background
- Bayesian phylogenetic inference and MCMC
- Serial implementation
- Parallel implementation
- Numerical result
- Future research

The likelihood evolution of single chain

The likelihood evolution of 4 chains

Numerical result Case 1 run time

Dataset 7X2439 of chains4 generation10000 Pa

rallel Mode I

Numerical result Case 1s speedup

Dataset 7X2439 of chains4 generation10000 Pa

rallel Mode I

Speedup obtained by Mode II

Note Use the same dataset as the previous one.

But only parallel at sequence level.

Comparison of two parallel modes

Conclusion

- Bayesian phylogenetic inference
- Using MCMC sample posterior probability
- Efficient parallel strategies
- Stochastic algorithms vs. NP-problems

Future research directions

- Improved parallel strategies
- Connection between Bayesian Bootstrap
- Performance of different Bayesian methods
- Alternative MCMC methods
- MCMC for other computational biology problems
- The power of stochastic algorithms
- Super tree construction