Loading...

PPT – Useful algorithms and statistical methods in Genome data analysis PowerPoint presentation | free to download - id: d1228-ZDc1Z

The Adobe Flash plugin is needed to view this content

Useful algorithms and statistical methods in

Genome data analysis

- Ahmed Rebaï
- Centre of Biotechnology of Sfax
- ahmed.rebai_at_cbs.rnrt.tn

Al-kawarizmi (780-840) at the origin of the word

Algorithm

Basic statistical concepts and tools

- Primer on
- Random variable
- Random distribution
- Statistical inference

Statistics

- Statistics concerns the optimal methods of

analyzing data generated from some chance

mechanism (random phenomena). - Optimal means appropriate choice of what is to

be computed from the data to carry out

statistical analysis

Random variables

- A random variable is a numerical quantity that in

some experiment, that involve some degree of

randomness, takes one value from some discrete

set of possible values - The probability distribution is a set of values

that this random variable takes together with

their associated probability

Three very important distributions!

- The Binomial distribution the number of

successes in n trials with two outcomes

(succes/failure) each, where proba. of sucess p

is the same for all trials and trials are

independant - Pr(Xk)n!/(n-k)!k! pk (1-p)n-k
- Poisson distribution the limiting form of the

binomial for large n and small p (rare events)

when np? is moderate - Pr(Xk)e-? ?k/ k!
- Normal (Gaussian) distribution arises in many

biological processes, limiting distribution of

all random variables for large n.

Statistical Inference

- Process of drawing conclusions about a population

on the basis of measurments or observation on a

sample of individuals from the population - Frequentist inference an approach basedon a

frequency (long-run) view of probability with

main tools Tests, likelihood (the probability of

a set of observations x giventhe value of some

set of parameters ? L(x/?)) - Bayesian inference

Hypothesis testing

- Significance testing a procedure when applied to

a set of obs. results in a p-value relative to

some null hypothesis - P-value probability of observed data when the

null hypothesis is true, computed from the

distribution of the test statistic - Significance level the level of probability at

which it is argued that the null hypothesis will

be rejected. Conventionnally set to 0.05. - False positive reject nul hypothsis when it is

true (type I error) - False negative accept null hypothesis when it

is false (type II error)

Bayesian inference

- Approach based on Bayes theorem
- Obtain the likelihood L(x/?)
- Obtain the prior distrbution L(?) expressing what

is known about ?, before observing data - Apply Bayes theorrem to derive de the posterior

distribution L(?/x) expresing what is known about

? after observing the data - Derive appropriate inference statement from the

posterior distribution estimation, probability

of hypothesis - Prior distribution respresent the investigators

knowledge (belief) about the parameters before

seeing the data

Resampling techniques Bootstrap

- Used when we cannot know the exact distribution

of a parameter - sampling with replacement from the orignal data

to produce random samples of same size n - each bootstrap sample produce an estimate of the

parameter of interest. - Repeating the process a large number of times

provide information on the variability of the

estimator - Other techniques
- Jackknife generates n samples each of size n-1

by omitting each member in turn from the data

Multivariate analysis

- A primer on
- Discriminant analysis
- Principal component analysis
- Correspondance analysis

Exploratory Data Analysis (EDA)

- Data analysis that emphasizes the use of informal

graphical porcedures not based on prior

assumptions about the structure of the data or on

formal models for the data - Data smooth rough where the smooth is the

underlying regulariry or pattern in the data. The

objective of EDA is to separate the smooth from

the rough with minimal use of formal mathematics

or statistics methods

Classification and clustering

Classification

- In classification you do have a class label (o

and x), each defined in terms of G1 and G2

values. - You are trying to find a model that splits the

data elements into their existing classes - You then assume that this model can be used to

assign new data points x and y to the right class

Linear Discriminant Analysis

- Proposed by Fisher (1936) for classifying an

obervation into one of two possible groups using

measurements x1,x2,xp. - Seek a linear transformation of the variables

za1x1a2x2..apxp such that the separation

between the group means on the transformed scale

aS-1 (x1-x2) where x1 and x2 are mean vectors

of the two groups and S the pooled covariance

matrix of the two groups - Assign subject to group 1 if zi-zclt0 and to group

2 if zi-zcgt0 where zc(z1z2)/2 is the group

means - Subsets of variable most useful for

discrimination can be selected by regression

procedures

Linear (LDA) and Quadratic (QDA) discriminant

analysis for gene prediction

- Simple approach.
- Training set (exons and non-exons)
- Set of features (for internal exons)
- donor/acceptor site recognizers
- octonucleotide preferences for coding region
- octonucleotide preferences for intron interiors
- on either side
- Rule to be induced linear (or quadratic)

combination of features. - Good accuracy in some circumstances
- (programs Zcurve, MZEF)

LDA and QDA in gene prediction

Principal Component Analysis

- Introduced by Pearson (1901) and Hotelling (1933)

to describe the variation in a set of

multivariate data in terms of a set of

uncorrelated variables, yi each of which is a

linear combination of the original variables

(xi) - Yi ai1x1ai2x2aipxp where i1..p
- The new variables Yi are derived in decreasing

order of importance they are called principal

components

Principal Component Analysis

How to avoid correlated features? Correlations ?

covariance matrix is non-diagonal Solution

diagonalize it, then use transformation that

makes it diagonal to de-correlate

features Columns of Z are eigenvectors and

diagonal elements of ? are eigenvalues

Properties of PCA

Small li ? small variance ? data change little in

direction Yi PCA minimizes C matrix

reconstruction errors Zi vectors for large li

are sufficient because vectors for small

eigenvalues will have very small contribution to

the covariance matrix. The adequacy of

representation by the two first components is

measured by explanation (l1 l2) /? li

True covariance matrices are usually not known,

estimated from data.

Use of PCA

- PCA is useful for finding new, more informative,

uncorrelated features reducing dimensionality

reject low variance features,.. - Analysis of expression data

Correspondance analysis (Benzecri, 1973)

- For uncovering and understanding the structure

and pattern in data in contingency tables. - Involves finding coordinate values which

represent the row and column categories in some

optimal way - Coordinates are obtained from the Singular value

decomposition (SVD) of a matrix E which elements

are eij(pij- pi. p.j)/(pi. p.j)1/2 - EUDV, U eigenvectors of EE, V eigenvectors of

EE and D a diagonal matrix with ?k singular

values (?²k eigenvalues of either EE or EE) - The adequacy of representation by the two first

coordinates is measured by - inertia (?²1?²2)/??²k

Properties of CA

- allows consideration of dummy variables and

observations (called illustrative variables or

observations), as additional variables (or

observations) which do not contribute to the

construction of the factorial space, but can be

displayed on this factorial space. - With such a representation it is possible to

determine the proximity between observations and

variables and the illustrative variables and

observations.

Example analysis of codon usage and gene

expression in E. coli (McInerny, 1997)

- A gene can be represented by a 59-dimensional

vector (universal code) - A genome consists of hundreds (thousands) of

these genes - Variation in the variables (RSCU values) might be

governed by only a small number of factors - For each gene and each codon i calculate
- RCSUi observed cordon i/expected codon i
- Do Correspondance analysis on RSCU

Plot of the two most important axes after

correspondence analysis of RSCU values from the

E. coli genome.

Lowly-expressed genes

Highly expressed genes

Recently acquired genes

Other Methods

- Factor analysis explain correlations between a

set of p variables in terms of a smaller number

kltp of unobservable latent variables (factors)

xAfu , SAAD where D is a diagonal matrix and

A is a matrix of factor loadings - Multidimensional Scaling construct a

low-dimentional geometrical representation of a

distance matrix - Cluster analysisa set of methods for contructing

sensible and informative classification of an

initially unclassified set of data using the

variable values observed on each individual

(hierarchical clustering, K-means clustering, ..)

Datamining or KDD (Knowledge Discovery in

Databases)

- Process of seeking interesting and valuable

information within large databases - Extension and adaptation of standard EDA

procedures - Confluence of ideas from statistics, EDA, machine

learning, pattern recognition, database

technology, etc. - Emphasis is more on algorithms rather than

mathematical or statistical models

Famous Algorithms in Bioinformatics

- A primer on
- HMM
- EM
- Gibbs sampling (MCMC)

Hidden Markov Models

- An introduction

Historical overview

- Introduced in 1960-70 to model sequences of

observations - Observations are discrete (character from some

alphabet) or continious (signal frequency) - First applied in speech recognition since 1970
- Applied to biological sequences since 1992

multiple alignment, profile, gene prediction,

fold recognition, ..

Hidden Markov Model (HMM)

- Can be viewed as an abstract machine with k

hidden states. - Each state has its own probability distribution,

and the machine switches between states according

to this probability distribution. - At each step, the machine makes 2 decisions
- What state should it move to next?
- What symbol from its alphabet should it emit?

Why Hidden?

- Observers can see the emitted symbols of an HMM

but has no ability to know which state the HMM is

currently in. - Thus, the goal is to infer the most likely states

of an HMM basing on some given sequence of

emitted symbols.

HMM Parameters

S set of all possible emission characters. Ex.

S H, T for coin tossing S 1, 2, 3,

4, 5, 6 for dice tossing Q set of hidden states

q1, q2, .., qn, each emitting symbols from S. A

(akl) a Q x Q matrix of probability of

changing from state k to state l. E (ek(b)) a

Q x S matrix of probability of emitting

symbol b during a step in which the HMM is in

state k.

The fair bet casino problem

- The game is to flip coins, which results in only

two possible outcomes Head or Tail. - Suppose that the dealer uses both Fair and Biased

coins. - The Fair coin will give Heads and Tails with same

probability of ½. - The Biased coin will give Heads with prob. of ¾.
- Thus, we define the probabilities
- P(HF) P(TF) ½
- P(HB) ¾, P(TB) ¼
- The crooked dealer changes between Fair and

Biased coins with probability 10

The Fair Bet Casino Problem

- Input A sequence of x x1x2x3xn of coin tosses

made by two possible coins (F or B). - Output A sequence p p1 p2 p3 pn, with each pi

being either F or B indicating that xi is the

result of tossing the Fair or Biased coin

respectively.

HMM for Fair Bet Casino

- The Fair Bet Casino can be defined in HMM terms

as follows - S 0, 1 (0 for Tails and 1 Heads)
- Q F,B F for Fair B for Biased coin.
- aFF aBB 0.9
- aFB aBF 0.1
- eF(0) ½ eF(1) ½
- eB(0) ¼ eB(1) ¾

HMM for Fair Bet Casino

Visualization of the Transition probabilities A

Biased Fair

Biased aBB 0.9 aFB 0.1

Fair aBF 0.1 aFF 0.9

Visualization of the Emission Probabilities E

Biased Fair

Tails(0) eB(0) ¼ eF(0) ½

Heads(1) eB(1) ¾ eF(1) ½

HMM for Fair Bet Casino

HMM model for the Fair Bet Casino Problem

Hidden Paths

- A path p p1 pn in the HMM is defined as a

sequence of states. - Consider path p FFFBBBBBFFF and sequence x

01011101001

x 0 1 0 1 1 1

0 1 0 0 1 p F F F B

B B B B F F F P(xipi) ½ ½

½ ¾ ¾ ¾ ¾ ¾ ½ ½ ½ P(pi-1 ? pi) ½

9/10 9/10 1/10 9/10 9/10 9/10 9/10

1/10 9/10 9/10

P(xp) Calculation

- P(xp) Probability that sequence x was generated

by the path p, given - the model M.
- n
- P(xp) P(p0? p1) . ? P(xi pi).P(pi ? pi1)
- i1
- n
- a p0, p1 . ? e pi (xi) . a pi, pi1
- i1

Forward-Backward Algorithm

- Given a sequence of coin tosses generated by an

HMM. - Goal find the probability that the dealer was

using a biased coin at a particular time. - The probability that the dealer used a biased

coin at any moment i is as follows

P(x, pi k)

fk(i) . bk(i) P(pi kx) _______________

______________

P(x) P(x)

Forward-Backward Algorithm

- fk,i (forward probability) is the probability of

emitting the prefix x1xi and reaching the state

p k. - The recurrence for the forward algorithm is
- fk,i ek(xi) . S fk,i-1 . alk
- l ? Q

- Backward probability bk,i the probability of

being in state pi k and emitting the suffix

xi1xn. - The backward algorithms recurrence
- bk,i S el(xi1) . bl,i1 . Akl
- l ? Q

HMM Parameter Estimation

- So far, we have assumed that the transition and

emission probabilities are known. - However, in most HMM applications, the

probabilities are not known and hard to estimate. - Let T be a vector combining the unknown

transition and emission probabilities. - Given training sequences x1,,xm, let P(xT) be

the max. prob. of x given the assignment of

param.s T. m - Then our goal is to find maxT ? P(xiT)
- j1

The Expectation Maximization (EM) algorithm

- Dempster and Laird (1977)

EM Algorithm in statistics

- An algorithm producing a sequence of parameter

estimates that converge to MLE. Have two steps - E-step calculate de expected value of the

likelihood conditionnal on the data and the

current estimates of the parameters E(L(x/?i)) - M-step maximize the likeliood to give updated

parameter estimates increasing L - The two steps are alternated until convergence
- Needs a first guess estimate to start

The EM Algorithm for motif detection

- This algorithm is used to identify conserved

areas in unaligned sequences. - Assume that a set of sequences is expected to

have a common sequence pattern. - An initial guess is made as to location and size

of the site of interest in each of the

sequences and these parts are loosely

aligned. - This alignment provides an estimate of base or

aa composition of each column in the site.

EM algorithm in motif finding

- The EM algorithm consists of the two steps, which

are repeated consecutively - Step 1 the Expectation step, the

column-by-column composition of the site is used

to estimate the probability of finding the site

at any position in each of the sequences. These

probabilities are used to provide new information

as to expected base or aa distribution for each

column in the site. - Step 2 the Maximization step, the new counts

for bases or aa for each position in the site

found in the step 1 are substituted for the

previous set.

Expectation Maximization (EM) Algorithm

OOOOOOOOXXXXOOOOOOOOOOOOOOOOXXXXOOOOOOOO o o o o

o o o o o o o o o o o o o o o o o o o

o OOOOOOOOXXXXOOOOOOOO OOOOOOOOXXXXOOOOOOOO

IIII IIIIIIII IIIIIII

Columns defined by a preliminary alignment of

the sequences provide initial estimates of

frequencies of aa in each motif column

Bases Background Site column 1 Site column 2

G 0.27 0.4 0.1

C 0.25 0.4 0.1

A 0.25 0.2 0.1

T 0.23 0.2 0.7

Total 1.00 1.00 1.00

Columns not in motif provide background

frequencies

EM Algorithm in motif finding

A

B

The resulting score gives the likelihood that the

motif matches positions A, B or other in seq 1.

Repeat for all other positions and find most

likely locator. Then repeat for the remaining

seqs.

EM Algorithm 1st expectation step calculations

- Assume that the seq1 is 100 bases long and the

length of the site - is 20 bases.
- Suppose that the site starts in the column 1 and

the first two positions are A and T. - The site will end at the position 20 and

positions 21 and 22 do not belong to the site.

Assume that these two positions are A and T also.

- The Probability of this location of the site in

seq1 is given by - Psite1,seq1 0.2 (for A in position 1) x 0.7

(for T in position 2) x Ps (for the next 18

positions in site) x 0.25 (for A in first

flanking position) x 0.23 (for T in second

flanking position x Ps (for the next 78 flanking

positions).

EM Algorithm 1st expectation step calculations

(continued)

- The same procedure is applied for calculation of

probabilities for Psite2,seq1 to Psite78, seq1,

thus providing a comparative set of probabilities

for the site location. - The probability of the best location in seq1,

say at site k, is the ratio of the site

probability at k divided by the sum of all the

other site probabilities. - Then the procedure repeated for all other

sequences.

EM Algorithm 2nd optimisation step calculations

- The site probabilities for each seq calculated

at the 1st step are then used to create a new

table of expected values for base counts for each

of the site positions using the site

probabilities as weights. - Suppose that P (site 1 in seq 1) Psite1,seq1 /

(Psite1,seq1 Psite2,seq1 Psite78,seq1 )

0.01 and P (site 2 in seq 1) 0.02. - Then this values are added to the previous table

as shown in the table below.

Bases Background Site column 1 Site column 2

G 0.27 0.4 0.1

C 0.25 0.4 0.1

A 0.25 0.2 0.01 0.1

T 0.23 0.2 0.7 0.02

Total/ weighted 1.00 1.00 1.00

EM Algorithm 2nd optimisation step calculations

This procedure is repeated for every other

possible first columns in seq1 and then the

process continues for all other sequences

resulting in a new version of the table. The

expectation and maximization steps are repeated

until the estimates of base frequencies do not

change

Gibbs sampling

- Application to motif discovery

Gibbs sampling and the Motif Finding Problem

Given a list of t sequences each of length n,

Find the best pattern of length l that appears

in each of the t sequences. Let s denote the set

of starting positions s (s1,...,st) for

l-mers in our t sequences. Then the substrings

corresponding to these starting positions will

form a t x l alignment matrix and a 4 x l

profile matrix which we will call P. Let

Prob(aP) be defined as the probability that an

l-mer a was created by Profile P

Scoring Strings with a Profile

Given a profile P

A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8

The probability of the consensus string

Prob(aaacctP) 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x

7/8 .033646

Probability of a different string

Prob(atacagP) 1/2 x 1/8 x 3/8 x 5/8 x 1/8 x

7/8 .001602

Gibbs Sampling

1) Choose uniformly at random starting

positions s (s1,...,st) and form the set of

l-tuples associated with these starting

positions. 2) Randomly choose one of the t

sequences. 3) Create a profile P from the other

t -1 seq. 4) For each position in the removed

sequence, calculate the probability that the

l-mer starting at that position was generated by

P. 5) Choose a new starting position for the

removed sequence at random based on the

probabilities calculated in step 4. 6) Repeat

steps 2-5 until no improvement on the starting

positions can be made.

An Example of Gibbs Sampling motifs finding in

DNA sequences

Input t 5 sequences where L 8 1.

GTAAACAATATTTATAGC 2. AAAATTTACCTCGCAAGG 3.

CCGTACTGTCAAGCGTGG 4. TGAGTAAACGACGTCCCA 5.

TACTTAACACCCTGTCAA

Start the Gibbs Smapling

- 1) Randomly choose starting positions,

S(s1,s2,s3,s4) in the 4 sequences and figure out

what they correspond to - S17 GTAAACAATATTTATAGC
- S211 AAAATTTACCTTAGAAGG
- S39 CCGTACTGTCAAGCGTGG
- S44 TGAGTAAACGACGTCCCA
- S51 TACTTAACACCCTGTCAA
- 2) Choose one of the sequences at random
- Sequence 2 AAAATTTACCTTAGAAGG

An Example of Gibbs Sampling

3) Create profile from remaining

subsequences

1 A A T A T T T A

3 T C A A G C G T

4 G T A A A C G A

5 T A C T T A A C

A 1/4 2/4 2/4 3/4 1/4 1/4 1/4 2/4

C 0 1/4 1/4 0 0 2/4 0 1/4

T 2/4 1/4 1/4 1/4 2/4 1/4 1/4 1/4

G 1/4 0 0 0 1/4 0 3/4 0

Consensus String T A A A T C G A

An Example of Gibbs Sampling

4) Calculate the prob(aP) for every possible

8-mer in the removed sequence Strings

Highlighted in Red prob(aP)

AAAATTTACCTTAGAAGG .000732

AAAATTTACCTTAGAAGG .000122

AAAATTTACCTTAGAAGG 0

AAAATTTACCTTAGAAGG 0

AAAATTTACCTTAGAAGG 0

AAAATTTACCTTAGAAGG 0

AAAATTTACCTTAGAAGG 0

AAAATTTACCTTAGAAGG .000183

AAAATTTACCTTAGAAGG 0

AAAATTTACCTTAGAAGG 0

AAAATTTACCTTAGAAGG 0

An Example of Gibbs Sampling

5) Create a distribution of probabilities, and

based on this distribution, randomly select a new

starting position.

a) Create a ratio by dividing each probability

by the lowest probability

Starting Position 1 prob( AAAATTTA P )

.000732 / .000122 6 Starting Position 2

prob( AAATTTAC P ) .000122 / .000122

1 Starting Position 8 prob( ACCTTAGA P )

.000183 / .000122 1.5

Ratio 6 1 1.5

An Example of Gibbs Sampling

b) Create a distribution so that the most

probable start position is chosen at random with

the highest probability

P(selecting starting position 1)

.706 P(selecting starting position 2)

.118 P(selecting starting position 8) .176

An Example of Gibbs Sampling

Assume we select the substring with the highest

probability then we are left with the following

new substrings and starting positions. S17 GTAA

ACAATATTTATAGC S21 AAAATTTACCTCGCAAGG S39 CC

GTACTGTCAAGCGTGG S45 TGAGTAATCGACGTCCCA S51

TACTTCACACCCTGTCAA 6) We iterate the procedure

again with the above starting positions until we

cannot improve the score any more.

Problems with Gibbs Sampling

- Gibbs Sampling often converges to local optimum

motifs instead of global optimum motifs. - Gibbs Sampling can be very complicated when

applied to samples with unequal distributions of

nucleotides

Useful books general

Useful book technical

Useful books statistical