# Useful algorithms and statistical methods in Genome data analysis - PowerPoint PPT Presentation

PPT – Useful algorithms and statistical methods in Genome data analysis PowerPoint presentation | free to download - id: d1228-ZDc1Z

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## Useful algorithms and statistical methods in Genome data analysis

Description:

### Exploratory Data Analysis (EDA) ... The objective of EDA is to separate the smooth from the rough with minimal use ... and adaptation of standard EDA procedures ... – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 68
Provided by: ahmed99
Category:
Tags:
Transcript and Presenter's Notes

Title: Useful algorithms and statistical methods in Genome data analysis

1
Useful algorithms and statistical methods in
Genome data analysis
• Ahmed Rebaï
• Centre of Biotechnology of Sfax
• ahmed.rebai_at_cbs.rnrt.tn

Al-kawarizmi (780-840) at the origin of the word
Algorithm
2
Basic statistical concepts and tools
• Primer on
• Random variable
• Random distribution
• Statistical inference

3
Statistics
• Statistics concerns the optimal methods of
analyzing data generated from some chance
mechanism (random phenomena).
• Optimal means appropriate choice of what is to
be computed from the data to carry out
statistical analysis

4
Random variables
• A random variable is a numerical quantity that in
some experiment, that involve some degree of
randomness, takes one value from some discrete
set of possible values
• The probability distribution is a set of values
that this random variable takes together with
their associated probability

5
Three very important distributions!
• The Binomial distribution the number of
successes in n trials with two outcomes
(succes/failure) each, where proba. of sucess p
is the same for all trials and trials are
independant
• Pr(Xk)n!/(n-k)!k! pk (1-p)n-k
• Poisson distribution the limiting form of the
binomial for large n and small p (rare events)
when np? is moderate
• Pr(Xk)e-? ?k/ k!
• Normal (Gaussian) distribution arises in many
biological processes, limiting distribution of
all random variables for large n.

6
Statistical Inference
• Process of drawing conclusions about a population
on the basis of measurments or observation on a
sample of individuals from the population
• Frequentist inference an approach basedon a
frequency (long-run) view of probability with
main tools Tests, likelihood (the probability of
a set of observations x giventhe value of some
set of parameters ? L(x/?))
• Bayesian inference

7
Hypothesis testing
• Significance testing a procedure when applied to
a set of obs. results in a p-value relative to
some null hypothesis
• P-value probability of observed data when the
null hypothesis is true, computed from the
distribution of the test statistic
• Significance level the level of probability at
which it is argued that the null hypothesis will
be rejected. Conventionnally set to 0.05.
• False positive reject nul hypothsis when it is
true (type I error)
• False negative accept null hypothesis when it
is false (type II error)

8
Bayesian inference
• Approach based on Bayes theorem
• Obtain the likelihood L(x/?)
• Obtain the prior distrbution L(?) expressing what
is known about ?, before observing data
• Apply Bayes theorrem to derive de the posterior
distribution L(?/x) expresing what is known about
? after observing the data
• Derive appropriate inference statement from the
posterior distribution estimation, probability
of hypothesis
• Prior distribution respresent the investigators
knowledge (belief) about the parameters before
seeing the data

9
Resampling techniques Bootstrap
• Used when we cannot know the exact distribution
of a parameter
• sampling with replacement from the orignal data
to produce random samples of same size n
• each bootstrap sample produce an estimate of the
parameter of interest.
• Repeating the process a large number of times
provide information on the variability of the
estimator
• Other techniques
• Jackknife generates n samples each of size n-1
by omitting each member in turn from the data

10
Multivariate analysis
• A primer on
• Discriminant analysis
• Principal component analysis
• Correspondance analysis

11
Exploratory Data Analysis (EDA)
• Data analysis that emphasizes the use of informal
graphical porcedures not based on prior
assumptions about the structure of the data or on
formal models for the data
• Data smooth rough where the smooth is the
underlying regulariry or pattern in the data. The
objective of EDA is to separate the smooth from
the rough with minimal use of formal mathematics
or statistics methods

12
Classification and clustering
13
Classification
• In classification you do have a class label (o
and x), each defined in terms of G1 and G2
values.
• You are trying to find a model that splits the
data elements into their existing classes
• You then assume that this model can be used to
assign new data points x and y to the right class

14
Linear Discriminant Analysis
• Proposed by Fisher (1936) for classifying an
obervation into one of two possible groups using
measurements x1,x2,xp.
• Seek a linear transformation of the variables
za1x1a2x2..apxp such that the separation
between the group means on the transformed scale
aS-1 (x1-x2) where x1 and x2 are mean vectors
of the two groups and S the pooled covariance
matrix of the two groups
• Assign subject to group 1 if zi-zclt0 and to group
2 if zi-zcgt0 where zc(z1z2)/2 is the group
means
• Subsets of variable most useful for
discrimination can be selected by regression
procedures

15
Linear (LDA) and Quadratic (QDA) discriminant
analysis for gene prediction
• Simple approach.
• Training set (exons and non-exons)
• Set of features (for internal exons)
• donor/acceptor site recognizers
• octonucleotide preferences for coding region
• octonucleotide preferences for intron interiors
• on either side
• Rule to be induced linear (or quadratic)
combination of features.
• Good accuracy in some circumstances
• (programs Zcurve, MZEF)

16
LDA and QDA in gene prediction
17
Principal Component Analysis
• Introduced by Pearson (1901) and Hotelling (1933)
to describe the variation in a set of
multivariate data in terms of a set of
uncorrelated variables, yi each of which is a
linear combination of the original variables
(xi)
• Yi ai1x1ai2x2aipxp where i1..p
• The new variables Yi are derived in decreasing
order of importance they are called principal
components

18
Principal Component Analysis
How to avoid correlated features? Correlations ?
covariance matrix is non-diagonal Solution
diagonalize it, then use transformation that
makes it diagonal to de-correlate
features Columns of Z are eigenvectors and
diagonal elements of ? are eigenvalues
19
Properties of PCA
Small li ? small variance ? data change little in
direction Yi PCA minimizes C matrix
reconstruction errors Zi vectors for large li
are sufficient because vectors for small
eigenvalues will have very small contribution to
the covariance matrix. The adequacy of
representation by the two first components is
measured by explanation (l1 l2) /? li
True covariance matrices are usually not known,
estimated from data.
20
Use of PCA
uncorrelated features reducing dimensionality
reject low variance features,..
• Analysis of expression data

21
Correspondance analysis (Benzecri, 1973)
• For uncovering and understanding the structure
and pattern in data in contingency tables.
• Involves finding coordinate values which
represent the row and column categories in some
optimal way
• Coordinates are obtained from the Singular value
decomposition (SVD) of a matrix E which elements
are eij(pij- pi. p.j)/(pi. p.j)1/2
• EUDV, U eigenvectors of EE, V eigenvectors of
EE and D a diagonal matrix with ?k singular
values (?²k eigenvalues of either EE or EE)
• The adequacy of representation by the two first
coordinates is measured by
• inertia (?²1?²2)/??²k

22
Properties of CA
• allows consideration of dummy variables and
observations (called illustrative variables or
observations) which do not contribute to the
construction of the factorial space, but can be
displayed on this factorial space.
• With such a representation it is possible to
determine the proximity between observations and
variables and the illustrative variables and
observations.

23
Example analysis of codon usage and gene
expression in E. coli (McInerny, 1997)
• A gene can be represented by a 59-dimensional
vector (universal code)
• A genome consists of hundreds (thousands) of
these genes
• Variation in the variables (RSCU values) might be
governed by only a small number of factors
• For each gene and each codon i calculate
• RCSUi observed cordon i/expected codon i
• Do Correspondance analysis on RSCU

24
Plot of the two most important axes after
correspondence analysis of RSCU values from the
E. coli genome.
Lowly-expressed genes
Highly expressed genes
Recently acquired genes
25
Other Methods
• Factor analysis explain correlations between a
set of p variables in terms of a smaller number
kltp of unobservable latent variables (factors)
xAfu , SAAD where D is a diagonal matrix and
• Multidimensional Scaling construct a
low-dimentional geometrical representation of a
distance matrix
• Cluster analysisa set of methods for contructing
sensible and informative classification of an
initially unclassified set of data using the
variable values observed on each individual
(hierarchical clustering, K-means clustering, ..)

26
Datamining or KDD (Knowledge Discovery in
Databases)
• Process of seeking interesting and valuable
information within large databases
• Extension and adaptation of standard EDA
procedures
• Confluence of ideas from statistics, EDA, machine
learning, pattern recognition, database
technology, etc.
• Emphasis is more on algorithms rather than
mathematical or statistical models

27
Famous Algorithms in Bioinformatics
• A primer on
• HMM
• EM
• Gibbs sampling (MCMC)

28
Hidden Markov Models
• An introduction

29
Historical overview
• Introduced in 1960-70 to model sequences of
observations
• Observations are discrete (character from some
alphabet) or continious (signal frequency)
• First applied in speech recognition since 1970
• Applied to biological sequences since 1992
multiple alignment, profile, gene prediction,
fold recognition, ..

30
Hidden Markov Model (HMM)
• Can be viewed as an abstract machine with k
hidden states.
• Each state has its own probability distribution,
and the machine switches between states according
to this probability distribution.
• At each step, the machine makes 2 decisions
• What state should it move to next?
• What symbol from its alphabet should it emit?

31
Why Hidden?
• Observers can see the emitted symbols of an HMM
but has no ability to know which state the HMM is
currently in.
• Thus, the goal is to infer the most likely states
of an HMM basing on some given sequence of
emitted symbols.

32
HMM Parameters
S set of all possible emission characters. Ex.
S H, T for coin tossing S 1, 2, 3,
4, 5, 6 for dice tossing Q set of hidden states
q1, q2, .., qn, each emitting symbols from S. A
(akl) a Q x Q matrix of probability of
changing from state k to state l. E (ek(b)) a
Q x S matrix of probability of emitting
symbol b during a step in which the HMM is in
state k.
33
The fair bet casino problem
• The game is to flip coins, which results in only
two possible outcomes Head or Tail.
• Suppose that the dealer uses both Fair and Biased
coins.
• The Fair coin will give Heads and Tails with same
probability of ½.
• The Biased coin will give Heads with prob. of ¾.
• Thus, we define the probabilities
• P(HF) P(TF) ½
• P(HB) ¾, P(TB) ¼
• The crooked dealer changes between Fair and
Biased coins with probability 10

34
The Fair Bet Casino Problem
• Input A sequence of x x1x2x3xn of coin tosses
made by two possible coins (F or B).
• Output A sequence p p1 p2 p3 pn, with each pi
being either F or B indicating that xi is the
result of tossing the Fair or Biased coin
respectively.

35
HMM for Fair Bet Casino
• The Fair Bet Casino can be defined in HMM terms
as follows
• S 0, 1 (0 for Tails and 1 Heads)
• Q F,B F for Fair B for Biased coin.
• aFF aBB 0.9
• aFB aBF 0.1
• eF(0) ½ eF(1) ½
• eB(0) ¼ eB(1) ¾

36
HMM for Fair Bet Casino
Visualization of the Transition probabilities A
Biased Fair
Biased aBB 0.9 aFB 0.1
Fair aBF 0.1 aFF 0.9
Visualization of the Emission Probabilities E
Biased Fair
Tails(0) eB(0) ¼ eF(0) ½
37
HMM for Fair Bet Casino
HMM model for the Fair Bet Casino Problem

38
Hidden Paths
• A path p p1 pn in the HMM is defined as a
sequence of states.
• Consider path p FFFBBBBBFFF and sequence x
01011101001

x 0 1 0 1 1 1
0 1 0 0 1 p F F F B
B B B B F F F P(xipi) ½ ½
½ ¾ ¾ ¾ ¾ ¾ ½ ½ ½ P(pi-1 ? pi) ½
9/10 9/10 1/10 9/10 9/10 9/10 9/10
1/10 9/10 9/10
39
P(xp) Calculation
• P(xp) Probability that sequence x was generated
by the path p, given
• the model M.
• n
• P(xp) P(p0? p1) . ? P(xi pi).P(pi ? pi1)
• i1
• n
• a p0, p1 . ? e pi (xi) . a pi, pi1
• i1

40
Forward-Backward Algorithm
• Given a sequence of coin tosses generated by an
HMM.
• Goal find the probability that the dealer was
using a biased coin at a particular time.
• The probability that the dealer used a biased
coin at any moment i is as follows

P(x, pi k)
fk(i) . bk(i) P(pi kx) _______________
______________
P(x) P(x)
41
Forward-Backward Algorithm
• fk,i (forward probability) is the probability of
emitting the prefix x1xi and reaching the state
p k.
• The recurrence for the forward algorithm is
• fk,i ek(xi) . S fk,i-1 . alk
• l ? Q
• Backward probability bk,i the probability of
being in state pi k and emitting the suffix
xi1xn.
• The backward algorithms recurrence
• bk,i S el(xi1) . bl,i1 . Akl
• l ? Q

42
HMM Parameter Estimation
• So far, we have assumed that the transition and
emission probabilities are known.
• However, in most HMM applications, the
probabilities are not known and hard to estimate.
• Let T be a vector combining the unknown
transition and emission probabilities.
• Given training sequences x1,,xm, let P(xT) be
the max. prob. of x given the assignment of
param.s T. m
• Then our goal is to find maxT ? P(xiT)
• j1

43
The Expectation Maximization (EM) algorithm
• Dempster and Laird (1977)

44
EM Algorithm in statistics
• An algorithm producing a sequence of parameter
estimates that converge to MLE. Have two steps
• E-step calculate de expected value of the
likelihood conditionnal on the data and the
current estimates of the parameters E(L(x/?i))
• M-step maximize the likeliood to give updated
parameter estimates increasing L
• The two steps are alternated until convergence
• Needs a first guess estimate to start

45
The EM Algorithm for motif detection
• This algorithm is used to identify conserved
areas in unaligned sequences.
• Assume that a set of sequences is expected to
have a common sequence pattern.
• An initial guess is made as to location and size
of the site of interest in each of the
sequences and these parts are loosely
aligned.
• This alignment provides an estimate of base or
aa composition of each column in the site.

46
EM algorithm in motif finding
• The EM algorithm consists of the two steps, which
are repeated consecutively
• Step 1 the Expectation step, the
column-by-column composition of the site is used
to estimate the probability of finding the site
at any position in each of the sequences. These
probabilities are used to provide new information
as to expected base or aa distribution for each
column in the site.
• Step 2 the Maximization step, the new counts
for bases or aa for each position in the site
found in the step 1 are substituted for the
previous set.

47
Expectation Maximization (EM) Algorithm
OOOOOOOOXXXXOOOOOOOOOOOOOOOOXXXXOOOOOOOO o o o o
o o o o o o o o o o o o o o o o o o o
o OOOOOOOOXXXXOOOOOOOO OOOOOOOOXXXXOOOOOOOO
IIII IIIIIIII IIIIIII
Columns defined by a preliminary alignment of
the sequences provide initial estimates of
frequencies of aa in each motif column
Bases Background Site column 1 Site column 2
G 0.27 0.4 0.1
C 0.25 0.4 0.1
A 0.25 0.2 0.1
T 0.23 0.2 0.7
Total 1.00 1.00 1.00
Columns not in motif provide background
frequencies
48
EM Algorithm in motif finding
A
B
The resulting score gives the likelihood that the
motif matches positions A, B or other in seq 1.
Repeat for all other positions and find most
likely locator. Then repeat for the remaining
seqs.
49
EM Algorithm 1st expectation step calculations
• Assume that the seq1 is 100 bases long and the
length of the site
• is 20 bases.
• Suppose that the site starts in the column 1 and
the first two positions are A and T.
• The site will end at the position 20 and
positions 21 and 22 do not belong to the site.
Assume that these two positions are A and T also.
• The Probability of this location of the site in
seq1 is given by
• Psite1,seq1 0.2 (for A in position 1) x 0.7
(for T in position 2) x Ps (for the next 18
positions in site) x 0.25 (for A in first
flanking position) x 0.23 (for T in second
flanking position x Ps (for the next 78 flanking
positions).

50
EM Algorithm 1st expectation step calculations
(continued)
• The same procedure is applied for calculation of
probabilities for Psite2,seq1 to Psite78, seq1,
thus providing a comparative set of probabilities
for the site location.
• The probability of the best location in seq1,
say at site k, is the ratio of the site
probability at k divided by the sum of all the
other site probabilities.
• Then the procedure repeated for all other
sequences.

51
EM Algorithm 2nd optimisation step calculations
• The site probabilities for each seq calculated
at the 1st step are then used to create a new
table of expected values for base counts for each
of the site positions using the site
probabilities as weights.
• Suppose that P (site 1 in seq 1) Psite1,seq1 /
(Psite1,seq1 Psite2,seq1 Psite78,seq1 )
0.01 and P (site 2 in seq 1) 0.02.
• Then this values are added to the previous table
as shown in the table below.

Bases Background Site column 1 Site column 2
G 0.27 0.4 0.1
C 0.25 0.4 0.1
A 0.25 0.2 0.01 0.1
T 0.23 0.2 0.7 0.02
Total/ weighted 1.00 1.00 1.00
52
EM Algorithm 2nd optimisation step calculations
This procedure is repeated for every other
possible first columns in seq1 and then the
process continues for all other sequences
resulting in a new version of the table. The
expectation and maximization steps are repeated
until the estimates of base frequencies do not
change
53
Gibbs sampling
• Application to motif discovery

54
Gibbs sampling and the Motif Finding Problem
Given a list of t sequences each of length n,
Find the best pattern of length l that appears
in each of the t sequences. Let s denote the set
of starting positions s (s1,...,st) for
l-mers in our t sequences. Then the substrings
corresponding to these starting positions will
form a t x l alignment matrix and a 4 x l
profile matrix which we will call P. Let
Prob(aP) be defined as the probability that an
l-mer a was created by Profile P
55
Scoring Strings with a Profile
Given a profile P
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
The probability of the consensus string
Prob(aaacctP) 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x
7/8 .033646
Probability of a different string
Prob(atacagP) 1/2 x 1/8 x 3/8 x 5/8 x 1/8 x
7/8 .001602
56
Gibbs Sampling
1) Choose uniformly at random starting
positions s (s1,...,st) and form the set of
l-tuples associated with these starting
positions. 2) Randomly choose one of the t
sequences. 3) Create a profile P from the other
t -1 seq. 4) For each position in the removed
sequence, calculate the probability that the
l-mer starting at that position was generated by
P. 5) Choose a new starting position for the
removed sequence at random based on the
probabilities calculated in step 4. 6) Repeat
steps 2-5 until no improvement on the starting
57
An Example of Gibbs Sampling motifs finding in
DNA sequences
Input t 5 sequences where L 8 1.
GTAAACAATATTTATAGC 2. AAAATTTACCTCGCAAGG 3.
CCGTACTGTCAAGCGTGG 4. TGAGTAAACGACGTCCCA 5.
TACTTAACACCCTGTCAA
58
Start the Gibbs Smapling
• 1) Randomly choose starting positions,
S(s1,s2,s3,s4) in the 4 sequences and figure out
what they correspond to
• S17 GTAAACAATATTTATAGC
• S211 AAAATTTACCTTAGAAGG
• S39 CCGTACTGTCAAGCGTGG
• S44 TGAGTAAACGACGTCCCA
• S51 TACTTAACACCCTGTCAA
• 2) Choose one of the sequences at random
• Sequence 2 AAAATTTACCTTAGAAGG

59
An Example of Gibbs Sampling
3) Create profile from remaining
subsequences
1 A A T A T T T A
3 T C A A G C G T
4 G T A A A C G A
5 T A C T T A A C
A 1/4 2/4 2/4 3/4 1/4 1/4 1/4 2/4
C 0 1/4 1/4 0 0 2/4 0 1/4
T 2/4 1/4 1/4 1/4 2/4 1/4 1/4 1/4
G 1/4 0 0 0 1/4 0 3/4 0
Consensus String T A A A T C G A
60
An Example of Gibbs Sampling
4) Calculate the prob(aP) for every possible
8-mer in the removed sequence Strings
Highlighted in Red prob(aP)
AAAATTTACCTTAGAAGG .000732
AAAATTTACCTTAGAAGG .000122
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG .000183
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
61
An Example of Gibbs Sampling

5) Create a distribution of probabilities, and
based on this distribution, randomly select a new
starting position.
a) Create a ratio by dividing each probability
by the lowest probability
Starting Position 1 prob( AAAATTTA P )
.000732 / .000122 6 Starting Position 2
prob( AAATTTAC P ) .000122 / .000122
1 Starting Position 8 prob( ACCTTAGA P )
.000183 / .000122 1.5
Ratio 6 1 1.5
62
An Example of Gibbs Sampling
b) Create a distribution so that the most
probable start position is chosen at random with
the highest probability
P(selecting starting position 1)
.706 P(selecting starting position 2)
.118 P(selecting starting position 8) .176
63
An Example of Gibbs Sampling
Assume we select the substring with the highest
probability then we are left with the following
new substrings and starting positions. S17 GTAA
ACAATATTTATAGC S21 AAAATTTACCTCGCAAGG S39 CC
GTACTGTCAAGCGTGG S45 TGAGTAATCGACGTCCCA S51
TACTTCACACCCTGTCAA 6) We iterate the procedure
again with the above starting positions until we
cannot improve the score any more.
64
Problems with Gibbs Sampling
1. Gibbs Sampling often converges to local optimum
motifs instead of global optimum motifs.
2. Gibbs Sampling can be very complicated when
applied to samples with unequal distributions of
nucleotides

65
Useful books general
66
Useful book technical
67
Useful books statistical