Biological sequence analysis - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Biological sequence analysis

Description:

Biological sequence analysis Terry Speed Wald Lecture II, August 8, 2001 – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 41

Provided by: WEH75

Category:

more less

Transcript and Presenter's Notes

Title: Biological sequence analysis

1
Biological sequence analysis

Terry Speed
Wald Lecture II,
August 8, 2001

2
The objects of our study

DNA, RNA and proteins macromolecules which
are unbranched polymers built up from smaller
units.
DNA units are the nucleotide residues A, C, G
and T
RNA units are the nucleotide residues A, C,
G and U
Proteins units are the amino acid residues A,
C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T,
V, W and Y.
To a considerable extent, the chemical
properties of DNA, RNA and protein molecules are
encoded in the linear sequence of these basic
units their primary structure.

3
The use of statistics to study linear sequences
of biomolecular units

Can be descriptive, predictive or everything
else in between..almost business as usual.
Stochastic mechanisms should never be taken
literally, but nevertheless can be amazingly
useful.
Care is always needed a model or method can
break down at any time without notice.
Biological confirmation of predictions is
almost always necessary.

4
The statistics of biological sequences can
be global or local

Base composition of genomes
E. coli 25 A, 25 C, 25 G, 25 T
P. falciparum 82AT
Translation initiation
ATG is the near universal motif indicating the
start of translation in DNA coding sequence.

5
From certainty to statistical models a brief
case study
1 ZNF Cys-Cys-His-His zinc finger DNA binding
domain
6
Cys-Cys-His-His zinc finger DNA binding domain

Its characteristic motif has regular
expression
C-x(2,4)-C-x(3)-LIVMFYWC-x(8)-H-x(3,5)-H
1ZNF XYKCGLCERSFVEKSALSRHQRVHKNX
Regular expressions can be limiting, and
position-specific distributions came to represent
the variability in motifs.

7
Cys-Cys-His-His profile sequence logo form
A sequence logo is a scaled position-specific
a.a.distribution. Scaling is by a measure of a
positions information content.
8
Representation of motifs the next steps

Missing from the position-specific
distribution
representation of motifs are good ways of
dealing with
Length distributions for insertions/deletions
Cross-position association of amino acids
Hidden Markov models help with the first.
The second remains a hard unsolved problem.

9
Hidden Markov models

Processes (St,Ot), t1,, where St is the
hidden
state and Ot the observation at time t, such
that
pr(St St-1,Ot-1,St-2 ,Ot-2 ) pr(St
St-1)
pr(Ot St-1,Ot-1,St-2 ,Ot-2 ) pr(Ot
St, St-1)
The basics of HMMs were laid bare in a series
of beautiful papers by L E Baum and colleagues
around 1970, and their formulation has been used
almost unchanged to this day.

10
Hidden Markov modelsextensions

Many variants are now used. For example, the
distribution of O may not depend on previous S
but on previous O values,
pr(Ot St , St-1 , Ot-1 ,.. ) pr(Ot
St ), or
pr(Ot St , St-1 , Ot-1 ,.. ) pr(Ot
St , St-1 ,Ot-1) .
Most importantly for us, the times of S and O
may be decoupled, permitting the Observation
corresponding to State time t to be a string
whose length and composition depends on St (and
possibly St-1 and part or all of the previous
Observations). This is called a hidden
semi-Markov or generalized hidden Markov model.

11
A simple HMM (Churchill, 1989)
O.O1
O.99
O.9
O.1
hidden states
.

.

.
.

.

.
observations
.

.

.
.

.

.
12
The algorithms

As the name suggests, with an HMM the series
O (O1,O2 ,O3 ,., OT) is observed, while
the states S (S1 ,S2 ,S3 ,., ST) are not.
There are elegant algorithms for calculating
pr(O?), arg max? pr(O?) in certain special
cases, and arg maxS pr(SO,?).
Here ? are the parameters of the model, e.g.
transition and observation probabilities.

13

Some early applications of HMMs
finance, but we never saw them
speech recognition modelling
ion channels
In the mid-late 1980s HMMs entered genetics
and molecular biology, and they are now firmly
entrenched.

Some current applications of HMMs to biology
mapping chromosomes
aligning biological sequences
predicting sequence structure
inferring evolutionary
relationships
finding genes in DNA sequence

14
HMMs representing coiled-coil domains
2TMA Tropomyosin
15
Coiled-coil domains, schematically
dimeric parallel helices, heptad repeats,
knobs-into-holes
16
(hydrophobic)
17
Designing the HMM, I
18
Designingthe HMM, 2
19
Designing the HMM, 3
20
HMM decoding
WGP ARQLNES VKDINKM LER HP BBB CCCCCCC CCCCCCC
CCC BB 000 abcdefg abcdefg abc 00 00c defgabc
defgabc def g0
Sequence Labels Path1 Path2

VITERBI decoding of all possible state-paths, we
determine the maximum probability one given the
amino acid sequence O
POSTERIOR decoding at each position, we
determine the state with the highest probability
given O.
Issue how to measure the strength of a potential
CC domain, and how should this depend on the
length of a domain?

21
CC-PROBABILITY PROFILE
Fusion protein of simian parainfluenza virus 5
22
Assessing performance terms
TP true positive a predicted fragment that
overlaps the annotated fragment (aa in the
annotated region) FP false positive a
predicted fragment does not overlap any
annotated fragment (aa not in the
annotated region) LS learning
set of sequences NTS negative test set
sequences with no CCD used for
estimating FP PTS positive test set used for
estimating TP Much care/effort required to create
these sets
23
Assessing performance study design
Study the variability of performance under
variation of the sequences used for
determining the model parameters. Compare
methods using the same set of
aa-frequencies / emission probabilities. Use
the same set of domains for learning and testing
instead of testing on different protein
families. Choose a number of FP-rates and
calculate the corresponding TP-rates (ROC curve).
24
PTS subdivided 150 times at random
(stratified) into 2/3 for learning and the rest
for testing.
LEARNING PHASE gt PARAMETERS
TESTING ON NTS gt THRESHOLDS
150 X
TESTING ON PTS gt TP-VALUES
ANALYSIS OF RESULTS
25
Assessing performance summaries

TP-rate at given FP-rate per family / per
length-class
TP- and FP-rates for aas per family / per
length-class
Accuracy at the borders and of length prediction

26
Finding genes in DNA sequence
This is one of the most challenging and
interesting problems in computational biology at
the moment. With so many genomes being sequenced
so rapidly, it remains important to begin by
identifying genes computationally.
27
What is a (protein-coding) gene?
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
28
What is a gene, ctd?

In general the transcribed sequence is longer
than the translated portion parts called introns
(intervening sequence) are removed, leaving exons
(expressed sequence), and yet other regions
remain untranslated. The translated sequence
comes in triples called codons, beginning and
ending with a unique start (ATG) and one of three
stop (TAA, TAG, TGA) codons.
There are also characteristic intron-exon
boundaries called splice donor and acceptor
sites, and a variety of other motifs promoters,
transcription start sites, polyA sites,branching
sites, and so on.
All of the foregoing have statistical
characterizations.

29
In more detail (color state)
30
Some facts about human genes

Comprise about 3 of the genome
Average gene length 8,000 bp
Average of 5-6 exons/gene
Average exon length 200 bp
Average intron length 2,000 bp
8 genes have a single exon
Some exons can be as small as 1 or 3 bp.
HUMFMR1S is not atypical 17 exons 40-60 bp long,
comprising 3 of a 67,000 bp gene

31
The idea behind a GHMM genefinder

States represent standard gene features
intergenic region, exon, intron, perhaps more
(promotor, 5UTR, 3UTR, Poly-A,..).
Observations embody state-dependent base
composition, dependence, and signal features.
In a GHMM, duration must be included as well.
Finally, reading frames and both strands must be
dealt with.

32
Half a model for a genefinder
33
Splice sites can be included in the exons
34
Beyond position-specific distributions

The bases in splice sites exhibit dependence, and
not simply of the nearest neighbor kind.
High-order (non-stationary) Markov models would
be one option, but the number of parameters in
relation to the amount of data rules them out.
The class of variable length Markov models
(VLMMs) deriving from early research by Rissanen
prove to be valuable in this context. However,
there is likely to be room for more research
here.

35
GENSCAN (Burge Karlin)
36
Remark

In general the problem of identifying
(annotating) human genes is considerably harder
than ß-globin might suggest.
The human factor VIII gene (whose mutations
cause hemophilia A) is spread over 186,000 bp.
It consists of 26 exons ranging in size from 69
to 3,106 bp, and its 25 introns range in size
from 207 to 32,400 bp. The complete gene is thus
9 kb of exon and 177 kb of intron.
The biggest human gene yet is for dystrophin.
It has gt 30 exons and is spread over 2.4
million bp.

37
Challenges in the analysis of sequence data

Understanding the biology well enough to
begin.
Designing HMM architecture, e.g. in Marcoil
for coiled-coils.
Modelling the parts, e.g. VLMMs for splice
sites.
Coding software engineering, is the hardest
and most important task of all making it all
work.
Obtaining good data sets for use in careful
evaluation and comparison with competing
algorithms designing the studies.
Opportunities for methodological research.

38
Topics not mentioned include

Molecular evolution, including phylogenetic
inference (building trees from aligned sequence
data)
Sequence alignment (pairwise, multiple),
including use of Gibbs sampler
Stochastic context-free grammar models and the
analysis of RNA sequence data.

39
Acknowledgements