Bioinformatics For MNW 2nd Year presentation

About This Presentation

Transcript and Presenter's Notes

Title: Bioinformatics For MNW 2nd Year

1
Bioinformatics For MNW 2nd Year

Jaap Heringa
FEW/FALW
Integrative Bioinformatics Institute VU (IBIVU)
heringa_at_cs.vu.nl

2
Current Bioinformatics Unit

Jens Kleinjung (1/11/02)
Victor Simosis PhD (1/12/02)
Radek Szklarczyk - PhD (1/01/03)
John Romein (1/12/02, Henri Bal)

3
Bioinformatics course 2nd year MNW spring 2003

Pattern recognition
Supervised/unsupervised learning
Types of data, data normalisation, lacking data
Search image
Similarity tables
Clustering
Principal component analysis
Discriminant analysis

4
Bioinformatics course 2nd year MNW spring 2003

Protein
Folding
Structure and function
Protein structure prediction
Secondary structure
Tertiary structure
Function
Post-translational modification
Prot.-Prot. Interaction -- Docking algorithm
Molecular dynamics/Monte Carlo

5
Bioinformatics course 2nd year MNW spring 2003

Sequence analysis
Pairwise alignment
Dynamic programming (NW, SW, shortcuts)
Multiple alignment
Combining information
Database/homology searching (Fasta, Blast,
Statistical issues-E/P values)

6
Bioinformatics course 2nd year MNW spring 2003

Gene structure and gene finding algorithm
Omics
DNA makes RNA makes protein
Expression data, Nucleus to ribosome,
translation, etc.
Metabolomics
Physiomics
Databases
DNA, EST
Protein sequence
Protein structure

7
Bioinformatics course 2nd year MNW spring 2003

Microarray data
Protein structure (PDB)
Proteomics
Mass spectrometry/NMR/X-ray?

8
Bioinformatics course 2nd year MNW spring 2003

Bioinformatics method development
IPR issues
Programming and scripting languages
Web solutions
Computational issues
NP-complete problems
CPU, memory, storage problems
Parallel computing
Bioinformatics method usage/application
Molecular viewers (RasMol, MolMol, etc.)

9
Gathering knowledge

Anatomy, architecture
Dynamics, mechanics
Informatics
(Cybernetics Wiener, 1948)
(Cybernetics has been defined as the science of
control in machines and animals, and hence it
applies to technological, animal and
environmental systems)
Genomics, bioinformatics

Rembrandt, 1632
Newton, 1726
10
Bioinformatics
Chemistry
Biology Molecular biology
Mathematics Statistics
Bioinformatics
Computer Science Informatics
Medicine
Physics
11
Bioinformatics

Studying informational processes in biological
systems (Hogeweg, early 1970s)
No computers necessary
Back of envelope OK

Information technology applied to the management
and analysis of biological data (Attwood and
Parry-Smith)
Applying algorithms with mathematical formalisms
in biology (genomics) -- USA
12
Bioinformatics in the olden days

Close to Molecular Biology
(Statistical) analysis of protein and nucleotide
structure
Protein folding problem
Protein-protein and protein-nucleotide
interaction
Many essential methods were created early on (BG
era)
Protein sequence analysis (pairwise and multiple
alignment)
Protein structure prediction (secondary, tertiary
structure)

13
Bioinformatics in the olden days (Cont.)

Evolution was studied and methods created
Phylogenetic reconstruction (clustering NJ
method

14
The Human Genome -- 26 June 2000
15
The Human Genome -- 26 June 2000
Dr. Craig Venter Celera Genomics -- Shotgun method
Sir John Sulston Human Genome Project
16
Human DNA

There are about 3bn (3 ? 109) nucleotides in the
nucleus of almost all of the trillions (3.5 ?
1012 ) of cells of a human body (an exception is,
for example, red blood cells which have no
nucleus and therefore no DNA) a total of 1022
nucleotides!
Many DNA regions code for proteins, and are
called genes (1 gene codes for 1 protein in
principle)
Human DNA contains 30,000 expressed genes
Deoxyribonucleic acid (DNA) comprises 4 different
types of nucleotides adenine (A), thiamine (T),
cytosine (C) and guanine (G). These nucleotides
are sometimes also called bases

17
Human DNA (Cont.)

All people are different, but the DNA of
different people only varies for 0.2 or less.
So, only 2 letters in 1000 are expected to be
different. Over the whole genome, this means that
about 3 million letters would differ between
individuals.
The structure of DNA is the so-called double
helix, discovered by Watson and Crick in 1953,
where the two helices are cross-linked by A-T and
C-G base-pairs (nucleotide pairs so-called
Watson-Crick base pairing).

18
Tot hier 3/2 10.45-12.30
19
DNA compositional biases

Base composition of genomes
E. coli 25 A, 25 C, 25 G, 25 T
P. falciparum (Malaria parasite) 82AT
Translation initiation
ATG is the near universal motif indicating the
start of translation in DNA coding sequence.

20
Some facts about human genes

Comprise about 3 of the genome
Average gene length 8,000 bp
Average of 5-6 exons/gene
Average exon length 200 bp
Average intron length 2,000 bp
8 genes have a single exon
Some exons can be as small as 1 or 3 bp.
HUMFMR1S is not atypical 17 exons 40-60 bp long,
comprising 3 of a 67,000 bp gene

21
Genetic diseases

Many diseases run in families and are a result of
genes which predispose such family members to
these illnesses
Examples are Alzheimers disease, cystic fibrosis
(CF), breast or colon cancer, or heart diseases.
Some of these diseases can be caused by a problem
within a single gene, such as with CF.

22
Genetic diseases (Cont.)

For other illnesses, like heart disease, at least
20-30 genes are thought to play a part, and it is
still unknown which combination of problems
within which genes are responsible.
With a problem within a gene is meant that a
single nucleotide or a combination of those
within the gene are causing the disease (or make
that the body is not sufficiently fighting the
disease).
Persons with different combinations of these
nucleotides could then be unaffected by these
diseases.

23
Genetic diseases (Cont.)Cystic Fibrosis

Known since very early on (Celtic gene)
Inherited autosomal recessive condition (Chr. 7)
Symptoms
Clogging and infection of lungs (early death)
Intestinal obstruction
Reduced fertility and (male) anatomical anomalies
CF gene CFTR has 3-bp deletion leading to Del508
(Phe) in 1480 aa protein (epithelial Cl- channel)
protein degraded in ER instead of inserted into
cell membrane

24
Genomic Data Sources

DNA/protein sequence
Expression (microarray)
Proteome (xray, NMR,
mass spectrometry)
Metabolome
Physiome (spatial,
temporal)

Integrative bioinformatics
25
Genomic Data Sources Vertical Genomics
genome
transcriptome
proteome
metabolome
physiome
Dinner discussion Integrative Bioinformatics
Genomics VU
26
A gene codes for a protein
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
27
Humans have spliced genes
28
DNA makes RNA makes Protein
29
Remark

The problem of identifying (annotating) human
genes is considerably harder than the early
success story for ß-globin might suggest.
The human factor VIII gene (whose mutations cause
hemophilia A) is spread over 186,000 bp. It
consists of 26 exons ranging in size from 69 to
3,106 bp, and its 25 introns range in size from
207 to 32,400 bp. The complete gene is thus 9 kb
of exon and 177 kb of intron.
The biggest human gene yet is for dystrophin. It
has gt 30 exons and is spread over 2.4
million bp.

30
DNA makes RNA makes ProteinExpression data

More copies of mRNA for a gene leads to more
protein
mRNA can now be measured for all the genes in a
cell at ones through microarray technology
Can have 60,000 spots (genes) on a single gene
chip
Colour change gives intensity of gene expression
(over- or under-expression)

31
(No Transcript)
32
Metabolic networksGlycolysis and
Gluconeogenesis
Kegg database (Japan)
33
High-throughput Biological Data

Enormous amounts of biological data are being
generated by high-throughput capabilities even
more are coming
genomic sequences
gene expression data
mass spec. data
protein-protein interaction
protein structures
......

34
Protein structural data explosion
Protein Data Bank (PDB) 14500 Structures (6
March 2001) 10900 x-ray crystallography, 1810
NMR, 278 theoretical models, others...
35
Dickersons formula equivalent to Moores law
n e0.19(y-1960) with y the year.
On 27 March 2001 there were 12,123 3D protein
structures in the PDB Dickersons formula
predicts 12,066 (within 0.5)!
36
Sequence versus structural data

Despite structural genomics efforts, growth of
PDB slowed down in 2001-2002 (i.e did not keep up
with Dickersons formula)
More than 100 completely sequenced genomes
Increasing gap between structural and sequence
data

37
Bioinformatics
Bioinformatics
Large - external (integrative) Science Human
Planetary Science Cultural Anthropology
Population Biology Sociology
Sociobiology Psychology Systems
Biology Biology Medicine
Molecular Biology
Chemistry Physics Small
internal (individual)
38
Bioinformatics

Offers an ever more essential input to
Molecular Biology
Pharmacology (drug design)
Agriculture
Biotechnology
Clinical medicine
Anthropology
Forensic science
Chemical industries (detergent industries, etc.)

39
High-throughput Biological DataThe data deluge

Hidden in these data is information that reflects
existence, organization, activity, functionality
of biological machineries at different levels
in living organisms

Most effectively utilising this information will
prove to be essential for Integrative
Bioinformatics
40
Data Issues

Data collection getting the data
Data representation data standards, data
normalisation ..
Data organisation and storage database issues
..
Data analysis and data mining discovering
knowledge, patterns/signals, from data,
establishing associations among data patterns
Data utilisation and application from data
patterns/signals to models for bio-machineries
Data visualization viewing complex data
Data transmission data collection, retrieval,
..

41
Tot hier 5/2
42
Bioinformatics

Nothing in Biology makes sense except in the
light of evolution (Theodosius Dobzhansky
(1900-1975))
Nothing in bioinformatics makes sense except in
the light of Biology

43
Pair-wise alignment
T D W V T A L K T D W L - - I K
Combinatorial explosion - 1 gap in 1 sequence
n1 possibilities - 2 gaps in 1 sequence (n1)n
- 3 gaps in 1 sequence (n1)n(n-1), etc.
2n (2n)! 22n
n (n!)2
??n 2 sequences of 300 a.a. 1088
alignments 2 sequences of 1000 a.a. 10600
alignments!
44
Dynamic programmingScoring alignments
Sa,b gp(k) pi k?pe affine gap
penalties pi and pe are the penalties for gap
initialisation and extension, respectively
45
Dynamic programmingScoring alignments
T D W V T A L K T D W L - - I K
20?20
10
1
Gap penalties (open, extension)
Amino Acid Exchange Matrix
Score s(T,T)s(D,D)s(W,W)s(V,L)Po2Px
s(L,I)s(K,K)
46
Pairwise sequence alignment Global dynamic
programming
MDAGSTVILCFVG
Evolution
M D A A S T I L C G S
Amino Acid Exchange Matrix
Search matrix
Gap penalties (open,extension)
MDAGSTVILCFVG-
MDAAST-ILC--GS
47
Global dynamic programming
j-1
i-1
MaxS0ltxlti-1, j-1 - Pi - (i-x-1)Px Si-1,j-1 MaxS
i-1, 0ltyltj-1 - Pi - (j-y-1)Px
Si,j si,j Max
48
Global dynamic programming
49
Global dynamic programming
50
Tot hier 17/02/03
51
Local dynamic programming (Smith Waterman,
1981)
LCFVMLAGSTVIVGTR
E D A S T I L C G S
Negative numbers
Amino Acid Exchange Matrix
Search matrix
Gap penalties (open, extension)
AGSTVIVG A-STILCG
52
Local dynamic programming (Smith Waterman,
1981)
j-1
i-1
Si,j MaxS0ltxlti-1,j-1 - Pi - (i-x-1)Px Si,j
Si-1,j-1 Si,j Max Si-1,0ltyltj-1 - Pi -
(j-y-1)Px 0
Si,j Max
53
Local dynamic programming
54
Sequence database searching Homology searching

DP too slow for repeated database searches
FASTA
BLAST and PSI-BLAST
QUEST
HMMER
SAM-T98

Fast heuristics
Hidden Markov modelling
55
FASTA

Compares a given query sequence with a library of
sequences and calculates for each pair the
highest scoring local alignment
Speed is obtained by delaying application of the
dynamic programming technique to the moment where
the most similar segments are already identified
by faster and less sensitive techniques
FASTA routine operates in four steps

56
FASTA

Operates in four steps
Rapid searches for identical words of a user
specified length occurring in query and database
sequence(s) (Wilbur and Lipman, 1983, 1984). For
each target sequence the 10 regions with the
highest density of ungapped common words are
determined.
These 10 regions are rescored using Dayhoff
PAM-250 residue exchange matrix (Dayhoff et al.,
1983) and the best scoring region of the 10 is
reported under init1 in the FASTA output.
Regions scoring higher than a threshold value and
being sufficiently near each other in the
sequence are joined, now allowing gaps. The
highest score of these new fragments can be found
under initn in the FASTA output.
full dynamic programming alignment (Chao et al.,
1992) over the final region which is widened by
32 residues at either side, of which the score is
written under opt in the FASTA output.

57
FASTA output example
DE METAL RESISTANCE PROTEIN YCF1 (YEAST CADMIUM
FACTOR 1). . . . SCORES Init1 161 Initn 161
Opt 162 z-score 229.5 E() 3.4e-06
Smith-Waterman score 162 35.1 identity in 57
aa overlap
10 20 30 test.seq
MQRSPLEKASVVSKLFFSW
TRPILRKGYRQRLE

YCFI_YEAST CASILLLEALPKKPLMPHQHIHQTLTRRKPNPY
DSANIFSRITFSWMSGLMKTGYEKYLV 180
190 200 210 220 230
40 50 60
test.seq LSDIYQIPSVDSADNLSEKLEREWDRE

YCFI_YEAST EADLYKLPRNFSSEELSQKLEKNWENELKQKSN
PSLSWAICRTFGSKMLLAAFFKAIHDV 240
250 260 270 280 290

58
FASTA

(1) Rapid identical word searches
Searching for k-tuples of a certain size within a
specified bandwidth along search matrix
diagonals.
For not-too-distant sequences (gt 35 residue
identity), little sensitivity is lost while speed
is greatly increased.
Technique employed is known as hash coding or
hashing a lookup table is constructed for all
words in the query sequence, which is then used
to compare all encountered words in each database
sequence.

59
FASTA

The k-tuple length is user-defined and is usually
1 or 2 for protein sequences (i.e. either the
positions of each of the individual 20 amino
acids or the positions of each of the 400
possible dipeptides are located).
For nucleic acid sequences, the k-tuple is 5-20,
and should be longer because short k-tuples are
much more common due to the 4 letter alphabet of
nucleic acids. The larger the k-tuple chosen, the
more rapid but less thorough, a database search.

60
BLAST

blastp compares an amino acid query sequence
against a protein sequence database
blastn compares a nucleotide query sequence
against a nucleotide sequence database
blastx compares the six-frame conceptual protein
translation products of a nucleotide query
sequence against a protein sequence database
tblastn compares a protein query sequence against
a nucleotide sequence database translated in six
reading frames
tblastx compares the six-frame translations of a
nucleotide query sequence against the six-frame
translations of a nucleotide sequence database.

61
BLAST

Generates all tripeptides from a query sequence
and for each of those the derivation of a table
of similar tripeptides number is only fraction
of total number possible.
Quickly scans a database of protein sequences for
ungapped regions showing high similarity, which
are called high-scoring segment pairs (HSP),
using the tables of similar peptides. The initial
search is done for a word of length W that scores
at least the threshold value T when compared to
the query using a substitution matrix.
Word hits are then extended in either direction
in an attempt to generate an alignment with a
score exceeding the threshold of S, and as far as
the cumulative alignment score can be increased.

62
BLAST

Extension of the word hits in each direction are
halted
when the cumulative alignment score falls off by
the quantity X from its maximum achieved value
the cumulative score goes to zero or below due to
the accumulation of one or more negative-scoring
residue alignments
upon reaching the end of either sequence
The T parameter is the most important for the
speed and sensitivity of the search resulting in
the high-scoring segment pairs
A Maximal-scoring Segment Pair (MSP) is defined
as the highest scoring of all possible segment
pairs produced from two sequences.

63
PSI-BLAST

Query sequences are first scanned for the
presence of so-called low-complexity regions
(Wooton and Federhen, 1996), i.e. regions with a
biased composition likely to lead to spurious
hits are excluded from alignment.
The program then initially operates on a single
query sequence by performing a gapped BLAST
search
Then, the program takes significant local
alignments found, constructs a multiple alignment
and abstracts a position specific scoring matrix
(PSSM) from this alignment.
Rescan the database in a subsequent round to find
more homologous sequences Iteration continues
until user decides to stop or search has
converged

64
PSI-BLAST iteration
Query sequence
Q
xxxxxxxxxxxxxxxxx
Gapped BLAST search
Query sequence
Q
xxxxxxxxxxxxxxxxx
Database hits
A C D . . Y
PSSM
Pi Px
Gapped BLAST search
A C D . . Y
PSSM
Pi Px
Database hits
65
PSI-BLAST output example
66
Multiple alignment profilesGribskov et al. 1987
i
A C D ? ? ? W Y
0.3 0.1 0 ? ? ? 0.3 0.3
Gap penalties
0.5
1.0
Position dependent gap penalties
67
Normalised sequence similarity
The p-value is defined as the probability of
seeing at least one unrelated score S greater
than or equal to a given score x in a database
search over n sequences. This probability
follows the Poisson distribution (Waterman and
Vingron, 1994)
P(x, n) 1 e-n?P(S? x), where n is the
number of sequences in the database Depending on
x and n (fixed)
68
Normalised sequence similarityStatistical
significance
The E-value is defined as the expected number of
non-homologous sequences with score greater than
or equal to a score x in a database of n
sequences E(x, n)
n?P(S ? x) if E-value 0.01, then the expected
number of random hits with score S ? x is 0.01,
which means that this E-value is expected by
chance only once in 100 independent searches over
the database. if the E-value of a hit is 5, then
five fortuitous hits with S ? x are expected
within a single database search, which renders
the hit not significant.
69
Normalised sequence similarityStatistical
significance

Database searching is commonly performed using an
E-value in between 0.1 and 0.001.
Low E-values decrease the number of false
positives in a database search, but increase the
number of false negatives, thereby lowering the
sensitivity of the search.

70
HMM-based homology searching

Most widely used HMM-based profile searching
tools currently are SAM-T98 (Karplus et al.,
1998) and HMMER2 (Eddy, 1998)
formal probabilistic basis and consistent theory
behind gap and insertion scores
HMMs good for profile searches, bad for alignment
HMMs are slow

71
The HMM algorithms

Questions
What is the most likely die (predicted) sequence?
Viterbi
What is the probability of the observed sequence?
Forward
What is the probability that the 3rd state is B,
given the observed sequence? Backward

72
HMM-based homology searching
Transition probabilities and Emission
probabilities Gapped HMMs also have insertion
and deletion states
73
Profile HMM mmatch state, I-insert state,
ddelete state go from left to right. I and m
states output amino acids d states are silent.
74
Homology-derived Secondary Structure of Proteins
(HSSP) Sander Schneider, 1991
75
Tot hier 17/02/03
76
Bio-Data Analysis and Data Mining

Existing/emerging bio-data analysis and mining
tools for
DNA sequence assembly
Genetic map construction
Sequence comparison and database searching
Gene finding
.
Gene expression data analysis
Phylogenetic tree analysis to infer
horizontally-transferred genes
Mass spec. data analysis for protein complex
characterization
Current mode of work

Often enough developing ad hoc tools for each
individual application
77
Bio-Data Analysis and Data Mining

As the amount and types of data and their cross
connections increase rapidly
the number of analysis tools needed will go up
exponentially
blast, blastp, blastx, blastn, from BLAST
family of tools
gene finding tools for human, mouse, fly, rice,
cyanobacteria, ..
tools for finding various signals in genomic
sequences, protein-binding sites, splice junction
sites, translation start sites, ..

78
Bio-Data Analysis and Data Mining
Many of these data analysis problems are
fundamentally the same problem(s) and can be
solved using the same set of tools e.g.
clustering or optimal segmentation by Dynamic
Programming
Developing ad hoc tools for each application (by
each group of individual researchers) may soon
become inadequate as bio-data production
capabilities further ramp up
79
Bio-data Analysis, Data Mining and Integrative
Bioinformatics
To have analysis capabilities covering wide
range of problems, we need to discover the common
fundamental structures of these problems HOWEVER
in biology one size does NOT fit all
Goal is development of a data analysis
infrastructure in support of Genomics and beyond
80
Algorithms in bioinformatics

string algorithms
dynamic programming
machine learning (NN, k-NN, SVM, GA, ..)
Markov chain models
hidden Markov models
Markov Chain Monte Carlo (MCMC) algorithms
stochastic context free grammars
EM algorithms
Gibbs sampling
clustering
tree algorithms
text analysis
hybrid/combinatorial techniques and more

81
Sequence analysis and homology searching
82
Finding genes and regulatory elements
83
Expression data
84
Functional genomics
Monte Carlo
85
Protein translation
86
Example of algorithm reuse Data clustering

Many biological data analysis problems can be
formulated as clustering problems
microarray gene expression data analysis
identification of regulatory binding sites
(similarly, splice junction sites, translation
start sites, ......)
(yeast) two-hybrid data analysis (for inference
of protein complexes)
phylogenetic tree clustering (for inference of
horizontally transferred genes)
protein domain identification
identification of structural motifs
prediction reliability assessment of protein
structures
NMR peak assignments
......

87
Data Clustering Problems

Clustering partition a data set into clusters so
that data points of the same cluster are
similar and points of different clusters are
dissimilar
cluster identification -- identifying clusters
with significantly different features than the
background

88
Application Examples

Regulatory binding site identification CRP (CAP)
binding site
Two hybrid data analysis

Gene expression data analysis

Are all solvable by the same algorithm!
89
Other Application Examples

Phylogenetic tree clustering analysis
Protein sidechain packing prediction
Assessment of prediction reliability of protein
structures
Protein secondary structures
Protein domain prediction
NMR peak assignments

90
Integrative bioinformatics _at_ VU

Studying informational processes at biological
system level
From gene sequence to intercellular processes
Computers necessary
We have biology, statistics, computational
intelligence (AI), HTC, ..
VUMC microarray facility
Enabling technology new glue to integrate
New integrative algorithms
Goals understanding cells in terms of genomes,
fighting disease (VUMC)

91
Bioinformatics _at_ VU

Progression
DNA gene prediction, predicting regulatory
elements
mRNA expression
Proteins docking, domain prediction
Metabolic pathways metabolic control
Cell-cell communication

92
(No Transcript)
93
Protein structure and function can be complex
Pyruvate kinase Phosphotransferase
b barrel regulatory domain a/b barrel catalytic
substrate binding domain a/b nucleotide binding
domain
1 continuous 2 discontinuous domains
94
Bioinformatics _at_ VU

Qualitative challenges
High quality alignments (alternative splicing)
In-silico structural genomics
In-silico functional genomics reliable
annotation
Protein-protein interactions.
Metabolic pathways assign the edges in the
networks
Cell-cell communication find membrane associated
components
New algorithms

95
Bioinformatics _at_ VU

Quantitative challenges
Understanding mRNA expression levels
Understanding resulting protein activity
Time dependencies
Spatial constraints, compartmentalisation
Are classical differential equation models
adequate or do we need more individual modeling
(e.g macromolecular crowding and activity at
oligomolecular level)?
Metabolic pathways calculate fluxes through time
Cell-cell communication tissues, hormones,
innervations

Need complete experimental data for good
biological model system to learn to integrate
96
Bioinformatics _at_ VU

VUMC
Neuropeptide addiction
Oncogenes disease patterns
Reumatic disease
CNCR
From synapses to higher order behaviour
Addiction
FPP
Genetic psychology twin data bank

97
Integrative Genomics
98
Recurrent theme Integration from molecule to
health
Leiden-VU-TNO (Centre for Medical Systems Biology)
CRCS
VUMC
Dinner discussion Integrative Bioinformatics
Genomics VU
99
genome
transcriptome
proteome
metabolome
physiome
Dinner discussion Integrative Bioinformatics
Genomics VU
100
Integrative bioinformatics

Calculate from sequence to molecular behaviour
Calculate from molecular behaviour and
interactions to cells
Calculate from cellular interactions to tissues
Calculate from tissue to organism
Calculate from organisms to ecosystem and society
Do this in conjunction with data analysis at all
levels
AND CALCULATE BACK (induction)

101
Bioinformatics _at_ VU

Quantitative challenges
How much protein produced from single gene?
What time dependencies?
What spatial constraints (compartmentalisation)?
Metabolic pathways assign the edges in the
networks
Cell-cell communication find membrane associated
components

102
Integrative bioinformatics

Integrate data sources
Integrate methods
Integrate data through method integration
(biological model)

103
Bioinformatics tool
Algorithm
Data
tool
Biological Interpretation (model)
104
Bioinformatics

Nothing in Biology makes sense except in the
light of evolution (Theodosius Dobzhansky
(1900-1975))
Nothing in Bioinformatics makes sense except in
the light of Biology

105
Pair-wise sequence alignment (more than just
string matching)
Global dynamic programming
MDAGSTVILCFVG
Evolution
M D A A S T I L C G S
Amino Acid Exchange Matrix
Search matrix
Gap penalties (open,extension)
MDAGSTVILCFVG-
MDAAST-ILC--GS
106
Pair-wise alignment search explosions
T D W V T A L K T D W L - - I K
Combinatorial explosion - 1 gap in 1 sequence
n1 possibilities - 2 gaps in 1 sequence (n1)n
- 3 gaps in 1 sequence (n1)n(n-1), etc.
2n (2n)! 22n
n (n!)2
??n 2 sequences of 300 a.a. 1088
alignments 2 sequences of 1000 a.a. 10600
alignments!
107
Global dynamic programming
108
This talk own kitchen

Three integrative methods to predict protein
structural aspects
Iterative multiple alignment protein secondary
structure (Praline)
Intermezzo 2½-D structure prediction of
flavodoxin fold by hand
Protein domain delineation based on consistency
of multiple ab initio model tertiary structures
(SnapDRAGON)
Protein domain delineation based on combining
homology searching with domain prediction
(Domaination)

109
Comparing sequences - Similarity Score -

Many properties can be used
Nucleotide or amino acid composition
Isoelectric point
Molecular weight
Morphological characters

110
Multivariate statistics Cluster analysis
C1 C2 C3 C4 C5 C6 ..
1 2 3 4 5
Raw table
Similarity criterion
Similarity matrix
Scores
55
Cluster criterion
Phylogenetic tree
111
Human Evolution
112
Comparing sequences - Similarity Score -

Many properties can be used
Nucleotide or amino acid composition
Isoelectric point
Molecular weight
Morphological characters
But molecular evolution through sequence
alignment

113
Multivariate statistics Cluster analysis
1 2 3 4 5
Multiple alignment
Similarity criterion
Similarity matrix
Scores
55
Phylogenetic tree
114
Lactate dehydrogenase multiple alignment
Distance
Matrix 1 2 3 4
5 6 7 8 9 10 11 12
13 1 Human 0.000 0.112 0.128 0.202
0.378 0.346 0.530 0.551 0.512 0.524 0.528 0.635
0.637 2 Chicken 0.112 0.000 0.155 0.214
0.382 0.348 0.538 0.569 0.516 0.524 0.524 0.631
0.651 3 Dogfish 0.128 0.155 0.000 0.196
0.389 0.337 0.522 0.567 0.516 0.512 0.524 0.600
0.655 4 Lamprey 0.202 0.214 0.196 0.000
0.426 0.356 0.553 0.589 0.544 0.503 0.544 0.616
0.669 5 Barley 0.378 0.382 0.389 0.426
0.000 0.171 0.536 0.565 0.526 0.547 0.516 0.629
0.575 6 Maizey 0.346 0.348 0.337 0.356
0.171 0.000 0.557 0.563 0.538 0.555 0.518 0.643
0.587 7 Lacto_casei 0.530 0.538 0.522 0.553
0.536 0.557 0.000 0.518 0.208 0.445 0.561 0.526
0.501 8 Bacillus_stea 0.551 0.569 0.567 0.589
0.565 0.563 0.518 0.000 0.477 0.536 0.536 0.598
0.495 9 Lacto_plant 0.512 0.516 0.516 0.544
0.526 0.538 0.208 0.477 0.000 0.433 0.489 0.563
0.485 10 Therma_mari 0.524 0.524 0.512 0.503
0.547 0.555 0.445 0.536 0.433 0.000 0.532 0.405
0.598 11 Bifido 0.528 0.524 0.524 0.544
0.516 0.518 0.561 0.536 0.489 0.532 0.000 0.604
0.614 12 Thermus_aqua 0.635 0.631 0.600 0.616
0.629 0.643 0.526 0.598 0.563 0.405 0.604 0.000
0.641 13 Mycoplasma 0.637 0.651 0.655 0.669
0.575 0.587 0.501 0.495 0.485 0.598 0.614 0.641
0.000
115
(No Transcript)
116
Multiple sequence alignmentWhy?

It is the most important means to assess
relatedness of a set of sequences
Gain information about the structure/function of
a query sequence (conservation patterns)
Construct a phylogenetic tree
Putting together a set of sequenced fragments
(Fragment assembly)
Comparing a segment sequenced by two different
labs
Many bioinformatics methods depend on it (e.g.
secondary/tertiary structure prediction)

117
Flavodoxin fold aligning 13 Flavodoxins cheY
5(??) fold
118
Flavodoxin-cheY multiple alignment Praline with
pre-processing

1fx1 -PKALIVYGSTTGNT-EYTAETIARQLANAG-YE
VDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-
DSLEETGAQGRKVACF
FLAV_DESDE MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HE
VTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-
EEFNRFGLAGRKVAAf
FLAV_DESVH MPKALIVYGSTTGNT-EYTaETIARELADAG-YE
VDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-
DSLEETGAQGRKVACf
FLAV_DESSA MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-ID
VELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-
DSLENADLKGKKVSVf
FLAV_DESGI MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-ME
TTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-
EDLDRAGLKDKKVGVf
2fcr --KIGIFFSTSTGNT-TEVADFIGKTLGA---KA
DAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLY
DKLPEVDMKDLPVAIF
FLAV_AZOVI -AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MS
DA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-
PKIEGLDFSGKTVALf
FLAV_ENTAG MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IA
DAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-
NTLSEADLTGKTVALf
FLAV_ANASP SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DV
VTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLY-
SELDDVDFNGKLVAYf
FLAV_ECOLI -AITGIFFGSDTGNT-ENIaKMIQKQLGK---DV
ADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE--------AQCDWDDFF-
PTLEEIDFNGKLVALf
4fxn -MK--IVYWSGTGNT-EKMAELIAKGIIESG-KD
VNTINVSDVNIDELL-NEDILILGCSAMGDEVL-------EESEFEPFI-
EEIS-TKISGKKVALF
FLAV_MEGEL MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-AD
VESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL-------EDSVVEPFF-
TDLA-PKLKGKKVGLf
FLAV_CLOAB -MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIE
VKTMNLDAVD-KKFLQESEGIIFgTPTYYAN---------ISWEMKKWI-
DESSEFNLEGKLGAAf
3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NV
EEAEDGVDALNKLQAGGYGFVI---SDWNMPNM----------DGLELL-
KTIRADGAMSALPVLM
T
1fx1 GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD
---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI-
-------
FLAV_DESDE ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE
---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL-
-------
FLAV_DESVH GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD
---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI-
-------

119
Flavodoxin-cheY NJ tree
120
Integrating secondary structure prediction in
multiple alignmentVictor Simossis

Praline multiple alignment method
(Heringa, Comp. Chem. 23, 341-3641999, Comp.
Chem., 26, 459-4772002
Kleinjung, Douglas Heringa, Bioinformatics, in
press2002)
Combining sequence data and secondary structure
prediction (Heringa, Curr. Prot. Pept. Sci., 1
(3), 273-3012000)
Secondary structure methods PhD, Predator,
PSIPred, Jpred, SSPRED,...

121
Using secondary structure in multiple alignment

Structure more conserved than sequence

122
Protein structure hierarchical levels
123
Protein structure hierarchical levels
124
Secondary structure-induced alignment
125
Using secondary structure in multiple alignment
Dynamic programming search matrix
Amino acid exchange weights matrices
MDAGSTVILCFV
HHHCCCEEEEEE
M D A A S T I L C G S
H H H H C C E E E C C
H
H
C
C
E
E
Default
126
Flavodoxin-cheY predicted secondary
structure (PREDATOR)
1fx1 -PK-ALIVYGSTTGNTEYTAETIARQLANAG-YE
VDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLFD
S-LEETGAQGRKVACF e eeee b
ssshhhhhhhhhhhhhhttt eeeee stt tttttt seeee b
ee sss ee ttthhhhtt ttss tt
eeeee FLAV_DESVH MPK-ALIVYGSTTGNTEYTaETIARELA
DAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDD
FIPLFDS-LEETGAQGRKVACf e eeeeee
hhhhhhhhhhhhhhh eeeeee eeeeee
hhhhhh
eeeee FLAV_DESGI MPK-ALIVYGSTTGNTEGVaEAIAKTLN
SEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQED
FVPLYED-LDRAGLKDKKVGVf e eeeeee
hhhhhhhhhhhhhh eeeeee hhhhhh eeeeeee
hhhhhh
eeeeee FLAV_DESSA MSK-SLIVYGSTTGNTETAaEYVAEAF
ENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQD
DFIPLYDS-LENADLKGKKVSVf
eeeeee hhhhhhhhhhhhhh eeeee
eeeee hhhhhhh h
eeeee FLAV_DESDE MSK-VLIVFGSSTGNTESIaQKLEELIA
AGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDD
FLSLFEE-FNRFGLAGRKVAAf eeee
hhhhhhhhhhhhhh eeeee hhhhhhhhhhheeeee
hhhhhhh hh eeeee 2fcr
--K-IGIFFSTSTGNTTEVADFIGKTLGAK---ADAPIDVDDVT
DPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFLYDKLPEVDMKD
LPVAIF eeeee
ssshhhhhhhhhhhhhggg b eeggg s gggggg seeeeeee
stt s s s sthhhhhhhtggg tt
eeeee FLAV_ANASP SKK-IGLFYGTQTGKTESVaEIIRDEFG
ND--VVTL-HDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSD
WEGLYSE-LDDVDFNGKLVAYf eeeee
hhhhhhhhhhhh eee hhh hhhhhhheeeeee
hhhhhhhhh
eeeeee FLAV_ECOLI -AI-TGIFFGSDTGNTENIaKMIQKQL
GKD--VADV-HDIAKSS-KEDLEAYDILLLgIPTWYYGEA--------QC
DWDDFFPT-LEEIDFNGKLVALf eee
hhhhhhhhhhhh eee hhh hhhhhhheeeee
hhhhh
eeeeee FLAV_AZOVI -AK-IGLFFGSNTGKTRKVaKSIKKRF
DDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENE
SWEEFLPK-IEGLDFSGKTVALf eee
hhhhhhhhhhhhh hhh hhhhhhheeeee
hhhhhhhhh
eeeeee FLAV_ENTAG MAT-IGIFFGSDTGQTRKVaKLIHQKL
DG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYD
SWQEFTNT-LSEADLTGKTVALf eeee
hhhhhhhhhhhh hhh hhhhhhheeeee
hhhhh eeeee 4fxn
----MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDV
NIDELLNE-DILILGCSAMGDEVL------E-ESEFEPFIEE-IST-KIS
GKKVALF eeeee
ssshhhhhhhhhhhhhhhtt eeeettt sttttt seeeeee
btttb ttthhhhhhh hst t tt
eeeee FLAV_MEGEL M---VEIVYWSGTGNTEAMaNEIEAAVK
AAG-ADVESVRFEDTNVDDVASK-DVILLgCPAMGSEEL------E-DSV
VEPFFTD-LAP-KLKGKKVGLf
hhhhhhhhhhhhhh eeeee hhhhhhhh eeeee

eeeee FLAV_CLOAB M-K-ISILYSSKTGKTERVaKLIEEGVK
RSGNIEVKTMNL-DAVDKKFLQESEGIIFgTPTY-YANI--------SWE
MKKWIDE-SSEFNLEGKLGAAf eee
hhhhhhhhhhhhhh eeeeee hhhhhhhhhh eeee
hhhhhhhhh eeeee 3chy
ADKELKFLVVDDFSTMRRIVRNLLKELGFNN-VEEAEDGV-DAL
NKLQAGGYGFVISD---WNMPNM----------DGLELLKTIRADGAMSA
LPVLMV tt eeee s
hhhhhhhhhhhhhht eeeesshh hhhhhhhh eeeee s
sss hhhhhhhhhh ttttt eeee 1fx1
GCGDS-SY-EYFCGAVDAIEEKLKNLGAEIVQD----------
-----------GLRIDGD--PRAARDDIVGWAHDVRGAI--------
eee s ss sstthhhhhhhhhhhttt ee s
eeees gggghhhhhhhhhhhhhh FLAV
_DESVH GCGDS-SY-EYFCGAVDAIEEKLKNLgAEIVQD-----
----------------GLRIDGD--PRAARDDIVGwAHDVRGAI------
-- eee hhhhhhhhhhhh
eeeee eeeee
hhhhhhhhhhhhhh FLAV_DESGI GCGDS-SY-TYFCGAVDVI
EKKAEELgATLVAS---------------------SLKIDGE--P--DSA
EVLDwAREVLARV-------- eee
hhhhhhhhhhhh eeeee
hhhhhhhhhhh FLAV_DESSA
GCGDS-DY-TYFCGAVDAIEEKLEKMgAVVIGD-----------------
----SLKIDGD--P--ERDEIVSwGSGIADKI--------
hhhhhhhhhhhh eeeee
e eee FLAV_DESDE
ASGDQ-EY-EHFCGAVPAIEERAKELgATIIAE-----------------
----GLKMEGD--ASNDPEAVASfAEDVLKQL--------
e hhhhhhhhhhhhhh eeeee
ee hhhhhhhhhhh 2fcr
GLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSV
RD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------
eee ttt ttsttthhhhhhhhhhhtt eee b gggs
s tteet teesseeeettt ss hhhhhhhhhhhhhhhht FLAV_A
NASP GTGDQIGYADNFQDAIGILEEKISQRgGKTVGYWSTDGYD
FNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------
hhhhhhhhhhhhhh
eeee
hhhhhhhhhhhhhhhh FLAV_ECOLI
GCGDQEDYAEYFCDALGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADD
DHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA
hhhhhhhhhhhhhh eeee
hhhhhhhhhhhhhhhhhh FLAV_AZOVI
GLGDQVGYPENYLDALGELYSFFKDRgAKIVGSWSTDGYEFESS
EAVVD-GKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L--
e hhhhhhhhhhhhhh eeeee
hhhhhhhhhhh FLAV_ENTA
G GLGDQLNYSKNFVSAMRILYDLVIARgACVVGNWPREGYKFSF
SAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------
hhhhhhhhhhhhhhh eeee
hhhhhhh hhhhhhhhhhhh 4fxn
G-----SYGWGDGKWMRDFEERMNGYGCVVVET---------
------------PLIVQNE--PDEAEQDCIEFGKKIANI---------
e eesss shhhhhhhhhhhhtt ee s
eeees ggghhhhhhhhhhhht FLAV
_MEGEL G-----SYGWGSGEWMDAWKQRTEDTgATVIGT-----
-----------------AIVNEM--PDNAPE-CKElGEAAAKA-------
-- hhhhhhhhhhh
eeeee eeee h
hhhhhhhh FLAV_CLOAB STANSIA-GGSDIALLTILNHLMVK
-gMLVYSG----GVAFGKPKTHLG-----YVHINEI--QENEDENARIfG
ERiANkV--KQIF--
hhhhhhhhhhhhhh eeeee
hhhh hhh hhhhhhhhhhhh h 3chy
-----------TAEAKKENIIAAAQAGASGY-------------------
------VVK----P-FTAATLEEKLNKIFEKLGM------
ess hhhhhhhhhtt see
ees s hhhhhhhhhhhhhhht

G
Enough to predict 5(??) topology
127
Secondary structure-induced alignment
128
Iteration
Convergence
Limit cycle
Divergence
129
Flavodoxin-cheY multiple alignment/ secondary
structure iteration cheY SSEs
3chy-AA SEQUENCE AA ADKELKFLVVDDFSTMRR
IVRNLLKELGFNNVEEAEDGVDALNKLQAGGYGFVISDWNMP 3chy-I
TERATION-0 PHD EEEEEEE
HHHHHHHHHHHHHHHHH E HHHHHHHHHH HHHEEE
3chy-ITERATION-1 PHD EEEEEEEE
HHHHHHHHHHHHHHH HHHHHHHH EEEEEE
3chy-ITERATION-2 PHD EEEEEEEE
HHHHHHHHHHHHHH HHHHHHHHH EEEEEE
3chy-ITERATION-3 PHD EEEEEEEE
HHHHHHHHHHHHHH EEE HHHHHH EEEEE
3chy-ITERATION-4 PHD EEEEEEEE
HHHHHHHHHHHHHH HHHHHHH EEEEE
3chy-ITERATION-5 PHD EEEEEEEE
HHHHHHHHHHHHHH EEE HHHHHH EEEEE
3chy-ITERATION-6 PHD EEEEEEEE
HHHHHHHHHHHHHH HHHHHHHH EEEEEE
3chy-ITERATION-7 PHD EEEEEEEE
HHHHHHHHHHHHHH EEE HHHHHH EEEEE
3chy-ITERATION-8 PHD EEEEEEEE
HHHHHHHHHHHHHH HHHHHHH EEEEEE
3chy-ITERATION-9 PHD EEEEEEEE
HHHHHHHHHHHHHH HHHHHHHHHH EEEEE
3chy-AA SEQUENCE AA
NMDGLELLKTIRADGAMSALPVLMVTAEAKKENIIAAAQAGASGYVVKP
FTAATLEEKLNKIFEKLGM 3chy-ITERATION-0
PHD HHHHHHEEEEEE HHHHHHHHHHHHHHHHH
HHHHHHHHHHHHHH 3chy-ITERATION-1
PHD HHHHHHEEEEEE HHH HHHHHHHHHHHHHHHHHH
EEE HHHHHHHHHHHHHH 3chy-ITERATION-2
PHD HHHHHHEEEEEE HHHHHHHHHHHHHHHHHH
EEE HHHHHHHHHHHHHH 3chy-ITERATION-3
PHD HHHHHHHHHHHH
HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH
3chy-ITERATION-4 PHD HHHHH
EEEEE HHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH
3chy-ITERATION-5 PHD HHHHHHHH
EEEEE HHHHHHHHHHHHHHHH EEE
HHHHHHHHHHHHHH 3chy-ITERATION-6 PHD
HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEEE
HHHHHHHHHHHHHH 3chy-ITERATION-7
PHD HHHHHHHH EEEEEE HHHHHHHHHHHHHHHH
EEE HHHHHHHHHHHHHH 3chy-ITERATION-8
PHD HHHHHHHH EEEEE HHHHHHHHHHHHHHHH
EEE HHHHHHHHHHHHHH 3chy-ITERATION-9
PHD HHHHHHHH EEEEE
HHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH
130
4fxn-AA SEQUENCE AA MKIVYWSGTGNTEKMAEL
IAKGIIESGKDVNTINVSDVNIDELLNEDILILGCSAMGDEV 4fxn-I
TERATION-0 PHD EEEEE
HHHHHHHHHHHHHHH EEE EEEEE
4fxn-ITERATION-1 PHD EEEEE
HHHHHHHHHHHHHHH EEEE EEEEE
4fxn-ITERATION-2 PHD EEEEE
HHHHHHHHHHHHHHH EEEE EEEEE
4fxn-ITERATION-3 PHD EEEEE
HHHHHHHHHHHHHHH E EEEEE
4fxn-ITERATION-4 PHD EEEEEE
HHHHHHHHHHHHHHH EEEE EEEEE
4fxn-ITERATION-5 PHD EEEEEE
HHHHHHHHHHHHHHH EE EEEEE
4fxn-ITERATION-6 PHD EEEEEE
HHHHHHHHHHHHHHH EEEE EEEEE
4fxn-ITERATION-7 PHD EEEEEE
HHHHHHHHHHHHHHH EE EEEEE
4fxn-ITERATION-8 PHD EEEEEE
HHHHHHHHHHHHHHH EEE EEEEE
4fxn-ITERATION-9 PHD EEEEE
HHHHHHHHHHHHHHH EEE EEEEE
4fxn-AA SEQUENCE AA
LEESEFEPFIEEISTKISGKKVALFGSYGWGDGKWMRDFEERMNGYGCV
VVETPLIVQNE 4fxn-ITERATION-0 PHD
EEEEE HHHHHHHHHHHHHHHHH EEE
EEE 4fxn-ITERATION-1 PHD
HHHH EEEEE HHHHHHHHHHHHHHH EEE EE
4fxn-ITERATION-2 PHD
HHHHHHHHHHHH EEEEEE HHHHHHHHHHHHHHH EEE
EE 4fxn-ITERATION-3 PHD
HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHH EEE
EE 4fxn-ITERATION-4 PHD
HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE
E 4fxn-ITERATION-5 PHD
HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE
E 4fxn-ITERATION-6 PHD
HHHHHHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE
E 4fxn-ITERATION-7 PHD
HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE
E 4fxn-ITERATION-8 PHD
HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE
E 4fxn-ITERATION-9 PHD
HHHHHHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE
E 4fxn-AA SEQUENCE AA
PDEAEQDCIEFGKKIANI 4fxn-ITERATION-0
PHD HHHHHHHHHHHHH 4fxn-ITERATION-1
PHD HHHHHHHHHHHHH 4fxn-ITERATION-2
PHD HHHHHHHHHHHHH 4fxn-ITERATION-3
PHD HHHHHHHHHHHHH 4fxn-ITERATION-4
PHD HHHHHHHHHHHH 4fxn-ITERATION-5
PHD HHHHHHHHHHHHH
4fxn-ITERATION-6 PHD
HHHHHHHHHHHH 4fxn-ITERATION-7 PHD
HHHHHHHHHHHHH 4fxn-ITERATION-8 PHD
HHHHHHHHHHHHH 4fxn-ITERATION-9
PHD HHHHHHHHHHHH
131
Predicting sec. struct. with PHD, etc.
A
1
5
B
2
4
C
3
D
6
132
Secondary structure prediction using MA (SymSS)
1 2 3 4
2 1 3 4
3 1 2 4
4 1 2 3
1 1 1 1
EEEEE HHHHHH EEEEE HH EEEE? ?HHHHH
EEE H EEEEE HHHHH? ??EE
HH EEEEEE ?HHHHH EEEE HH
EEEEE HHHHHH EEE HH EEEE? ?HHHHH
EEE H EEEEE HHHHH? ??EE HH EEEEE
?HHHHH EEEE HH
EEEEE HHHH EEE HH EEEE? ?HHH EEE
H EEEEE HHH? ??EE HH EEEEE HHH?
EEEE HH
EEEEE HHHHHH EEE HHHH EEEE? ?HHHHH
EEE ?HHH EEEEE HHHHH? ??EE HHHH EEEEE
?HHHHH EEEE HHHH
EEEEE HHHHH EEE H
EEEE HHHH EE HHH
EEEE HHHHH EEE H
EEEE HHH EEE HH
133
Flavodoxin-cheY
3chy ------------ GYVVKPFTAATLEEKLNKI
FEKLGM------ PHD ---------------
hhhhhhhhhhhhhh ------ 13 -gt 0
ee ??hhhhhhhhhhh? 13 -gt 1
ee ??hhhhhhhhhhh??
13 -gt 2 ee
??hhhhhhhhhhh? 13 -gt 3
eee ?hhhhhhhhhhh? 13 -gt 4
eee ?hhhhhhhhhhh?
13 -gt 5 eee
h?hhhhhhhhhhh 13 -gt 6
eee hh hhhhhhhhhhh 13 -gt 7
e eeeeeee hhhhhhhhhhhhh??
13 -gt 8 eeeeeee
hhhhhhhhhhhhh?? 13 -gt 9
eeeeeee hhhhhhhhhhhhh?? ????? 13 -gt 10
eeeeeee hhhhhhhhhhhhh??
13 -gt 11 e eeeeeeee
hhhhhhhhhhhhh??? 13 -gt 12
eeeeeee hhhhhhhhhh 13 -gt 13
hhhhhhhhhhhhhh
h DSSP ...............EEEESS
HHHHHHHHHHHHHHHT ......
134
Optimal segmentation of predicted secondary
structures
Each sequence within an alignment gives rise to
a library of n secondary structure predictions,
where n is the number of sequences in the
alignment. The predictions are recorded by
secondary structure type and region position in a
single matrix
1 2 3 4
1-gt1 1-gt2 1-gt3 1-gt4
EEEEE HHHHHH EEEEE HH EEEE? ?HHHHH EEE
H EEEEE HHHHH? ??EE HH EEEEEE ?HHHHH
EEEE HH
C
E
H
H score 0 0 0 0 0.
E score 3 4 4 4 3.
C score 1 0 0 0 0..
? Score 0 0 0 0 1.
Region 0 1 1 1 0.
135
Optimal segmentation of predicted secondary
structures by Dynamic Programming
H score
The recorded values are used in a weighted
function according to their secondary structure
type, that gives each position a window-specific
score. The more probable the secondary structure
element, the higher the score. Restrictions H
only if wsgt4 E only if wsgt2
E score
C score
? score
Region
window size
Segmentation score (Total score of each path)
2
6
sequence position
Max score
5
Offset
Label
H
136
Example of an optimally segmented secondary
structure prediction library for sequence 3chy
3chy ---------------GYVV-----
KPFTAATLEEKLNKIFEKLGM------ 3chy lt- 1fx1
??????????????? ee ?? hhhhhhhhhhhhhh
???????? 3chy lt- FLAV_DESDE
??????????????? ee ?? hhhhhhhhhhhhhhh
???????? 3chy lt- FLAV_DESVH
??????????????? ee ?? hhhhhhhhhhhhhh
???????? 3chy lt- FLAV_DESGI
??????????????? eee ?? ??hhhhhhhhhhhhh
???????? 3chy lt- FLAV_DESSA
??????????????? eee ?? ??hhhhhhhhhhhhh
???????? 3chy lt- 4fxn
??????????????? eee ?? hhhhhhhhhhhhh
????????? 3chy lt- FLAV_MEGEL
????????????????eee ?? hh?hhhhhhhhhhh
????????? 3chy lt- 2fcr e ?
eeeeeee hhhhhhhhhhhhhhh ?????? 3chy lt-
FLAV_ANASP ? eeeeeee
hhhhhhhhhhhhhhh ?????? 3chy lt- FLAV_ECOLI
eeeeeee hhhhhhhhhhhhhhh
hhhhh 3chy lt- FLAV_AZOVI ?
eeeeeee hhhhhhhhhhhhhhh ???? 3chy lt-
FLAV_ENTAG e eeeeeeee
hhhhhhhhhhhhhhhh? ?????? 3chy lt- FLAV_CLOAB
eeeeeee hhhhhhhhhh
??????????? 3chy lt- 3chy
--------------- ----- hhhhhhhhhhhhhh
------ Consensus
---------------EEEE----- HHHHHHHHHHHHH
------ Consensus-DSSP
....................xx.....
. PHD ---------------
----- HHHHHHHHHHHHHH ------ PHD-DSSP
...............xxxx.....
x...... DSSP
...............EEEE.....SS HHHHHHHHHHHHHHHT
...... LumpDSSP
...............EEEE..... HHHHHHHHHHHHHHH
......
137
What to do with a multiple alignment?

Use it to eyeball and detect structural/functiona
l features
Use it to make a profile and search a database
for homologs
Give it to other bioinformatics methods and
predict secondary structure, functional residues,
correlated mutations, phylogenetic trees, etc.

138
Rules of thumb when looking at a multiple
alignment (MA)

Hydrophobic residues are internal
Gly (Thr, Ser) in loops
MA hydrophobic block -gt internal ?-strand
MA alternating (1-1) hydrophobic/hydrophilic gt
edge ?-strand
MA alternating 2-2 (or 3-1) periodicity gt
?-helix
MA gaps in loops
MA Conserved column gt functional? gt active
site

139
Rules of thumb when looking at a multiple
alignment (MA)

Active site residues are together in 3D structure
Helices often cover up core of strands
Helices less extended than strands gt more
residues to cross protein
?-?-? motif is right-handed in gt95 of cases
(with parallel strands)
MA inconsistent alignment columns and match
errors!
Secondary structures have local anomalies, e.g.
?-bulges

Write a Comment

User Comments (0)

About PowerShow.com

Bioinformatics For MNW 2nd Year PowerPoint PPT Presentation