Protein Classification and Metaorganization' Methods for Global Organization of the Protein Universe - PowerPoint PPT Presentation

Loading...

PPT – Protein Classification and Metaorganization' Methods for Global Organization of the Protein Universe PowerPoint presentation | free to download - id: 10751d-Mjg4O



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Protein Classification and Metaorganization' Methods for Global Organization of the Protein Universe

Description:

Methods for Global Organization of the Protein Universe. Golan Yona ... FSSP (Holm & Sander NAR 1999). Significant overlap, but no two databases are identical. ... – PowerPoint PPT presentation

Number of Views:181
Avg rating:3.0/5.0
Slides: 83
Provided by: gol100
Learn more at: http://biozon.org
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Protein Classification and Metaorganization' Methods for Global Organization of the Protein Universe


1
Protein Classification and Meta-organization.Meth
ods for Global Organization of the Protein
Universe
Golan Yona
Department of Computer Science Cornell University
2
Outline
  • Introduction
  • Sequence-based classifications of proteins
  • Methods
  • Sequence comparison algorithms
  • Mathematical models of protein families
  • Databases
  • Domain-based protein sequence classification
  • Protein-wise protein sequence classification
  • Alternative representations of proteins
  • Structure-based classifications of proteins
  • Structure comparison algorithms
  • Databases
  • Other classifications

3
What are proteins ?
  • Structural framework (keratin, collagen)
  • Transport and storage of small molecules
    (hemoglobin)
  • Transmit information (hormones, receptors)
  • Antibodies
  • Blood clotting factors
  • EnzymesThe protein is created in the cell as a
    unique sequence
  • of amino acids

4
Sequence
ACMVLLCEVEKYP
folding
Structure
Function
?????
Ismb02
5
Background and Problem definition
About protein sequences are known today
(non-redundant database). This number keeps
rapidly growing (large scale sequencing
projects).
!
The function of 40-50 of the new proteins is
unknown.
Understanding biological function is important
for
  • Study of fundamental biological processes
  • Drug design
  • Genetic engineering

6
Basic approach
  • Calculate pairwise distances/similarities with
    sequences of known proteins
  • Pairwise similarities are computed using the
    rigorous dynamic programming algorithm
    (Smith-Waterman), or heuristic algorithms such
    as Fasta and Blast.

V MCTKKLLVYD
VRSM KLLLGF E
7
Database search
  • A sequence is compared with all sequences in the
    database
  • Its properties are extrapolated from neighboring
    sequences (nearest neighbor approach to learning
    and generalization)

MFGPVRACAATFD...
MCTKKLVTD....
RVSMLILLGFH.....
KAACGCGLMDRFYVVLVV....
.....
8
BLASTP 2.2.2 Jan-08-2002 Reference Altschul,
Stephen F., Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb
Miller, and David J. Lipman (1997), "Gapped
BLAST and PSI-BLAST a new generation of protein
database search programs", Nucleic Acids Res.
253389-3402. Query nr001640000880 trembl
(Q935W2) DNA mismatch repair protein (Fragment).
(164 letters) Database
/local/databases/nr/nr 933,075
sequences 292,319,403 total letters Searching...
...............................................don
e
Score E Sequences
producing significant alignments
(bits) Value nr008510000010 swissprot
(Q99XL8) DNA mismatch repair protein ... 316
5e-86 nr001640000880 trembl (Q935W2) DNA
mismatch repair protein (Fr... 298
1e-80 nr001640000882 trembl (Q936C5) DNA
mismatch repair protein (Fr... 290
3e-78 nr001640000877 trembl (Q933P8) DNA
mismatch repair protein (Fr... 283
4e-76 nr001640000883 trembl (Q936C9) DNA
mismatch repair protein (Fr... 280
5e-75 nr001350001337 trembl (Q9ETZ4) DNA
mismatch repair protein (Fr... 248
2e-65 nr001350001345 trembl (Q9EVZ2) DNA
mismatch repair protein (Fr... 247
3e-65 nr001350000805 trembl (Q933M4) Putative
DNA mismatch repair en... 246 7e-65
9
BLASTP 2.2.2 Jan-08-2002 Reference Altschul,
Stephen F., Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb
Miller, and David J. Lipman (1997), "Gapped
BLAST and PSI-BLAST a new generation of protein
database search programs", Nucleic Acids Res.
253389-3402. Query nr001640000690 trembl
(Q22102) Hypothetical 19.0 kDa protein.
(164 letters) Database /local/databases/nr/nr
933,075 sequences 292,319,403 total
letters Searching................................
..................done
Score
E Sequences producing significant alignments
(bits) Value nr0016400006
90 trembl (Q22102) Hypothetical 19.0 kDa
protein. 310 5e-84 nr009250000047 trembl
(Q9GUG1) Hypothetical 104.9 kDa protein. 230
4e-60 nr009660000030 trembl (Q20175)
Hypothetical 110.0 kDa protein. 225
1e-58 nr008350000033 trembl (Q17506)
Hypothetical 95.8 kDa protein. 152
8e-37 nr011390000015 trembl (O76601) H02F09.4
protein. 152
1e-36 nr002650001251 trembl (Q9XUR0) T05F1.12
protein. 148
2e-35 nr002140000359 trembl (O16440)
Hypothetical 24.8 kDa protein. 139
2e-32 nr002210001099 trembl (Q9N5M4)
Hypothetical 25.9 kDa protein. 135 1e-31
10
BLASTP 2.2.2 Jan-08-2002 Query nr001640000685
trembl (Q19708) F22B5.4 protein. (164
letters) Database /local/databases/nr/nr
933,075 sequences 292,319,403 total
letters Searching................................
..................done
Score
E Sequences producing significant alignments
(bits) Value nr001670000430
trembl (P90860) F36A2.7 protein.
153 8e-37 gtnr001670000430 trembl
(P90860) F36A2.7 protein. Length
167 Score 153 bits (386), Expect 8e-37
Identities 83/157 (52), Positives 108/157
(67), Gaps 2/157 (1) Query 8
RKTMFRFVLSRNASTSNVPSPARIQLKKPAEAGHFQYSRNWSRDPRFVKV
AIQKGDTPYQ 67 T R VL RNAS S
K G FY R SRD R A GDT Sbjct 13
KNTIARIVLVRNAS-SGLTLKHEQTIKINDQQGFFKYQRDVSRDTRYSNP
A-KPGDTAAR 70 Query 68 FLVRRLGHAYEVYPLFVLTAAWFV
LFCSASYWSFGKAEIWLDRSNSKAPWDWERLRDTYW 127
F RLGHAYEYPLF L A W VLF SF KAEIWLDRS
APWDWERR YW Sbjct 71 FMFRKLGHAYEIYPLFGLLAIWCVL
FGYTVWYSFEKAEIWLDRSHTVAPWDWERIRNNYW 130
11
Smith Waterman (Adv. App. Math 1981)
Based on the dynamic programming algorithm. Finds
the maximal local similarity given the parameters
for point mutations s(ai,bj) and the gap
penalties g The optimal alignment of the
a1a2a3.ai with b1b2b3.bj must end in one of
three possible ways ai OR ai OR
- bj - bj If S(i,j) is the best
score over all possible alignments of a1a2a3.ai
with b1b2b3.bj then
S(i,j) MAXS(i-1,j-1)s(ai,bj),
S(i-1,j)-g, S(i,j-1)-g
12
Smith Waterman (Adv. App. Math 1981)
Implementation
0 . . . j
. . m
0 . i . n
S(i-1,j-1)s(ai,bj)
S(i-1,j)-g
S(i,j-1)-g
13
FASTA (Pearson Lipman, PNAS 1988)
Create a hash table with all k-tuples that appear
in the query sequence, index vectors and an
offset vector. Run a bounded dynamic programming
around the best diagonal/region
Library sequence 0 . . .
j . . m
0 . i . n
14
BLAST and Gapped-BLAST (Altschul et al., JMB
1990, Altschul et al. NAR 1997)
For each word length k in the query generate all
k-tuples with score gt T. Use these k-tuples to
search the library sequence. Extend the seeds to
generate HSPs (high scoring segment pairs without
gaps). Combine HSPs
0 . . . j
. . m
0 . i . n
15
PSI-BLAST (Altschul et al. NAR 1997)
  • Iterative procedure that uses a
    position-specific scoring matrix.
  • Generate a profile P p1p2pn from hits
    detected at iteration I, to search the database
    in iteration I1, until convergence.
  • The profile stores the probabilities to observe
    the 20 amino acids at each position along the
    query sequence (derived from hits with related
    sequences, after weighting and addition of
    pseudo-counts)

1 2 n
1 2 . . 20
16
Limitations of sequence comparison
In many cases sequences have diverged to the
extent that their sequence similarity is
undetectable with the above algorithms. More
than 50 of the 1,000,000 proteins that are known
today have an unidentified function Iterative
procedures such as PSI-BLAST are more powerful,
but are more susceptible to parameter tuning and
have higher false positive rate Use reliable
mathematical models of protein families to
classify new genes
17
Protein classification
More specific
Sub-families (rab, ras, ran, rac, rho, ARF1,
transducin,..) Families (G-proteins, Motor
proteins, DNA helicases,..) Super-families
(P-loop containing nucleotide triphosphate
hydrolases) Fold families (P-loop containing
hydrolases, beta/alpha tim barrel) Classes
(alpha and beta, all-alpha, ...)
Common evolutionary origin
18
Why is protein classification important?
  • Protein classification can help to elucidate the
    function of new genes.
  • Comparing a sequence with a database of protein
    families is more effective than a standard
    database search.
  • Each family is represented as a model
  • regular expression
  • profile
  • HMM
  • Those models can capture subtle similarities.
    Protein classification is useful in structure
    and function prediction, and especially
    important in large-scale annotation efforts.

Family1
Family2
Family3
19
Databases of protein families
  • Domain-based clusterings
  • Prosite (Bairoch Bucher ISMB 1994, Laurent
    et al. NAR 2002).
  • Pfam (Sonnhammer et al. Proteins 1997,
    Bateman et al. NAR 2002).
  • ProDom (Gouzy et al. Computers and Chemistry
    1999, Corpet et al. NAR 2000).
  • Prints (Attwood et al. NAR 2002).
  • Domo (Gracy Argos Bioinformatics 1998).
  • Blocks (Henikoff et al. NAR 1999).
  • Protein-based clusterings
  • ProtoMap (Yona et at. Proteins 1999, Yona et
    al. NAR 2000).
  • Cogs (Tatusov et al. Science 1997, Tatusov et
    al. NAR 2001).
  • Systers (Krause Vingron Bioinformatics 1998,
    Krause et al. NAR 2002).
  • PIR (Barker et al. NAR 2000, Wu et al. NAR
    2002).
  • Structural classifications
  • SCOP (Murzin et al. JMB 1995).
  • CATH (Orengo et al. Structure 1997).
  • FSSP (Holm Sander NAR 1999).

Significant overlap, but no two databases are
identical.
20
Mathematical representations of protein families
  • Regular expression
  • Position specific scoring matrices (profiles)
  • Hidden Markov Models
  • Probabilistic suffix trees
  • Sparse Markov transducers

21
Regular expression
G-x(2,3)-MLIV-x-P-K,H-x(2)-C
GAY_LDPAKSC GMF_VEPNTSC GKRFVRPFHNC GKSTVDPFHNC
G - a match with amino acid G x a match with
any amino acid x(i,j) a match with any amino
acid at least i times and no more than j times
MLIV a match with any amino acid of the four
M, L, I or V K,H a match with any amino acid
other than K and H
22
Regular expression - examples
ACTININ_1. EQ-x(2)-ATV-FY-x(2)-W-x-N. HO
MEOBOX_1. LIVMFYG-ASLVR-x(2)-LIVMSTACN-x-
LIVM-x(4)-LIV-RKNQESTAIY- LIVFSTNK
H-W-FYVC-x-NDQTAH-x(5)- RKNAIMW. HELIX
_LOOP_HELIX. DENSTAP-KR-LIVMAGSNT- F
YWCPHKR-LIVMT-LIVM-x(2)- STAV-LIVMST
ACKR-x-VMFYH- LIVMTA-P-P-LIVMRKHQ
.
23
Profile (Gribskov et al. PNAS 1987)
GAA_TGAAGTC GGT_TTAGGCC GGCAATAGGGC GTTAATAACAC
Position specific scoring matrix A 0 0.25
0.25 1 0.5 0 1 0.5 0 0.25 0 C
0 0 0.25 0 0 0 0 0 0.25 0.25
1 G 1 0.5 0 0 0 0.25 0 0.5 0.75
0.25 0 T 0 0.25 0.5 0 0.5 0.75 0 0
0 0.25 0 Add pseudo counts Na(position
s) Ntotal(s) Samino acid i p(a/i) p(i/s) Final
score score(a/position s) Samino acid i
p(i/s) score(i,a)
24
Hidden Markov Models
v1
v2
a22
v3
v4
w2
a11
a12
a23
w1
w3
b11
v1
b14
v2
v3
v4
  • Hidden states w and observations v
  • w(t) the state at time t
  • Transition probabilities
  • P(wj(t1)/wi(t)) aij
  • Emission probabilities
  • P(vk(t)/wj(t)) bjk
  • Transition and emission probabilities are learned
    from training data

25
Hidden Markov Models (cont.)
a11
a22
a12
w1
w2
b11
A
a21
b14
C
T
G
  • A simple intron/exon model for DNA
  • Hidden states are w1exon and w2intron.
  • Observations (visible symbols) are A, C, G and T

26
Hidden Markov Models (cont.)
  • HMMs used to model protein families are profile
    HMMs with three different types of hidden states
    Match (M), delete (D) and insertion (I) states
    (Krogh et al. JMB 1996)
  • The observations (visible symbols) are the amino
    acids

27
Hidden Markov Models (cont.)
Given a sequence of length T, what is the
probability that the sequence was generated by
the model? Answer Sum over all possible paths
that can generate this sequence Define the
forward variable aj(t) to be the probability to
observe the sequence Vt v(1)v(2)v(t) and to
reach hidden state j at time t Note P(VT) Sj
aj(T) The HMM forward algorithm Initialize
aj(1) P0(j) bjk where vkv(1) Iteration for
t1 till T-1 aj(t1) Si ai(t) aij bjk where
vkv(t1)
28
Hidden Markov Models (cont.)
What is the most probable sequence of hidden
states that generated it? The Viterbi
algorithm Define the variable dj(t) to be the
probability to observe the sequence Vt
v(1)v(2)v(t) and to reach hidden state j at time
t maximized over all sequences of hidden states
of length t-1. Define the pointer variable
pathj(t) that contains the pointer to the most
probable hidden state at time t-1 from which we
can reach state j at time t Initialize dj(1)
P0(j) bjk where vkv(1) pathj(1) 0
Iteration for t1 till T-1 dj(t1) maxi
di(t) aij bjk where vkv(t) pathj(t1)
arg maxi di(t) aij Output probability P
maxi di(T) final state wT arg maxi di(T)
and trace back
29
Probabilistic Suffix Trees
(Ron et al. Machine Learning 1996)
  • Identify short significant patterns (contiguous
    segments)
  • Does not require multiple alignment
  • Induces a probability distribution on the next
    symbol to appear right after the segment (short
    term memory)
  • Variable memory length
  • More efficient than order L Markov chains
  • Longer memory length compared to first-order
    HMMs, and easier to learn

30
Probabilistic Suffix Trees (cont)
  • The learning algorithm
  • Initialize let tree T consist of a single root
    node (with empty label)
  • let S s/ s in S and P(s) gt Pmin
  • Building the PST while S is not empty, pick s in
    S
  • Remove s from S
  • If there exists a symbol s in S such that
  • P(s/s) gt Tmin
  • and P(s/s) gt r
  • P(s/suf(s)) lt 1/r
  • then add to T the node s and all the nodes on
    the path to s from the deepest node of T that is
    a suffix of s
  • If s lt L then add the strings ss / s in S and
    P(ss) gt Pmin to S
  • Finalize smooth tree probabilities

or
31
Databases of protein domains
  • Prosite http//www.expasy.ch/prosite/
  • Pfam http//www.sanger.ac.uk/Software/Pfam/
  • Blocks http//www.blocks.fhcrc.org/
  • ProDom http//prodes.toulouse.inra.fr/prodom/d
    oc/prodom.html
  • Prints http//www.bioinf.man.ac.uk/dbbrowser/PRI
    NTS/
  • Domo http//www.infobiogen.fr/services/domo/
  • InterPro http//www.ebi.ac.uk/interpro/
  • Smart http//smart.embl-heidelberg.de/
  • eMotif http//dna.stanford.edu/identify

32
Motifs and domains
  • Motif a simple combination of a few consecutive
    secondary structure elements with a specific
    geometric arrangement (e.g., helix-loop-helix).
    Some, but not all motifs are associated with a
    specific biological function.
  • Domain the fundamental unit of structure folding
    and evolution. It combines several secondary
    elements and motifs, not necessarily contiguous,
    which are packed in a compact globular structure.
    A domain can fold independently into a stable 3D
    structure, and it has a specific function.
  • Domain family proteins that share a domain
    (possibly in combination with other domains)
  • Protein family proteins that have the same
    combination of domains

33
General approaches
  • Motif based databases
  • Prosite, Prints, Blocks, eMotif, InterPro
  • Domain-based databases
  • Pfam, ProDom, Domo, Smart
  • Manual/Semi-manual
  • Prosite
  • Semi-automatically
  • Pfam, Smart
  • Fully automatic
  • ProDom, Blocks, Domo, eMotif
  • Use different models (regular expressions,
    profiles, HMMs)
  • Based on each other

34
Prosite
  • A dictionary of functional and structural motifs
    and domains
  • Valuable biological information on each family
  • Each motif/domain/family is represented as a
    regular expression, a rule or a profile
  • Models are generated from (usually published)
    multiple alignments, manually calibrated to
    ensure selectivity and sensitivity
  • Patterns do not always cover complete domains
    whereas profiles usually span the whole domain
  • Some families have more than one signature
    pattern and /or profiles
  • Pattern length can vary from a few amino acids to
    several hundreds
  • As of June 2002 contains 1564 patterns and
    profiles describing 1144 families or domains

1 2 3 4 5 6 7 8 9
10 11 A 0 0.25 0.25 1 0.5 0 1
0.5 0 0.25 0 C 0 0 0.25 0 0 0
0 0 0.25 0.25 1 G 1 0.5 0 0 0
0.25 0 0.5 0.75 0.25 0 T 0 0.25 0.5
0 0.5 0.75 0 0 0 0.25 0
OR
G-x(2,3)-MLIV-x-P-K,H-x(2)-C
35
Prosite - example
Profile
ID TEST MATRIX. AC PS50999 MA /DISJOINT
DEFINITIONPROTECT N14 N227 MA
/NORMALIZATION MODE1 FUNCTIONLINEAR
R11.1126 R20.02183468 TEXT'NScore' MA
/CUT_OFF LEVEL0 SCORE110 N_SCORE8.5
MODE1 MA /DEFAULT M0-8 D-20 I-20
B1-60 E1-60 MI-105 MD-105 IM-105
DM-105 MA /I B10 BI-105 BD-105 CC
A, B, C, D, E, F, G, H, I,
K, L, M, N, P, Q, R, S, T, V, W, Y,
Z MA /M SY'R' M-12, -9,-30, -9, -6,-24,
10, -8,-33, 16,-24,-13, 0,-19, 0, 36,
-7,-13,-23,-20,-17, -6 MA /M SY'R' M -2,
-4,-24,-10, -4,-19,-16, -6,-14, 8,-14, -7,
4,-17, 4, 23, -2, -3,-12,-23,-12, -3 MA /M
SY'G' M -5, -7,-27, -8, -7,-18, 18,-15,-30,
7,-26,-14, -1,-16, -8, 0, -2,-11,-21,-19,-15,
-7 MA /M SY'G' M -9, 4,-27, -2, -9,-24,
24, -6,-32, 0,-27,-17, 16,-20, -6, 12,
0,-12,-27,-25,-21, -9 MA /M SY'K' M-11,
-1,-30, -1, 9,-29,-20, -9,-30, 48,-29,-10,
0,-11, 10, 34,-10,-10,-20,-20,-10, 9 MA /M
SY'K' M -7, -6,-24, -7, -1,-19, -4,-12,-20,
9,-19,-10, -5,-15, -4, 5, -3, -1,-11,-21,-10,
-3 MA /M SY'V' M -1,-30,-12,-31,-30,
0,-31,-30, 32,-21, 11, 11,-29,-29,-29,-21,-11,
-1, 48,-29, -9,-30 MA /M SY'T' M -1, -3,
7,-13,-13,-11,-21,-21,-13,-13,-11,-11,
-3,-14,-13,-13, 16, 42, -1,-33,-13,-13 MA /M
SY'V' M -6,-18,-19,-21,-16, -5,-26,-20,
9,-11, 8, 4,-15,-21,-13, -6, -7, 6, 12,-25,
-7,-16 MA /M SY'V' M -4,-30,-17,-33,-29,
1,-33,-29, 35,-24, 17, 14,-27,-27,-26,-23,-15,
-4, 40,-26, -6,-29 MA /M SY'S' M -3,
4,-19, 1, 7,-22,-12, -5,-20, 3,-22,-14,
9,-11, 10, 5, 16, 13,-16,-31,-15, 8 MA /M
SY'G' M -2, 0,-28, -4,-16,-28,
56,-14,-36,-16,-30,-20, 12,-20,-16,-16,
2,-16,-30,-24,-28,-16 MA /M SY'L'
M-10,-29,-21,-33,-23, 14,-31,-21, 23,-28, 36,
20,-27,-28,-21,-21,-25, -9, 15,-17, 3,-23 MA
/M SY'E' M -6, 20,-26, 26, 30,-31,-12,
-1,-28, 3,-23,-20, 11, -7, 12, -4, 4,
-6,-26,-32,-19, 21 MA /M SY'L' M
5,-15,-18,-19, -8, -8,-19,-15, 3,-10, 10,
10,-16,-18, -9,-12, -9, -2, 5,-23, -9, -8 MA
/M SY'F' M -6, -7,-21,-11,-12,
21,-18,-10,-12,-10, -9, -9, -4,-21,-17,-11, -5,
-5, -9, -9, 11,-13 MA /I I -6
MD-32 MA /M SY'G' M -2, 3,-25, 7,
2,-27, 18,-10,-27, -9,-18,-14, 0,-14, -2,-12,
-1,-12,-22,-23,-20, 0 D-6 MA /I I
-6 MI-32 IM-32 DM-32 MA /M SY'I' M
0,-21,-20,-25,-20, 4,-25,-20, 17,-22, 16,
12,-19,-22,-18,-21,-15, -7, 14,-19, -3,-20 MA
/M SY'D' M -7, 27,-27, 36, 21,-32, -3,
-3,-32, 0,-25,-23, 14, -9, 2, -9, 2,
-9,-27,-34,-21, 12 MA /M SY'L' M
-9,-23,-27,-21,-11, -8,-27,-18, 6,-20, 14,
5,-22, 8, -9,-18,-18, -6, -4,-23,-10,-12 MA
/M SY'K' M-11, 2,-27, 3, 10,-22,-20,
-4,-24, 23,-21,-11, 0,-12, 10, 16, -4,
-2,-19,-19, -3, 9 MA /M SY'K' M 2,
-2,-20, -3, 5,-22,-13,-12,-21, 16,-22,-12,
0,-10, 3, 6, 8, 5,-13,-27,-14, 4 MA /M
SY'L' M -5,-28,-19,-31,-23, 10,-28,-21,
21,-26, 32, 18,-26,-27,-21,-20,-22, -8, 17,-19,
-1,-22 MA /M SY'A' M 31,-14,
0,-22,-15,-17, -2,-22, -8,-15, -6,
-8,-13,-17,-15,-21, 2, -3, 1,-24,-19,-15 MA
/M SY'A' M 16, -5,-17, -9, -2,-22, -7,
-3,-20, 5,-21,-11, -1,-12, 0, 1, 10,
0,-11,-26,-13, -2 MA /M SY'E' M 3,
4,-23, 5, 13,-23,-16, -9,-13, -3,-12, -7, -2,
-9, 5,-10, 1, -2,-12,-27,-15, 9 MA /I
E10 IE-105 DE-105
Pattern
K-x-V-TC-x-IVL-x(2)-LMF
Courtesy of Nicolas Hulo
36
Prints
  • A database of gene- and domain-family
    fingerprints in the form of collections of short
    motifs aligned without gaps
  • Fingerprints comprise several motifs from
    different parts of a multiple alignment
  • Discrimination power increases with the number of
    motifs in a fingerprint
  • Start from a seed alignment of a few sequences.
    Search the database and refine the fingerprint
    based on the set of sequences that match the
    fingerprint completely (add sequences and/or
    add/remove motifs), until convergence
  • Partial match possibly subfamily members
  • As of July 2002, contains 10,500 motifs grouped
    into 1750 families

37
Prints - a fingerprinting overview
PRINTS
annotation
N
C
Courtesy of Terri Attwood
38
Prints - example
SUMMARY INFORMATION 37 codes involving 8
elements 0 codes involving 7 elements
0 codes involving 6 elements 0 codes
involving 5 elements 0 codes involving 4
elements 1 codes involving 3 elements 0
codes involving 2 elements COMPOSITE
FINGERPRINT INDEX 8 37 37 37 37
37 37 37 37 7 0 0 0 0
0 0 0 0 6 0 0 0 0 0
0 0 0 5 0 0 0 0 0
0 0 0 4 0 0 0 0 0
0 0 0 3 1 0 0 0 1 1
0 0 2 0 0 0 0 0 0 0
0 ----------------------------------------
-- 1 2 3 4 5 6 7 8
True positives.. PRIO_COLGU PRIO_MACFA
PRIO_CEREL PRIO_ODOHE PRIO_GORGO PRIO_PANTR
PRIO_HUMAN O46648 PRIO_SHEEP PRIO_CALJA
PRIO_BOVIN PRP2_BOVIN PRIO_ATEPA PRIO_SAISC
PRIO_PREFR PRIO_PONPY O75942 PRIO_CAPHI
PRIO_CEBAP PRIO_CAMDR PRIO_FELCA PRP1_TRAST
PRIO_RABIT PRP2_TRAST PRIO_PIG
PRIO_CANFA PRIO_CRIGR PRIO_CRIMI Q15216
PRIO_RAT PRIO_CERAE PRIO_MUSPF PRIO_MUSVI
PRIO_MESAU PRIO_MOUSE O46593 PRIO_TRIVU
Subfamily Codes involving 3 elements
Subfamily True positives.. PRIO_CHICK
Courtesy of Terri Attwood
39
Prints - website
Courtesy of Terri Attwood
40
ProDom
  • A database of automatically generated protein
    domains
  • Old procedure
  • Identify high-scoring segment pairs (HSPs) using
    BLAST for all-versus-all comparison of SwissProt
  • Construct homologous segment sets by transitive
    closure of overlapping HSPs
  • Break into domains
  • New procedure
  • Construct homologous segment sets by PSI-BLAST
    recursive search
  • Generate a multiple alignment and a consensus
    sequence for each family
  • May break real domains into smaller domains (more
    than 305,000 domain families as of January 2002)

41
ith iteration
DB
query
internal repeat detection
yes
no
query
PSI-BLAST
no match
matches
repeat matches
DB changes remove newly found domains split
modified sequences sort by size
(i1)th iteration
DB
query
Courtesy of Daniel Kahn
42
ProDom - website
  • Domain analysis
  • Which proteins contain a given domain?
  • Which proteins share a homologous domain with a
    given protein?

Domain motif
Links to other representations of the family
Summary alignment
Parameters to define sub-family levels
Courtesy of Daniel Kahn
43
ProDom - website
  • Synthetic views of alignments and trees
  • Homology search with graphical output

Courtesy of Daniel Kahn
44
  • Links to SWISS-PROT, TREMBL, PROSITE PDB
  • Links to 3-D modelling with SWISS-MODEL

Courtesy of Daniel Kahn
45
Blocks
  • A collection of blocks (short ungapped
    alignments)
  • Each represent a conserved region of a protein
    family (not necessarily associated with a known
    function)
  • Blocks are generated from multiple alignments of
    groups of related proteins.
  • Groups are derived from protein families in
    Prosite, Smart, Pfam, ProDom
  • Used only the SWISSPROT entries to define the
    blocks (to avoid false positive sequences from
    TrEMBL)
  • Complementary to PRINTS
  • As of August 2001 contains 8656 blocks
    representing 2101 groups

46
Blocks (cont)
  • Detection
  • Identify short motifs using the algorithm by
    Smith 1990
  • Short patterns are then extended in both
    directions until the similarity score drops.
    Maximal length of a block is 60
  • Assembly
  • Create an ordered set of non overlapping blocks
    (path) that occur in a critical number of
    sequences
  • Scoring
  • Score individual blocks based on length,
    similarity, sequences, etc
  • Score a path based on scores of the constituent
    blocks and sequences

GAYKLDPAKSC GMFMVEPNTSC GKRFVRPFHNC GKSTVDPFHNC
DKGAYKLDPAKSCNL HSGMFMVEPNTSCHF GLGKRFVRPFHNCWM
MMGKSTVDPFHNCHD
DKGAYKLDPAKSCNL HSGMFMVEPNTSCHF GLGKRFVRPFHNCWM
MMGKSTVDPFHNCHD
KKVLDEAKSCN KLILEENTSCH SLFLREFHNCW AAVLDEFHNC
H SKLLEEFSECH
KHAYKLDAKSNLL VHMFMVDNTSHFL LHKRFVDFHSWMM
47
Pfam
  • A database of Hidden Markov Models for protein
    domains/families generated semi-automatically
  • Start from a seed multiple alignment (published
    or from other databases, such as Prosite and
    ProDom).
  • The alignment usually represent complete domains
    (confirmed with structural data when available).
  • After manual inspection a profile HMM is built
    from the alignment
  • The HMM is used to search the database. If a true
    member is missed, then it is added to the seed
    alignment and the process is repeated.
  • A full alignment is created for all members of
    the family. If it is not correct the process is
    repeated with another seed alignment or another
    alignment method
  • As of May 2002, contains 3849 families, domains,
    repeats and motifs (in addition Pfam B derived
    from ProDom)

48
Domo
  • Combines compositional and local similarity
    search followed by multiple sequence alignments
  • Start by detecting global similarities from the
    comparison of amino acid and dipeptide
    composition of each protein.
  • Group similar proteins into clusters
  • Pick one representative from each cluster and
    compile into a suffix tree
  • Compare the suffix tree with itself local
    similarities
  • Cluster local similarities and align (without
    gaps)
  • Analyze alignments to detect domain boundaries
    and split
  • Create multiple alignments
  • Iterative family extension using automatically
    generated regular expressions
  • Contains 8877 entries

49
InterPro
  • Combined resource of proteins domain and motifs
    from ProSite, ProDom, Prints, Smart and Pfam

50
Databases of protein clusters
  • Cogs http//www.ncbi.nlm.nih.gov/COG/
  • Systers http//systers.molgen.mpg.de/
  • PIR http//wwwnbrf.georgetown.edu/pirwww/pirhome.
    shtml
  • ProtoMap http//protomap.cornell.edu/
  • What are the differences?

51
General approach
  • Guide lines
  • Two proteins are homologous if they evolved from
    the same ancestor protein.
  • We can deduce homology based on statistically
    significant sequence similarities
  • Homology is a transitive relation

52
Transitive closure leads to false connections
  • Similarity is not transitive and it does not
    necessarily entail homologyMain traps
  • 1) Overestimates of the statistical significance
    of similarity scores.
  • Chance similarities
  • 2) Multi-domain proteins

53
Systers
  • Run all-against-all comparison (Smith-Waterman)
    of database sequences
  • Apply the single linkage clustering algorithm to
    construct hierarchy of all sequneces
  • Derive superfamilies from the structure of the
    hierarchy
  • Construct distance graph for each superfamily
  • Cut weak connections in each superfamily distance
    graph to derive family clusters
  • 290,811 non-redundant sequences (without
    inclusions and duplicates) from Swiss-Prot,
    TrEMBL, PIR, wormbase, FlyBase, etc. (as of
    August 2000)
  • cluster into 64,282 superfamilies,
  • split into 82,449 family clusters (including
    55,182 single sequence clusters of mostly
    fragmental sequences),
  • annotated with Pfam domains

Seed 1e-80 1e-72 1e-57 1e-42 1e-38 1e-31 1e-14 1e-
08 1e-01
Seed 1e-85 1e-76 1e-40 1e-36 1e-22 1e-02
54
Systers
Superfamilies as well as family clusters are
derivedfrom the structure generated by the data
itself? no need for a user defined static cutoff
Courtesy of Antje Krause
55
COGs
  • Based on comparison of 43 complete genomes (as of
    May 2002)
  • Apply single linkage clustering algorithm to get
  • COGs - clusters of orthologous proteins (genes
    in different species that evolved from the same
    ancestral protein) or orthologous sets of
    paralogs (genes from the same genome which are
    related by duplication).
  • Each COG consists of at least 3 species.
  • Start by forming minimal COGs (triangles)
  • Merge COGs that share a side edge
  • Refine split COGs that were incorrectly merged
    due to multi-domain proteins or other reasons
  • COGNITOR a tool to add new proteins to COGs on
    the basis of genome-specific best hits used to
    incorporate new genomes.
  • Total of 3307 COGs as of May 2002 (including
    74,059 of 104,101 proteins encoded in the
    analyzed genomes)

56
ProtoMap
A graph-based approach sequence space is
represented as a directed graph where vertices
protein sequences and edges represent
relationships between proteins. The width
(weight) of an edge reflects the significance.
1 Starting from a very high resolution
classification Only connections of very high
statistical significance (10-100) are considered.
The graph splits into connected
components 2 Lower the threshold (consider
weaker similarities). Merge together clusters
that are strongly connected obtain the
classification at the next level of confidence
Myoglobin
Hemoglobin alpha
Leghemoglobin
Hemoglobin beta
57
The clustering algorithm (cont.)
This process is repeated at thresholds 10-95
,10-90,10-85 ,10-80 ..... 10-5 ,10-0
(1) 365,174 proteins are classified into 18,140
clusters (29,000 singletons)
58
ProtoMap - website
Results of the search on keyword electron
transport.
59
Results of the search on sequence ID acha_human
60
Browsing a cluster (part1 list)
61
Browsing a cluster (part2 higher level
constituents)
62
Higher level constituents - cont.
63
Alternative representations
  • Is there a unique universal representation of
    proteins?

64
Alternative representations
  • Studies based on dipeptide composition - a 400
    dimensional vector
  • van Heel 1991
  • Compared sequences by means of the correlation of
    their representations (advantages..)
  • Linear projection and clustering (k-means) into
    fixed number of clusters
  • Ferran et al. 1994
  • Used a two-layer neural network to classify
    sequences based on their dipeptide composition
    into a fixed number of clusters (trained using
    the Kohonen learning algorithm)
  • Meaning of weight vectors..

65
Alternative representations
  • Combination of compositional properties and other
    physical/chemical properties (Hobohm Sander
    1995)
  • Composition of the 20 amino acids
  • Composition of different groups of amino acids
    (polar, charged, aromatic, bulky, etc)
  • Isoelectric point
  • Molecular weight
  • Average hydrophobicity
  • Length
  • Dipeptide composition (used reduced
    representation)
  • The combination is weighted so as to achieve
    maximal separation over the train set
  • Can induce alternative similarity measures
    between proteins and used to search for close
    relatives

66
Alternative representations
  • Casari, et al. (1995) A method to predict
    functional residues in proteins
  • Input a multiple-sequence alignment
  • each sequence is converted to a vector of size
    (20 L) where L is length of the alignment
  • Generation of of N x (20L) matrix
  • one sequence produces a vector of dimensions
    20L
  • N sequences to produce N vectors of dimension
    20L
  • Use Principal Component Analysis
  • calculate the covariance matrix- indicates how
    different dimensions are correlated to one
    another
  • largest eigenvalues and corresponding
    eigenvectors are the principal components
  • take the three largest (the largest of which
    represents consensus sequence)
  • project their 20L dimensional data onto those 3
    dimensions (minimize covariance by projection
    onto a new basis of the eigenvectors of the
    covariance matrix)
  • this can be used to predict a protein subfamily
    for a given protein and functional residues

67
Structural classifications
  • SCOP http//scop.mrc-lmb.cam.ac.uk/scop/
  • CATH http//www.biochem.ucl.ac.uk/bsm/cath_new/in
    dex.html
  • FSSP http//www.ebi.ac.uk/dali/fssp/fssp.html
  • Structure comparison algorithms
  • Dali
  • CE
  • Structal

68
Dali (Holm Sander JMB 1993)
  • Comparison of protein structures using 2D
    distance matrices to represent 3D structures

CYC b56
ROP
CYC b56
ROP
69
Dali (cont.)
  • Assignment of equivalent residue pairs

70
Dali (cont.)
  • The similarity score of two structures
  • i and j are labeled pairs of equivalent (matched)
    residues (i.e. i iA,iB).
  • f similarity measure based on Ca-Ca distances
    dAij and dBij
  • Largest S corresponds to optimal set of
    equivalencies.
  • Rigid similarity scores
  • q R zero level of similarity
  • Elastic similarity score
  • dij the average of dAij and dBij
  • q E tolerance of 20 deviation
  • w(r) envelope function exp(-r2/a2)

fR(i,j) q R dAij dBij
71
CE (Shindyalov Bourne, Protein Eng. 1998)
  • Protein Structure Alignment by Incremental
    Combinatorial Extension (CE) of the Optimal Path
  • Define alignment fragment pair (AFP) as a
    continuous segment of protein A aligned against a
    continuous segment of protein B (without gaps).
  • An alignment is a path of AFPs s.t. for every two
    consecutive AFPs there may be gaps inserted into
    either A or B, but not into both. That is, for
    every two consecutive AFPs i and i1
  • and
  • or and
  • or and
  • where piA is the starting position of AFP i in
    protein A

72
CE
  • What is a goodAFP?
  • Define the distance between two different AFPs i
    and j as
  • dA(p,q) represents the distance between the
    alpha carbon atoms at positions p and q in
    protein A.
  • If you already have n-1 AFPs and consider adding
    the n-th AFN, do so only if

Protein B
i j i j
Protein A
73
CE (cont.)
  • Select an initial AFP.
  • Build an alignment path by incrementally adding
    good AFPs that satisfy the conditions of paths
  • Repeat step (2) until the proteins are completely
    matched, or until no good AFPs remain.
  • To assess the significance of the alignment,
    compare it to the alignment of a random pairs of
    structures, and compute the Z-score based on the
    RMSD and number of gaps in the final alignment.

Protein B
Protein A
74
Structal (Levitt Gerstein, PNAS 1998)
  • An initial equivalence is chosen, based on
    matching the ends of the two structures.
  • Repeat until convergence
  • Superimpose the two structures so as to minimize
    the RMS, given the equivalence
  • Given the superposition, calculate the distances
    dij between any atom i in the first protein and
    any atom j in the second protein
  • Transform distances into similarities sij M/1
    (dij/d0)2 where M20 and d0 2.24A
  • Apply dynamic programming to define a new set of
    equivalences

75
Structal (cont)
2) Superimpose to minimize RMS
1) Alignment fixed
4) Use dynamic prog. to find the best set of
equivalences
3) Calculate distances between all atoms
5) Superimpose given the new alignment
6) Recalculate distances between all atoms
76
SCOP
  • Structural classification of proteins with 5
    level hierarchy
  • Domains the individual entries
  • Family homologous proteins with significant
    sequence similarity
  • Superfamily protein families that share weak
    sequence similarity but with conserved functional
    residues (e.g. in active sites) believed to be
    evolutionary related
  • Fold protein superfamilies that share he same
    fold (not necessarily due to common evolutionary
    ancestry)
  • Class all-alpha, all-beta, alpha/beta,
    alphabeta, membrane proteins, small proteins
  • The classification is based on manual analysis by
    experts (Dr. Alexy Murzin)
  • As of May 2002, 7 main classes, 686 folds, 1073
    superfamilies, 1827 families

77
CATH
  • Structural classification of proteins with 5
    level hierarchy
  • Protein chains the individual entries
  • Homologous superfamily proteins with highly
    similar structures and functions.
  • Topology clusters according to the topological
    connections and numbers of secondary structures.
  • Architecture describes the gross orientation of
    secondary structures, independent of
    connectivities (assigned manually).
  • Class derived from secondary structure content,
    is assigned for more than 90 of protein
    structures automatically.
  • The assignments of structures to topology
    families and homologous superfamilies are made by
    sequence and structure comparisons.
  • As of Jan 2002, 8 main classes, 46 architectures,
    1453 topologies, more than 2000 superfamilies.

78
FSSP
  • Structural classification of proteins into a tree
    hierarchy
  • Protein domains the individual entries (defined
    using the algorithm of Holm and Sander 1994)
  • Start with all-vs-all structure comparison of
    protein domains
  • Domains are clustered automatically into clusters
    using the single linkage algorithm based on the
    z-scores of the structure similarity scores
  • 3242 families of more than 30,000 structures as
    of June 2002

79
Combined sequence and structure studies
  • Han Baker 1996
  • Multiple alignments of protein families
  • Create profiles
  • Cluster profile columns to clusters
  • Each cluster (signature pattern) is mapped to a
    structural motif
  • Rigoutsos et al. 1999
  • Analysis of 17 complete genomes
  • Unsupervised pattern discovery (represented as
    extended regular expressions) 1D dictionary
  • Locate motifs in the PDB database and align the
    corresponding substructures
  • If the RMSD is below 2.5A enter to 3D dictionary
  • Elofsson Shonnhammer 1999
  • Mapping of Pfam domains to SCOP domains and vice
    versa.
  • Kasuya and Thoronton 1999
  • Salamov et al. 1999
  • Jonassen et al. 2000

80
BioSpace (Yona Levitt, ISMB 2000)
MCAVSST...
RASWIILTC...
DYEEPEERT...
81
Summary
  • Many different classification systems
  • Different source databases, different objects
  • Different representations
  • Different methodologies
  • Significant overlap, but differences are too
    fundamental to accommodate in a single
    classification system.
  • Is there a universal definition and
    characterization of domain families and protein
    families?

82
Acknowledgments
  • Terri Attwood (prints)
  • Jerome Gracy (domo)
  • Steven Henikoff (blocks)
  • Nicolas Hulo (prosite)
  • Daniel Kahn (prodom)
  • Eugene Koonin (cogs)
  • Antje Krause (systers)
  • Erik Sonnhammer (Pfam)
  • Chris Chau, Nick DeNunzio, Leonid Meyerguz
About PowerShow.com