Title: Pfam:%20multiple%20sequence%20alignments%20and%20HMM-profiles%20of%20protein%20domains
1Pfam multiple sequence alignments and
HMM-profiles of protein domains
2Outline
- What is Pfam?
- What is a Hidden Markove model (the methodology
underlying Pfam)? - How to use Pfam and sample output
3pfam
- Pfam is a database of multiple alignments of
protein domains or conserved protein regions. - The alignments represent some evolutionary
conserved structure which has implications for
the protein's function. - Profile hidden Markov models (profile HMMs) built
from the Pfam alignments can be very useful for
automatically recognizing that a new protein
belongs to an existing protein family, even if
the homology is weak.
4Overview of Pfam Database
- Pfam A contains curated families each with an
associated profile HMM that can be used for
alignment and database searching - Annotation --contains several compulsory fields
- Seed alignment a manually verified multiple
alignment of a representative set of sequences - HMM profile turned a multiple sequence
alignment into a position-specific scoring
system. - Full alignment generated automatically from the
seed HMM-profile by searching Swisssprot for all
detectable members and aligning them to the HMM
profile - Pfam B are clustered automatically, allowing
Pfam to be comprehensive
5Pfam Sequence Database Coverage
residue
Sequence
Data shown is from Pfam v2.0 as of 1998 with 527
families. Current version is Pfam 12.0 (January
2004) contains alignments and models for 7316
protein families, based on the Swissprot 42.5 and
SP-TrEMBL 25.6 protein sequence databases
6Markov Model
- Simplest example Each state emits (or,
equivalently, recognizes) a particular element
with probability 1.
Example sequences 1234 234 14 121214 2123334
7Probabilistic Emission
- If we let the states define a set of emission
probabilities for elements, we can no longer be
sure which state we are in given a particular
element of a sequence BCCD or BCCD ?
8Hidden Markov Models (HMM)
- Emission uncertainty means the sequence doesn't
identify a unique path. The states are
hidden - Probability of a sequence is sum of all paths
that can produce it
p(bccd) 0.5 0.2 0.1 0.3 0.75 0.6
0.8 0.9 0.5 0.7 0.75 0.6
0.2 0.6 0.8 0.9 0.000972
0.013608 0.01458
9HMMs for homology
- Homology model ancestral residue (match)
states, insertion states, deletion states.
10Profile HMM
11Searching Pfam
- Web site provide users the ability to search
query protein sequences against one, all, or a
few PfamHMM. - _http//www.sanger.ac.uk/Pfam
- _http//genome.wustl.edu/Pfam
- --http//www.cgr.ki.se/Pfam
- . Software Users can use Pfam HMM-profile to
search locally using the freely available
HMMERsoftware package at http//genome.wustle.e
du/eddy/hmmer.htmlhmmer
12Sample Pfam Query Results
Score Query Start Query End Hmm Start Hmm end Pfam Family Description
97.57 104 153 1 50 DAG_PE bind Phorbol Estser/ diacylglycerol binding domain
92.44 169 216 1 50 DAG_PE-bind Phorbol Estser/ diacylglycerol binding domain
137.88 240 328 1 92 C2 C2 domain
276.16 413 674 1 247 pkinase Eukaryotic protein kinase domain
84.44 675 741 1 69 pkinase_C Protein kinase C terminal domain
70.99 807 857 17 69 pkinase_C Protein kinase C terminal domain
13Acknowledgements
- Some slides adapted from lectures by Larry Hunter
at University of Colorado Health Sciences Center - Altmann Lab for critical comments