Title: Good solutions are advantageous
1Good solutions are advantageous
Christophe Roos - MediCel ltd christophe.roos_at_medi
cel.fi
Evolution changes sequences
Motifs, profiles, structures
Part 5 modular proteins
Similarity is a tool in understanding the
information in a sequence
2Proteins share similar domains
By comparing several related sequences to each
other, one can distiguish segments with higher
level of conservation. Usually they have a key
role in the function of a protein. Blast
identifies related sequences fast but only
roughly.
3Refine the comparison
- Multiple sequence alignments of the best scoring
sequences fround by Blast (or some other way) is
done with a more sensitive algorithm. - Example The eyeless gene in the fruit fly is
also found in several species birds, mammals,
reptiles, fish, invertebrates. There it is called
PAX6.
4Visualise the relationship
- Once a multiple sequence alignment is done, it
can also be used for finding relationship
(evolutionary distance) - The distance is calculated as the amount of
mutations needed to evolve from a putative
ancestor to all used present-day sequences.
Then a path including all sequences is computed.
Different metrics can be used (most parsimonious,
maximum likelihood, etc).
5Visualise the output of aligned domains
First all sequence pairs are aligned and scored,
then in a second round a multiple sequence
alignment is built up. In this case (PAX6
proteins from vertebrates and fruit fly), two
domains are more conserved than the rest of the
sequence. The most conserved areas have been
highlighted by the use of black or gray
background and white text. Only part of the
alignment is shown.
6Profiles and motifs
- A sequence motif is a locally conserved region of
a sequence or a short sequence pattern shared by
a set of sequences. - The term motif refers to any sequence pattern
that is predictive of a molecules function, a
structural feature, or a family membership. - Motifs can be detected in proteins, DNA and RNA
sequences, but they most commonly refer to
protein motifs. - Motifs can be represented for computational
purposes as - Flexible patterns K,R-R-P-C-x(11)-C-V-S
(qualitative, unweighted see the Prosite
database at www.expasy.org) - Position-specific scoring matrices (PSSM, see
next page) - Profile hidden Markov models (HMM). These are
rigorous probabilistic formulation of a sequence
profile. They contain the same probability
information as PSSMs but can also account for
gaps.
7Position specific scoring matrix
- This corresponds to the flexible pattern of the
paired box K,R-R-P-C-x(11)-C-V-S
A B C D E F G H I K L M
N P Q R S T V W X Y Z
- -22 -22 -35 -26 -15 -37 -30 -9 -38 35 -36 -23
-16 -34 -5 53 -23 -24 -35 -40 -19 -31 -9 0
0 -51 -52 -62 -57 -46 -64 -59 -33 -66 -16 -63 -49
-44 -64 -34 70 -51 -53 -63 -64 -46 -57 -40 0
0 -42 -58 -59 -55 -53 -68 -59 -54 -63 -51 -65 -57
-62 73 -54 -56 -50 -53 -59 -72 -51 -69 -54 0
0 -42 -69 99 -75 -84 -49 -66 -72 -43 -76 -54 -53
-62 -79 -74 -75 -51 -48 -42 -65 -58 -59 -79 0
0 -21 -38 -19 -41 -30 -29 -43 -36 6 32 -16 -13
-35 -44 -25 -15 -34 -22 47 -41 -18 -36 -27 0
0 -21 6 -8 -12 -27 -7 -25 -13 26 -22 23 8
30 -39 -21 -23 -20 -13 10 -30 -9 -19 -24 0
0 -31 -40 -21 -43 -34 -23 -48 -36 50 33 -9 -8
-37 -47 -27 -17 -39 -28 5 -46 -20 -33 -30 0
0 -27 -36 -24 -38 -30 -12 -40 -30 -3 31 39 3
-32 -42 -20 -11 -35 -28 -10 -37 -16 -29 -24 0
0 -5 11 -7 -8 -18 -24 -15 -11 2 -17 -17 -13
35 -32 -17 -20 20 -2 23 -33 -7 -26 -18 0
0 24 -20 0 -22 -19 -21 -12 -20 5 -19 -12 -9
-16 -24 -19 -22 21 0 24 -29 -7 -25 -19 0
0 21 11 -3 -6 -16 -28 -10 -9 -19 -13 -25 -17
33 -26 -13 -16 2 28 -10 -35 -8 -25 -14 0
0 -3 -17 -4 -21 -21 -11 -19 -18 -1 -20 19 2
-12 -29 -19 -21 20 27 -3 -30 -6 -21 -20 0
0 -18 16 -17 33 -6 -20 -26 52 2 -21 -17 -13
-5 -35 -12 -19 -21 -16 20 -30 -10 -8 -10 0
0 -26 -41 -12 -45 -40 10 -43 -10 30 -33 5 45
-37 -44 -27 -31 -34 -21 7 -4 -17 45 -33 0
0 -27 12 -22 33 -13 -8 -28 -21 -10 -27 -5 42
-15 -40 -20 -28 -28 -24 -14 73 -14 -5 -17 0
0 -42 -69 99 -75 -84 -49 -66 -72 -43 -76 -54 -53
-62 -79 -74 -75 -51 -48 -42 -65 -58 -59 -79 0
0 -40 -73 -33 -75 -63 -45 -72 -68 -6 -66 -29 -28
-71 -71 -65 -67 -59 -40 64 -57 -45 -56 -64 0
0 -25 -40 -35 -44 -45 -59 -39 -45 -60 -47 -63 -56
-36 -55 -47 -48 61 -24 -52 -66 -39 -57 -46 0
0
8Motif and databases mode of use
- Motifs can be used to search sequence databases
- take a family of related sequences
- align and define motifs
- use the motifs to search a database of sequences
to find novel family members - can also be generated from unaligned sequences
(e.g. MEME, see next page) - Motif databases can be searched with sequences
- take one sequence and ask what known motifs it
contains - deduce its function using knowledge about those
motifs in other sequences - DBs
- Blocks, Fred Hutchinson Cancer Research Center
(ungapped alignments) - COG, clusters of orthologous groups, NCBI (21
complete genomes) - Pfam, Sanger Center (gapped profiles, curated)
- Prints, Univ. Manchester (fingerprints, i.e. more
than one pattern) - Prosite, Univ. Geneva (consensus patterns,
expert-curated) - SMART, EMBL-Heidelberg
- IntePro, EBI (multiple, curated), includes Pfam,
SMART, etc. 2 pages forward
9Motif discovery tools and PSSM creators
- The MEME tool takes as input unaligned sequences
and searches for patterns according to several
parameters such as - Min-max length
- Amount per sequence
- Amount per set
- MEME also generates PSSM for the found domains.
- MAST is a tool for searching databases with PSSMs
10The InterPro database of motifs at EBI
- (Nov 2001) was built from Pfam 6.6, PRINTS 31.0,
PROSITE 16.37, ProDom 2001.2, SMART 3.1, TIGRFAMs
1.2, and the current SWISS-PROT TrEMBL data.
This release of InterPro contains 4691 entries,
representing 1068 domains, 3532 families, 74
repeats and 15 post-translational modification
sites.
11Scan the InterPro database - example
- The InterPro database was scanned with the PAX6
sequence from the fruit fly.
12Protein 3D structure
- 3D is better than linear strings of letters...
- Protein folding is critical for function
- Protein folding is ordered
- Structures consist of folds
- 3D structure can be measured, but computational
ab initio structure prediction is a tough task
and nearly impossible above a certain protein
size (cpu and rule limits)
13Protein 3D structure building blocks
- Primary structure the linear array of aminoacids
- Secondary structures
- Alpha helix
- Beta-strand
- Tertiary structures
DNA-binding protein (DNA helix, white helices,
pink sheets of beta-strands, ocra)