Good solutions are advantageous - PowerPoint PPT Presentation

1 / 13

About This Presentation

Title:

Good solutions are advantageous

Description:

By comparing several related sequences to each other, one can distiguish ... mutations needed to evolve from a putative ancestor to all used present-day' sequences. ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 14

Provided by: christo57

Category:

more less

Transcript and Presenter's Notes

Title: Good solutions are advantageous

1
Good solutions are advantageous
Christophe Roos - MediCel ltd christophe.roos_at_medi
cel.fi
Evolution changes sequences
Motifs, profiles, structures
Part 5 modular proteins
Similarity is a tool in understanding the
information in a sequence
2
Proteins share similar domains
By comparing several related sequences to each
other, one can distiguish segments with higher
level of conservation. Usually they have a key
role in the function of a protein. Blast
identifies related sequences fast but only
roughly.
3
Refine the comparison

Multiple sequence alignments of the best scoring
sequences fround by Blast (or some other way) is
done with a more sensitive algorithm.
Example The eyeless gene in the fruit fly is
also found in several species birds, mammals,
reptiles, fish, invertebrates. There it is called
PAX6.

4
Visualise the relationship

Once a multiple sequence alignment is done, it
can also be used for finding relationship
(evolutionary distance)
The distance is calculated as the amount of
mutations needed to evolve from a putative
ancestor to all used present-day sequences.
Then a path including all sequences is computed.
Different metrics can be used (most parsimonious,
maximum likelihood, etc).

5
Visualise the output of aligned domains
First all sequence pairs are aligned and scored,
then in a second round a multiple sequence
alignment is built up. In this case (PAX6
proteins from vertebrates and fruit fly), two
domains are more conserved than the rest of the
sequence. The most conserved areas have been
highlighted by the use of black or gray
background and white text. Only part of the
alignment is shown.
6
Profiles and motifs

A sequence motif is a locally conserved region of
a sequence or a short sequence pattern shared by
a set of sequences.
The term motif refers to any sequence pattern
that is predictive of a molecules function, a
structural feature, or a family membership.
Motifs can be detected in proteins, DNA and RNA
sequences, but they most commonly refer to
protein motifs.
Motifs can be represented for computational
purposes as
Flexible patterns K,R-R-P-C-x(11)-C-V-S
(qualitative, unweighted see the Prosite
database at www.expasy.org)
Position-specific scoring matrices (PSSM, see
next page)
Profile hidden Markov models (HMM). These are
rigorous probabilistic formulation of a sequence
profile. They contain the same probability
information as PSSMs but can also account for
gaps.

7
Position specific scoring matrix

This corresponds to the flexible pattern of the
paired box K,R-R-P-C-x(11)-C-V-S

A B C D E F G H I K L M
N P Q R S T V W X Y Z
- -22 -22 -35 -26 -15 -37 -30 -9 -38 35 -36 -23
-16 -34 -5 53 -23 -24 -35 -40 -19 -31 -9 0
0 -51 -52 -62 -57 -46 -64 -59 -33 -66 -16 -63 -49
-44 -64 -34 70 -51 -53 -63 -64 -46 -57 -40 0
0 -42 -58 -59 -55 -53 -68 -59 -54 -63 -51 -65 -57
-62 73 -54 -56 -50 -53 -59 -72 -51 -69 -54 0
0 -42 -69 99 -75 -84 -49 -66 -72 -43 -76 -54 -53
-62 -79 -74 -75 -51 -48 -42 -65 -58 -59 -79 0
0 -21 -38 -19 -41 -30 -29 -43 -36 6 32 -16 -13
-35 -44 -25 -15 -34 -22 47 -41 -18 -36 -27 0
0 -21 6 -8 -12 -27 -7 -25 -13 26 -22 23 8
30 -39 -21 -23 -20 -13 10 -30 -9 -19 -24 0
0 -31 -40 -21 -43 -34 -23 -48 -36 50 33 -9 -8
-37 -47 -27 -17 -39 -28 5 -46 -20 -33 -30 0
0 -27 -36 -24 -38 -30 -12 -40 -30 -3 31 39 3
-32 -42 -20 -11 -35 -28 -10 -37 -16 -29 -24 0
0 -5 11 -7 -8 -18 -24 -15 -11 2 -17 -17 -13
35 -32 -17 -20 20 -2 23 -33 -7 -26 -18 0
0 24 -20 0 -22 -19 -21 -12 -20 5 -19 -12 -9
-16 -24 -19 -22 21 0 24 -29 -7 -25 -19 0
0 21 11 -3 -6 -16 -28 -10 -9 -19 -13 -25 -17
33 -26 -13 -16 2 28 -10 -35 -8 -25 -14 0
0 -3 -17 -4 -21 -21 -11 -19 -18 -1 -20 19 2
-12 -29 -19 -21 20 27 -3 -30 -6 -21 -20 0
0 -18 16 -17 33 -6 -20 -26 52 2 -21 -17 -13
-5 -35 -12 -19 -21 -16 20 -30 -10 -8 -10 0
0 -26 -41 -12 -45 -40 10 -43 -10 30 -33 5 45
-37 -44 -27 -31 -34 -21 7 -4 -17 45 -33 0
0 -27 12 -22 33 -13 -8 -28 -21 -10 -27 -5 42
-15 -40 -20 -28 -28 -24 -14 73 -14 -5 -17 0
0 -42 -69 99 -75 -84 -49 -66 -72 -43 -76 -54 -53
-62 -79 -74 -75 -51 -48 -42 -65 -58 -59 -79 0
0 -40 -73 -33 -75 -63 -45 -72 -68 -6 -66 -29 -28
-71 -71 -65 -67 -59 -40 64 -57 -45 -56 -64 0
0 -25 -40 -35 -44 -45 -59 -39 -45 -60 -47 -63 -56
-36 -55 -47 -48 61 -24 -52 -66 -39 -57 -46 0
0
8
Motif and databases mode of use

Motifs can be used to search sequence databases
take a family of related sequences
align and define motifs
use the motifs to search a database of sequences
to find novel family members
can also be generated from unaligned sequences
(e.g. MEME, see next page)
Motif databases can be searched with sequences
take one sequence and ask what known motifs it
contains
deduce its function using knowledge about those
motifs in other sequences
DBs
Blocks, Fred Hutchinson Cancer Research Center
(ungapped alignments)
COG, clusters of orthologous groups, NCBI (21
complete genomes)
Pfam, Sanger Center (gapped profiles, curated)
Prints, Univ. Manchester (fingerprints, i.e. more
than one pattern)
Prosite, Univ. Geneva (consensus patterns,
expert-curated)
SMART, EMBL-Heidelberg
IntePro, EBI (multiple, curated), includes Pfam,
SMART, etc. 2 pages forward

9
Motif discovery tools and PSSM creators

The MEME tool takes as input unaligned sequences
and searches for patterns according to several
parameters such as
Min-max length
Amount per sequence
Amount per set
MEME also generates PSSM for the found domains.
MAST is a tool for searching databases with PSSMs

10
The InterPro database of motifs at EBI

(Nov 2001) was built from Pfam 6.6, PRINTS 31.0,
PROSITE 16.37, ProDom 2001.2, SMART 3.1, TIGRFAMs
1.2, and the current SWISS-PROT TrEMBL data.
This release of InterPro contains 4691 entries,
representing 1068 domains, 3532 families, 74
repeats and 15 post-translational modification
sites.

11
Scan the InterPro database - example

The InterPro database was scanned with the PAX6
sequence from the fruit fly.

12
Protein 3D structure

3D is better than linear strings of letters...
Protein folding is critical for function
Protein folding is ordered
Structures consist of folds
3D structure can be measured, but computational
ab initio structure prediction is a tough task
and nearly impossible above a certain protein
size (cpu and rule limits)

13
Protein 3D structure building blocks