Good solutions are advantageous - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Good solutions are advantageous

Description:

By comparing several related sequences to each other, one can distiguish ... mutations needed to evolve from a putative ancestor to all used present-day' sequences. ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 14
Provided by: christo57
Category:

less

Transcript and Presenter's Notes

Title: Good solutions are advantageous


1
Good solutions are advantageous
Christophe Roos - MediCel ltd christophe.roos_at_medi
cel.fi
Evolution changes sequences
Motifs, profiles, structures
Part 5 modular proteins
Similarity is a tool in understanding the
information in a sequence
2
Proteins share similar domains
By comparing several related sequences to each
other, one can distiguish segments with higher
level of conservation. Usually they have a key
role in the function of a protein. Blast
identifies related sequences fast but only
roughly.
3
Refine the comparison
  • Multiple sequence alignments of the best scoring
    sequences fround by Blast (or some other way) is
    done with a more sensitive algorithm.
  • Example The eyeless gene in the fruit fly is
    also found in several species birds, mammals,
    reptiles, fish, invertebrates. There it is called
    PAX6.

4
Visualise the relationship
  • Once a multiple sequence alignment is done, it
    can also be used for finding relationship
    (evolutionary distance)
  • The distance is calculated as the amount of
    mutations needed to evolve from a putative
    ancestor to all used present-day sequences.
    Then a path including all sequences is computed.
    Different metrics can be used (most parsimonious,
    maximum likelihood, etc).

5
Visualise the output of aligned domains
First all sequence pairs are aligned and scored,
then in a second round a multiple sequence
alignment is built up. In this case (PAX6
proteins from vertebrates and fruit fly), two
domains are more conserved than the rest of the
sequence. The most conserved areas have been
highlighted by the use of black or gray
background and white text. Only part of the
alignment is shown.
6
Profiles and motifs
  • A sequence motif is a locally conserved region of
    a sequence or a short sequence pattern shared by
    a set of sequences.
  • The term motif refers to any sequence pattern
    that is predictive of a molecules function, a
    structural feature, or a family membership.
  • Motifs can be detected in proteins, DNA and RNA
    sequences, but they most commonly refer to
    protein motifs.
  • Motifs can be represented for computational
    purposes as
  • Flexible patterns K,R-R-P-C-x(11)-C-V-S
    (qualitative, unweighted see the Prosite
    database at www.expasy.org)
  • Position-specific scoring matrices (PSSM, see
    next page)
  • Profile hidden Markov models (HMM). These are
    rigorous probabilistic formulation of a sequence
    profile. They contain the same probability
    information as PSSMs but can also account for
    gaps.

7
Position specific scoring matrix
  • This corresponds to the flexible pattern of the
    paired box K,R-R-P-C-x(11)-C-V-S

A B C D E F G H I K L M
N P Q R S T V W X Y Z
- -22 -22 -35 -26 -15 -37 -30 -9 -38 35 -36 -23
-16 -34 -5 53 -23 -24 -35 -40 -19 -31 -9 0
0 -51 -52 -62 -57 -46 -64 -59 -33 -66 -16 -63 -49
-44 -64 -34 70 -51 -53 -63 -64 -46 -57 -40 0
0 -42 -58 -59 -55 -53 -68 -59 -54 -63 -51 -65 -57
-62 73 -54 -56 -50 -53 -59 -72 -51 -69 -54 0
0 -42 -69 99 -75 -84 -49 -66 -72 -43 -76 -54 -53
-62 -79 -74 -75 -51 -48 -42 -65 -58 -59 -79 0
0 -21 -38 -19 -41 -30 -29 -43 -36 6 32 -16 -13
-35 -44 -25 -15 -34 -22 47 -41 -18 -36 -27 0
0 -21 6 -8 -12 -27 -7 -25 -13 26 -22 23 8
30 -39 -21 -23 -20 -13 10 -30 -9 -19 -24 0
0 -31 -40 -21 -43 -34 -23 -48 -36 50 33 -9 -8
-37 -47 -27 -17 -39 -28 5 -46 -20 -33 -30 0
0 -27 -36 -24 -38 -30 -12 -40 -30 -3 31 39 3
-32 -42 -20 -11 -35 -28 -10 -37 -16 -29 -24 0
0 -5 11 -7 -8 -18 -24 -15 -11 2 -17 -17 -13
35 -32 -17 -20 20 -2 23 -33 -7 -26 -18 0
0 24 -20 0 -22 -19 -21 -12 -20 5 -19 -12 -9
-16 -24 -19 -22 21 0 24 -29 -7 -25 -19 0
0 21 11 -3 -6 -16 -28 -10 -9 -19 -13 -25 -17
33 -26 -13 -16 2 28 -10 -35 -8 -25 -14 0
0 -3 -17 -4 -21 -21 -11 -19 -18 -1 -20 19 2
-12 -29 -19 -21 20 27 -3 -30 -6 -21 -20 0
0 -18 16 -17 33 -6 -20 -26 52 2 -21 -17 -13
-5 -35 -12 -19 -21 -16 20 -30 -10 -8 -10 0
0 -26 -41 -12 -45 -40 10 -43 -10 30 -33 5 45
-37 -44 -27 -31 -34 -21 7 -4 -17 45 -33 0
0 -27 12 -22 33 -13 -8 -28 -21 -10 -27 -5 42
-15 -40 -20 -28 -28 -24 -14 73 -14 -5 -17 0
0 -42 -69 99 -75 -84 -49 -66 -72 -43 -76 -54 -53
-62 -79 -74 -75 -51 -48 -42 -65 -58 -59 -79 0
0 -40 -73 -33 -75 -63 -45 -72 -68 -6 -66 -29 -28
-71 -71 -65 -67 -59 -40 64 -57 -45 -56 -64 0
0 -25 -40 -35 -44 -45 -59 -39 -45 -60 -47 -63 -56
-36 -55 -47 -48 61 -24 -52 -66 -39 -57 -46 0
0
8
Motif and databases mode of use
  • Motifs can be used to search sequence databases
  • take a family of related sequences
  • align and define motifs
  • use the motifs to search a database of sequences
    to find novel family members
  • can also be generated from unaligned sequences
    (e.g. MEME, see next page)
  • Motif databases can be searched with sequences
  • take one sequence and ask what known motifs it
    contains
  • deduce its function using knowledge about those
    motifs in other sequences
  • DBs
  • Blocks, Fred Hutchinson Cancer Research Center
    (ungapped alignments)
  • COG, clusters of orthologous groups, NCBI (21
    complete genomes)
  • Pfam, Sanger Center (gapped profiles, curated)
  • Prints, Univ. Manchester (fingerprints, i.e. more
    than one pattern)
  • Prosite, Univ. Geneva (consensus patterns,
    expert-curated)
  • SMART, EMBL-Heidelberg
  • IntePro, EBI (multiple, curated), includes Pfam,
    SMART, etc. 2 pages forward

9
Motif discovery tools and PSSM creators
  • The MEME tool takes as input unaligned sequences
    and searches for patterns according to several
    parameters such as
  • Min-max length
  • Amount per sequence
  • Amount per set
  • MEME also generates PSSM for the found domains.
  • MAST is a tool for searching databases with PSSMs

10
The InterPro database of motifs at EBI
  • (Nov 2001) was built from Pfam 6.6, PRINTS 31.0,
    PROSITE 16.37, ProDom 2001.2, SMART 3.1, TIGRFAMs
    1.2, and the current SWISS-PROT TrEMBL data.
    This release of InterPro contains 4691 entries,
    representing 1068 domains, 3532 families, 74
    repeats and 15 post-translational modification
    sites.

11
Scan the InterPro database - example
  • The InterPro database was scanned with the PAX6
    sequence from the fruit fly.

12
Protein 3D structure
  • 3D is better than linear strings of letters...
  • Protein folding is critical for function
  • Protein folding is ordered
  • Structures consist of folds
  • 3D structure can be measured, but computational
    ab initio structure prediction is a tough task
    and nearly impossible above a certain protein
    size (cpu and rule limits)

13
Protein 3D structure building blocks
  • Primary structure the linear array of aminoacids
  • Secondary structures
  • Alpha helix
  • Beta-strand
  • Tertiary structures

DNA-binding protein (DNA helix, white helices,
pink sheets of beta-strands, ocra)
Write a Comment
User Comments (0)
About PowerShow.com