Biological Databases for Protein Sequence Analysis - PowerPoint PPT Presentation

1 / 120
About This Presentation
Title:

Biological Databases for Protein Sequence Analysis

Description:

Analyses can be pursued with decreasing certainty towards the Twilight Zone ... around this threshold is the Twilight Zone, where alignments may appear ... – PowerPoint PPT presentation

Number of Views:804
Avg rating:4.0/5.0
Slides: 121
Provided by: attw
Category:

less

Transcript and Presenter's Notes

Title: Biological Databases for Protein Sequence Analysis


1
Biological Databasesfor Protein Sequence Analysis
  • Terri Attwood
  • School of Biological Sciences
  • University of Manchester, Oxford Road
  • Manchester M13 9PT, UK
  • http//www.bioinf.man.ac.uk/dbbrowser/

2
Overview
  • Introduction
  • Web practical, science fact fiction
  • the Twilight Zone, the Midnight Zone
  • Biological databases
  • primary, secondary pattern, composite, etc.
  • Pattern recognition
  • regular expressions, fingerprints, profiles,
    etc.
  • Building a search protocol
  • combining results, estimating significance

3
The practical - BioActivity
  • BioActivity is intended to support the lectures
  • you begin with a DNA sequence fragment
  • try to find out what protein this codes for, the
    family to which it belongs, whether its function
    structure are known, etc.
  • The practical is entirely Web-based
  • largely uses local servers, but also links to
    external sites
  • be patient mindful of traffic - don't waste
    time on slow links
  • Most important of all
  • please read the instructions!
  • The Web is constantly evolving....
  • please report dead links (otherwise theyll stay
    dead)!

4
(No Transcript)
5
The stuff you have to know
  • Single- three-letter amino acid codes
  • G Glycine Gly P Proline Pro
  • A Alanine Ala V Valine Val
  • L Leucine Leu I Isoleucine Ile
  • M Methionine Met C Cysteine Cys
  • F Phenylalanine Phe Y Tyrosine Tyr
  • W Tryptophan Trp H Histidine His
  • K Lysine Lys R Arginine Arg
  • Q Glutamine Gln N Asparagine Asn
  • E Glutamic Acid Glu D Aspartic Acid Asp
  • S Serine Ser T Threonine Thr
  • Additional codes
  • B Asn/Asp Z Gln/Glu X Any amino acid

6
(No Transcript)
7
Basic definitions
  • Primary structure
  • the linear sequence of amino acids in a protein
  • Secondary structure
  • regions of local regularity
  • i.e., a-helices, b-strands, -sheets -turns

8
Definitions contd.
  • Super-secondary structure
  • the packing of secondary structure elements into
    stable units
  • e.g., b-barrels, bab units, Greek keys, etc..

9
Definitions contd.
  • Tertiary structure
  • the overall chain fold that results from packing
    of secondary structure elements

10
Definitions contd.
  • Quaternary structure
  • the arrangement of separate chains within a
    protein that has more than one subunit
  • e.g., haemoglobin

11
Definitions contd.
  • Quinternary structure
  • the arrangement of separate molecules, such as in
    protein-protein or protein-nucleic acid
    interactions

12
Definitions contd.
  • Bioinformatics
  • broadly, Information Technology applied to
    biology
  • this can mean anything from AI robotics to
    genome analysis!
  • boundaries with computational biology now
    blurred
  • originally coined in the 80s to mean
    bio-sequence analysis
  • with increasing availability of protein
    structures, the term now also encompasses
    structure analysis
  • but the scale of the problem here is vastly
    different.....

13
Importance of sequence analysis
  • 694,000 sequences available in public databases
  • millions more (including ESTs) in proprietary
    databases
  • these s will snowball with completion of more
    genomes
  • so what?
  • Locked up in sequences is a huge amount of
    structural, functional evolutionary info
  • they're a highly valuable resource
  • By contrast, the of unique protein structures
    is 2000
  • this represents a huge information deficit

14
Sequence-structure deficit
  • Non-redundant growth of sequences during
    1988-1998 ( ) the corresponding growth in
    the number of structures ( ).

15
Challenges for bioinformatics
  • Spurred on by the sequence/structure deficit, the
    challenges are to
  • rationalise the mass of sequence data
  • derive more efficient means of data storage
  • design more incisive reliable analysis tools
  • The imperative - to convert sequence information
    into biochemical biophysical knowledge
  • to decipher the structural, functional
    evolutionary clues encoded in the language of
    biological sequences

16
The Holy Grail of bioinformatics
  • ...to be able to understand the words in a
    sequence sentence that form a particular protein
    structure

17
The reality of sequence analysis
  • ...isn't so glamorous....but means we can
    recognise words that form characteristic
    patterns, even if we don't know the precise
    syntax to build complete protein sentences

18
Pattern recognition prediction
  • In investigating the meaning of sequences, 2
    distinct analytical approaches have emerged
  • pattern recognition is used to detect similarity
    between sequences hence to infer related
    structures functions
  • ab initio prediction is used to deduce structure,
    to infer function, directly from sequence
  • These methods are different shouldnt be
    confused
  • Sequence- structure-based pattern recognition
    methods demand that some characteristic has been
    seen before housed in a db
  • Prediction methods remove the need for template
    dbs because deductions are made directly from
    sequence

19
Science fact fiction
  • Sequence pattern recognition is easier to
    achieve, is much more reliable, than fold
    recognition
  • which is 40-50 reliable even in expert hands
  • Prediction is still not possible
  • is unlikely to be so for decades to come (if
    ever)
  • Structural genomics will yield representative
    structures for more proteins in future
  • structures of new sequences will be determined by
    modelling
  • prediction will become an academic exercise
  • But, to debunk a popular myth, knowing structure
    alone does not inherently tell us function

20
A reality check
  • What is the function of this structure?
  • What is the function of this sequence?
  • What is the function of this motif?
  • the fold provides a scaffold, which can be
    decorated in different ways by different
    sequences to confer different functions - knowing
    the fold function allows us to rationalise how
    the structure effects its function at the
    molecular level

21
A test case for structural genomics
Structure-based assignment of the biochemical
function of hypothetical protein mj0577
(Zarembinski et al., PNAS 95 1998)
Although the structure co-crystallised with ATP,
the biochemical function of the protein is
unknown

22
The Twilight Zone
  • Prediction methods dont work because we dont
    fully understand the Folding Problem
  • we cant read the language sequences use to
    create their folds
  • But, with sequence analysis techniques, we can
    try to find similarities between new sequences
    those in dbs
  • whose structures functions we hope have been
    elucidated
  • This is straightforward at high levels of
    identity, but below 50 it is difficult to
    establish relationships reliably
  • Analyses can be pursued with decreasing certainty
    towards the Twilight Zone
  • 20 identity, where results may look plausible
    to the eye, but are no longer statistically
    significant

23
Application areas of analysis tools
  • The scale indicates identity between aligned
    sequences
  • Alignment of 2 random seqs can produce 20
    identity
  • less than 20 does not constitute a significant
    alignment
  • around this threshold is the Twilight Zone,
    where alignments may appear plausible to the eye,
    but cant be proved by conventional methods

24
Homology analogy
  • The term homology is confounded abused in the
    literature!
  • sequences are homologous if theyre related by
    divergence from a common ancestor
  • analogy relates to the acquisition of common
    features from unrelated ancestors via convergent
    evolution
  • e.g., b-barrels occur in soluble serine proteases
    integral membrane porins chymotrypsin
    subtilisin share groups of catalytic residues,
    with near identical spatial geometries, but no
    other similarities
  • Homology is not a measure of similarity is not
    quantifiable
  • it is an absolute statement that sequences have a
    divergent rather than a convergent relationship
  • the phrases "the level of homology is high" or
    "the sequences show 50 homology", or any like
    them, are strictly meaningless!
  • This is not just a semantic issue
  • loose use muddies thinking about evolutionary
    relationships

25
A terminology muddle
  • In comparing 3D structures, exactly the same
    arguments apply
  • structures may be similar, as denoted by RMS
    positional deviation between compared atomic
    positions
  • common evolutionary origin remains a hypothesis,
    until supported by other evidence
  • homology among similar structures is a
    hypothesis
  • This may be correct or mistaken, but their
    similarity is a fact, no matter how it is
    interpreted
  • Similarity of sequence or structure is just that
    - similarity
  • Homology connotes a common evolutionary origin
  • Reeck, G.R., de Haen, C., Teller, D.C.,
    Doolittle, R.F., Fitch, W.M., Dickerson, R.E.,
    Chambon, P., McLachlan, A.D., Margoliash, E.,
    Jukes, T.H. Zuckerkandl, E. (1987) Homology
    in proteins and nucleic acids a terminology
    muddle and a way out of it. Cell, 50, 667.

26
Orthology paralogy
  • Among homologous sequences we can distinguish
  • orthologues - largely perform the same function
    in different species
  • paralogues - perform different but related
    functions in one organism
  • Studying orthologues opens the way to molecular
    palaeontology
  • e.g., using phylogenetic trees to show
    cross-species relationships
  • Paralogues shed light on underlying evolutionary
    mechanisms
  • paralogous proteins are thought to have arisen
    from single genes via successive duplication
    events
  • duplicated genes follow separate evolutionary
    pathways new specificities evolve through
    variation adaptation
  • Such complexity presents real challenges for
    sequence analysis

27
Challenges for sequence analysis
  • Much of the challenge is in getting the biology
    right
  • complicated by orthology vs paralogy
  • Following a db search, it may be unclear how much
    functional annotation can be legitimately
    inherited by a query
  • source of numerous annotation errors in dbs
  • propagation could lead to an error catastrophe
  • Further complications result from the modular
    nature of proteins
  • modules are autonomous folding units, used as
    protein building blocks - like Lego bricks, they
    can confer a variety of functions on the parent
    protein, either by multiple combinations of the
    same module, or via different modules to form
    mosaics
  • Automatic systems dont distinguish orthologues
    from paralogues dont consider the modular
    nature of proteins

28
(No Transcript)
29
  • Monkeys are exploited in different Goldberg
    machines, where they perform different functions
    - here, we couldnt predict a monkey in that
    spot, even with total knowledge of the rest of
    the machine
  • Similarity searches are just like this
  • identifying the presence of a module tells little
    of the function of the complete system
  • knowing most components of a mosaic, we cant
    predict a missing one
  • modules (monkeys) in different proteins dont
    always perform exactly the same function

30
The Midnight Zone
  • Identifying evolutionary links between sequences
    is useful
  • this often implies a shared function
  • Arguably, prediction of function from sequence is
    of more immediate value than the prediction of
    structure
  • However, between distantly-related proteins,
    structure is more conserved than the underlying
    sequences
  • thus, some relationships are only apparent at the
    structural level
  • Such relationships can't be detected by even the
    most sensitive sequence comparison methods
  • the region of identity where sequence comparisons
    fail completely to detect structural similarity
    is the Midnight Zone - there is thus a
    theoretical limit to the effectiveness of
    sequence analysis methods

31
Ground rules for bioinformatics
  • Don't always believe what programs tell you
  • they're often misleading sometimes wrong!
  • Don't always believe what databases tell you
  • they're often misleading sometimes wrong!
  • Don't always believe what lecturers tell you
  • they're often misleading sometimes wrong!
  • In short, don't be a naive user
  • when computers are applied to biology, it is
    vital to understand the difference between
    mathematical biological significance
  • computers dont do biology
  • they do sums
  • quickly!

32
Significance
  • Appreciating that mathematical biological
    significance are different is crucial
  • It is especially important in understanding the
    limitations of
  • database search algorithms
  • multiple sequence alignment algorithms
  • pattern recognition techniques
  • functional site structure prediction tools
  • Contrary to popular opinion, there is currently
    still
  • no biologically-reliable automatic multiple
    alignment algorithm
  • no infallible pattern-recognition technique
  • no reliable gene, function or structure
    prediction algorithm

33
(No Transcript)
34
(No Transcript)
35
Biological Databases
  • Overview
  • Primary data sources
  • GenBank, SWISS-PROT TrEMBL
  • Composite sequence databases
  • NRDB, OWL, SPTrEMBL
  • Secondary pattern databases
  • PROSITE, PRINTS, Profiles, Pfam, BLOCKS,
    IDENTIFY
  • Composite pattern databases
  • BLOCKS, InterPro

36
Primary sequence databases
  • In the '80s, when sequences started to
    accumulate, several labs saw advantages to
    establishing central repositories
  • trouble is, many labs thought the same made
    their own
  • Nucleic Protein
  • EMBL SWISS-PROT
  • GenBank PIR
  • DDBJ MIPS
  • TrEMBL
  • NRL-3D
  • The proliferation of dbs causes problems
  • do they have the same format? Which is the most
    accurate? The most up-to-date? The most
    comprehensive? Which should we use?

37
Composite sequence databases
  • A solution to proliferating dbs is to compile a
    composite
  • these render searches very efficient, especially
    if non-redundant
  • Trouble is, there are now several composites,
    each with their own format redundancy criteria
  • NRDB OWL SPTrEMBL
  • PDB SWISS-PROT SWISS-PROT
  • SWISS-PROT PIR TrEMBL
  • PIR GenBank
  • GenPept NRL-3D
  • GenPept updates
  • NRDB SPTrEMBL are non-identical, not
    non-redundant
  • but which is best? Which the most comprehensive?
    The most up-to-date? Which should we use?

38
Secondary pattern databases
  • As well as 1' resources, there are also many 2'
    pattern dbs derived from them
  • trouble is, they use different 1' sources
    different analysis methods, all have different
    formats!
  • But it isn't all bad - SWISS-PROT is emerging as
    a standard, most of the 2' dbs use it as their
    basis
  • PROSITE SWISS-PROT Regular expressions
    (patterns)
  • PRINTS SWISS-PROT/TrEMBL Aligned motifs
    (fingerprints)
  • Pfam SWISS-PROT/TrEMBL Hidden Markov Models
    (HMMs)
  • Profiles SWISS-PROT Weight matrices (profiles)
  • BLOCKS PRINTS/InterPro/Domo Weighted motifs
    (blocks)
  • IDENTIFY PRINTS/InterPro Permissive regular
    expressions

39
Why create pattern databases?
  • Arise from the need to make more specific
    functional diagnoses than are possible by just
    searching the 1's
  • Theyre built on the principle that homologous
    sequences may be gathered into alignments, within
    which are regions (motifs) that show little
    variation
  • these usually reflect vital structural or
    functional roles
  • Motifs are exploited in different ways to build
    diagnostic patterns for protein families
  • new sequences can be searched against dbs of such
    patterns to see if they can be assigned to known
    families
  • hence they offer a fast track to the inference of
    function

40
What's in a sequence?
41
Single motif methods
Fuzzy regex (IDENTIFY)
Exact regex (PROSITE)
Full domain alignment methods
Profiles (PROFILE LIBRARY)
HMMs (Pfam)
Identity matrices (PRINTS)
Multiple motif methods
Weight matrices (BLOCKS)
42
The challenge of family analysis
43
Know your family
44
The problem with domains
45
PROSITE
  • This was the first pattern database
  • protein families characterised by single motifs
  • Sequence information in motifs is reduced to
    consensus or regular expressions the seed
    pattern used to search SP
  • results are checked by hand to determine true
    false matches
  • noisy patterns are revised to achieve optimal
    results
  • Some families cant be characterised by single
    motifs
  • here, additional patterns are created refined
    until an optimal set of patterns is achieved that
    capture most or all of the family
  • results are then manually annotated for inclusion
    in the db

46
(No Transcript)
47
(No Transcript)
48
PRINTS
  • Most protein families are characterised by 1
    motif
  • it is sensible to use them all to build a
    diagnostic signature
  • This is the principle of fingerprints
  • these offer improved diagnostic reliability by
    virtue of the biological context provided by
    motif neighbours
  • Motifs are excised from alignments by hand
    encoded as ungapped, unweighted local alignments
  • residue information is augmented via iterative
    searches
  • sequences matching all motifs that weren't in the
    original alignment are added to the motifs, the
    db searched again
  • The process is repeated until convergence
  • results are manually annotated prior to inclusion
    in the db

49
(No Transcript)
50
SUMMARY INFORMATION 37 codes involving 8 el
ements 0 codes involving 7 elements
0 codes involving 6 elements
0 codes involving 5 elements
0 codes involving 4 elements
1 codes involving 3 elements
0 codes involving 2 elements
COMPOSITE FINGERPRINT INDEX 8 37 37
37 37 37 37 37 37
7 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
3 1 0 0 0 1 1 0 0
2 0 0 0 0 0 0 0 0
------------------------------------------
1 2 3 4 5 6 7 8
True positives.. PRIO_COLGU PRIO_MACFA PRIO_C
EREL PRIO_ODOHE PRIO_GORGO PRIO_PANTR PRIO_
HUMAN O46648 PRIO_SHEEP PRIO_CALJA PRIO_BOV
IN PRP2_BOVIN PRIO_ATEPA PRIO_SAISC PRIO_PR
EFR PRIO_PONPY O75942 PRIO_CAPHI PRIO_
CEBAP PRIO_CAMDR PRIO_FELCA PRP1_TRAST PRI
O_RABIT PRP2_TRAST PRIO_PIG PRIO_CANFA P
RIO_CRIGR PRIO_CRIMI Q15216 PRIO_RAT
PRIO_CERAE PRIO_MUSPF PRIO_MUSVI PRIO_MESAU
PRIO_MOUSE O46593 PRIO_TRIVU Subfamily Co
des involving 3 elements Subfamily True positive
s.. PRIO_CHICK
51
(No Transcript)
52
Profiles Pfam
  • An alternative to motif-based methods exploits
    regions between motifs, which contain valuable
    information
  • the full alignment effectively becomes the
    discriminator
  • A complex scoring scheme allowing for
    substitutions INDELs is used to create
    family-specific profiles
  • These profiles can be used to detect distant
    relation-ships, where only few residues are
    conserved
  • this is the basis of the Profile library
  • In an extension of this approach, alignments are
    encoded as probabilistic models termed HMMs
  • this is the basis of Pfam

53
BLOCKS IDENTIFY
  • There are advantages to storing motifs in a raw
    form
  • no information is lost
  • different scoring schemes may be used to confer
    different diagnostic potentials on the same data
  • Additional pattern databases have arisen in this
    way
  • BLOCKS - processed PROSITE families automatically
    (BLOCKS includes many other sources)
  • BLOCKS-format PRINTS - PRINTS motifs with BLOCKS
    scoring
  • IDENTIFY - creates fuzzy expressions from PRINTS
    InterPro
  • These databases are derived fully automatically,
    hence offer
  • no family annotation (they link back to PRINTS
    InterPro)
  • no further family coverage

54
Composite pattern databases
  • To simplify sequence analysis, the pattern
    databases are being integrated to create a
    unified protein family resource - InterPro
  • this is a central annotation resource (derived
    from PRINTS PROSITE documentation), with
    pointers to its satellite databases
  • release 3.0 contains 3591 entries
  • current partners are PRINTS, PROSITE, Profiles,
    Pfam ProDom
  • future partners will include SMART, TigrFam
    hopefully others (BLOCKS, MetaFam, etc.)
  • lags behind its sources

55
(No Transcript)
56
(No Transcript)
57
Pattern Recognition
  • Overview
  • Pattern recognition methods
  • regular expressions, fingerprints, blocks,
    profiles HMMs
  • Which method is best?

58
Pattern recognition methods
  • These methods classify proteins into families
  • the basis of the methods is multiple sequence
    alignment
  • They depend on developing a representation of
    conserved elements of alignments that may be
    diagnostic of structure or function, whether
    from
  • homologous sequence families
  • sequences that share some structural/functional
    domains

59
Determining significance of database matches
  • When searching a db, the challenge for analysis
    methods is to determine if matches are related
    (true-positive) or unrelated (true-negative)
  • At a given scoring threshold, it is likely that
    unrelated sequences will be matched erroneously
    (false-positives) some correct matches will be
    missed (false-negative)
  • The aim is to improve the resolution between the
    curves - in the overlap, it is difficult or
    impossible to establish if matches are
    significant
  • Different methods tackle this problem in
    different ways

60
Regular expressions/patterns
  • These are derived from single conserved regions,
    which are reduced to consensus expressions for db
    searches
  • they are minimal expressions, so sequence
    information is lost
  • the more divergent the sequences used, the more
    fuzzy poorly discriminating the pattern
    becomes
  • Alignment Pattern
  • GAVDFIALCDRYF
  • GPIDFVCFCERFY G-X-IV-DE-F-IVL-X2-C-DE-R-
    FY2
  • GRVEFLNRCDRYY
  • Patterns do not tolerate similarity
  • sequences either match or not, regardless of how
    similar they are
  • matching is a binary on-off event frequently
    misses true matches
  • single-motif methods are very hit-or-miss - how
    do you know if you've encoded the best region?

61
In the beginning was PROSITE
  • G_PROTEIN_RECEPTOR PATTERN
  • PS00237
  • G-protein coupled receptor signature
  • GSTALIVMYWC-GSTANCPDE-EDPKRH-X(2)-LIVMNQGA
    -
  • X(2)-LIVMFT-GSTANC-LIVMFYWSTAC-DENH-R
  • /TOTAL919(919)/POS869(869)/FALSE_POS50(50)/F
    ALSE_NEG70
  • /PARTIAL49 UNKNOWN0(0)
  • This represents an apparent 18 error rate
  • the actual rate is probably higher
  • Thus, a match to a pattern is not necessarily
    true
  • a mis-match is not necessarily false!
  • False-negatives are a fundamental limitation to
    this type of pattern matching
  • if you don't know what you're looking for, you'll
    never know you missed it!

62
R-Y-x-DT-W-x-LIVM-ST-T-P-LIVM(3)
63
(No Transcript)
64
Regular expressions/rules
  • Regular expression patterns are most effective
    when applied to highly-conserved, family-specific
    motifs
  • It is often possible to identify, shorter generic
    patterns that are characteristic of common
    functional sites
  • Functional site Rule
  • N-glycosylation N-P-ST-P
  • Protein kinase C phosphorylation ST-X-RK
  • Casein kinase II phosphorylation ST-X2-DE
  • Such features result from convergence to a common
    property
  • glycosylation sites, phosphorylation sites, etc.
  • They cannot be used for family diagnosis don't
    discriminate
  • they can only be used to suggest whether a
    certain functional site might exist (which must
    then be tested by experiment)
  • such patterns are termed rules

65
Diagnostic limitations of short motifs
  • Consider the sequence motif Asp-Ala-Val-Ile-Asp
    (DAVID)
  • results of db searching for such a sequence will
    differ, depending on whether we search for exact
    or permissive fuzzy matches
  • Pattern Matches
  • D-A-V-I-D 71 (99)
  • D-A-V-I-DEQN 252
  • DEQN-A-V-I-DEQN 925
  • DEQN-A-VLI-I-DEQN 2,739
  • DEQN-AG-VLI-VLI-DEQN 51,506
  • D-A-V-E 1,088 (1,493)
  • (number of matches in OWL29.6 ( OWL31.1))
  • Use of fuzzy regular expressions has the
    potential advantage of being able to recognise
    more distant relationships
  • the inherent disadvantage that more matches
    will be made by chance, making it difficult to
    separate out true matches from noise

66
Residue groups for fuzzy patterns
  • It is possible to assign residues to groups
    corresponding to various biochemical properties -
    e.g., charge size
  • using such groups to create fuzzy expressions
    theoretically ensures that resulting motifs have
    sensible biochemical interpretations
  • small Ala, Gly
  • small hydroxyl Ser, Thr
  • basic His, Lys, Arg
  • aromatic Phe, Tyr, Trp
  • aliphatic Val, Leu, Ile, Met
  • acidic/amide Asp, Glu, Asn, Gln
  • small/polar Ala, Gly, Ser, Thr, Pro
  • This is more flexible than exact regular
    expression matching
  • but the inherent permissiveness of the fuzzy
    approach brings an inevitable signal-to-noise
    trade-off

67
Fingerprints
  • Fingerprints are groups of motifs excised from
    alignments used for iterative db searching
  • no weighting scheme is used
  • searches depend only on residue frequencies
  • resulting scoring matrices are thus sparse
  • Each motif trawls the database independently
  • search results are correlated to determine which
    sequences match all the motifs which match only
    partially
  • no information is thrown away
  • Iteration refines the fingerprint increases its
    potency
  • fingerprints are diagnostically more powerful
    than regular expressions

68
TM domain
TM domain
69
loop region
70
A fingerprinting overview
71
  • T C A G N S P F L Y H Q V K D E
    I W R M B X Z
  • 0 0 2 0 0 0 0 0 0 7 0 0 1 0 0 0
    0 2 0 0 0 0 0
  • 0 0 3 0 0 0 0 0 2 0 0 0 4 0 0 0
    3 0 0 0 0 0 0
  • 6 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0
  • 1 0 1 1 0 3 0 0 1 0 0 0 3 0 0 0
    0 0 0 2 0 0 0
  • 2 0 1 1 0 0 0 0 0 0 0 3 0 4 0 0
    0 0 1 0 0 0 0
  • 4 0 1 0 0 1 0 2 0 1 3 0 0 0 0 0
    0 0 0 0 0 0 0
  • 0 0 0 0 0 0 0 0 0 0 0 0 0 12 0 0
    0 0 0 0 0 0 0
  • 0 0 0 0 0 6 0 0 0 0 0 0 0 6 0 0
    0 0 0 0 0 0 0
  • 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0
    0 0 0 0 0 0 0
  • 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
    0 0 10 0 0 0 0
  • 9 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0
    0 0 0 0 0 0 0
  • 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0
  • 0 0 4 0 0 2 0 0 6 0 0 0 0 0 0 0
    0 0 0 0 0 0 0
  • (b)
  • T C A G N S P F L Y H Q V K D E
    I W R M B X Z
  • 0 0 4 0 0 0 0 8 4 34 0 0 15 0 0 0
    1 7 0 0 0 0 0
  • 0 4 15 0 0 0 0 0 7 0 0 0 37 0 0 0
    10 0 0 0 0 0 0
  • 50 0 0 0 0 3 0 18 0 0 0 0 0 0 0 0
    0 0 0 2 0 0 0
  • YVTVQHKKLRTPL
  • YVTVQHKKLRTPL
  • YVTVQHKKLRTPL
  • AATMKFKKLRHPL
  • AATMKFKKLRHPL
  • YIFATTKSLRTPA
  • VATLRYKKLRQPL
  • YIFGGTKSLRTPA
  • WVFSAAKSLRTPS
  • WIFSTSKSLRTPS
  • YLFSKTKSLQTPA
  • YLFTKTKSLQTPA
  • (a)
  • Key
  • (a) motif, with 3 conserved positions
  • (b) corresponding frequency matrix
  • (c) same matrix, but after 3 iterations
  • (d) same matrix, with PAM250 weighting

72
Fingerprint visualisation
  • Full potency of fingerprinting is gained from the
    mutual context provided by motif neighbours
  • Important, as it inherently implies a biological
    context to motifs matched in the correct order,
    with appropriate distances between them
  • results are thus biologically more meaningful
    than those from single motifs
  • Allows sequence identification even when parts of
    the fingerprint are absent
  • such matches are best visualised graphically

73
(No Transcript)
74
(No Transcript)
75
Blocks
  • Blocks are groups of motifs derived automatically
    from families identified in PRINTS InterPro
  • sequences are aligned automatically motifs are
    automatically identified by searching for spaced
    residue triplets (e.g., AxxxVxxC)
  • a block score is calculated using the BLOSUM62
    matrix
  • validity of blocks is confirmed with a 2nd
    motif-finding algorithm
  • blocks found by both methods are considered
    reliable
  • Sequences within motifs are clustered to reduce
    contributions to residue frequencies from sets of
    closely-related sequences
  • each cluster is treated as a single sequence
    given a score that gives a measure of its
    relatedness
  • the higher the weight, the more dissimilar the
    segment from others in the block, the most
    distant being given a score of 100
  • segments

76
(No Transcript)
77
(No Transcript)
78
Profiles
  • Profiles are scoring tables derived from full
    alignments
  • these define which residues are allowed at given
    positions
  • which positions are conserved which degenerate
  • which positions, or regions, can tolerate
    insertions
  • the scoring system is intricate, may include
    evolutionary weights, results from structural
    studies, data implicit in the alignment
  • variable penalties are specified to weight
    against INDELs occurring in core 2' structure
    elements
  • Within a profile, the I M fields contain
    position-specific scores for insert match
    positions
  • in conserved regions, INDELs aren't totally
    forbidden, but are strongly impeded by large
    penalties defined in the DEFAULT field
  • these are superseded by more permissive values in
    gapped regions
  • the inherent complexity of profiles renders them
    highly potent discriminators, but they are
    time-consuming to derive

79
(No Transcript)
80
(No Transcript)
81
Hidden Markov Models
  • HMMs are similar in concept to profiles
  • they are probabilistic models consisting of
    inter-connecting states
  • essentially, linear chains of match, delete or
    insert states
  • Match states are assigned to conserved columns in
    an alignment
  • insert states allow for insertions relative to
    match states
  • delete states allow match positions to be
    skipped
  • thus, building an HMM requires each position in
    an alignment to be assigned to match, delete or
    insert states
  • HMMs usually perform well, but can be
    over-trained
  • they may also suffer if created from automatic
    iterative processes
  • if it once accepts a false match, an HMM becomes
    corrupt

82
An HMM
C
L
Y
E
C
L
W
D
83
Which method is best?
  • The range of methods available leads to familiar
    problems
  • which should we use?
  • which is the most reliable?
  • which is the most comprehensive?
  • None of the pattern-recognition techniques is
    infallible
  • each has its optimum area of application
  • None of the resulting pattern databases is
    complete
  • none is the best
  • bearing in mind the diagnostic strengths
    weaknesses of the different approaches, keeping
    biological significance in mind, the best
    strategy is to use them all

84
Current status of pattern databases
  • PROSITE (SIB) - 1034 entries
  • single motifs (regexs) - best with small highly
    conserved sites
  • Profile library (ISREC) - 300 entries
  • weight matrices - good with divergent domains
    superfamilies
  • PRINTS (Manchester) - 1500 entries
  • multiple motifs (fingerprints) - best for
    families and sub-families
  • Pfam (Sanger Centre) - 2727 entries
  • HMMs - good with divergent domains
    superfamilies
  • InterPro (EBI) - 3591 entries
  • derived from PRINTS, PROSITE, Profiles, Pfam,
    ProDom, etc.
  • BLOCKS (FHCRC) - 2433 entries
  • multiple motifs (derived from PRINTS, InterPro,
    Domo etc.)
  • IDENTIFY (Stanford)
  • permissive regexs (derived from PRINTS InterPro)

85
Tools for predicting protein function from
sequence
86
Building a search protocol
  • Overview
  • The usual starting point
  • searching the primary data sources
  • NRDB, SPTR, etc.
  • Pattern recognition methods
  • searching the secondary sources
  • patterns, profiles, blocks, fingerprints HMMs
  • Estimating significance
  • when do we believe a result?

87
A practical approach
  • A central goal is to predict protein function
    from sequence
  • Given a newly-determined sequence, we want to
    know
  • what is my protein?
  • to what family does it belong?
  • what is its function?
  • how can we explain its function in structural
    terms?
  • By searching pattern dbs fold libraries, we may
    recognise patterns that allow us to infer
    relationships with previously-characterised
    families folds
  • Given the variety of dbs to search, how do we use
    them to build a sensible search protocol?

88
  • Protein sequence
    database identity search
  • e.g., for short fragments, pinpoints
    identical matches
  • to probe - may identify correct reading
    frame
  • Protein sequence database similarity search
  • e.g., nrdb, OWL, SPSPTrEMBL - identifies
  • homologues to
    probe
  • Protein pattern database search
  • e.g., PROSITE, profiles, PRINTS, BLOCKS,
  • Pfam - identifies
    family relationships or pin-
  • points key
    structural or functional sites
  • Known structure No known
    structure
  • Structure classification database query
    Protein fold pattern library search
  • e.g., scop, CATH, FSSP - provides details
    e.g., threading - identifies compatible
  • of structural class, secondary structure
    folds for the probe sequence
  • information, ligand-binding, etc.

89
Similarity searching
  • Whether or not an identity search finds a match,
    the next step is to look for similar sequences
  • e.g., you may wish to know if a wider family
    exists
  • The most rapid option is to use BLAST (Best Local
    Alignment Search Tool), flavours of it, or
    FastA
  • In BLAST output, look for
  • high scores with low P-values (unlikely to be
    random)
  • clusters of high scores at the top of the hitlist
    (a family?)
  • trends in the type of sequences matched
  • To ensure a comprehensive search, identity
    similarity searches are best performed on
    composite databases
  • e.g., NRDB, SPSP-TrEMBL

90
Ideal results show high scores low E-values
91
Why bother with pattern searches?
  • Primary searches won't always allow outright
    diagnosis
  • BLAST FASTA are not infallible
  • often can't assign mathematically significant
    scores
  • results may be complicated by modules, domains or
    compositionally-biased regions
  • annotations of retrieved hits may be incorrect
  • Pattern databases contain potent descriptors
  • so, distant relationships missed by BLAST may be
    captured by one or more of the family or
    functional site distillations

92
(No Transcript)
93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
(No Transcript)
97
(No Transcript)
98
(No Transcript)
99
(No Transcript)
100
Structural functional interpretation
  • Running db searches often does little more than
    identify a protein family
  • this only scratches the surface - we still want
    to know what our protein does what it might
    look like
  • The first step is to examine the detailed family
    documentations in PROSITE, PRINTS InterPro
  • these should help to elucidate the function of
    the protein
  • The next step is to examine the fold
    classification structure summary resources
  • e.g., SCOP, CATH PDBsum (assuming the structure
    is known)

101
(No Transcript)
102
(No Transcript)
103
(No Transcript)
104
(No Transcript)
105
(No Transcript)
106
Estimating significance
  • When do we believe a result?
  • A real example.....

107
(No Transcript)
108
(No Transcript)
109
(No Transcript)
110
(No Transcript)
111
(No Transcript)
112
(No Transcript)
113
(No Transcript)
114
(No Transcript)
115
(No Transcript)
116
(No Transcript)
117
(No Transcript)
118
(No Transcript)
119
Conclusions
  • Gene prediction, structure function prediction
    are non-trivial
  • structure function prediction tools are, at
    best, 70 accurate
  • What are the lessons for sequence analysis?
  • when searching for distant homologues, several
    dbs should be searched
  • different methods provide different perspectives
  • dbs arent complete their contents dont fully
    overlap
  • The more dbs searched, the more difficult it can
    be to interpret results
  • The more computers are involved in automating
    genome annotation, the greater the need for
    collaboration with biologists
  • The more data we have to handle, the more
    rigorous we must be in our thinking ( writing)
    if we are to make sense of the complexities
  • We are still a long way from having reliable
    tools for deducing protein function from
    sequence
  • but with the right approach, there is hope

120
The right approach risks being obscured by other
issues...
Write a Comment
User Comments (0)
About PowerShow.com