Matching Problems in Bioinformatics - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Matching Problems in Bioinformatics

Description:

Given a string P (pattern) and a long string T (text), find ... PRINTS (http://umber.sbs.man.ac.uk/dbbrowser/PRINTS/ ) is a database of protein fingerprints. ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 35
Provided by: chh8
Category:

less

Transcript and Presenter's Notes

Title: Matching Problems in Bioinformatics


1
Matching Problems in Bioinformatics
  • Charles Yan
  • Fall 2008

2
Matching Problem
  • Given a string P (pattern) and a long string T
    (text), find all occurrences, if any, of P in T.
  • Example
  • T Given a string P (pattern) and a long string T
    (text), find all occurrences, if any, of P in T.
  • P any
  • Exact matching Does not allow any mismatch
  • Inexact matching Allow up to k mismatches

3
Matching Problem
  • Unix grep
  • MS word find
  • Genbank http//www.ncbi.nlm.nih.gov/Genbank/
  • Human genome
  • http//www.ncbi.nlm.nih.gov/projects/mapview/map_s
    earch.cgi?taxid9606
  • Given TTGTTCCGGTTAAAGATGGTGAAAATTTTT, does it
    appear in human genome? Where?
  • How about ACCCCCAGGCGAGCATCTGACAGCCTGGAGCAGCACACA
    CAACCCCAGGCGAG?

4
Motifs
  • A motif is a conserved element corresponding to a
    certain function (or structure). Occurrence of a
    motif in a protein is likely to indicate that the
    protein has the corresponding function.
  • Motifs are usually represented using alignment or
    regular expression

5
Motifs
6
Motifs
  • Protein function prediction using motifs
  • Each protein function is characterized by one
    single motif or multiple motifs .
  • If a protein contain the motif(s), it probably
    has the function that the motif(s) corresponds
    to.
  • A pertinent analogy is the use of fingerprints by
    the police for identification purposes. A
    fingerprint is generally sufficient to identify a
    given individual. Similarly, motif(s) can be used
    to formulate hypotheses about the function of a
    newly discovered protein.

7
PROSITE
  • PROSITE (http//ca.expasy.org/prosite/) is a
    database of protein families and domains.
    (Starting in 1988).
  • PROSITE currently contains patterns (motifs) and
    profiles specific for more than a thousand
    protein families or domains. Release 20.36, of
    22-Jul-2006 (contains 1528 documentation
    entries).
  • Each of these signatures comes with documentation
    providing background information on the structure
    and function of these proteins.

8
PROSITE
9
PROSITE
10
PROSITE
11
PROSITE
12
PROSITE
  • Steps in the development of a new motif
  • Select a set of sequences that belong to a
    function family. Make a multiple alignment.
  • Find a short (not more than four or five residues
    long) conserved sequence (core motif) which is
    part of a region known to be important or which
    include biologically significant residue(s).

13
PROSITE
  • Steps in the development of a new motif (cont.)
  • The most recent version of the Swiss-Prot
    knowledgebase is then scanned with these core
    pattern(s). If a core motif will detect all the
    proteins in the family and none (or very few) of
    the other proteins, we can stop at this stage.
  • In most cases we are not so lucky and we pick up
    a lot of extra sequences which clearly do not
    belong to the group of proteins under
    consideration. A further series of scans,
    involving a gradual increase in the size of the
    motif, is then necessary. In some cases we never
    manage to find a good motif.

14
PROSITE
  • The motif are described using the following
    conventions
  • The standard IUPAC one-letter codes for the amino
    acids are used.
  • The symbol 'x' is used for a position where any
    amino acid is accepted.
  • Ambiguities are indicated by listing the
    acceptable amino acids for a given position,
    between square parentheses ' '. For example
    ALT stands for Ala or Leu or Thr.
  • Ambiguities are also indicated by listing between
    a pair of curly brackets ' ' the amino acids
    that are not accepted at a given position. For
    example AM stands for any amino acid except
    Ala and Met.
  • Each element in a pattern is separated from its
    neighbor by a '-'.

15
PROSITE
  • The motif are described using the following
    conventions (Cont.)
  • Repetition of an element of the pattern can be
    indicated by following that element with a
    numerical value or a numerical range between
    parenthesis. Examples x(3) corresponds to x-x-x,
    x(2,4) corresponds to x-x or x-x-x or x-x-x-x.
  • When a pattern is restricted to either the N- or
    C-terminal of a sequence, that pattern either
    starts with a 'lt' symbol or respectively ends
    with a 'gt' symbol. In some rare cases (e.g.
    PS00267 or PS00539), 'gt' can also occur inside
    square brackets for the C-terminal element.
    'F-GSTV-P-R-L-Ggt' means that either
    'F-GSTV-P-R-L-G' or 'F-GSTV-P-R-Lgt' are
    considered.
  • A period ends the pattern.
  • Examples
  • AC-x-V-x(4)-ED.This pattern is translated
    as Ala or Cys-any-Val-any-any-any-any-any but
    Glu or Asp

16
PROSITE
17
PROSITE
18
PROSITE
  • A profile or weight matrix is a table of
    position-specific amino acid weights and gap
    costs. These numbers (also referred to as scores)
    are used to calculate a similarity score for any
    alignment between a profile and a sequence, or
    parts of a profile and a sequence. An alignment
    with a similarity score higher than or equal to a
    given cut-off value constitutes a motif
    occurrence.

19
PROSITE
20
Motifs and Matching
  • Motif Finding
  • Given a set of protein sequences, to find the
    motif(s) that are shared by these proteins.
  • Motif Scanning
  • Given a motif and a protein sequence, to find
    the occurrences (not necessary identical) of the
    motif on the protein sequences.
  • --The Matching Problem!

21
From Single Motif to Multiple Motifs
  • One single motif is not sufficient to predict a
    protein function. Multiple motifs have stronger
    predicting power.

22
Multiple Motifs
  • Protein function prediction using multiple motifs
  • Each protein function is characterized by a set
    of motifs (in stead of a single one).
  • If a protein contain a set of motifs, it probably
    has the function that the set of motifs
    correspond to.

23
PRINTS
  • PRINTS (http//umber.sbs.man.ac.uk/dbbrowser/PRINT
    S/ ) is a database of protein fingerprints.
  • A fingerprint is a group of conserved motifs used
    to characterize a protein family
  • ftp.bioinf.man.ac.uk/pub/prints
  • PRINTS is now maintained at the University of
    Manchester
  • PRINTS VERSION 38.1 (25 May, 2007)
  • 1904 FINGERPRINTS, encoding 11,451 single motifs

24
PRINTS
  • Two types of fingerprint are represented in the
    database, i.e. they are either simple or
    composite, depending on their complexity simple
    fingerprints are essentially single-motifs while
    composite fingerprints encode multiple motifs.
    The bulk of the database entries are of the
    latter type because discrimination power is
    greater for multi-component searches.
  • Usually the motifs do not overlap, but are
    separated along a sequence, though they may be
    contiguous in 3D-space.
  • Fingerprints can encode protein folds and
    functionalities more flexibly and powerfully than
    can single motifs, full diagnostic potency
    deriving from the mutual context provided by
    motif neighbors.

25
PRINTS
26
PRINTS
27
PRINTS
  • a) General field

28
PRINTS
  • FPScan
  • Submitting a PROTEIN sequence find the closest
    matching PRINTS fingerprint/s.

29
PRINTS
30
PRINTS
31
PRINTS
32
PRINTS
33
Related Projects
  • InterPro - Integrated Resources of Proteins
    Domains and Functional Sites
  • BLOCKS - BLOCKS db
  • Pfam - Protein families db (HMM derived) Mirror
    at St. Louis (USA)
  • PRINTS - Protein Motif fingerprint db
  • ProDom - Protein domain db (Automatically
    generated)
  • PROTOMAP - An automatic hierarchical
    classification of Swiss-Prot proteins
  • SBASE - SBASE domain db
  • SMART - Simple Modular Architecture Research Tool
  • TIGRFAMs - TIGR protein families db

34
Motifs and Matching
  • Motif Finding
  • Given a set of protein sequences, to find the
    motif(s) that are shared by these proteins.
  • Motif Scanning
  • Given a motif and a protein sequence, to find
    the occurrences (not necessary identical) of the
    motif on the protein sequences.
  • --The Matching Problem!
Write a Comment
User Comments (0)
About PowerShow.com