Introduction to Bioinformatics 20120 - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Introduction to Bioinformatics 20120

Description:

PSI-BLAST tutorial on the NCBI web site. Intro to ... vs whole database of sequences comparisons. Approximates Smith-Waterman (local alignment) ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 27
Provided by: gruye
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics 20120


1
Introduction to Bioinformatics20120
  • Gianluca Pollastri
  • office CS A1.07
  • email gianluca.pollastri_at_ucd.ie

2
Credits
  • Richard Lathrop and Pierre Baldis Bioinformatics
    courses at University of California _at_ Irvine.

3
Credits (2)
  • PSI-BLAST tutorial on the NCBI web site

4
Course overview
  • Context DNA, RNA, proteins
  • Resources GenBank, PDB, etc.
  • Algorithms for sequence comparison.
  • Phylogenetics.
  • Structural bioinformatics protein structure
    prediction.

5
Lecture notes
  • http//gruyere.ucd.ie/2007_courses/20120/
  • confidential..

6
Recommended/useful readings
  • No book is actually required
  • Introduction to Bioinformatics
  • Lesk
  • Introduction to Computational Molecular Biology
  • Setubal, Meidanis
  • Bioinformatics the Machine Learning approach
  • Baldi, Brunak

7
BLAST
  • Basic Local Alignment Search Tool
  • Popular package for sequence vs whole database of
    sequences comparisons.
  • Approximates Smith-Waterman (local alignment)
  • But while SW would take ages in many practical
    cases (sequence_length2 x _sequences), BLAST is
    very fast (sequence_length x _sequences)

8
PSI-BLAST
  • Position-Specific Iterated BLAST.
  • The substitution matrix is not fixed anymore. A
    position specific scoring matrix (PSSM) is
    constructed automatically from a multiple
    alignment of the highest scoring hits in an
    initial BLAST search
  • This process can be iterated any number of times

9
PSI-BLAST (2)
  • The PSSM is generated by calculating
    position-specific scores for each position in the
    alignment.
  • Highly conserved positions receive high scores
    and weakly conserved positions receive scores
    near zero (similar to building standard matrix,
    except that here the score depends on the
    position).
  • The PSSM is used to perform a second (etc.) BLAST
    search and the results of each iteration used to
    refine the PSSM.
  • This iterative searching strategy results in
    increased sensitivity, because the
    penalties/rewards for substitutions are specific
    to a set, or family of proteins.

10
  • Last position-specific scoring matrix computed,
    weighted observed percentages rounded down,
    information per position, and relative weight of
    gapless real matches to pseudocounts
  • A R N D C Q E G H I L K M
    F P S T W Y V A R N D C Q E
    G H I L K M F P S T W Y V
  • 1 M -1 -2 -3 -4 -2 -1 -3 -3 -2 1 1 -2 7
    -1 -3 -2 -1 -2 -2 2 0 0 0 0 0 0 0
    0 0 0 0 0 77 0 0 0 0 0 0
    23 0.62 0.18
  • 2 N 0 -1 3 -1 -2 -1 -1 -1 -1 -2 -3 -1 -2
    -3 -2 3 4 -4 -2 -2 0 0 21 0 0 0 0
    0 0 0 0 0 0 0 0 40 39 0 0
    0 0.55 0.22
  • 3 I -2 -4 -4 -4 -2 -3 -4 -5 -4 5 2 -3 3
    -1 -4 -3 -2 -3 -2 2 0 0 0 0 0 0 0
    0 0 75 14 0 11 0 0 0 0 0 0
    0 0.71 0.28
  • 4 F 0 -1 -1 0 -3 0 4 -3 -1 -3 -3 2 -2
    2 -2 -1 1 -3 -1 -2 11 0 0 0 0 0 49
    0 0 0 0 15 0 15 0 0 10 0 0
    0 0.46 0.28
  • 5 E -2 -1 0 5 -4 2 4 -3 -1 -4 -4 2 -3
    -4 -2 -1 -2 -4 -3 -4 0 0 0 41 0 10 30
    0 0 0 0 19 0 0 0 0 0 0 0
    0 0.78 0.32
  • 6 M -2 -2 -3 -4 -2 -1 -3 -4 -3 0 1 -2 9
    -1 -4 -2 -2 -2 -2 0 0 0 0 0 0 0 0
    0 0 0 0 0 100 0 0 0 0 0 0
    0 1.14 0.33
  • 7 L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5 -4 1
    0 -4 -3 -2 -3 -2 0 0 0 0 0 0 0 0
    0 0 0 100 0 0 0 0 0 0 0 0
    0 0.92 0.33
  • 8 R -2 5 -2 -3 -4 0 -1 -3 -2 -2 -2 3 -2
    -3 -3 -2 -2 -4 -3 1 0 53 0 0 0 0 0
    0 0 0 0 28 0 0 0 0 0 0 0
    19 0.71 0.33
  • 9 I -2 2 1 -2 -4 3 -1 0 0 0 -2 -1 -2
    0 -3 -2 -2 -1 6 -2 0 16 9 0 0 18 0
    9 0 10 0 0 0 0 0 0 0 0 38
    0 0.48 0.33
  • 10 D -3 -3 0 7 -5 -1 1 -2 -2 -4 -5 -2 -4
    -5 -2 -1 -2 -5 -4 -4 0 0 0 100 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0
    0 1.47 0.33
  • 11 E -2 -1 -1 1 -5 1 6 -3 -1 -4 -4 0 -3
    -4 -2 -1 -2 -4 -3 -3 0 0 0 0 0 0 100
    0 0 0 0 0 0 0 0 0 0 0 0
    0 1.22 0.33
  • 12 G -1 -3 -1 -2 -4 -3 -3 7 -3 -5 -5 -3 -4
    -4 -3 -1 -3 -4 -4 -4 0 0 0 0 0 0 0
    100 0 0 0 0 0 0 0 0 0 0 0
    0 1.51 0.33
  • 13 L -2 1 -3 -2 -3 -1 2 -4 -2 1 3 -2 0
    0 -3 -2 -2 -2 4 1 0 10 0 0 0 0 18
    0 0 9 35 0 0 0 0 0 0 0 18
    9 0.33 0.33
  • 14 R -2 6 -1 -2 -4 0 -1 -3 -1 -4 -3 4 -2
    -4 -3 -1 -2 -4 -3 -3 0 62 0 0 0 0 0
    0 0 0 0 38 0 0 0 0 0 0 0
    0 1.00 0.33
  • 15 L -2 -3 -3 1 -3 -2 2 -4 -3 0 4 -2 0
    -1 -3 -2 0 -3 -2 1 0 0 0 8 0 0 18
    0 0 0 56 0 0 0 0 0 9 0 0
    9 0.39 0.33
  • 16 K -1 0 -1 -1 -3 0 2 -3 -2 -2 -3 5 -2
    -4 -2 1 1 -4 -3 0 0 0 0 0 0 0 18
    0 0 0 0 54 0 0 0 9 9 0 0
    9 0.52 0.33
  • 17 I -2 -3 -4 -4 -2 -3 -4 -4 -4 2 1 -3 4
    -1 -4 -3 -2 7 -1 4 0 0 0 0 0 0 0
    0 0 10 9 0 18 0 0 0 0 18 0
    45 0.71 0.33

11
PSI-BLAST (3)
  • Most parameters are similar to BLAST
  • An important new option is -k. This is the
    maximum e-value (expectation that a match
    occurred by chance) for a sequence to be included
    into the PSSM.
  • Normally -k is set to stricter values (often
    10-10 or smaller) than -e, to ensure that the
    proteins that determine the substitution model
    for the next round are truely close to the query.

12
PSI-BLAST (4)
  • Summarising
  • Search for sequences that are similar to the
    query, align them
  • Compute the substitution model for each position,
    based on the alignment above
  • if satisfied, stop, otherwise go back to 1
  • 3-4 rounds is generally considered to be ok. More
    can introduce problems.

13
PSI-BLAST (5)
  • PSI-BLAST can find hits at similarity levels that
    BLAST would consider too low to be reliable
    remote homologues.
  • (two proteins are homologous if they are
    evolutionarily related - in which case they are
    often structurally and functionally related)
  • Hits at 10-15 sequence similarity can sometimes
    be found by PSI-BLAST.
  • Still a long way to go to find all homology the
    average sequence similarity between structural
    homologues is 8-10, barely above pure chance..

14
End of pairwise sequence alignments...
  • yippie!

15
Multiple sequence alignments (MA)
  • We may want to find the optimal alignment of
    multiple sequences instead of pairs of sequences.
  • For instance, we have proteins with the same
    function for multiple organisms we want to find
    out which parts of the sequences match and which
    parts contain most gaps and mismatches.

16
MA (2)
  • Most informative alignment will contain a mixture
    of closely related and distantly related
    sequences
  • if all closely related, not much information, not
    much to learn (we are looking at the result of a
    short evolutionary stretch)
  • if all very remotely related, hard to even get an
    alignment (unless structures, or other bits of
    information are available)

17
What do we use MA for?
  • E.g. compress them into profiles, i.e. tables of
    frequencies of various residues in different
    positions across a family of proteins.
  • Profiles can be used to
  • retrieve remote homologues (as in PSI-BLAST)
  • identify active sites (highly conserved)
  • identify surface loops (can help for instance to
    design vaccines)
  • predict more accurately just about anything about
    a protein more information in MA (evolutionary
    snapshot) than in single sequence

18
  • Last position-specific scoring matrix computed,
    weighted observed percentages rounded down,
    information per position, and relative weight of
    gapless real matches to pseudocounts
  • A R N D C Q E G H I L K M
    F P S T W Y V A R N D C Q E
    G H I L K M F P S T W Y V
  • 1 M -1 -2 -3 -4 -2 -1 -3 -3 -2 1 1 -2 7
    -1 -3 -2 -1 -2 -2 2 0 0 0 0 0 0 0
    0 0 0 0 0 77 0 0 0 0 0 0
    23 0.62 0.18
  • 2 N 0 -1 3 -1 -2 -1 -1 -1 -1 -2 -3 -1 -2
    -3 -2 3 4 -4 -2 -2 0 0 21 0 0 0 0
    0 0 0 0 0 0 0 0 40 39 0 0
    0 0.55 0.22
  • 3 I -2 -4 -4 -4 -2 -3 -4 -5 -4 5 2 -3 3
    -1 -4 -3 -2 -3 -2 2 0 0 0 0 0 0 0
    0 0 75 14 0 11 0 0 0 0 0 0
    0 0.71 0.28
  • 4 F 0 -1 -1 0 -3 0 4 -3 -1 -3 -3 2 -2
    2 -2 -1 1 -3 -1 -2 11 0 0 0 0 0 49
    0 0 0 0 15 0 15 0 0 10 0 0
    0 0.46 0.28
  • 5 E -2 -1 0 5 -4 2 4 -3 -1 -4 -4 2 -3
    -4 -2 -1 -2 -4 -3 -4 0 0 0 41 0 10 30
    0 0 0 0 19 0 0 0 0 0 0 0
    0 0.78 0.32
  • 6 M -2 -2 -3 -4 -2 -1 -3 -4 -3 0 1 -2 9
    -1 -4 -2 -2 -2 -2 0 0 0 0 0 0 0 0
    0 0 0 0 0 100 0 0 0 0 0 0
    0 1.14 0.33
  • 7 L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5 -4 1
    0 -4 -3 -2 -3 -2 0 0 0 0 0 0 0 0
    0 0 0 100 0 0 0 0 0 0 0 0
    0 0.92 0.33
  • 8 R -2 5 -2 -3 -4 0 -1 -3 -2 -2 -2 3 -2
    -3 -3 -2 -2 -4 -3 1 0 53 0 0 0 0 0
    0 0 0 0 28 0 0 0 0 0 0 0
    19 0.71 0.33
  • 9 I -2 2 1 -2 -4 3 -1 0 0 0 -2 -1 -2
    0 -3 -2 -2 -1 6 -2 0 16 9 0 0 18 0
    9 0 10 0 0 0 0 0 0 0 0 38
    0 0.48 0.33
  • 10 D -3 -3 0 7 -5 -1 1 -2 -2 -4 -5 -2 -4
    -5 -2 -1 -2 -5 -4 -4 0 0 0 100 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0
    0 1.47 0.33
  • 11 E -2 -1 -1 1 -5 1 6 -3 -1 -4 -4 0 -3
    -4 -2 -1 -2 -4 -3 -3 0 0 0 0 0 0 100
    0 0 0 0 0 0 0 0 0 0 0 0
    0 1.22 0.33
  • 12 G -1 -3 -1 -2 -4 -3 -3 7 -3 -5 -5 -3 -4
    -4 -3 -1 -3 -4 -4 -4 0 0 0 0 0 0 0
    100 0 0 0 0 0 0 0 0 0 0 0
    0 1.51 0.33
  • 13 L -2 1 -3 -2 -3 -1 2 -4 -2 1 3 -2 0
    0 -3 -2 -2 -2 4 1 0 10 0 0 0 0 18
    0 0 9 35 0 0 0 0 0 0 0 18
    9 0.33 0.33
  • 14 R -2 6 -1 -2 -4 0 -1 -3 -1 -4 -3 4 -2
    -4 -3 -1 -2 -4 -3 -3 0 62 0 0 0 0 0
    0 0 0 0 38 0 0 0 0 0 0 0
    0 1.00 0.33
  • 15 L -2 -3 -3 1 -3 -2 2 -4 -3 0 4 -2 0
    -1 -3 -2 0 -3 -2 1 0 0 0 8 0 0 18
    0 0 0 56 0 0 0 0 0 9 0 0
    9 0.39 0.33
  • 16 K -1 0 -1 -1 -3 0 2 -3 -2 -2 -3 5 -2
    -4 -2 1 1 -4 -3 0 0 0 0 0 0 0 18
    0 0 0 0 54 0 0 0 9 9 0 0
    9 0.52 0.33
  • 17 I -2 -3 -4 -4 -2 -3 -4 -4 -4 2 1 -3 4
    -1 -4 -3 -2 7 -1 4 0 0 0 0 0 0 0
    0 0 10 9 0 18 0 0 0 0 18 0
    45 0.71 0.33

19
Rules about MA
  • Inserting spaces in sequences so that all their
    lengths are the same. Illegal (pointless) all
    sequences have a space in the same position.
  • Find the MA that has the best score according to
    some criterion.

20
SP measure
  • Scoring alignments isnt as trivial as in the
    pairwise case.
  • SP (sum-of-pairs) is a reasonable scoring
    function
  • where k are the positions of the alignment, and
    s_i and s_j are sequences i and j, as aligned
    (with gaps).

21
Example
  • Alignment of Coronavirus sequences and 1 SARS
    sequence
  • Murine_hepatitis_vir_AF207902
    CCATACCGGCGTATGCGAAGCAGTGGTT-GCAACCCTGGTCCATCCTTCT
  • Bovine_coronavirus___NC_003045
    TTATGCCTGTGCAATCCCGGAAATTTAT-TGTTCCTTGGGTTATGTACTT
  • Avian_infectious_bro_NC_001451
    CTAGCCTTGCGCTAGATTTTTAACTTA--ACAAAACGGACTTAAATACCT
  • Porcine_epidemic_dia_NC_003436
    TAGGTTTGCTTAAGTAGCCATCGCAAGT-GCTGTGCTGTCCTCTAGTTCC
  • Human_coronavirus_22_NC_002645
    TTGGGTTGCAACAGTT-TGGAAGCAAGT-GCTGTG-TGTCCTAGTCTAAG
  • Transmissible_gastro_NC_002306
    TCAGTTTG--GCAATC--ACTCCTTGGA-ACGGGGTTGAGCGAACGGTGC
  • gi30468046gbAY283798.1
    TAAACGTTCTGATGCCTTAAGCACCAATCACGGCCACAAGGTCGTTGAGC
  • First column
  • p(C,T) p(C,C) p(C,T) p(C,T) p(C,T)
    p(C,T)
  • p(T,C) p(T,T) p(T,T) p(T,T) p(T,T)
  • p(C,T) p(C,T) p(C,T) p(C,T)
  • p(T,T) p(T,T) p(T,T)
  • p(T,T) p(T,T)
  • p(T,T)

22
p()
  • We can use the same penalty function used for
    pairwise aligments
  • match 1
  • mismatch -1
  • gap -2

23
Example
  • Alignment of Coronavirus sequences and 1 SARS
    sequence
  • Murine_hepatitis_vir_AF207902
    CCATACCGGCGTATGCGAAGCAGTGGTT-GCAACCCTGGTCCATCCTTCT
  • Bovine_coronavirus___NC_003045
    TTATGCCTGTGCAATCCCGGAAATTTAT-TGTTCCTTGGGTTATGTACTT
  • Avian_infectious_bro_NC_001451
    CTAGCCTTGCGCTAGATTTTTAACTTA--ACAAAACGGACTTAAATACCT
  • Porcine_epidemic_dia_NC_003436
    TAGGTTTGCTTAAGTAGCCATCGCAAGT-GCTGTGCTGTCCTCTAGTTCC
  • Human_coronavirus_22_NC_002645
    TTGGGTTGCAACAGTT-TGGAAGCAAGT-GCTGTG-TGTCCTAGTCTAAG
  • Transmissible_gastro_NC_002306
    TCAGTTTG--GCAATC--ACTCCTTGGA-ACGGGGTTGAGCGAACGGTGC
  • gi30468046gbAY283798.1
    TAAACGTTCTGATGCCTTAAGCACCAATCACGGCCACAAGGTCGTTGAGC
  • First column
  • (-1) 1 (-1) (-1) (-1) (-1)
  • (-1) 1 1 1 1
  • (-1) (-1) (-1) (-1)
  • 1 1 1
  • 1 1
  • 1 1

24
Example
  • Alignment of Coronavirus sequences and 1 SARS
    sequence
  • Murine_hepatitis_vir_AF207902
    CCATACCGGCGTATGCGAAGCAGTGGTT-GCAACCCTGGTCCATCCTTCT
  • Bovine_coronavirus___NC_003045
    TTATGCCTGTGCAATCCCGGAAATTTAT-TGTTCCTTGGGTTATGTACTT
  • Avian_infectious_bro_NC_001451
    CTAGCCTTGCGCTAGATTTTTAACTTA--ACAAAACGGACTTAAATACCT
  • Porcine_epidemic_dia_NC_003436
    TAGGTTTGCTTAAGTAGCCATCGCAAGT-GCTGTGCTGTCCTCTAGTTCC
  • Human_coronavirus_22_NC_002645
    TTGGGTTGCAACAGTT-TGGAAGCAAGT-GCTGTG-TGTCCTAGTCTAAG
  • Transmissible_gastro_NC_002306
    TCAGTTTG--GCAATC--ACTCCTTGGA-ACGGGGTTGAGCGAACGGTGC
  • gi30468046gbAY283798.1
    TAAACGTTCTGATGCCTTAAGCACCAATCACGGCCACAAGGTCGTTGAGC
  • What about columns with multiple gaps?
  • What is p(-,-)?
  • Usually 0..

25
Property of SP
  • If the gap-gap penalty is set to zero, then the
    SP score of an MA is equal to the sum of all
    pairwise scores induced by it.
  • We can either sum over all pairs for each column,
    or for each pair of sequences (throwing away
    simultaneous gaps) for all pairs. This just means
    inverting the sums in

26
MA
  • How can we obtain the actual MA?
  • Dynamic programming, as in the pairwise case,
    would work.
  • The complexity is a bit higher though..
Write a Comment
User Comments (0)
About PowerShow.com