Introduction to Bioinformatics 20120 - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Introduction to Bioinformatics 20120

Description:

Intro to Bioinformatics 20120. Introduction to Bioinformatics ... massive amount of portals, servers, boutique databases, etc., some good, some a bit less. ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 35
Provided by: gruye
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics 20120


1
Introduction to Bioinformatics20120
  • Gianluca Pollastri
  • office CS A1.07
  • email gianluca.pollastri_at_ucd.ie

2
Credits
  • Richard Lathrop and Pierre Baldis Bioinformatics
    courses at University of California _at_ Irvine.

3
Course overview
  • Context DNA, RNA, proteins
  • Resources GenBank, PDB, etc.
  • Algorithms for sequence comparison.
  • Phylogenetics.
  • Structural bioinformatics protein structure
    prediction.

4
Lecture notes
  • http//gruyere.ucd.ie/2007_courses/20120/
  • confidential..

5
Recommended/useful readings
  • No book is actually required
  • Introduction to Bioinformatics
  • Lesk
  • Introduction to Computational Molecular Biology
  • Setubal, Meidanis
  • Bioinformatics the Machine Learning approach
  • Baldi, Brunak

6
  • CS 20120, Introduction to Bioinformatics
  • Assignment 1, 29 January 2007
  • 10 of the overall mark
  • To hand in by midnight of February 12
  • 1. identify your favourite pet
  • 2. get the protein sequence for one of its genes
    on
  • a. http//www.ncbi.nlm.nih.gov/entrez/
  • 3. BLAST your sequence against UniProt at
  • a. http//www.ebi.ac.uk/blast2/index.html?UniProt
  • 4. If you get less than 6 results from 6
    different organisms, go back to 2 and choose
    another protein
  • 5. Select 6 sequences returned by BLAST, from 6
    different organisms (ticking the appropriate
    boxes and downloading them in fasta format will
    give you the right input format for the next
    step)
  • 6. Run clustalW on them using the page (be
    patient, might take time)
  • a. http//www.ebi.ac.uk/clustalw/index.html
  • 7. Draw a phylogenetic tree for your guide tree
    (.dnd) using an online viewer, e.g.
  • a. http//bioweb.pasteur.fr/seqanal/interfaces/dra
    wtree.html
  • 8. email me (gianluca.pollastri_at_ucd.ie)
  • a. your protein sequence UniProt record

7
Public tools summary
  • DNA/RNA databases GenBANK/EMBL - 3-way
    consortium
  • Protein sequences UniProt - SWISS-PROT, TrEMBL,
    etc.
  • Protein structures Protein Data Bank (PDB)
  • A massive amount of portals, servers, boutique
    databases, etc., some good, some a bit less.
  • To sort out the above, a lot of servers
    benchmarking other servers (e.g. EVA, LiveBench,
    etc.)

8
Some goals of Bioinformatics
  • Understand biology based on sequences
  • Interrelate sequence, structure, expression
    (presence/absence), function understand the
    system
  • Use current sequence data to travel back in
    time..
  • Use all this knowledge to produce technologies
    (for health, agriculture, etc.)

9
More specific goals
  • Given a sequence, find sequences in a database
    that are similar to it (sequence comparison).
  • Given a structure, find structures in a database
    that are similar to it (structure comparison).
  • Given a sequence, find its structure (protein
    structure prediction).
  • Given a structure, find a sequence whose
    structure is similar to it (protein synthesis)

10
Data size
  • 1011 letters in DNA repositories a decent-sized
    hard disk
  • 6 complete years of issues of the NY Times (which
    has notoriously large weekend supplements..)
  • A formidable increase rate..

11
(No Transcript)
12
Computer Science needed
  • Given the size and nature of molecular biology
    data, a set of specific computer science
    technologies are especially crucial for
    bioinformatics
  • Fast algorithms for comparing strings, 3D
    structures
  • Efficient data structures
  • Data mining/machine learning

13
Sequence Comparison
14
Sequence comparison
  • Most important primitive operation in
    computational biology. Almost everything else we
    can do relies on it.
  • Similarity between two sequences (DNA, protein)
    how much they look alike (are they likely to be
    evolutionarily related?).
  • Alignment how we place a sequence vs another to
    maximise matches between the two, often to
    measure similarity.

15
Sequence comparison (2)
  • Similarity between two sequences
  • of identical letters in the same positions
  • OK, but this works only if all letters are
    equally dissimilar..
  • evolutionary distance
  • great, but how do we compute that?
  • something in between?
  • any ideas?
  • DADLAKKNNCIACHQVETKVVGPALKDIAAKYADKDDAATYLAGKIKGGS
    SGVWGQIPMPPNVNVSDADAKALADWILTLK
  • ...LYAEKACAGCHSTDSRLVGPSYKGLFGSTRGVIADENYIRKSILQPT
    AQVVKGYPMPSQGQLSDDEINALIEYIKTLK

16
Scenario 1
  • 2 sequences over the same alphabet (e.g. both
    proteins, or both DNA), roughly the same length.
    Small differences. We want to find them

ACCTGGGCTACGTGACTTA-AACT ACCTG-GCTACGAGACTTATAACT
d i
17
Scenario 1 (2)
  • E.g. multiple labs sequencing the same gene, or
    protein. There may be small differences due to
    natural variations and sequencing errors.

ACCTGGGCTACGTGACTTA-AACT ACCTG-GCTACGAGACTTATAACT
d i
18
Scenario 2
  • 2 sequences over the same alphabet. We want to
    figure out if the suffix of one is similar to a
    prefix of another

..ACCCGACCTGGGCTACGTGACTTA-AA
ACCTG-GCTACGAGACTTATAACTTCAA
...
19
Scenario 3
  • Same as 2 (compare the end of one sequence vs the
    beginning of the other) but now we have hundreds
    of sequences.
  • We also know that many of these sequences are
    likely to be unrelated, i.e. many/most pairs of
    sequences dont match. Two problems in one
    finding which sequences match, and
    quantify/qualify the match.
  • E.g. genome sequencing. DNA cut in different
    places by enzymes, sequenced, and then
    reassembled.

20
Scenario 4
  • Find substrings of two sequences that are similar
    to each other. All the surrounding stuff in both
    sequences can be different.

..ACCCGACCTGGGCTACGTGACTTA-AAGGACGC
TTTGTACCTG-GCTACGAGACTTATAACTTCAA
-----...------
21
Scenario 4
  • E.g. finding a common motif in two regions,
    finding a common domain (functional unit) between
    two different proteins.

..ACCCGACCTGGGCTACGTGACTTA-AAGGACGC
TTTGTACCTG-GCTACGAGACTTATAACTTCAA
-----...------
22
Scenario 5
  • Same as 4, but now we have 1 sequence A vs
    thousands of sequences. Most of these sequences
    are NOT similar to A. Two problems in one find
    which sequences are similar, and gauge their
    similarity to A.
  • We have a motif and want to find all the
    sequences that include it
  • We have a protein domain and want to find all the
    proteins that include it
  • We have a gene and ...
  • In general when we are trying to find biological
    similarity we are in scenario 4 or 5

23
Global sequence comparison
  • The two sequences below look similar. We are
    looking for an algorithm to detect this, and
    align the sequences optimally (including gaps).

ACCTGGGCTACGTGACTTA-AACT ACCTG-GCTACGAGACTTATAACT
d i
24
Global sequence comparison (2)
  • We want to compare whole sequences we are not
    interested in small regions of local similarity
    in the middle of irrelevant/random/unrelated
    information.
  • E.g. we have a gene A, and want to compare it to
    a list of genes L. We are sure that A and all
    sequences in L are actual genes.
  • We may not know much or anything about A, but we
    have information about the elements of L. If A
    looks like an element B of L then some of what we
    know about B may be transferred to A.

25
Alignments
  • We want to insert arbitrary spaces (gaps) in both
    sequences so that we end up with the max number
    of matches (same letter in the same position)
  • random deletions and insertions are common during
    evolution - without allowing gaps we wouldnt be
    able to detect many evolutionary relationships
  • Spaces at the end of the sequences are OK
  • perhaps a sequence is longer than the other one,
    and insertions at the end of a sequence tend to
    be more neutral that insertions in the middle
  • Only illegal (silly) thing two spaces in the
    same position.

26
OK
---TGGGCTACGTGACTTA-AACT ACCTG-GCTACGAGACTTATAACT
27
NO! (silly)
---TGGGCTA-CGTGACTTA-AACT ACCTG-GCTA-CGAGACTTATAAC
T
28
Visualising sequence similarity using dotplots
example
  • ACCTGGGCCACGT
  • ACCAGGCTACGA

29
(No Transcript)
30
(No Transcript)
31
Score
  • How do we score an alignment?
  • For instance
  • 1 match
  • -1 mismatch
  • -2 gap
  • (Not at all the only way, e.g. for proteins there
    are much better ways, as we will see)

32
Score
ACCTGGGCTACGTGACTTA-AACT ACCAG-GCTACGAGACTTATAGCT
--- 19 matches, 3
mismatches, 2 gaps -gt score 19x1 3x(-1)
2x(-2) 12
33
Score (2)
...TGGGCTACGTGACTTA-AACT ACCAG-GCTACGAGACTTATAGCT
...--- 16 matches, 3
mismatches, 2 gaps, 3 gaps at the beginning of a
seq -gt score 16x1 3x(-1) 2x(-2) 3x0 9
34
Algorithms for sequence comparison
  • Generating all possible alignments and picking
    the best one impossibly slow.
  • Dynamic programming (here programming has
    nothing to do with computers) solving a problem
    by splitting it dynamically into subparts.
  • We build up a solution based on similarity
    between prefixes of the two sequences..
Write a Comment
User Comments (0)
About PowerShow.com