Sequencing, Sequence Alignment - PowerPoint PPT Presentation

1 / 123
About This Presentation
Title:

Sequencing, Sequence Alignment

Description:

Sequencing, Sequence Alignment & Software Lushan Wang, Shandong University – PowerPoint PPT presentation

Number of Views:266
Avg rating:3.0/5.0
Slides: 124
Provided by: Comp1545
Category:

less

Transcript and Presenter's Notes

Title: Sequencing, Sequence Alignment


1
Sequencing, Sequence Alignment Software
Lushan Wang, Shandong University
2
Objectives
  • Understand how DNA sequence data is collected and
    prepared
  • Be aware of the importance of sequence searching
    and sequence alignment in biology and medicine
  • Be familiar with the different algorithms and
    scoring schemes used in sequence searching and
    sequence alignment

3
30,000
4
Shotgun Sequencing
Isolate Chromosome
ShearDNA into Fragments
Clone into Seq. Vectors
Sequence
5
Principles of DNA Sequencing
Primer
DNA fragment
Amp
pBR322
Tet
Ori
Denature with heat to produce ssDNA
Klenow ddNTP dNTP primers
6
The Secret to Sanger Sequencing
7
Principles of DNA Sequencing
3 Template
G C A T G C
5
5 Primer
dATP dCTP dGTP dTTP
ddCTP
GddC
GCddA
GCAddT
ddG
GCATGddC
GCATddG
8
Principles of DNA Sequencing
G
T
short
_
_
C
A
G C A T G C


long
9
Capillary Electrophoresis
Separation by Electro-osmotic Flow
10
Multiplexed CE with Fluorescent detection
ABI 377, 3700
96x700 bases
11
Shotgun Sequencing
Assembled Sequence
Sequence Chromatogram
Send to Computer
12
Shotgun Sequencing
  • Very efficient process for small-scale (10 kb)
    sequencing (preferred method)
  • First applied to whole genome sequencing in 1995
    (H. influenzae)
  • Now standard for all prokaryotic genome
    sequencing projects
  • Successfully applied to D. melanogaster
  • Moderately successful for H. sapiens

13
The Finished Product
GATTACAGATTACAGATTACAGATTACAGATTACAG ATTACAGATTACA
GATTACAGATTACAGATTACAGA TTACAGATTACAGATTACAGATTACA
GATTACAGAT TACAGATTAGAGATTACAGATTACAGATTACAGATT AC
AGATTACAGATTACAGATTACAGATTACAGATTA CAGATTACAGATTAC
AGATTACAGATTACAGATTAC AGATTACAGATTACAGATTACAGATTAC
AGATTACA GATTACAGATTACAGATTACAGATTACAGATTACAG ATTA
CAGATTACAGATTACAGATTACAGATTACAGA TTACAGATTACAGATTA
CAGATTACAGATTACAGAT
14
Sequencing Successes
T7 bacteriophage completed in 1983 39,937 bp, 59
coded proteins Escherichia coli completed in
1998 4,639,221 bp, 4293 ORFs Sacchoromyces
cerevisae completed in 1996 12,069,252 bp, 5800
genes
15
Sequencing Successes
Caenorhabditis elegans completed in
1998 95,078,296 bp, 19,099 genes Drosophila
melanogaster completed in 2000 116,117,226 bp,
13,601 genes Homo sapiens completed in
2003 3,201,762,515 bp, 31,780 genes
16
Genomes to Date
  • 8 vertebrates (human, mouse, rat, fugu,
    zebrafish)
  • 3 plants (arabadopsis, rice, poplar)
  • 2 insects (fruit fly, mosquito)
  • 2 nematodes (C. elegans, C. briggsae)
  • 1 sea squirt
  • 4 parasites (plasmodium, guillardia)
  • 4 fungi (S. cerevisae, S. pombe)
  • 200 bacteria and archebacteria
  • 2000 viruses

17
So what do we do with all this sequence data?
18
Sequence Alignment
19
Alignments tell us about...
  • Function or activity of a new gene/protein
  • Structure or shape of a new protein
  • Location or preferred location of a protein
  • Stability of a gene or protein
  • Origin of a gene or protein
  • Origin or phylogeny of an organelle
  • Origin or phylogeny of an organism

20
Factoid
Sequence comparisons lie at the heart of
all bioinformatics
21
Similarity versus Homology
  • Similarity refers to the likeness or identity
    between 2 sequences
  • Similarity means sharing a statistically
    significant number of bases or amino acids
  • Similarity does not imply homology
  • Homology refers to shared ancestry
  • Two sequences are homologous is they are derived
    from a common ancestral sequence
  • Homology usually implies similarity

22
Similarity versus Homology
  • Similarity can be quantified
  • It is correct to say that two sequences are X
    identical
  • It is correct to say that two sequences have a
    similarity score of Z
  • It is generally incorrect to say that two
    sequences are X similar

23
Similarity versus Homology
  • Homology cannot be quantified
  • If two sequences have a high identity it is OK
    to say they are homologous
  • It is incorrect to say two sequences have a
    homology score of Z
  • It is incorrect to say two sequences are X
    homologous

24
Homologues All That
  • Homologue (or Homolog)
  • Protein/gene that shares a common ancestor and
    which has good sequence and/or structure
    similarity to another (general term)
  • Paralogue (or Paralog)????
  • A homologue which arose through gene duplication
    in the same species/chromosome
  • Orthologue (or Ortholog)????
  • A homologue which arose through speciation (found
    in different species)

25
Sequence Complexity
MCDEFGHIKLAN. High Complexity
ACTGTCACTGAT. Mid Complexity
NNNNTTTTTNNN. Low Complexity
Translate those DNA sequences!!!
26
Assessing Sequence Similarity
THESTORYOFGENESIS THISBOOKONGENETICS THESTORYOFGE
NESI-S THISBOOKONGENETICS THE STORY OF
GENESIS THIS BOOK ON GENETICS
Two Character Strings
Character Comparison


Context Comparison
27
Assessing Sequence Similarity
is this alignment significant?
28
Is This Alignment Significant?
29
Some Simple Rules
  • If two sequence are gt 100 residues and gt
    25 identical, they are likely related
  • If two sequences are 15-25 identical they may be
    related, but more tests are needed
  • If two sequences are lt 15 identical they are
    probably not related
  • If you need more than 1 gap for every 20 residues
    the alignment is suspicious

30
Doolittles Rules of Thumb
31
Sequence Alignment - Methods
  • Dot Plots
  • Dynamic Programming
  • Heuristic (Fast) Local Alignment
  • Multiple Sequence Alignment
  • Contig Assembly

32
Dot Plots
33
Dot Plots
  • Invented in 1970 by Gibbs McIntyre
  • Good for quick graphical overview
  • Simplest method for sequence comparison
  • Inter-sequence comparison
  • Intra-sequence comparison
  • Identifies internal repeats
  • Identifies domains or modules

34
Dot Plot Algorithm
  • Take two sequences (A B), write sequence A out
    as a row (lengthm) and sequence B as a column
    (length n)
  • Create a table or matrix of m columns and n
    rows
  • Compare each letter of sequence A with every
    letter in sequence B. If theres a match mark it
    with a dot, if not, leave blank

35
Dot Plot Algorithm
A C D E F G H G
A C D E F G H G
36
Dot Plots Internal Repeats
37
Dot Plots
  • Most commercial programs offer pretty good dot
    plot programs including
  • GCG/Omiga/DS gene (Accelrys Inc.)
  • PepTool (BioTools Inc.)
  • LaserGene (DNAStar)
  • Popular freeware package is Dotter
    www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html
  • Dotlet http//www.isrec.isb-sib.ch/java/dotlet/Dot
    let.html
  • JDotter http//athena.bioc.uvic.ca/sars/jdotter/ma
    in.php

38
Dynamic Programming
39
Dynamic Programming
  • Developed by Needleman Wunsch (1970)
  • Refined by Smith Waterman (1981)
  • Ideal for quantitative assessment
  • Guaranteed to be mathematically optimal
  • Slow N2 algorithm
  • Performed in 2 stages
  • Prepare a scoring matrix using recursive function
  • Scan matrix diagonally using traceback protocol

40
Identity Scoring Matrix (Sij)
????
41
??????
42
The Recursive Function
Si-1,j-1 or max Si-x,j-1 wx-1
or max Si-1,j-y wy-1
Sij sij max
2ltxlti
2ltyltj
W gap penalty (????) S alignment score
(????)
????
43
A Simple Example...
A A T V D A 1 1 0 0 0 V 0 1 1 2 1 V D
A A T V D A 1 1 0 0 0 V 0 1 1 2 1 V 0 1
1 2 2 D 0 1 1 1 3
A A T V D A 1 1 0 0 0 V 0 1 1 2 1 V 0 1
1 2 2 D 0 1 1 1 3
A A T V D A - V V D
A A T V D A V V D
A A T V D A V - V D
44
Could We Do Better?
  • Key to the performance of Dynamic Programming is
    the scoring function
  • Dynamic Programming always gives the
    mathematically correct answer
  • Dynamic Programming does not always give the
    biologically correct answer
  • The weakest link -- The Scoring Matrix

45
Scoring Matrices
  • An empirical model of evolution, biology and
    chemistry all wrapped up in a 20 X 20 table of
    integers
  • Structurally or chemically similar residues
    should ideally have high diagonal or off-diagonal
    numbers
  • Structurally or chemically dissimilar residues
    should ideally have low diagonal or off-diagonal
    numbers

46
A Better Matrix - PAM250
47
Using PAM250...
A T V D A 2 T 1 3 V 0 0 4 D 0 0-2 4
Gap Penalty -1
A A T V D A 2 1 0 -1 -1 V -1 2 1 5
-1 V D
A A T V D A 2 1 0 -1 -1 V -1 2 1 5 -1 V
-1 1 2 5 3 D -1 1 1 0 9
A A T V D A 2 1 0 -1 -1 V -1 2 1 5 -1 V
-1 1 2 5 3 D -1 1 1 0 9
A A T V D A V - V D
48
PAM Matrices
  • Developed by M.O. Dayhoff (1978)
  • PAM Point Accepted Mutation
  • Matrix assembled by looking at patterns of
    substitutions in closely related proteins
  • 1 PAM corresponds to 1 amino acid change per 100
    residues
  • 1 PAM 1 divergence or 1 million years in
    evolutionary history

49
Dynamic Programming
  • Great for doing pairwise global alignments
  • Produces a quantitative alignment score
  • Problems if one tries to do alignments with very
    large sequences (memory requirement grows as N2
    or as N x M)
  • Serious problems if one tries to align one
    sequence against a database (10s of hours)
  • Need an alternative..

50
Fast Local Alignment Methods
ACDEAGHNKLM...
KKDEFGHPKLM...
SCDEFCHLKLM...
MCDEFGHNKLV...
ACDEFGHIKLM...
QCDEFGHAKLM...
AQQQFGHIKLPI...
WCDEFGHLKLM...
SMDEFAHVKLM...
ACDEFGFKKLM...
51
Fast Local Alignment Methods
  • Developed by Lipman Pearson (1985/88)
  • Refined by Altschul et al. (1990/97)
  • Ideal for large database comparisons
  • Uses heuristics statistical simplification
  • Fast N-type algorithm (similar to Dot Plot)
  • Cuts sequences into short words (k-tuples)
  • Uses Hash Tables to speed comparison

52
Fast Alignment Algorithm
53
Fast Alignment Algorithm
54
Fast Alignment Algorithm
A C D E F G D E F...
L M R G CD D Y G
55
Fast Alignment Algorithm
56
Multiple Sequence Alignment
Multiple alignment of Calcitonins
57
Multiple Alignment Algorithm
  • Take all n sequences and perform all possible
    pairwise (n/2(n-1)) alignments
  • Identify highest scoring pair, perform an
    alignment create a consensus sequence
  • Select next most similar sequence and align it to
    the initial consensus, regenerate a second
    consensus
  • Repeat step 3 until finished

58
Multiple Sequence Alignment
  • Developed and refined by many (Doolittle, Barton,
    Corpet) through the 1980s
  • Used extensively for extracting hidden
    phylogenetic relationships and identifying
    sequence families
  • Powerful tool for extracting new sequence motifs
    and signature sequences

59
Multiple Alignment
  • Most commercial vendors offer good multiple
    alignment programs including
  • GCG (Accelrys)
  • PepTool/GeneTool (BioTools Inc.)
  • LaserGene (DNAStar)
  • Popular web servers include T-COFFEE, MULTALIN
    and CLUSTALW
  • Popular freeware includes PHYLIP PAUP

60
Mutli-Align Websites
  • Match-Box http//www.fundp.ac.be/sciences/biologie
    /bms/matchbox_submit.shtml
  • MUSCA http//cbcsrv.watson.ibm.com/Tmsa.html
  • T-Coffee http//www.ch.embnet.org/software/TCoffee
    .html
  • MULTALIN http//www.toulouse.inra.fr/multalin.html
  • CLUSTALW http//www.ebi.ac.uk/clustalw/

61
(No Transcript)
62
Multi-alignment Contig Assembly
ATCGATGCGTAGCAGACTACCGTTACGATGCCTT TAGCTACGCATCGT
CTGATGGCAATGCTACGGAA..
TAGCTACGCATCGT
TAGCAGACTACCGTT
ATCGATGCGTAGC
GTTACGATGCCTT
63
Contig Assembly
  • Read, edit trim DNA chromatograms
  • Remove overlaps ambiguous calls
  • Read in all sequence files (10-10,000)
  • Reverse complement all sequences (doubles of
    sequences to align)
  • Remove vector sequences (vector trim)
  • Remove regions of low complexity
  • Perform multiple sequence alignment

64
Contig Assembly Multiple Alignment
  1. Only accept a very high sequence identity
  2. Accept unlimited number of end gaps
  3. Very high cost for opening internal gaps
  4. A short match with high score/residue is
    preferred over a long match with low score/residue

65
Chromatogram Editing
66
Sequence Loading
67
Sequence Alignment
68
Contig Alignment - Process
ATCGATGCGTAGC
TAGCAGACTACCGTT
GTTACGATGCCTT
TGCTACGCATCG
CGATGCGTAGCA
CGATGCGTAGCA
ATCGATGCGTAGC
TAGCAGACTACCGTT
GTTACGATGCCTT
ATCGATGCGTAGCAGACTACCGTTACGATGCCTT
69
Problems for Assembly
  • Repeat regions
  • Capture sequences from non-contiguous regions
  • Polymorphisms
  • Cause failure to join correct regions
  • Large data volume
  • Requires large numbers of pair-wise comparisons

70
Sequence Assembly Programs
  • Phred - base calling program that does detailed
    statistical analysis (UNIX)
    http//www.phrap.org/
  • Phrap - sequence assembly program (UNIX)
    http//www.phrap.org/
  • TIGR Assembler - microbial genomes (UNIX)
    http//www.tigr.org/softlab/assembler/
  • The Staden Package (UNIX)
  • http//www.mrc-lmb.cam.ac.uk/pubseq/
  • GeneTool/ChromaTool/Sequencher (PC/Mac)

71
Phrap
  • Phrap is a program for assembling shotgun DNA
    sequence data
  • Uses a combination of user-supplied and
    internally computed data quality information to
    improve assembly accuracy in the presence of
    repeats
  • Constructs the contig sequence as a mosaic of the
    highest quality read segments rather than a
    consensus
  • Handles large datasets

72
http//bio.ifom-firc.it/ASSEMBLY/assemble.html
73
Conclusions
  • Sequence alignments and database searching are
    key to all of bioinformatics
  • There are four different methods for doing
    sequence comparisons 1) Dot Plots 2) Dynamic
    Programming 3) Fast Alignment and 4) Multiple
    Alignment
  • Understanding the significance of alignments
    requires an understanding of statistics and
    distributions

74
MOLECULAR BIOLOGY SOFTWARE
  • Windows NT Desktop Sequence Analysis
  • Omiga Now part of Accelrys (formerly Oxford
    Molecular of GCG fame), they also supply the Mac
    program Mac Vector
  • Discovery Studio Gene (DS Gene)Replacement
    program for OMIGA. A trial version is available
    on the Accelrys website.
  • BioEditFreeware, with a very nice sequence
    editor.

75
An Introduction to DS Gene for sequence analysis
the Windows sequence analysis program (GCG-Unix)
  • Downloading sequences
  • Sequence Alignments and Dotplots
  • Restriction maps
  • Primer design
  • Annotating your sequence with feature information
  • Database searching

76
DS Gene supports the following file formats
File format Nucleic acid sequences Nucleic acid sequences Protein sequences Protein sequences
  Import Export Import Export
EMBL Yes Yes - -
FastA Yes  Yes Yes Yes
GCG Yes Yes Yes Yes
GenBank Yes Yes -  -
GenPept - - Yes Yes
Staden Yes No Yes No
PIR/NBRF - - Yes No
Swissprot - - Yes No
MacVector Yes No Yes No
Omiga Yes No Yes No
PDB - - Yes Yes
Text Yes No Yes No
ABI trace file Yes No - -
SCF trace file Yes No - -
77
DS gene accepts multiple sequence files
Multiple sequence format Nucleic acid sequences Nucleic acid sequences Protein sequences Protein sequences
  Import Export Import Export
GCG (.msf) Yes Yes Yes Yes
Phylip (.phy) Yes Yes Yes Yes
Nexus (.nex) Yes Yes Yes Yes
78
the DS Gene window is divided into two parts the
navigation pane, and the view pane.
79
Map, Editor tab, Features tab, Properties tab
80
In the Map view you can modify feature
information and edit the appearance of these
features. Any changes made in this view are
automatically transmitted to the other views.
81
There is also a Trace view, which is visible only
when a trace file is active. Chromatogram files
generated from automated sequencers can be edited
within DS Gene.
82
  • A number of analyses can be found under the
    Analyze menu. These include restriction or
    proteolytic digests, design of sequencing and PCR
    primers, nucleic acid or protein motifs, dotplots
    and translations. This is an example of an
    results view for a restriction digest

83
Analysis toolboxes
  • The nucleic acid analysis and protein analysis
    toolboxes enable you to apply a range of
    algorithms to your sequences. The toolboxes will
    perform useful analyses such as searching for
    open reading frames(ORF) and codon preference on
    nucleic acids and hydrophobicity plots on protein
    sequences. 
  • Alignments can be performed using ClustalW from
    the Analyze menu. The alignment display can be
    coloured in a variety of ways and the example
    below displays the alignment according to percent
    identity

84
The Database menu will allow access to the
databases at NCBI using Entrez and Blast and
alternatively the GCG databases on MoBiCS can be
accessed with Blast and FastA (currently not
tested).  
85
Conclusions
  • DS Gene covers database searching, alignments,
    primer design and digests.
  • The package has been integrated well with the
    NCBI search facilities. The Blast and Entrez
    outputs.
  • Restriction mapping, motif searching and primer
    design
  • Multiple sequence alignment is available with
    ClustalW and there is an extensive array of
    choices for displaying the output. 
  • DS Gene is the replacement for the OMIGA program
  • Requirements Microsoft Windows 98/2000/xp

86
(No Transcript)
87
(No Transcript)
88
(No Transcript)
89
(No Transcript)
90
(No Transcript)
91
Analyze Restriction enzyme
92
Analyze Nucleic acide analysis toolbox
93
(No Transcript)
94
Analyze Find PCR primer pairs
95
Analyze find sequence primers
96
Analyze translation analysis
97
Analyze proteolysis enzyme anaylsis
98
Analyze Protein analysis toolbox
99
(No Transcript)
100
Analyze Protein analysis toolbox pI value
101
Entrez Blast
102
Hit list
103
map
text
104
Download sequence
105
map
editor
SGTATYSGNPFVGVTPWANAYYASEVSSLAIPSLTGAMATAAAAVAKVPS
FMWLDTLDKTPLMEQTLADIRTANKNGGNYAGQFVVFDLPDRDCAALASN
GEYSIADGGVAKYKNYIDTIRQIVVEYSDIRTLLVIEPDSLANLVTNLGT
PKCANAQSAYLECINYAVTQLNLPNVAMYLDAGHAGWLGWPANQDPAAQL
FANVYKNASSPRALRGLATNVANYNGWNITSPPSYTQGNAVYNEKLYIHA
IGPLLANHGWSNAFFITDQGRSGKQPTGQQQWGDWCNVIGTGFGIRPSAN
TGDSLLDSFVWVKPGGECDGTSDSSAPRFDSHCALPDALQPAPQAGAWFQ
AYFVQLLTNANPSFL
Feature
DISULFID(94 ...153) DISULFID(286 ...333)
Properties
106
Dot plot
107
ClustalW
108
(No Transcript)
109
Phylogeny
110
  • Most file formats accepted
  • Restriction mapping
  • Multiple sequence alignment
  • Graphics
  • Pairwise Alignment
  • Nucleotide and Protein Analyses
  • Database Searches
  • Internet Links
  • Sequence Trace Files
  • Editing
  • Plasmid Drawing
  • Value for money It's free!

111
Split view of an alignment of 6205 prokaryotic
16S RNAs showing identities and similarities
112
Shows single sequence editing and GenBank info
113
Hydrophobicity plots
114
Plasmid drawing and annotation
115
Mutual information analysis and graphical data
examination
116
ABI and SCF trace viewing, editing and conversion
117
User-defined motif searching
118
Automated link to ClustalW with command line
options available
119
Configuration of external accessory analysis
applications
120
Graphic shaded view of alignment
121
Dynamic feature annotation shading and
information
122
Easy editing of graphical feature annotations
123
Editing sequence groups or families
Write a Comment
User Comments (0)
About PowerShow.com