One can not compare apples and pears German saying - PowerPoint PPT Presentation

1 / 88
About This Presentation
Title:

One can not compare apples and pears German saying

Description:

OC Spermatophyta; Magnoliophyta; eudicotyledons; core ... (Smith & Waterman) Identifies the region with the best. local similarity. Algorithm: Similarity ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 89
Provided by: wiem
Category:

less

Transcript and Presenter's Notes

Title: One can not compare apples and pears German saying


1
One can not compare apples and pearsGerman saying
Is this true ?
2
or
the origin of life
3
One can not compare apples and pearsGerman saying
Is this true ?
OC Eukaryota Viridiplantae Streptophyta
Embryophyta Tracheophyta OC Spermatophyta
Magnoliophyta eudicotyledons core eudicots
Rosidae OC eurosids I Rosales Rosaceae
Maloideae Malus. OC Eukaryota Viridiplantae
Streptophyta Embryophyta Tracheophyta OC
Spermatophyta Magnoliophyta eudicotyledons
core eudicots Rosidae OC eurosids I Rosales
Rosaceae Maloideae Pyrus.
4
  • Nothing in Biology Makes Sense Except in the
    Light of Evolution

Theodosius Dobzhansky (1900-1975)
5
Whole genome sequencing facilitates a systematic
analysis of redundancy within the genome
Arabidopsis thaliana (MIPS)
6
How does evolution work ?
7
redundancy / homologyintra- and inter-genomes
  • Diversification within one genome
  • Speciation/diversification between genomes

8
Pairwise sequence alignment
  • Why sequence alignment and definitions
  • What is sequence alignment
  • How align sequences
  • global vs. local alignment
  • gaps
  • Substitution matrix
  • Dot plot
  • Dynamic programming

9
Why sequence alignment
  • Lots of sequences with unknown structure and
    function vs. a few (but growing number) sequences
    with known structure and function
  • If they align, they are similar
  • If they are similar, then they might have similar
    structure and/or function. Identify conserved
    patterns (motifs)
  • If one of them has known structure/function, then
    alignment of other might yield insight about how
    the structure/functions works. Similar motif
    content might hint to similar function
  • Define evolutionary relationships

10
Basics in sequence comparison
  • Identity
  • The extent to which two (nucleotide or amino
    acid) sequences are invariant (identical).
  • Similarity
  • The extent to which (nucleotide or amino acid)
    sequences are related. The extent of similarity
    between two sequences can be based on percent
    sequence identity and/or conservation. In BLAST
    similarity refers to a positive matrix score.
    This is quite flexible (see later examples of DNA
    polymerases) similar across the whole sequence
    or similarity restricted to domains !
  • Homology
  • Similarity attributed to descent from a common
    ancestor.

11
Homologous Similarity attributed to any gene
descenting from a common ancestor. Orthologous Ho
mologous sequences in different species that
arose from a common ancestral gene during
speciation Orthologous genes may or may not have
the same function ! In most cases they
will. Paralogous Homologous sequences within a
single species that arose by gene duplication.
12
Pairwise sequence alignment
  • Why sequence alignment and definitions
  • What is sequence alignment
  • How align sequences
  • global vs. local alignment
  • gaps
  • Substitution matrix
  • Dot plot
  • Dynamic programming

13
Aligning biological sequences
  • Nucleic acid (4 letter alphabet gap)
  • TT-GCAC
  • TTTACAC
  • Proteins (20 letter alphabet gap)
  • RKVA--GMAKPNM
  • RKIAVAAASKPAV

14
Problem
  • Any two sequences can always be aligned
  • There are many possible alignments
  • Sequence alignment needs to be scored to find the
    optimal alignment
  • In many cases there will be several solutions
    with the same score

ACGTACGTACGTACGTACGTACGTACGT
GATCGATCGATCGATCGATCGATCGATC
ACGTACGTACGTACGTACGTACGTACGT
GATCGATCGATCGATCGATCGATCGATC
ACGTACGTACGTACGTACGTACGTACGT
GATCGATCGATCGATCGATCGATCGATC
ACGTACGTACGTACGTACGTACGTACGT
GATCGATCGATCGATCGATCGATCGATC
Question what is similar enough to be relevant
?
ACCGGTACGTTACGATACGTAACGTTACTGTACTGT
GATCGATCGATCGATCGATCGATCGAT
C
15
What is sequence alignment
  • Given two sentences of letters (strings), and a
    scoring scheme for evaluating matching letters,
    find the optimal pairing of letters from one
    sequence to letters of the other sequence
  • Align
  • THIS IS A RATHER LONGER SENTENCE THAN THE NEXT
  • THIS IS A SHORT SENTENCE
  • THIS IS A RATHER LONGER - SENTENCE THAN THE NEXT
  • ---- ---- - ---- --- ----
  • THIS IS A --SH-- -O---R T SENTENCE ---- --- ----
  • or
  • THIS IS A RATHER LONGER SENTENCE THAN THE NEXT
  • ------ ------ ---- --- ----
  • THIS IS A SHORT- ------ SENTENCE ---- --- ----

16
Statement of problem
  • Given
  • 2 sequences
  • Scoring system for evaluating match (or mismatch)
    of two characters (simple for nucleic acids /
    difficult for proteins)
  • Penalty function for gaps in sequences
  • Produce
  • Optimal pairing of sequences that retains the
    order of characters in each sequence, perhaps
    introducing gaps, such that the total score is
    optimal.

17
Pairwise sequence alignment
  • Why sequence alignment and definitions
  • What is sequence alignment
  • How align sequences
  • global vs. local alignment
  • gaps
  • Substitution matrix
  • Dot plot
  • Dynamic programming

18
Pairwise sequence alignment
  • Global alignment
  • Local alignment

19
Global alignment
  • 1 TGTCGATTAAGCGGTCGTAGCTGACCTGAGATTGCCCGATGGCGTAGT
    AGCTGACC 56

  • 1 TGTCGATTATGCGGTCGTAG..GACCTGAGTTTCCCCGATGGCGTAGT
    AGGTGACC 54

Two closely related sequences
Algorithm GAP (Needleman Wunsch) Produces an
end-to-end alignment
20
Global alignment
1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAA
TTAAAGAGGAGGTAGACCG... 67

1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGA
AGCACTAAAGCGTCAGCGAGACCG 70
Two sequences sharing several local regions of
local similarity
Algorithm GAP (Needleman Wunsch) Produces an
end-to-end alignment
21
Local alignment
Algorithm Bestfit (Smith Waterman) Identifies
the region with the best local similarity
Algorithm Similarity (X. Huang) Identifies all
regions with local similarity
22
Pairwise sequence alignment
  • Why sequence alignment and definitions
  • What is sequence alignment
  • How align sequences
  • global vs. local alignment
  • gaps
  • Substitution matrix
  • Dot plot
  • Dynamic programming

23
Global alignmentthe gap
1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAA
TTAAAGAGGAGGTAGACCG 67

1 AGGATTGGAATGCTACAGAAGCAGCTAAAGCGTGTATGCAGGATTGG
AATTAAAGAGGAGGTAGACCG 68
24
Parameters for sequence alignment
Gap penalties Opening The cost to introduce a
gap Extension The cost to extend a gap Scoring
systems Every symbol pairing is assigned with a
numerical value that is based on a symbol
comparison or replacement table/matrix
25
Drawing alignments
  • Exact matches OK, Inexact costly, Gaps cheapThis
    is a rather longer sentence than the nextThis is
    a ------ ------ sentence ---- --- ----
  • Exact matches OK, Inexact costly, Gaps
    cheapThis is a -rather longer - sentence than
    the nextThis is a s---h-- -o---r-t sentence ----
    --- ----
  • Exact matches OK, Inexact moderately, Gap
    extension cheapThis is a rather longer sentence
    than the nextThis is a -----s hort-- sentence
    ---- --- ----
  • Exact matches OK, Inexact cheap, Gaps costlyThis
    is a rather longer sentence than the nextThis is
    a short sentence----------------------

Which is best ? - what is best ? How can this be
calculated
26
Gap (insertions/deletions) are scored
ATGTAATGCA TATGTGGAATGA
ATGT..AATGCA TATGTGGAATGA
or
Insertion / deletion
The generation of a gap is penalized with a
negative score
27
Why gap penalties ?
Gaps not permitted Score 0
1 GTGATAGACACAGACCGGTGGCATTGTGG 29
1 GTGTCGGGAAGAGATAACTCCGATGGTT
G 29
13 matches 16 mismatches
Gaps allowed but not penalized Score 88
1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29
1
GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29
20 matches 3 mismatches
Match 5 Mismatch -4
28
Why gap penalties ?
  • The optimal alignment of two similar sequences
    usually
  • maximizes the number of matches and
  • minimizes the number of gaps.
  • Permitting the insertion of arbitrarily many gaps
    might lead to high scoring alignments of
    non-homologous sequences.
  • Penalizing gaps forces alignments to have
    relatively few gaps.

Gap penalties increase the quality of an
alignment non-homologous sequences are not
aligned
29
Gap penalties
Linear gap penalty score Affine gap penalty
score g(g) gap penalty score of a gap of
length g d gap opening penalty e
gap extension penalty g gap length
g(g) - gd
g(g) -d - (g -1) e
30
Scoring insertions and deletions
T A T G T G C G T A T A A T G T T
A T A C
Total Score 4
T A T G T G C G T A T A
A T G T - - - T A T A C
Total Score 8 (-3.2) 4.8
match 1 mismatch 0
31
Modification of gap penalties
Score Matrix BLOSUM62
gap opening penalty 3 gap extension penalty
0.1 score 6.3 gap opening penalty
0 gap extension penalty 0.1 score 11.3
1 ...VLSPADKFLTNV 12 1
VFTELSPAKTV.... 11
1 V...LSPADKFLTNV 12 1
VFTELSPA.K..T.V 11
32
Aligning biological sequenceswhere put the gap ?
  • Nucleic acid (4 letter alphabet gap)
  • TT-GCAC
  • ?
  • TTTACAC

TTG-CAC ? TTTACAC
or
G-A or GT ?
33
Pairwise sequence alignment
  • Why sequence alignment and definitions
  • What is sequence alignment
  • How align sequences
  • global vs. local alignment
  • gaps
  • Substitution matrix
  • Dot plot
  • Dynamic programming

34
Parameters for sequence alignment
Gap penalties Opening The cost to introduce a
gap Extension The cost to extend a gap Scoring
systems Every symbol pairing is assigned with a
numerical value that is based on a symbol
comparison or substitution table/matrix
35
DNA scoring systems
Sequence 1 ACTACCAGTTCATTTGATACTTCTCAAA
Sequence 2
TACCATTACCGTGTTAACTGAAAGGACTTAAAGACT
A C G T A 1 0 0 0 C 0 1 0 0 G 0
0 1 0 T 0 0 0 1
Match 5 x 1 5 Mismatch 19 x 0 0 Score
5
36
DNA scoring systems
Sequence 1 ACTACCAGTTCATTTGATACTTCTCAAA
Sequence 2
TACCATTACCGTGTTAACTGAAAGGACTTAAAGACT
Negative scoring values to penalize mismatches
A C G T A 5 -4 -4 -4 C -4 5 -4 -4 G -4
-4 5 -4 T -4 -4 -4 5
Match 5 x 5 25 Mismatch 19 x -4 -76
Score -51
37
Complication inexact is not binary (10) but
something relative
Amino acids have different physical and
biochemical properties that are/are not important
for function and thus influence their probability
to be replaced in evolution
38
Protein scoring systems
substitution matrix C S T P A G N D . . C 9
S -1 4 T -1 1 5 P -3 -1 -1 7 A 0 1 0 -1 4 G -3 0
-2 -2 0 6 N -3 1 0 -2 -2 0 5 D -3 0 -1 -1 -2 -1
1 6 . .
TG -2 TT 5 ... Score 48
39
substitution (scoring) matrix
Grouping of side chains by charge, polarity ...
Exchange of D (Asp) by E (Glu) is better (both
are negatively charged) than replacement e.g. by
F (Phe) (aromatic) C (Cys) makes disulphide
bridges and cannot be exchanged by other residue
? high score of 9.
40
Alignment of human hemoglobulin a and b chains
ltgt identical ltgt highly similar lt.gt similar lt
gt unrelated
Symbol Comparison Table PAM250 Gap opening
penalty 3 Gap extension penalty
0.1 Score 116
41
How are substitution matrices generated ?
  • Manually align protein structures (or, more
    risky, sequences)
  • Look for frequency of amino acid substitutions at
    structurally constant sites.
  • Entry -log(freq(observed/freq(expected))
  • ? more likely than random
  • 0 ? At random base rate
  • - ? less likely than random

42
PAM (Percent Accepted Mutations) matrices
  • Derived from global alignments of protein
    families.Family members sharing at least 85
    identity (Dayhoff et al., 1978).
  • Construction of phylogenetic tree and ancestral
    sequences of each protein family
  • Computation of number of substitutions for each
    pair of amino acids

43
PAM (Percent Accepted Mutations) matrices
  • The PAM-1 matrix, which was computed calculating
    the number of substitutions, reflects an average
    change of 1 of all amino acid positions. PAM
    matrices for larger evolutionary distances can be
    extrapolated from the PAM-1 matrix.
  • PAM250 250 mutations per 100 residues.
  • Greater numbers mean bigger evolutionary distance

44
PAM250
45
BLOSUM (Blocks Substitution Matrix)
  • Derived from alignments of domains of distantly
    related proteins (Henikoff Henikoff, 1992)

A A C E C
  • Occurrences of each amino acid pair in each
    column of each block alignment is counted
  • The numbers derived from all blocks were used to
    compute the BLOSUM matrices

A A C E C
A - A 1 A - C 4 A - E 2 C - E 2 C - C 1
46
BLOSUM (Blocks Substitution Matrix)
  • Sequences within blocks are clustered according
    to their level of identity
  • Clusters are counted as a single sequence
  • Different BLOSUM matrices differ in the
    percentage of sequence identity used in
    clustering
  • The number in the matrix name (e.g. 62 in
    BLOSUM62) refers to the percentage of sequence
    identity used to build the matrix
  • Greater numbers mean smaller evolutionary distance

47
Different substitution matrices for different
alignments
less stringent
more stringent
  • BLOSUM matrices usually perform better than PAM
    matrices for local similarity searches (Henikoff
    Henikoff, 1993)
  • When comparing closely related proteins one
    should use lower PAM or higher BLOSUM matrices,
    for distantly related proteins higher PAM or
    lower BLOSUM matrices
  • For database searching the commonly used matrix
    (default) is BLOSUM62

48
Calculating alignmentsGlobal vs. Local alignment
  • For optimal GLOBAL alignment, we want best score
    in the final row or final column
  • GLOBAL - best alignment of entirety of both
    sequences (possibly at expense of great local
    similarity)
  • For optimal LOCAL alignment, we want best score
    anywhere in matrix
  • LOCAL - best alignment of segments, without
    regard to rest of two sequences (at the expense
    of the overall score)

49
Pairwise sequence alignment
  • Why sequence alignment and definitions
  • What is sequence alignment
  • How align sequences
  • global vs. local alignment
  • gaps
  • Substitution matrix
  • Dot plot
  • Dynamic programming

50
Dotplot
Dotplot gives an overview of all possible
alignments The ideal case two identical sequences
Sequence 1
T A T C G A A G T A T A T C G A A G T A
Every word in one sequence is aligned with each
word in the second sequence
Sequence 2
51
Dotplot
Dotplot gives an overview of all possible
alignments The normal case two somewhat similar
sequences
Sequence 1
T A T C G A A G T A T A T T C A T G T A
isolated dots
2 dots form a diagonal
Sequence 2
3 dots form a diagonal
52
Dotplot
Dotplot gives an overview of all possible
alignments
Sequence 1
T A T C G A A G T A T A T T C A T G T A
Sequence 2
Word Size 1
53
Dotplot
In a dotplot each diagonal corresponds to a
possible (ungapped) alignment
Sequence 1
T A T C G A A G T A T A T T C A T G T A
One possible alignment
Sequence 2
TATCGAAGTA TATTCATGTA
Word Size 1
54
Dotplot
Dotplot gives an overview of all possible
alignments Filters (word size) can be introduced
to get rid of noise
Sequence 1
Word size 1
T A T C G A A G T A T A T T C A T G T A
isolated dots
2 dots form a diagonal
Sequence 2
3 dots form a diagonal
55
Dotplot
Dotplot gives an overview of all possible
alignments Filters (word size) can be introduced
to get rid of noise
Sequence 1
Word size 2
T A T C G A A G T A T A T T C A T G T A
2 dots form a diagonal
Sequence 2
3 dots form a diagonal
56
Dotplot
Dotplot gives an overview of all possible
alignments Filters (word size) can be introduced
to get rid of noise
Sequence 1
Word size 3
T A T C G A A G T A T A T T C A T G T A
Sequence 2
3 dots form a diagonal
57
Dotplot
Dotplot gives an overview of all possible
alignments Filters (word size) can be introduced
to get rid of noise
Sequence 1
Word size 4
T A T C G A A G T A T A T T C A T G T A
Sequence 2
conditions too stringent !!
58
Word size algorithm
sliding window
Sequence 1
T A C G G T A T G A C A G T A
T C T A C G G T A T G A C A G
T A T C T A C G G T A T G A C
A G T A T C T A C G G T A T G
A C A G T A T C T A C G G T A T G
A C A G T A T C T A C G G T A T G
A C A G T A T C
T A C G G T A T G A C A G T A T C
Sequence 2
Word Size 3 Only perfect matches are counted
59
Word size algorithm
Stringency
Sequence 1
T A C G G T A T G A C A G T A
T C T A C G G T A T G A C A G
T A T C T A C G G T A T G A C
A G T A T C T A C G G T A T G
A C A G T A T C T A C G G T A T G
A C A G T A T C T A C G G T A T G
A C A G T A T C
T A C G G T A T G A C A G T A T C
Sequence 2
Word Size 3 One mismatch allowed is called
stringency
60
Word size algorithm
sliding window
Sequence 1
T A C G G T A T G A C A G T A
T C T A C G G T A T G A C A G
T A T C T A C G G T A T G A C
A G T A T C T A C G G T A T G
A C A G T A T C T A C G G T A T G
A C A G T A T C T A C G G T A T G
A C A G T A T C
T A C G G T A T G A C A G T A T C
Sequence 2
Word Size 5 Stringency 4
61
window / stringency
PTHPLASKTQILPEDLASEDLTI
PTHPLAGERAIGLARLAEEDFGM PTHPLASKTQILPEDLASEDL
TI PTHPLAGERAIGLARLAEEDFGM
PTHPLASKTQILPEDLASEDLTI
PTHPLAGERAIGLARLAEEDFGM
Score 11
Substitution Matrix Filtering
Score 11
Matrix PAM250 Window 12 Stringency 9
Score 7
62
Insertions / deletions in a Dotplot
Sequence 1
T A C G A C T A T G T A C G A T A T G
Sequence 2
T A C G A C T A T G T A C G A - T A T G
Word Size 1 Stringency 1
63
Insertions / deletions in a Dotplot
Sequence 1
T A C G A C T A T G T A C G A T A T G
Sequence 2
T A C G A C T A T G T A C G A - T A T G
Word Size 2 Stringency 2
64
Insertions / deletions in a Dotplot
Sequence 1
T A C G G G T A T G T A C G G T A T G
Sequence 2
T A C G G G T A T G T A C G G - T A T G
T A C G G G T A T G T A C G - G T A T G
T A C G G G T A T G T A C - G G T A T G
Word Size 1 Stringency 1
65
considerations
  • The window/stringency method is more sensitive
    than the wordsize method (ambiguities are
    permitted)
  • The smaller the window, the larger the weight of
    statistical (unspecific) matches
  • With large windows the sensitivity for short
    sequences is reduced
  • Insertions/deletions are not treated explicitly

66
Dot matrixexample of a repetitive DNA sequence
  • In addition to the main diagonal, there are
    several other diagonalsOnly one half of the
    matrix is shown because of the symmetry

perfect tool to visualize repeats
/
-/
67
Problems with Dot matrices
  • Rely on visual analysis (necessarily merely a
    screen dump due to number of operations)
    Improvement Dotter (Sonnhammer et al.)
  • Difficult to find optimal alignments
  • Difficult to estimate significance of alignments
  • Insensitive to conserved substitutions (e.g. L ?
    I or S ?T) if no substitution matrix can be
    applied
  • Compares only two sequences (vs. multiple
    alignment)
  • Time consuming (1,000 bp vs. 1,000 bp 106
    operations, 1,000,000 vs. 1,000,000 bp 1012
    operations)

81 days !
68
Sequence alignment of phage T7 and Thermus
aquaticus DNA polymerases
Exonuclease domain
Which are the catalytically important residues?
Polymerase domain
69
Multiple sequence alignment to identify
catalytically important residues
F discriminates between dNTP and ddNTP Y does
not discriminate important for DNA sequencing
If not identical, what is significant
70
Pairwise sequence alignment
  • Why sequence alignment and definitions
  • What is sequence alignment
  • How align sequences
  • global vs. local alignment
  • gaps
  • Substitution matrix
  • Dot plot
  • Dynamic programming

71
Dynamic Programming
  • Automatic procedure that finds the best alignment
    with an optimal score depending on the selected
    parameters.
  • Needleman and Wunsch Algorithm - Global
    Alignment -
  • Smith and Waterman Algorithm - Local Alignment -

72
Needleman Wunschglobal alignment
align sequences MGKP and MGPKKP
MGKP MGPKKP
MGKP MG.PKKP
MGKP MGPKKP
MGK..P MGPKKP
MGKP MGPKKP
?
MG..KP MGPKKP
MG.K.P MGPKKP
73
Generation of an alignment path matrix
IdeaBuild up an optimal alignment using
previous solutions for optimal alignments of
smaller subsequences
  • Construct matrix F indexed by i and j (one index
    for each sequence)
  • F(i,j) is the score of the best alignment between
    the initial segment x1...i of x1 up to xi and the
    initial segment y1...j of y1 up to yj
  • Build F(i,j) recursively beginning with F(0,0) 0

74
Generation of an alignment path matrix
  • We can calculate F(i,j), if F(i-1,j-1), F(i-1,j)
    and F(i,j-1) are known
  • Three possibilities
  • xi and yj are aligned, F(i,j) F(i-1,j-1)
    s(xi,yj)
  • xi is aligned to a gap, F(i,j) F(i-1,j) - d
  • yj is aligned to a gap, F(i,j) F(i,j-1) - d
  • The best score up to (i,j) will be the highest
    value of the three options

F(i,j) max
s(xi,yj) score from substitution matrix d gap
penalty
75
Problem align sequences MGKP and MGPKKP
Generation of an alignment path matrix
i
1. gap penalties d gap penalty
j
0
-6
-12
-18
-24
-6
boundary conditions F(i, 0) F(i-1,0) -
d F(0,j) F(0,j-1) - d
-12
-18
-24
-30
-36
d gap penalty (here 6)
76
Problem align sequences MGKP and MGPKKP
Generation of an alignment path matrix
i
2. substitution matrixhere Blosom62
j
0
-6
-12
-18
-24
displaying the score matrix blosum62... A R
N D C Q E G H I L K M F P S T W Y
V B Z X A 4 R -1 5 N -2 0 6 D
-2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0
-3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1
-3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I
-1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1
-2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1
-3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1
5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1
4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2
-1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3
-1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2
-1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1
-2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 B
-2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0
-1 -4 -3 -3 4 Z -1 0 0 1 -3 3 4 -2 0 -3
-3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 X 0 -1 -1
-1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1
-1 -1 -1 -1
-6
-12
-18
-24
-30
-36
d gap penalty (here 6)
77
Problem align sequences MGKP and MGPKKP
Generation of an alignment path matrix
i
2. substitution matrixhere Blosom62
j
0
-6
-12
-18
-24
-6
-12
-18
-24
-30
-36
d gap penalty (here 6)
78
Problem align sequences MGKP and MGPKKP
Generation of an alignment path matrix
i
3. path matrix generationcalculation of F(i,j)
j
0
-6
-12
-18
-24
-6
5
-12
-18
-24
-30
-36
d gap penalty (here 6)
79
Problem align sequences MGKP and MGPKKP
Generation of an alignment path matrix
i
3. path matrix generationcalculation of F(i,j)
j
0
-6
-12
-18
-24
-6
5
-1
-12
-18
-24
-30
-36
d gap penalty (here 6)
80
Problem align sequences MGKP and MGPKKP
Generation of an alignment path matrix
i
3. path matrix generationcalculation of F(i,j)
j
0
-6
-12
-18
-24
-6
5
-1
-7
-13
-1
11
-12
-18
-24
-30
-36
d gap penalty (here 6)
81
Calculating the possible paths
Diagonal add blue value from field
below Vertical/horizontal axis add penalty score
(here -6)
F(2,3)s(xi,yj) 5 5 10 F(3,4) max
F(2,4) -d -2 -6 -8 10 F(3,3)
-d 10 -6 4
82
Constructing the alignment
i
4. alignment
j
Diagonal match Vertical/horizontal gap
83
Dynamic Programming
  • Automatic procedure that finds the best alignment
    with an optimal score depending on the selected
    parameters.
  • Needleman and Wunsch Algorithm - Global
    Alignment -
  • Smith and Waterman Algorithm - Local Alignment -

84
Smith Waterman Local alignment
Two differences
2. An alignment can end anywhere in the matrix
Example Sequence 1 H E A G A W G H E E
Sequence 2 P A W H E A E Scoring parameters
Log-odds ratiosGap penalty Linear gap penalty
of 8
85
Smith Waterman
H E A G A W G H E E 0 0 0 0
0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0
0 0 0 A 0 0 0 5 0 5 0 0 0 0 0 W 0
0 0 0 2 0 20 12 4 0 0 H 0 10 2 0 0
0 12 18 22 14 6 E 0 2 16 8 0 0 4 10 18
28 20 A 0 0 8 21 13 5 0 4 10 20 27 E 0
0 6 13 18 12 4 0 4 16 26
highscore
AWGHE AW.HE
Optimal local alignment
86
extended Smith Waterman
  • To obtain multiple local alignments
  • delete regions around best path
  • repeat backtracking

87
extended Smith Waterman
H E A G A W G H E E 0 0 0 0
0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0
0 0 0 A 0 0 0 5 0 5 0 0 0 0 0 W 0
0 0 0 2 0 20 12 4 0 0 H 0 10 2 0 0
0 12 18 22 14 6 E 0 2 16 8 0 0 4 10 18
28 20 A 0 0 8 21 13 5 0 4 10 20 27 E 0
0 6 13 18 12 4 0 4 16 26
88
extended Smith Waterman
H E A G A W G H E E 0 0 0 0
0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0
0 0 0 A 0 0 0 5 0 5 0 0 0 0 0 W 0
0 0 0 2 0 20 12 4 0 0 H 0 10 2 0 0
0 12 18 22 14 6 E 0 2 16 8 0 0 4 10 18
28 20 A 0 0 8 21 13 5 0 4 10 20 27 E 0
0 6 13 18 12 4 0 4 16 26
highscore
HEA HEA
Second best local alignment
89
Summary
  • Critical user choices are
  • Availability (web-server)
  • Speed
  • Gap penalty
  • Replacement matrix
  • Local vs. Global alignment

90
  • Mind the parameters you apply in sequence
    alignment and comparison !
  • If you can chance parameters/search criteria, you
    need to know what the consequences will be
Write a Comment
User Comments (0)
About PowerShow.com