Algorithms for Pairwise Sequence Alignment - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Algorithms for Pairwise Sequence Alignment

Description:

D. Inversion. Influences on Variation. Rates of mutations are influenced by: ... While building the table, keep track of where optimal score came from, reverse arrows ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 60
Provided by: HYG1
Category:

less

Transcript and Presenter's Notes

Title: Algorithms for Pairwise Sequence Alignment


1
Algorithms for Pairwise Sequence Alignment
  • Craig A. Struble, Ph.D.
  • Marquette University

2
Overview
  • Pairwise Sequence Alignment
  • Dynamic Programming Solution
  • Global Alignment
  • Local Alignment

3
Goals
  • Define the pairwise sequence alignment problem
  • Understand the difference between global and
    local alignment
  • Understand the significance of pairwise sequence
    alignment

4
Pairwise Sequence Alignment
  • Problem
  • Given two sequences (DNA or AA), line them up
    in a biologically meaningful way.

HEAGAWGHE-E
HEAGAWGHEE
PAWHEAE
P-A--W-HEAE
5
Origins Of Similar Sequences
2
a
duplication
1
a1
a2
speciation
duplication
a1
a2
a2
a1
Species 2
Species 1
2
2
1
1
Transfer
Convergence
6
Why is comparing sequences important?
  • One of the fundamental phenomena explored by
    bioinformatics, around which many tools are built
  • Databases, data selection, etc.
  • Researchers compare sequences in order to
  • infer the function of genes
  • infer the structure of genes and gene products
  • infer the evolutionary history of genes and
    organisms
  • identify variation responsible for disease and
    other complex phenotypes

7
Why is this a challenging problem?
  • Similar sequences contain variation
  • Sequences mutate over time
  • Mutations are spontaneous changes in sequence
    caused by replication (or other) errors. Mutation
    rates vary, and can be influenced by many
    factors.
  • Sequence data contains errors
  • Sequencing techniques are imperfect

8
Four Basic Types of Mutations
  • Substitution

C. Insertion
Thr Tyr Leu Leu ACC TAT TTG CTG
Thr Tyr Leu Leu ACC TAT TTG CTG
ACC TCT TTG CTG Thr Ser Leu Leu
ACC TAC TTT GCT G-- Thr Tyr Phe Ala
B. Deletion
D. Inversion
Thr Tyr Leu Leu ACC TAT TTG CTG
Thr Tyr Leu Leu ACC TAT TTG CTG
ACC TTT ATG CTG Thr Phe Met Leu
ACC TAT TGC TG- Thr Tyr Cys
9
Influences on Variation
  • Rates of mutations are influenced by
  • Substitution class (transition/transversion)
  • Coding site (synonymous/nonsynonymous)
  • Length of insertion/deletion
  • Codon usage bias
  • Nucleotide consist (GC content)
  • Stability fate of variation depends upon
  • Drift
  • Selection (positive Darwinian/purifying, sexual,
    artificial)
  • Other mutations (reversions are not uncommon)

10
Homology vs. Similarity
  • Homology is a discrete state pertaining to
    relatedness - two genes are homologues if and
    only if they share a commone gene ancestor
  • Orthologues in different organisms, a result of
    speciation
  • Paralogues in the same organism, a result of
    gene duplication
  • Homologues may have the same, similar, or
    different functions
  • Similarity is a continuous state describing the
    degree of to which two homologues share
    characteristics
  • Generally a percentage
  • Distance estimates are also estimates of
    similarity

11
Kinds of Alignments
  • The local alignment includes only regions of
    identity (or strong similarity). The favors
    finding conserved regions.
  • The global alignment is stretched over the entire
    sequence length, including as many matches as
    possible.

12
When do you choose local vs. global?
  • Choose local alignment when
  • DNA sequences encode genes with introns
  • Amino acid sequences encoding proteins
  • Choose a global alignment when
  • Sequences can be seen to be very similar
  • Similar regions are in the same order and
    orientation

13
Methods Of Sequence Alignment
  • Dot matrix analysis
  • Dynamic programming algorithms
  • Word or k-tuple methods
  • BLAST, FASTA
  • Discussed later in the semester

14
Dot Matrix Analysis
  • Visualization of sequence similarity
  • First technique to use on pairs of sequences
  • Insertions/deletions
  • Inverted repeats
  • Does not show actual alignment
  • Optimal alignment not obvious

15
Simple Dot Matrix Example
  • For sequences
  • ATGCGTCGTT
  • ATCCGCGAT

A T G C G T C G T T
A T C C G C G A T
  • Steps
  • Arrange sequences on a matrix
  • Place a dot anywhere nucleotides match
  • Diagonal stretches (here indicated by a line) are
    areas of alignment
  • More than one area of alignment can appear

16
DNA sequence matrix Noisy
  • Sequence alignment of 2 long DNA sequences
  • Many random matches make it difficult or
    impossible to find areas of alignment
  • Using a window stringency setting, we can
    eliminate some of the noise

17
DNA sequence matrix Less noisy
  • To decrease noise of random matches, a window of
    11 nucleotides was defined, and a dot placed when
    at least 7 matches occur
  • Window 11, Stringency 7
  • Some diagonal lines begin to appear

18
DNA sequence matrix Less noisy
  • To decrease noise of random matches, a window of
    23 nucleotides was defined, and a dot placed when
    at least 15 matches occur
  • Window 23, Stringency 15
  • A clear diagonal line appears, indicating an area
    of alignment
  • A few other areas are still apparent - probably
    long random matches

19
Protein sequence matrix Noisy
  • Sequence comparison of amino acid sequence (same
    gene as previous example)
  • Window 1, stringency 1
  • To decrease noise due to random matches,
    conditions can be tightened

20
Protein sequence matrix Less noisy
  • Same sequence comparison, tighter analysis
    conditions
  • Window 3, stringency 2
  • A single aligned region is visible, with a number
    of areas of random matches

21
Evidence of repeats in a DNA sequence
Window 1, stringency 1
Window 23, stringency 7
22
Programs for Dot Matrix Analysis
  • DNA Strider (Macintosh)
  • Dotter (Unix/Linux, X-Windows)
  • In the lab
  • COMPARE, DOTPLOT in GCG
  • PLALIGN (FASTA)
  • Plots alignments found by DP method
  • Dotlet
  • http//www.isrec.isb-sib.ch/java/dotlet/Dotlet.htm
    l

23
Optimal Sequence Alignments
  • Example
  • Which one is better?

HEAGAWGHEE
PAWHEAE
HEAGAWGHE-E
HEAGAWGHE-E
P-A--W-HEAE
--P-AW-HEAE
24
Scoring
  • To compare two sequence alignments, calculate a
    score
  • Scoring matrix
  • Provide a score for each match/mismatch
  • Sometimes a mismatch is acceptable
  • PAM, BLOSUM are two classes of scoring matrices
  • Gap penalty
  • Initiating a gap
  • Gap extension penalty
  • Extending a gap

25
Scoring Matrix Example
  • Gap penalty -8
  • Gap extension -4

HEAGAWGHE-E
--P-AW-HEAE
(-8) (-4) (-1) 5 15 (-8) 10 6
(-8) 6 13
HEAGAWGHE-E
Exercise Calculate for
P-A--W-HEAE
26
Formal Description
  • Problem PairSeqAlign
  • Input Two sequences x,y
  • Scoring matrix s
  • Gap penalty d
  • Gap extension penalty e
  • Output The optimal sequence alignment

27
How Difficult Is This?
  • Consider two sequences of length n
  • There are
  • possible global alignments, and we need to find
    an optimal one from amongst those!

28
So what?
  • So at n 20, we have over 120 billion possible
    alignments
  • We want to be able to align much, much longer
    sequences
  • Some proteins have 1000 amino acids
  • Genes can have several thousand base pairs

29
Dynamic Programming
  • General algorithmic development technique
  • Reuses the results of previous computations
  • Store intermediate results in a table for reuse
  • Look up in table for earlier result to build from

30
Global Alignment
  • Needleman-Wunsch 1970
  • Idea Build up optimal alignment from optimal
    alignments of subsequences
  • Three ways to align x1..i with y1..j

Extend both strings at the same time
xi already aligned, align yj with a gap
IGAxi LGVyj
AIG Axi GVyj--
GAxi-- SLG Vyj
yj already aligned, align xi with a gap
31
Global Alignment
  • Notation
  • xi ith letter of string x
  • yj jth letter of string y
  • x1..i Prefix of x from letters 1 through I
  • F matrix of optimal scores
  • F(i,j) represents optimal score lining up x1..i
    with y1..j
  • d gap penalty
  • s scoring matrix

32
Global Alignment
  • The work is to build up F
  • Initialize F(0,0) 0, F(i,0) id, F(0,j)jd
  • Fill from top left to bottom right using the
    recursive relation

33
Global Alignment
yj aligned to gap
Move ahead in both
s(xi,yj)
d
d
xi aligned to gap
While building the table, keep track of where
optimal score came from, reverse arrows
34
Example
35
Completed Table
36
Traceback
  • Trace arrows back from the lower right to top
    left
  • Diagonal both
  • Up upper gap
  • Left lower gap

HEAGAWGHE-E --P-AW-HEAE
37
Summary
  • Uses recursion to fill in intermediate results
    table
  • Uses O(nm) space and time
  • O(n2) algorithm
  • Feasible for moderate sized sequences, but not
    for aligning whole genomes.

38
Local Alignment
  • Smith-Waterman (1981)
  • Another dynamic programming solution

39
Example
40
Traceback
Start at highest score and traceback to first 0
AWGHE AW-HE
41
Summary
  • Similar to global alignment algorithm
  • For this to work, expected match with random
    sequence must have negative score.
  • Behavior is like global alignment otherwise
  • Similar extensions for repeated and overlap
    matching
  • Care must be given to gap penalties to maintain
    O(nm) time complexity

42
Scoring Matrices
  • Substitutions
  • Models of substitutions
  • PAM
  • BLOSUM
  • Gap penalties

43
DNA
44
Transitional and Transversional Nucleotide
Substitutions
  • ? ? are rates of transitional and
    transversional substitutions, respectively
  • Generally, ? gt ?
  • Possible substitutions (total 16)
  • Identical (freq O) 4
  • Transitions (P) 4
  • Transversions (Q) 8
  • Giving us
  • p P Q
  • R P/Q
  • R is usually between 0.5 and 2 for nuclear
    genes, higher for mitochondrial genes (up to 15)

45
Synonymous and Non-synonymous substitutions
Non-synonymous
Synonymous
Thr Tyr Leu Leu ACC TAT TTG CTG
Thr Tyr Leu Leu ACC TAT TTG CTG
ACC TCT TTG CTG Thr Ser Leu Leu
ACC TAC TTG CTG Thr Tyr Leu Leu
  • Synonymous substitutions more likely to occur
  • Preserve AA

46
Categories of Amino Acids
Grouped according to properties of side chain
47
Amino Acid Substitutions
  • Tend to preserve chemical similarity
  • Tend to preserve structure
  • Tend to preserve function
  • More frequent in non-functional domains

48
Models of Substitution
  • Percept Accepted Mutation (PAM)
  • Dayhoff 1978
  • Accepted Mutation changes accepted by natural
    selection
  • PAM1 represents evolutionary divergence where 1
    of amino change
  • Blocks Amino Acid Substitution Matrices (BLOSUM)
  • Henikoff and Henikoff 1992
  • Observed AA substitutions in conserved AA blocks
  • Maximum level of identity, BLOSUM62 represents
    62 identity

49
PAM
  • Markov model

Probability of transitioning from one state to
another
pstpts
S
T
C

State for amino acid
P
50
PAM
  • Assumes substitutions are independent
  • pxy is calculated from observations
  • 1572 changes in 71 groups of proteins
  • Organized into phylogenetic trees
  • Changes counted
  • Divided by normalizing factor
  • The probabilities are stored in a matrix
  • Probability form
  • PAM1 represents 10 my evolutionary distance
  • PAMN is derived from PAM1N because Markov Model
    is used

51
PAM1 for DNA
0.00333
0.00333
0.00333
0.99
52
BLOSUM
  • 2000 conserved amino acid patterns
  • blocks ungapped patterns
  • 3-60 AA long
  • gt500 families of related proteins
  • Software
  • MOTIF (H. Smith et al. 1990)
  • PROTOMAT (Henikoff and Henikoff)

53
Computing BLOSUM Scores
  • Consider all pairs (dont know ancestor)
  • fAA3216 fAL4 fAS4 fLS1
  • Calculate frequency of occurrence
  • qAAfAA/(fAAfALfASfLS) 0.4
  • Calculate expected frequency of being in a pair
  • pA(qAAqAS/2qAL/2)0.66
  • Calculate expected frequency of a pair
  • eAApApA0.44
  • Matrix entry for pair
  • mAA qAA/eAA 0.9

A
L
A
S
A
A
54
Log Odds Scoring
  • Each of the previous matrices are converted to
    log odds matrices
  • DP algorithm based on addition
  • log(xy)log(x)log(y)
  • Compares real occurrence with random occurence.
  • BLOSUM
  • sAAlog2(qAA/eAA) 2 -0.304 (will be rounded)
  • PAM1 DNA (uniform)
  • sCT log2(pCMCT / pCpT)
  • log2(0.25 0.00333/ 0.252)
  • -6.23

55
The PAM250 Matrix
Each matrix value is calculated by first dividing
the frequency of change, for each amino acid
pair, in related proteins separated by one step
in an evolutionary tree by the probability of a
chance alignment based on the frequency of the
amino acids. The ratios are expressed as
logarithms to the base 10 (approx. 1/3 bit
values).
  • Note
  • High values on diagonal
  • High values for similar groups

56
The Blosum62 Matrix
Each entry is the actual frequency of occurrence
of the amino acid pair in the blocks database,
clustered at the 62 level, divided by the
expected probability of occurrence. The expected
value is calculated from the frequency of
occurrence of each of the two individual amino
acids in the blocks database,and provides a
measure of a chance alignment of the two amino
acids. The actual/expected ratio is expressed as
a log odds. A zero score means that the frequency
of the amino acid pair in the database was as
expected by chance, a positive score that the
pair was found more often than by chance, and a
negative score that the pair was found less often
than by chance.
57
Selecting Matrices
  • PAM
  • Mutational model of evolution
  • Tracks evolutionary origins of proteins/sequences
  • Use lower numbers for evolutionarily close
    sequences, higher numbers for distance sequences
  • BLOSUM
  • No model of evolution, conserved AA motifs
  • Designed to find conserved domains
  • Similar sequences, use higher numbers,
  • Divergent sequences, use lower numbers.

58
GAP Penalties
  • Recall d is gap opening penalty, e is gap
    extension penalty
  • Total gap penalty wxde(x-1)
  • In order to make things work properly, need
    affine gap function (Smith et al. 1981)
  • wx dx
  • Any affine function works
  • For the linear function above, e d
  • Typical gap penalties (Mount p.113)
  • BLOSUM50 d15, e8-15
  • PAM250 d15, e5-15

59
Next Time
  • Put on your math/stats caps
  • Significance of scores
  • Bayesian statistics
  • Bayes block alignment
Write a Comment
User Comments (0)
About PowerShow.com