Algorithms for Sequence Alignments - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Algorithms for Sequence Alignments

Description:

Dynamic Programming: Smith-Waterman, Needleman-Wunsch. Methods for Multiple Sequence Alignment ... local alignment: Smith-Waterman. statistical significance can ... – PowerPoint PPT presentation

Number of Views:319
Avg rating:3.0/5.0
Slides: 38
Provided by: z149
Category:

less

Transcript and Presenter's Notes

Title: Algorithms for Sequence Alignments


1
Algorithms forSequence Alignments
  • A. Brüngger, Labhead BioinformaticsNovartis
    Pharma AG
  • adrian.bruengger_at_pharma.novartis.com

2
Algorithms for Sequence Alignments
  • Definition of Sequence Alignment and Origins of
    Sequence Similarity
  • Global Alignment
  • Local Alignment
  • Alignment of Pairs of Sequences
  • Alignment of Multiple Sequences
  • Methods for Pairwise Sequence Alignments
  • Dot matrix
  • Scoring Matrices
  • Dynamic Programming Smith-Waterman,
    Needleman-Wunsch
  • Methods for Multiple Sequence Alignment
  • Dynamic Programming
  • Progressive multiple alignments CLUSTALW, PILEUP
  • Database Searching for Similar Sequences
  • FASTA, BLAST
  • Patscan
  • Hidden Markov Models

Outline follows David W. Mount, "Bioionformatics
- Sequence and Genome Analysis Cold Spring
Harbour Laboratory Press, 2001. Online
http//www.bioinformaticsonline.org
3
Rational for Sequence Analysis, Origins of
Sequence Similarity
Similar sequence leads to similar function
Sequence Analysis as the basic tool to
discover functional, structural,evolutionary
information in biological sequences
Sequence A
Sequence B
Evolutionary relationship between two similar
sequences and a possible common ancestor. The
number of steps to convert one sequence into the
other is the "evolutionary" distance between the
sequences (x y). Usually, the ancestor sequence
is not available, only (x y) can be computed.
y Steps
x Steps
common ancestor sequence
4
Origins of Homology ? Significance of Sequence
Alignments
  • Possible Origins of Sequence Homology
  • orthologs (panel A and B) a1 in species I and a1
    in species II (same ancestor!)
  • paralogs (panel A and B) a1 and a2 (arose from
    gene duplication event)
  • analogs (panel C) different genes converge to
    same function by different evolutionary paths
  • transfer of genetic material (panel D) between
    different species
  • Homology vs. Similarity
  • Similarity can be computed (by sequence
    alignments)
  • Homology is deduced (e.g. from similarity, but
    also from other evidence!)

5
Definition of Sequence Alignment
  • Computational procedure (algorithm) for
    comparing two/many sequences
  • identify series of identical residues or patterns
    of identical residuesthat appear in the same
    order in the sequences
  • visualized by writing sequences as follows
  • sequence alignment is an optimiztion
    problembringing as many identical residues as
    possible into corresponding positions

MLGPSSKQTGKGS-SRIWDN
MLN-ITKSAGKGAIMRLGDA
Pairwise Global Alignment (over whole length of
sequences)
GKG GKG
Pairwise Local Alignment (similar parts of
sequences)
6
Pairwise Sequence Alignment
  • Alignment Score Quantify the Quality of an
    alignment
  • simple scoring system (eg. for DNA sequences)
  • when two identical residues are aligned 1
  • when two non-identical residues are aligned -1
  • when a gap is introduced -1
  • total score of alignment is the sum of the scores
    in each position

agaag-tagattcta aggaggtag-tt-ta
  • Compute Score
  • 11 matches
  • 1 mismatch
  • 3 gaps
  • Score 11 - 1 -3 7

Alignment is 'optimal' when assigned score is
maximal Note Scoring scheme defines 'optimal'
alignment
7
Algorithms for Pairwise Sequence Alignments Dot
Plot
  • Algorithm introduced by Gibbs and McIntyre
    (1970)
  • write one sequence on top from left to right
  • write other sequence on the left from top to
    bottom
  • place a dot whenever two residues are identical
    variants use colors
  • visual inspection (variants identify diagonals
    with many dots

a g a c c t a g a g
a c c
t
8
Algorithms for Pairwise Sequence Alignments Dot
Plot
  • Example
  • Dot Plots of DNA and Protein of Phage l cI
    (horizontal) and P22 c3 (vertical)

DNA
Protein
Removing noise Plot a dot only if 7
("stringency") out of the next 11 ("window size")
residues are identical!
9
Algorithms for Pairwise Sequence Alignments Dot
Plot
  • Example Human LDL receptor against itself.

Window size 1, stringency 1.
Window size 23, stringency 7.
"Repeats" in first part of sequence.
10
Algorithms for Pairwise Sequence Alignments Dot
Plot
  • Strengths
  • visualization of sequence similarity
  • finding direct and inverted repeats in sequences
  • finding self-complementary regions in RNA
    (secondary structures)
  • simple to compute, simple to visualize
  • Implementations
  • DNA Strider (Macintosh)
  • DOTTER (UNIX X-Windows)
  • GCG "COMPARE", "DOTPLOT"
  • online SIB http//www.isrec.isb-sib.ch/java/dotle
    t/Dotlet.html
  • Computational Complexity
  • two sequences of length n, m O(nm)

11
Scoring Matrices for Proteins
  • Observation Protein (Function, Structure) is
    preserved when substitutions in certain AAs
    happen
  • scoring system less simple in case of proteins
  • Example The same protein in three different
    species

A W T V A S A V R T S I A Y T V A S A V R T S I A
W T V A A A V L T S I
Organism A Organism B Organism C
A
B
C
Therefore Assigning L to R scores higher than
assigning it to another AA
W to Y
L to R A to S
12
Scoring Matrices for Proteins
  • Generalization Dayhoff PAM Weight Matrices
    (percent accepted mutation)
  • observed mutations during a unit of evolutionary
    time(1 PAM time that is required that an AA is
    changed with probability 1)
  • initial PAM matrix ("PAM1") derived from 1572
    changes in 71 groups of protein sequences that
    are at least 85 similar
  • observed in proteins that have the same function
    gt "accepted" mutations

C S T P .. C fcc fcs fct fcp S fsc
fss fst fsp . .
13
Scoring Matrices for Proteins
  • Extrapolation to longer evolutionary distances
    possible!

M
PAM1 p(HM) 0 PAM2 p(HM) p(HK)p(KM) p(HR
)p(RM)
R
K
H
14
Scoring Matrices for Proteins
  • For each pair of residues, a score is given by a
    "scoring matrix"
  • Scoring of two pairs of residues depends on
  • frequency of the two residues
  • exchange that tends to preserves function should
    have high score
  • Scoring matrices are constructed by empirically
    evaluating (manually constructed) alignments of
    closely related proteins
  • PAM
  • introduced by Margarete Dayhoff, 1978
  • inspecting 1572 changes in 71 groups of protein
    sequences
  • different matrices for varying degree of
    similarity (or, evolutionary distance)
  • BLOSUM
  • inspection of about 2'000 conserved AA-stretches,
    "BLOCKS"

Sequence 1 V D S - C Y Sequence 2 V E
S L C Y score 4 2 4 -2 4 4 Total
16
15
Dynamic Programming Approach to Sequence
Alignments
  • Description by Example
  • input two AA sequences VDSCY and
    VESLCY scoring matrix match 4, mismatch
    2, gap -2
  • Some possible alignments and their
    score VDSCY VD-SCY VDS-CY
    VESLCY V-ESLCY VESLCY 14 8 16
  • Observation Longer alignments can be derived
    from shorter

new score old score score of new
pair VDS-CY VDS-C Y VESLCY
VESLC Y 16 12
4 VDS-C VDS- C VESLC
VESL C 12 8
4
16
Dynamic Programming Approach to Sequence
Alignments
  • Alignment Path in matrix from "top left" to
    "bottom right"

Seq 1 V D S - C Y Seq 2 V E S L C Y
possible extensions of "current" alignment ( )
  • Three situations for extending previous
    alignment

introduce gap in sequence 1
introduce gap in sequence 2
17
Dynamic Programming Approach to Sequence
Alignments
  • In each matrix element, we can write the score of
    the best alignment up to this element, and a
    pointer to the shorter alignment it is derived
    from

V
D
S
Y
C
Goal is to compute this path!
4
V
6
E
Seq 1 V D S - C Y Seq 2 V E S L C
Y 4 2 4 -2 4 4
10
S
L
C
12
match 4, mismatch 2, gap -2
16
Y
V
D
S
Y
C
two possibilities 4 (E-gtS) gap 4 2
(E-gtS) 4 choose one (first)
V
D
S
Y
C
4
fill firstrow/column
V
2
2
2
2
4
V
2
2
2
2
2
6
4
4
4
E
2
E
2
4
S
2
S
three possibilities 4 (L-gtD) 2 gaps 2 2
(L-gtD) 1 gap 2 2 (L-gtD)
4 choose third (score max)
L
2
4
L
2
C
2
4
C
2
Y
2
4
Y
2
18
Dynamic Programming Approach to Sequence
Alignments
  • Formally
  • ai, bj Residue of Sequence A at position i and
    Residue of Sequence B at position j
  • wx Penalty for introducing gap of length x
  • Sij Score of alignment at position (i, j)
  • s(a-gtb) Score for aligning a and b (known from
    scoring matrix)
  • Iterate process until whole matrix is filled with
    scores and back-pointers
  • Choose maximum score in last column or row
  • Follow pointers to construct alignment

Si-1, j-1 s(ai -gt bi)
Sij max
max x³1 (Si-x,j - wx)
max y³1 (Si,j-y - wy)
19
Dynamic Programming Approach to Sequence
Alignments
  • Global AlignmentNeedleman and Wunsch(1970)
  • Local AlignmentSmith-Waterman(minor
    modification)

20
Dynamic Programming Approach to Sequence
Alignments
  • Implementations available on the web

21
Dynamic Programming Approach to Sequence
Alignments
  • Strengths
  • finds optimal (mathematically best) alignment
  • suited for both, local and global alignments
  • global alignment Needleman-Wunsch
  • local alignment Smith-Waterman
  • statistical significance can be
    attached(recompute alignment when one or both
    sequences are randomly changed)
  • Implementations
  • LALIGN
  • GCG "GAP" (global) and BESTFIT (local)
  • ...
  • Complexity
  • two sequences of length n, m
  • time O(nm), space O(nm)
  • space is crucial
  • example matrix element 4 bytes, nm10000, space
    requirement 400MB
  • can be improved trade time for space

22
Multiple Sequence Alignment
  • When aligning k (kgt2) sequences, the dynamic
    programming idea can theoretically be used
  • instead of a two-dimensional scoring schema, a
    k-dimensional one is used
  • instead of two-dimensional residue substitution
    matrix, a k-dimensional one is used
  • However, this is not feasible in practical terms
  • size of these matrices increase too rapidly!
  • example 5 sequences, 100 residues each, matrix
    size 1005 1010in terms of computer memory
    10 GB
  • ConsequenceOptimal alignment for multiple
    sequences is not computable!
  • Approximative methods used (heuristics)

23
Multiple Sequence Alignment Progressive methodes
  • Idea for progressive methodes
  • step 1 Align two of the sequences
  • step 2 Compute "consensus" sequence
  • step 3 Align a third sequence to the consensus
  • step 4 Repeat steps 2 and 3

24
Multiple Sequence Alignment Progressive
methodes, CLUSTALW
  • Basic steps
  • step 1 Perform pairwise sequence alignments of
    all possible pairs
  • step 2 Use scores to produce a phylogenetic
    tree
  • step 3 align the sequences sequentially, guided
    by the tree
  • the most closely related sequences are aligned
    first
  • other sequences or groups of sequences are added
  • Example Phylogenetic tree for 4 sequences

A B C D A 100 90 85 90 B 100 95
80 C 100 97 D
100 Percentage Identities inpairwise alignment
A
B
90
90
80
85
95
C
D
97
Join sequences with highest similarity
25
Multiple Sequence Alignment Progressive
methodes, CLUSTALW
  • Computation of Phylogenetic tree

Join sequences with highest similarity
A B C D A 100 95 80 90 B 100 95
85 C 100 97 D
100 Percentage Identities inpairwise alignment
A
A
B
B
95
95
90
85
80
(8595)/2 90
(8090)/2 85
95
C
D
97
C, D
Join sequences with highest similarity
A, B
(8590)/2 87.5
A
B
C, D
C
D
26
Multiple Sequence Alignment Progressive
methodes, CLUSTALW
  • Example SevenGlobins fromSWISSPROT

27
Multiple Sequence Alignment Progressive
methodes, CLUSTALW
  • Available Implementations

28
Riddle Probability that a short string occurs in
a longer text
  • Given an alphabet A a, c, g, t a random
    string s of length k over A a random text t of
    length n (ngtk) over A
  • Find probability p, by which the string s occurs
    in the text t.
  • Find expected number of occurrences e of s in t.
  • Example s tggtacaaatgttct (glucocorticoid
    response element GRE) t 10000 bp (promoter,
    upstream DNA to start codon)

29
Basics of Sequence DB Searches Definition
  • Given one short sequence q (query) sequence
    DB, conceptually one long sequence s (subject)
  • Find occurrences of similar sequences to q in
    s.
  • Attach to each similar occurrence a statistical
    significance.
  • Example s human genomic DNA, 3109 b q
    tggtacaaatgttct (glucocorticoid response element
    GRE)
  • Dynamic programming approach not feasible!

30
Basics of Sequence DB Searches Detection of
identical k-mers
  • Idea identify identical k-mers in s and q
    (seed) expand alignment from seed in both
    directions
  • Example

q MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEPVYPGDNATPEQM
AQYAADLRRYINMLTRPRYGKRHKEDTLAFSEWGS

...
MAVAYCCLSLFLVSTWVALLLQPLQGTWGAPLEPMYPGDYATPEQMAQYE
TQLRRYINTLTRPRYGKRAEEENTGGLP...
  • BLAST
  • HSP, high scoring pair
  • gapped alignment
  • starting extension also from similar (and not
    only identical) seeds

31
Basics of Sequence DB Dearches Detection of
identical k-mers
  • Precompute position of all k-mers in DB sequence
  • Indexing all Peptides of length k in "database
  • Example

0 1 2 3 1234567890123456789
012345678901234 MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEP
MAAAR AAARL AARLC ....
APLEP Sorted AAARL 2
AARLC 3 APLEP 30 ... MAAAR 1 ...
VALLL 17
  • For each peptide of length k in "query", the
    identical peptides in "database" are identified
    efficiently (binary search in sorted list)
  • For identical pairs, extension step in both
    directions is performed

32
Basics of Sequence DB Dearches Detection of
identical k-mers
  • Indexing all Peptides of length k in "database
    some refinements

0 1 2 3 1234567890123456789
012345678901234 MAAARLCVALLLLSTCVALLLQPLLGAQGAPLEP
- 8 Pointers
to previous occurrences List of all words of
length k and last occurrence in query AAAAA
- AAAAC - AAAAD - .... AARLC 3
.... APLEP 30 ... ... VALLL 17 ...
Simple, yet efficient data-structure - array of
integers (sizelength of db) - array of integers
(sizenumber of words with length
k) Book-keeping - more than one db/query
sequence - build database chunks that fit into
main memory(speeds up computation
1000x) Extension step optimizations
  • For each peptide of length k in "query", the
    position in the wordlist can be easily computed
    (no binary search!)

33
Significance of matches DNA case
  • issue searching with short query vs. large
    database ? found match could have occurred by
    pure chance
  • assume equal distribution of c,g,a,t
  • what is ...
  • the probability q, that sequence B (lenm) is
    contained in sequence A (lenn)?
  • the expected length of a common subsequence of
    two sequences?
  • the expected score when locally aligning two
    sequences of length n, m?
  • Solution for first
  • a. There are (n-m) 'words' of length m in
    sequence A
  • b. In total, there are 4m sequences of length m
  • c. p (n-m) / 4m

This is wrong. Why?
34
Significance of matches Statistics
  • statistics
  • the statistical distribution of alignment scores
    found in a DB search follows the extreme value
    distribution (not normal distribution)
  • extreme value distribution changes with length of
    sequences and their residual composition
  • scores of actual database search results are
    plotted vs. expected scores(FASTA)
  • BLAST computes E-Value (number of expected hits
    with this score, when comparing the query with
    unrelated database sequences)

35
Putting it together BLAST2
A W T V A S A V R T S I
  • (optional) filtering for low complexity region in
    query
  • all words of length 3 are listed
  • to each word, 50 'high scoring' additional words
    are added
  • matching words are found in DB (with
    datastructure as described before)
  • ungapped alignment constructed from word matches,
    'HSP'
  • statistics determines, whether HSP is significant
  • SW-alignment for significant HSPs

AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS
AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS AWA
IAS TVR ... TWA LAS AIR ... ART ITS AVS ... ...
... ...
36
An example comparing mus chr X vs. hum chr X,
syntenic regions
mouse chromosome X (147Mbp)
human chromosome X (149Mbp)
Each dot conserved stretch of AA HSP, high
scoring pair
37
Conclusions
  • types of sequence alignments pairwise, multiple,
    query vs. database
  • local and global alignment
  • scoring matrices attach score to each pair of
    residues(allow similar residues to be aligned)
  • sequence alignment problem is mathematical
    optimization problemfind optimal alignment
    maximise score
  • optimal alignment in practical terms only
    feasible for pairwise alignment
  • Heuristics for multiple sequence
    alignmentsprogressive alignment guided by all
    pairwise alignments
  • Sequence database searches
  • efficiently find occurrence of query k-mers in db
    sequences (seeds)
  • expand (ungapped) HSP from seeds
  • attach statistical significance
Write a Comment
User Comments (0)
About PowerShow.com