E. Sequence Alignment - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

E. Sequence Alignment

Description:

Part I: Overview and foundation: A. Central Dogma, B. Accessing Databases, C. ... If A is homologous to B, and B to C, then A must be homologous to C, even if ... – PowerPoint PPT presentation

Number of Views:112

Avg rating:3.0/5.0

Slides: 37

Provided by: stephe87

Category:

Tags: alignment | bb | sequence

more less

Transcript and Presenter's Notes

Title: E. Sequence Alignment

1
E. Sequence Alignment

Part I Overview and foundation A. Central
Dogma, B. Accessing Databases, C. Data Mining,
D. Tree of life.
Part II Gene Sequence E. Sequence Alignment, F.
Blasting NCBI Tutorial, G. SNP Video, H.
Methods for Manipulating DNA and RNA Ch 8. MBoC
pp 469-513, I. Hidden Markov Model Handouts,
J. Gene Finding Handout, K. Microarray Analysis
Ch 4 and Ch 5 DGPB
Part III Gene Regulation
L. Cell Chemistry and Biosynthesis Ch 2 MBoC,
M. Proteins Ch 3 MBoC, N. DNA and Chromosomes
Ch 4 MBoC, O. From DNA to Protein Ch 6. MBoC,
P. Control of Gene Expression Ch 7 MBoC
Part IV Proteonomics
Q. Introduction Ch. 6 DGPB, R. Subcellular
Localization Handout, S. Structural Prediction
Handout, T. Protein Interaction Handout, U.
Mass Spec Proteomic Informatics Handout
Part V Whole Genome Perspective
V. Protein and Gene Networking Handout Some
parts of Unit Three DGPB
Part V Applications and Conclusion
W. Cancer Ch. 23 MBoC, X. Pathogens Ch. 25
MBoC, Y. Case Studies Some Parts of Unit Four
DGPB, Z. Future Directions Handout

2
E. Sequence Alignment

E. Sequence Alignment
E.1. Sequence Similarity
E.2. Dynamic Programming
E.3. Blasting and (Psi-Blast)
Reading
The NCBI BLAST education pages (all three
sections).
Altschul, et al on PSI-BLAST and on improving
PSI-BLAST (using ROC curves)

3
E. Sequence Alignment Purpose of Sequence
Alignment and Blasting

What is an alignment, and why might it be
significant?
An alignment is a mapping from one sequence to
another, identifying elements that are likely to
have arisen from a common ancestor
A good alignment is an indication of homology
Alignments are NOT exact matches. We will need a
method to find good alignments in a database...

4
E. Sequence Alignment Phylogenetic Tree
5
E.1 Sequence Similarity Similarity vs. Homology
Paralogs vs. Orthologs

Homology is an evolutionary relationship that
either exists or does not. It cannot be partial.
An ortholog is a homolog with shared function.
A paralog is a homolog that arose through a gene
duplication event. Paralogs often have divergent
function.
Similarity is a measure of the quality of
alignment between two sequences. High similarity
is evidence for homology. Similar sequences may
be orthologs or paralogs.

6
E.1 Sequence Similarity How do we compute
similarity?

Similarity can be defined by counting positions
that are identical between two sequences
Gaps (insertions/deletions) can be important
abcdef abcdef abcdef
abceef acdef a-cdef

7
E.1 Sequence Similarity Not all mismatches are
the same

Some amino acids are more substitutable for each
other than others. Serine and threonine are
more alike than tryptophan and lysine.
We can introduce "mismatch costs" for handling
different substitutions.
We don't usually use mismatch costs in aligning
nucleotide sequences, since no substitution is
per se better than any other.

8
E.1 Sequence Similarity Many possible alignments
to consider

Without gaps, there are are NxM possible
alignments between sequences of length N and M
Once we start allowing gaps, there are many
possible arrangements to consider abcbcd
abcbcd abcbcd
abc--d a--bcd ab--cd
This becomes a very large number when we allow
mismatches, since we then need to look at every
possible pairing between elements there are
roughly NM possible alignments.

9
E.1 Sequence Similarity Avoiding random
alignments with a score function

Not only are there many possible gapped
alignments, but introducing too many gaps makes
nonsense alignments possible
s--e-----qu---en--ce sometimesquipsentice
Need to distinguish between alignments that occur
due to homology, and those that could be expected
to be seen just by chance.
Define a score function that accounts for both
element mismatches and a gap penalty

10
E.1 Sequence Similarity Match scores

Match scores are often calculated on the basis of
the frequency of particular mutations in very
similar sequences.
We can transform substitution frequencies into
log odds scores, which can then be added together.

11
E.1 Sequence Similarity Local vs. Global
alignments

A global alignment includes all elements of a
sequence, and includes gaps
A global alignment may or may not include "end
gap" penalties.
A local alignment is includes only subsequences,
and sometimes computed without gaps.
Local alignments can find shared domains in
divergent proteins and are fast to compute
Global alignments are better indicators of
homology and take longer to compute.

12
E.1 Sequence Similarity An alignment score

An alignment score is the sum of all the match
scores of an alignment, with a penalty subtracted
for each gap.
Gap penalties are usually "affine" meaning that
the penalty for one long gap is smaller than the
penalty for many smaller gaps that add up to the
same size.
a b c - - da c c e f d9 2 7 6 gt 24 - (10
2) 12

Gap start continuationpenalty
AlignmentScore
Matchscore
13
E.2 Dynamic Programming Finding the optimal
alignment

Given a pair of sequences and a score function,
identify the best scoring (optimal) alignment
between the sequences.
Remember, exponential number of possible
alignments (most with terrible scores).
Computer science to the rescue dynamic
programming identifies optimal alignments in time
proportional to the sum of the lengths of the
sequences

14
E.2. Dynamic programming

The name comes from an operations research task,
and has nothing to do with writing programs.
Dynamic programming alignments are a key
technology in bioinformatics, and you should
understand how they work.
Called Needleman-Wunch or Smith-Waterman

15
E.2. Dynamic Programming The recurrence

The idea Recursion for the cost that you want to
optimize
Boundary Condition
Recurrence
Example Pairwise sequence alignment
Ci,j optimal cost for aligning sequence
A1..i and B1..j where A and B are input
sequences of length n and m.
Boundary Condition
C0,j Ci,0 0 for i,j 0.
C0,j Ci,0 cost of deletion for i,j gt 0.
Recurrence Si,j
Si-1,j-1 cost of substituting Ai with Bj.
Si-1,j cost of deletion
Si,j-1 cost of insertion

16
E.2. Dynamic Programming Filling up the table

Using the recurrence compute all Ci,j in a
table C
At the same time, keep track of which case you
use to get Ci,j in another table F
Use F to backtrack and construct the solution.

Ci.j
For backtracking
(1)
(1)
(2)
(2)
(3)
(3)
Ci,j
17
E.2. Dynamic Programming Dynamic programming
alignment

Each cell has the score for the best aligned
sequence prefix up to that position.
Start by filling in initial gap and first element
to first element match score
Use arrow to indicate path to that alignment

18
E.2. Dynamic Programming Continue filling in
optimalpath scores

For each cell, have three choices for how to get
there from the last optimal alignment (match, gap
sequence 1, gap sequence 2).
Best score(s) are selected, and arrows added
indicated route.-5 5 0 5 -5 0-7 -5
-12

19
E.2. Dynamic Programming Optimal alignment by
traceback

We traceback a path that gets us the highest
score. If we don't have end gap penalties,
then take any path from the last row or columnto
the first.
Otherwise we needto include the top and bottom
corners
AACADCD---A-CD

20
E.2. Dynamic Programming How do we pick match
scores?

For match scores, two main options
PAM based on global alignments of closely related
sequences. Normalized to changes per 100 sites,
then exponentiated for more distant relatives.
BLOSUM based on local alignments in much more
diverse sequences
Picking the right distance is important, and may
be hard to do. BLOSUM seems to work better for
more evolutionarily distant sequences. BLOSUM62
is a good default.

21
E.2. Dynamic Programming Picking gap penalties

Many different possible forms
Most common is affine (gap open gap continue
penalities)
More complex penalties have been proposed.
Penalties must be commensurate with match scores.
Therefore, the match scoring scheme influences
the gap penalty
Most alignment programs suggest appropriate
penalties for each match score option.

22
E.2. Dynamic Programming Searching for optimal
scores

One possibility is to try several different match
score and gap penalties, and choose the best
In general, this is called parameter space search
and it is important in many areas.
Problems
requires a lot computation
we need some principled way to compare the
results.
Use significance testing to compare...

23
E.2. Dynamic Programming The significance of an
alignment

Significance testing is the branch of statistics
that is concerned with assessing the probability
that a particular result could have occurred by
chance.
How do we calculate the probability that an
alignment occurred by chance?
Either with a model of evolution, or
Empirically, by scrambling our sequences and
calculating scores on many randomized sequences.
Extreme value statistics (max, not sum)

24
E.2. Dynamic Programming Metric-space database
search

A metric is a function with these
characteristics
f(A, B) f (B,A)
f(A,A) 0
f(A,B) f(B,C) ? f(A,C) the triangle inequality
Log(n) database search for closest (inexact)
match can be done for metrics
Miranker, et al. have applied this to sequence
databases (transforming score matrix)

25
E.3 Blasting Why BLAST?

Dynamic programming solutions to alignment
problems are relatively slow, and don't lend
themselves to efficient database search.
Need some way to search a large database to find
sequences that have an inexact match to a query
sequence
Competing solutions FASTA BLAST.
Both imperfect approximations to DP. DP finds
some distantly related sequences the
approximations don't
BLAST is more commonly used, although both are
fine.

26
E.3 Blasting Sequence search basics

BLAST/FASTA are 50-100x faster than DP
If searching for coding regions, always translate
nucleotide to amino acid sequence.
Use appropriate substitution and gap scores
BLOSUM62 is good for weak protein similarities
Use PAM30, PAM70 or BLOSUM45 for better results
on more similar sequences, BLOSUM80 for most
distant
Use Low-complexity filters and, for human
sequence, filter out human repeats (ALUs, etc)

27
E.3 Blasting How does BLAST work

BLAST2 (gapped BLAST)
Break sequence into overlapping words, by
default of length 3. n-l1 l-size words for
sequence of length n. ABCDE ? ABC, BCD, CDE
For each word, define 50 other words that are
similar (use substitution matrix threshold T)
Repeat for each of the n-l1 words, giving about
50n words (out of 2038000 possible)
Use a hash table to find all places in DB with
exact match to any of those words.

28
E.3 Blasting Blasting Extending alignments

Identify database sequences that contain several
matching words on the same diagonal (think DP
alignments) and within a short distance.
Extend these short, ungapped alignments in both
directions along the sequence so long as score of
alignment increases.
Call these extended alignments HSP's for high
scoring pairs

29
E.3 Blasting PSI Blast

Position Specific Iterated BLAST
Intuition substitution matrices should be
specific to a particular site. Penalize
alanine?glycine more in a helix.
Idea Use BLAST with high stringency to get a set
of closely related sequences. Align those
sequences to create a new substitution matrix for
each position. Then use that matrix
(iteratively) to find additional sequences.

30
E.3 Blasting Why (not) PSI-BLAST

If the sequences used to construct the Position
Specific Scoring Matrices (PSSMs) are all
homologous, the sensitivity at a given
specificity improves significantly.
However, if non-homologous sequences are included
in the PSSMs, they are corrupted. Then they
pull in more non-homologous sequences, and become
worse than generic

31
E.3 Blasting How to use PSI BLAST

Set initial thresholds high. Inspect each
iteration's result for suspicious sequences.
Do several iterations (5), or until no new
sequences are found
Even if only looking for a small set of
sequences, make the initial search very broad
First, use NR with up to 5 iterations to set PSSM
Then use that PSSM to search in restricted domain

32
E.3 Blasting PSI-BLAST example
33
E.3 Blasting First Iteration
...
34
E.3 Blasting Second iteration
...
...
35
E.3 Blasting PSI-BLAST caveats

Increased ability to find distant homologues
Cost of additional required care to prevent
non-homologous sequences from being included in
the PSSM calculation.
When in doubt, leave it out!
Examine sequences with moderate similarity
carefully.
Be particularly cautious about matches to
sequences with highly biased amino acid content

36
E.3 Blasting Some notes on sequence-based
database searching

Matches of gt50 identity in a 20-40 amino acid
region occur frequently by chance.
Most sequences that share statistically
significant similarity throughout their entire
lengths are homologous.
If A is homologous to B, and B to C, then A must
be homologous to C, even if they share no
significant sequence similarity.
Low complexity regions, transmembrane regions and
coiled-coil regions often display significant
similarity without homology
Screen them out of your query sequences!