Class 4: Sequence Alignment II Gaps, Heuristic Search presentation

About This Presentation

Transcript and Presenter's Notes

Title: Class 4: Sequence Alignment II Gaps, Heuristic Search

1
Class 4 Sequence Alignment IIGaps, Heuristic
Search
2
Alignment with Gaps Example
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
3
Gaps

Both alignments have the same number of matches
and spaces but alignment 2 seems better
Definition A gap is any maximal, consecutive run
of spaces in a single string.
The length of a gap the number of spaces in it
Example 1 has 11 gaps, example 2 only 2 gaps
Idea develop alignment scores that take gaps
(not spaces) into account

4
Biological Motivation

Number of mutational events
A single gap due to a single event that removed
a number of residues
Each gap due to distinct, independent event
Protein structure
Protein secondary structure consists of alpha
helices, beta sheets and loops
Loops of varying size can lead to very similar
structure

5
Biological Motivation
6
cDNA Mataching

cDNA is the sequence after splicing (introns have
been removed) and editing
We expect regions of high similarity, separated
by long gaps

7
Gap Penalty Models (I)

Constant model
Gives each gap a constant score, spaces are free
Maximize
Time O(mn)
Works well with cDNA matching
Affine model
Penalty for starting a gap penalty for each
additional space
Each gap costs Wg qWs
Maximize
Time O(mn)
Widely used

8
Gap Penalty Models (II)

Convex model
Each extra space contributes less penalty
Gap function is convex in its length
Example Ws log(q)
Time O(mnlogm)
A better model of biology
General model
The weight of a gap is some arbitrary w(q)
Time O(mn2 nm2)

9
Example Revised
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
10
Indel Model
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
Score -6
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
Score -6
Scoring Parameters Match 1 Indel -2
11
Constant Model
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
Score -6
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
Score 12
Scoring Parameters Match 1 Open gap -2
12
Affine Model
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
Score -17
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
Score 1
Scoring Parameters Match 1 Open gap -2, each
space -1
13
Convex Model
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
Score -6
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
Score 7
Scoring Parameters Match 1 Open gap -2, gap
length -logn
14
Affine Weight Model

We divide all possible alignments of the prefixes
s1..i and t1..j into 3 types
s i
t j
s i-----
t j
s i
t j-----

15
Affine Weight Model
Recurrence relations
16
Affine Weight Model
Initial condition Optimal alignment Complex
ity Time O(mn) Space O(mn)
17
Affine Weight Model

This model has a natural explanation as a finite
state automata

18
Alignment in Real Life

One of the major uses of alignments is to find
sequences in a database
Such collections contain massive number of
sequences (order of 106)
Finding homologies in these databases with the
standard dynamic programming can take too long
Example
query protein 232 AAs
NR protein DB 2.7 million sequences 748
million AAs
mn 1.7 1011 cells !

19
Heuristic Search

Instead, most searches rely on heuristic
procedures
These are not guaranteed to find the best match
Sometimes, they will completely miss a
high-scoring match
We now describe the main ideas used by some of
these procedures
Actual implementations often contain additional
tricks and hacks

20
Basic Intuition

The main resource consuming factor in the
standard DP is decision of where the gaps are. If
there were no gaps, life was easy!
Almost all heuristic search procedures are based
on the observation that real-life well-matching
pairs of sequences often do contain long strings
with gap-less matches.
These heuristics try to find significant local
gap-less matches and then extend them.

21
Banded DP

Suppose that we have two strings s1..n and
t1..m such that n?m
If the optimal global alignment of s and t has
few gaps, then path of the alignment will be
close to the diagonal

s
t
22
Banded DP

To find such a path, it suffices to search in a
diagonal region of the matrix
If the diagonal band has presumed width a, then
the dynamic programming step takes O(an)
Much faster than O(n2) of standard DP in this case

s
a
t
23
Banded DP

Problem (for local alignment)
If we know that ti..j matches the query
sp..q, then we can use banded DP to evaluate
quality of the match
However, we do not know i,j,p,q !
How do we select which sub-sequences to align
using banded DP?

24
FASTA Overview

Main idea
Find (fast!) good diagonals and extend them to
complete matches
Suppose that we have a relatively long gap-less
local match (diagonal)
AGCGCCATGGATTGAGCGA
TGCGACATTGATCGACCTA
Can we find clues that will let us find it
quickly?

25
Signature of a Match

Assumption good matches contain several
patches of perfect matches
AGCGCCATGGATTGAGCGA
TGCGACATTGATCGACCTA

26
FASTA

Given s and t, and a parameter k
Find all pairs (i,j) such that si..ik and
tj..jk match perfectly
Locate sets of pairs that are on the same
diagonal by sorting according to i-j thus
Locating diagonals that contain
many close pairs.
This is faster than O(nm) !

s
i ik
j
jk
t
27
FASTA

Extend the best diagonal matches to imperfect
(yet ungapped) matches, compute alignment scores
per diagonal. Pick the best-scoring matches.
Try to combine close diagonals to potential
gapped matches, picking the best-scoring matches.
Finally, run banded DP on the regions containing
these matches, resulting in several good
candidate alignments.
Most applications of FASTA use very small k (2
for proteins, and 4-6 for DNA)

28
BLAST Overview

FASTA drawback is its reliance on perfect matches
BLAST (Basic Local Alignment Search Tool) uses
similar intuition, but relies on high scoring
matches rather than exact matches
Given parameters length k, and threshold T
Two strings s and t of length k are a high
scoring pair (HSP) if d(s,t) gt T

29
High-Scoring Pair

Given a query string s, BLAST construct all words
w (neighborhood words), such that w is an HSP
with a k-substring of s.
Note not all k-mers have an HSP in s

30
BLAST phase 1

Phase 1 compile a list of word pairs (k3) above
threshold T
Example for the following query
FSGTWYA (query word is in green)
A list of words (k3) is
FSG SGT GTW TWY WYA
YSG TGT ATW SWY WFA
FTG SVT GSW TWF WYS

31
BLAST phase 1
scores
GTW 6,5,11 22 neighborhood ASW 6,1,11
18 word hits ATW 0,5,11 16 gt threshold NTW
0,5,11 16 GTY 6,5,2 13 GNW 10 neighborh
ood GAW 9 word hits below threshold
(T11)
32
BLAST phase 2

Search the database for perfect matches with
neighborhood words. Those are hits for further
alignment.
We can locate seed words in a large database in a
single pass, given the database is properly
preprocessed (using hashing techniques).

33
Extending Potential Matches

Once a hit is found, BLAST attempts to find a
local alignment that extends it.
Seeds on the same diagonal tend to be combined
(as in FASTA)

34
Two HSP diagonal

An improvement look for 2 HSPs on close
diagonals
Extend the alignment between them
Fewer extensions considered
There is a version of BLAST,
involving gapped
extensions.
Generally faster then FASTA,
arguably better.

s
t
35
Blast Variants

blastn (nucleotide BLAST)
blastp (protein BLAST)
tblastn (protein query, translated DB BLAST)
blastx (translated query, protein DB BLAST)
tblastx (translated query, translated DB BLAST)
bl2seq (pairwise alignment)

36
Biological Databases

Today, most of the biological information can be
freely accessed on the web.
One can
Search for information on a known gene
Check if a sequence exists in a database
Find a homologous protein, helping us guess
Structure
Function

37
Databases and Tool

Important gateways
National Center for Biotechnology (GenBank)
http//www.ncbi.nlm.nih.gov/
European Bioinformatics Institue (EMBL-Bank)
http//www.ebi.ac.uk/
Expert Protein Analysis System (SwissProt)
http//www.expasy.org/
? Different tools and DBs to allow biologists a
rich suite of queries

38
Database Types

Nucleotide DBs (GenBank, EMBL-Bank)
Contain any and every type of DNA fragment
Full cDNA, ESTs, repeats, fragments
Dirty and redundant
Protein DBs (SwissProt)
Contain amino-acid sequences for full proteins
High quality, strict screening process
Lots of annotated information on each protein

Write a Comment

User Comments (0)

About PowerShow.com

Class 4: Sequence Alignment II Gaps, Heuristic Search PowerPoint PPT Presentation