MUMmer: fast alignment of large-scale DNA and protein sequences - PowerPoint PPT Presentation

About This Presentation

Title:

MUMmer: fast alignment of large-scale DNA and protein sequences

Description:

Use dynamic programming (Smith-Waterman) to extend MUMs. Smaller regions, so can compute quickly ... 3) Gap closing (via dynamic programming, Smith-Waterman) ... – PowerPoint PPT presentation

Number of Views:467

Avg rating:3.0/5.0

Slides: 66

Provided by: jmt7

Learn more at: https://www.cse.lehigh.edu

Category:

more less

Transcript and Presenter's Notes

Title: MUMmer: fast alignment of large-scale DNA and protein sequences

1
MUMmer fast alignment of large-scale DNA and
protein sequences

Presented by Arthur Loder
Course CSE 497 Computational Issues in
Molecular Biology
Date February 17, 2004

2
MUMmers Significance

MUMmer is a system for rapidly aligning entire
genomes or very large protein sequences
Input2 strings Outputbase-to-base alignment
What distinguishes MUMmer from previous
algorithms?
Can align very large sequences (millions of
nucleotides long)

3
Complete Genome Alignment

What it means to align complete genomes
Human chromosome 1 200 million base pairs
Previous alignment algorithms run space
efficiently O(n)
Time complexity O(n2) is unacceptably slow
We would like to align using a global dynamic
programming algorithm (Needleman/Wunsch), but
infeasible need a shortcut

4
MUMmer vs. BLAST

Instead, use technique somewhat similar to BLAST
(find similar regions between strings)
BLAST for comparing 1 known string to a large set
of unknown strings
MUMmer for aligning 2 very similar known strings

5
You know youve made it when
6
Overview MUMs Part 1

String A
CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAG
SCTGAAGTGCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCG
AGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGT
CGAGGCTTTAGAGCTAGCTAGCGCGATCGAGGCTAGCTATGCTAGCTATC
ATCGCAAGCTAGCTGAGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCT
AGTCGAGCTGATCGTAGCTAGTAATGTATCCATCTACTCTAGTAGATCGA
TTAGTCGATCGATGCTAGATCGGATCGAGTCGAGATCGATGGAGTCGAGA
TCGATCTAATCTATCTCTAAATGGAGCGA
String B
GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCA
TCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGG
AGTCCAAACTCGCAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGT
CGCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGC
TAGTAATGTATCATAGCTAATCGCACTACTACGATGCGATCTCTAGTCGA
TCTATCTCGGCTTCGATCGTA
How align without Needleman/Wunsch ?

7
Overview MUMs Part 2

String A
CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAG
SCTGAAGTGCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCG
AGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGT
CGAGGCTTTAGAGCTAGCTAGCGCGATCGAGGCTAGCTATGCTAGCTATC
ATCGCAAGCTAGCTGAGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCT
AGTCGAGCTGATCGTAGCTAGTAATGTATCCATCTACTCTAGTAGATCGA
TTAGTCGATCGATGCTAGATCGGATCGAGTCGAGATCGATGGAGTCGAGA
TCGATCTAATCTATCTCTAAATGGAGCGA
String B
GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCA
TCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGG
AGTCCAAACTCGCAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGT
CGCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGC
TAGTAATGTATCATAGCTAATCGCACTACTACGATGCGATCTCTAGTCGA
TCTATCTCGGCTTCGATCGTA
Easier with large exact matches highlighted?

8
Overview MUMs Part 3

Any optimal global alignment will probably use
these two subsequences as anchors
This is the shortcut needed to calculate global
alignment quickly on large sequences
Very intuitive in alignment process, but only a
heuristic

9
Overview MUMs Part 4

MUM Maximal Unique Match
MUMs occur exactly once in each sequence
Ignores repeat sequences
MUMs found efficiently using suffix tree data
structure (to be explained later)

10
Overview Choosing MUMs

Once have anchors, need to choose which ones to
use in alignment
All MUMs

MUMs used in alignment (subset)

11
Overview Closing the Gaps

After choose/align anchors, what next?

Close the gaps
Use dynamic programming (Smith-Waterman)
to extend MUMs
Smaller regions, so can compute quickly
Implicit assumption sequences very similar

12
3 Phases of MUMmer

3 Phases
1) Obtaining MUMs (via Suffix Trees)
2) MUM choosing (via Longest Increasing
Subsequences)
3) Gap closing (via dynamic programming,
Smith-Waterman)
Comprised of previously known algorithms packaged
to form a unique algorithm

13
Critiquing MUMmers Output

Sample Output
Sequence A ACTGC_TGAC_CTA
Sequence B ACC_CA_GGCTCG_
MUMmer best-case same alignment as
Needleman/Wunsch
MUMmer worst-case sub optimal alignment
At least computable, whereas Needleman/Wunsch is
not

14
Phase 1 Obtaining MUMs

via Suffix Trees
Edward M. McCreight. (1973) A Space-Economical
Suffix Tree Construction Algorithm.
http//doi.acm.org/10.1145/321941.321946

15
Suffix Trees Outline

I. Suffix Trees
A. Motivations
B. Tries
C. Suffix Trees
D. How Mummer utilizes suffix trees

16
Tries

Term trie comes from retrieval
Introduced in 1960s by Fredkin
Suffix trees are a type of trie
Uses
Quickly search large text via preprocessing
Used for regular expressions, longest common
substring, automatic command completion, etc

17
Non-Compact Trie Example

5 strings encoded BIG, BIGGER, BILL, GOOD, GOSH
Every edge represents a symbol of the alphabet

18
Implementation of Tries

Use linked list
Include pointers to sibling and first child

19
Compacting Tries Part 1

Method 1 trim chains leading to leaves
Compact trie for strings BIG, BIGGER, BILL,
GOOD, GOSH

20
Compacting Tries Part 2

Method 2 Patricia Tries
Before, one edge per character
Now, unary nodes are collapsed

21
Suffix Trie

Normal trie, but input strings are suffixes
Assume text string t1tn
Q Tree has how many leaves?
A Tree has n Leaves

22
Suffix Tree

First compact suffix trie
Next collapse unary nodes

23
Suffix Trees Decreasing storage

Rather than storing strings, store a pair of
indices (x,y) where x is beginning of string and
y is the end
Storage becomes O(n)

24
Suffix Tree Algorithms

First linear-time algorithm given by Weiner
(1973)
McCreight developed more space efficient
algorithm (1976)
Two original papers reputations difficult to
understand

25
McCreights Algorithm Part 1

Algorithm M
Maps finite string S of characters into
auxiliary index to S in the form of a digital
search tree T whose paths are the suffixes of S,
and whose terminal nodes correspond uniquely to
positions within S.
A Space-Economical Suffix Tree Construction
Algorithm. Edward M. McCreight.
http//doi.acm.org/10.1145/321941.321946

26
McCreights Algorithm Part 2

S ababc
Definitions
sufi is suffix of S beginning at character
position i
headi is longest prefix of sufi which is also
prefix for sufj for some
j lt i
taili is sufi headi
suf3 ?
abc
head3 ?
ab
tail3 ?
abc ab c

27
McCreights Algorithm Part 3

Builds suffix tree by adding sufi to Treei-1
Initially, Tree1 contains only suff1 (the entire
string)
To obtain Tree2, add suff2 to Tree1
Continue until you have added suffn to
Treen-1
Treen is the final suffix tree

28
McCreights Algorithm Part 4

Adding a suffix (going from T2 to T3)
suf3 abc head3 ab tail3 c

29
Suffix Trees Complexity

Adding a non-terminal and a new arc that
corresponds to tail takes at most constant time
If could find head in at most constant time, it
would run in linear time n, the length of the
string S
Do so by using suffix links (see paper for
details)

30
Finding MUMs from Suffix Trees
31
Finding MUMs from Suffix Trees 2
32
Finding MUMs from Suffix Trees 3

More
general
case

33
Phase 2 Choosing MUMs For Alignment

via Longest Increasing Subsequence (LIS)
Gusfield, D. (1997) Algorithms on Strings, Trees
and Sequences Computer Science and Computational
Biology.

34
Motivation For Choosing MUMs

Q Why cant we use all MUMs for alignment?
A Due to crossing of MUMs can only choose
increasing set of MUMs
Problem given a set of MUMs, how do we choose
the optimal sequence?

35
Choosing MUMs (Continued)

Configuration can be uniquely represented
P 1, 2, 3, 4, 6, 7, 5
LIS(P) 1, 2, 3, 4, 6, 7

Determining optimal sequence of MUMs reduces to
finding LIS of P

36
IS Definition

Increasing Subsequence values (strictly)
increase from left to right
Sequence P 4, 2, 1, 5, 8, 6, 9, 10
Examples of two increasing subsequences
4, 5, 9 or 2, 5, 6, 9, 10

37
DS Subsequence

Decreasing Subsequence numbers that are
decreasing from left to right
Sequence P 4, 2, 1, 5, 8, 6, 9, 10
Examples? ltinsert class participation heregt
4, 2, 1, 4, 2, 4, 1, 2, 1, or 8, 6

38
Covers Definition Part 1

Cover of P set of decreasing subsequences of P
that contains all numbers of P
P 7, 3, 4, 8, 6, 2, 1
Some possible covers ?
7, 3 4 8 6, 2, 1
OR
7, 3, 2, 1 4 8, 6
And Others

39
Covers Definitions Part 2

Size of cover number of decreasing subsequences
it contains
Smallest cover cover with minimum size
If I is an increasing subsequence of P with
length equal to the size of a cover of P, call it
C, then I is a longest increasing subsequence of
P and C is a smallest cover of P
Why?

40
Covers Relation to LIS

If I is an increasing subsequence of P with
length equal to the size of a cover of P, call it
C, then I is a longest increasing subsequence of
P and C is a smallest cover of P. Why?
Because no increasing subsequence can contain
more than one character from each decreasing
subsequence in a cover

41
Covers Examples

P 7, 3, 4, 8, 6, 2, 1
Two possible covers
7, 3 4 8 6, 2, 1
7, 3, 2, 1 4 8, 6
What is the size of the smallest cover?
3 (no cover can contain lt 3 decreasing
subsequences)
How many elements in LIS?
3

42
Covers Examples Continued

P 7, 3, 4, 8, 6, 2, 1
For a particular cover, say
7, 3, 2, 1 4 8, 6
You can only choose one element from each
subsequence, otherwise subsequence would not be
increasing.
Example
IS 3, 4, 6 to add an element, would need to
choose from a subsequence from which you already
chose

43
Greedy Cover Algorithm Part 1

To create a smallest cover, use Greedy Cover
algorithm
Start from left of sequence P
Examine each number
Place number at the end of the left-most
subsequence it can extend
If none exists, make a new decreasing subsequence
(to the far right)

44
Greedy Cover Algorithm Part 2

Example P 6, 3, 5, 1, 9, 7, 2, 10, 4
6
6, 3
6, 3 5
6, 3, 1 5
6, 3, 1 5 9
6, 3, 1 5 9, 7
6, 3, 1 5, 2 9, 7
6, 3, 1 5, 2 9, 7 10
6, 3, 1 5, 2 9, 7, 4 10
6, 3, 1 5, 2 9, 7, 4 10 (smallest cover)

45
Obtaining LIS From Smallest Cover

LIS Algorithm
Set i subsequences in greedy cover
Set I to empty list
Choose any element x in subsequence I and place
in front of List I
While i gt 1
Scan from top of subsequence (i-1) and find first
element y smaller than x
x y and i i -1
Place x in the front of list I

46
Obtaining LIS Example

P 6, 3, 5, 1, 9, 7, 2, 10, 4
Smallest Cover 6, 3, 1 5, 2 9, 7, 4 10
6 5 9 10
3 2 7
1 4
i subsequences 4
i 4 x 10 I 10
i 3 x 9 I 9, 10
i 2 x 5 I 5, 9, 10
i 1 x 3 I 3, 5, 9, 10
P 6, 3, 5, 1, 9, 7, 2, 10, 4
LIS 3, 5, 9, 10

47
How Mummer Utilizes LIS

P 1, 2, 3, 4, 6, 7, 5

LIS 1, 2, 3, 4, 6, 7

48
Obtaining LIS Complexity Analysis

Greedy cover can be found in O(nlogn)
LIS found from greedy cover in O(n)

49
Phase 3 Closing the Gaps

via Smith-Waterman
Smith, T. and Waterman, M. (1981) Identification
of Common Molecular Subsequences. Journal of
Molecular Biology , 147, 195-197.
http//citeseer.ist.psu.edu/smith81identification.
html

50
Closing the Gaps

After global-MUM alignment found, need to close
local gaps
Gap interruption in MUM-alignment
Types of gaps
1) SNP Single Nucleotide Polymorphisms
2) Insertion
3) Highly polymorphic region
4) Repeat

51
Types of Gaps Examples
52
SNP Processing

SNP Single Nucleotide Polymorphism
SNPs in human DNA appear to be associated with
many health issues (genetic disease)
Q How can we determine SNPs by using MUMmer?
A By looking between MUMs SNPs surrounded by
matching subsequences

53
Inserts Transpositions

Large gaps in one genome but not the other
Transpositions subsequences that were deleted
from one location, inserted elsewhere
Detected during post-processing
How?
Part of
MUM-alignment

54
Simple Inserts

Simple inserts subsequences that appear in only
one genome
Do not appear in MUM-alignment

55
Polymorphic Regions

Gaps in alignment caused by sequences with large
numbers of differences
If regions small enough, align using dynamic
programming (Smith-Waterman)
Gives optimal alignment given pre-specified
insertion and mutation costs
If regions too large, recursively call MUMmer
algorithm
Q What must change when running MUMmer again?
A Minimum MUM size must be smaller

56
Repeat Processing

Repeat sequences do not appear in MUM alignment
Only includes sequences that appear exactly once
in each genome
In a sense, repeat sequences can fake out the
alignment

57
Finding Inversions

Professor Lopresti mentioned a method of finding
inversions last class.
Q How can we identify inversions by using
MUMmer?
A By running in both directions on gap regions

58
Results Mummer 1.0 Storage

12 bytes per leaf node (suffix tree)
24 bytes per internal node (suffix tree)
1 byte for each base in genome
Generous upper-bound 37 bytes per base
Therefore, comparing two 100Mb would require lt 8
gigabytes of memory

59
Improvements Mummer 2.0

New suffix tree algorithm (Kurtz)
At most 20 bytes per base pair (compared to 37)
Stream query string against suffix tree, cutting
down suffix tree storage

60
Improvements Mummer 3.0

New graphical modules
Search operations are optimal or near-optimal
Code rewrite (modular design)
OpenSource

61
MUMmer Conclusion

via Arthur Loder

62
In Conclusion

MUMmer allows alignment of sequences which are
too long to be aligned using previous algorithms
Utilizes suffix trees, LIS, and Smith-Waterman to
obtain results
Outputs a base-to-base alignment of two sequences

63
MUMmer Discussion
64
Discussion Questions

What if MUMs overlap? What if some are longer
than others? How does LIS take this into account?
Are repeats really ignored?
Should repeats be ignored? Arent they part of
the global alignment? The goal is to obtain the
same optimal alignment as in Needleman/Wunsch,
which does not ignore repeats.

65
Demonstration