MUMmer: fast alignment of large-scale DNA and protein sequences - PowerPoint PPT Presentation

About This Presentation
Title:

MUMmer: fast alignment of large-scale DNA and protein sequences

Description:

Use dynamic programming (Smith-Waterman) to extend MUMs. Smaller regions, so can compute quickly ... 3) Gap closing (via dynamic programming, Smith-Waterman) ... – PowerPoint PPT presentation

Number of Views:467
Avg rating:3.0/5.0
Slides: 66
Provided by: jmt7
Category:

less

Transcript and Presenter's Notes

Title: MUMmer: fast alignment of large-scale DNA and protein sequences


1
MUMmer fast alignment of large-scale DNA and
protein sequences
  • Presented by Arthur Loder
  • Course CSE 497 Computational Issues in
    Molecular Biology
  • Date February 17, 2004

2
MUMmers Significance
  • MUMmer is a system for rapidly aligning entire
    genomes or very large protein sequences
  • Input2 strings Outputbase-to-base alignment
  • What distinguishes MUMmer from previous
    algorithms?
  • Can align very large sequences (millions of
    nucleotides long)

3
Complete Genome Alignment
  • What it means to align complete genomes
  • Human chromosome 1 200 million base pairs
  • Previous alignment algorithms run space
    efficiently O(n)
  • Time complexity O(n2) is unacceptably slow
  • We would like to align using a global dynamic
    programming algorithm (Needleman/Wunsch), but
    infeasible need a shortcut

4
MUMmer vs. BLAST
  • Instead, use technique somewhat similar to BLAST
    (find similar regions between strings)
  • BLAST for comparing 1 known string to a large set
    of unknown strings
  • MUMmer for aligning 2 very similar known strings

5
You know youve made it when
6
Overview MUMs Part 1
  • String A
  • CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAG
    SCTGAAGTGCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCG
    AGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGT
    CGAGGCTTTAGAGCTAGCTAGCGCGATCGAGGCTAGCTATGCTAGCTATC
    ATCGCAAGCTAGCTGAGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCT
    AGTCGAGCTGATCGTAGCTAGTAATGTATCCATCTACTCTAGTAGATCGA
    TTAGTCGATCGATGCTAGATCGGATCGAGTCGAGATCGATGGAGTCGAGA
    TCGATCTAATCTATCTCTAAATGGAGCGA
  • String B
  • GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCA
    TCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGG
    AGTCCAAACTCGCAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGT
    CGCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGC
    TAGTAATGTATCATAGCTAATCGCACTACTACGATGCGATCTCTAGTCGA
    TCTATCTCGGCTTCGATCGTA
  • How align without Needleman/Wunsch ?

7
Overview MUMs Part 2
  • String A
  • CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAG
    SCTGAAGTGCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCG
    AGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGT
    CGAGGCTTTAGAGCTAGCTAGCGCGATCGAGGCTAGCTATGCTAGCTATC
    ATCGCAAGCTAGCTGAGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCT
    AGTCGAGCTGATCGTAGCTAGTAATGTATCCATCTACTCTAGTAGATCGA
    TTAGTCGATCGATGCTAGATCGGATCGAGTCGAGATCGATGGAGTCGAGA
    TCGATCTAATCTATCTCTAAATGGAGCGA
  • String B
  • GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCA
    TCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGG
    AGTCCAAACTCGCAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGT
    CGCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGC
    TAGTAATGTATCATAGCTAATCGCACTACTACGATGCGATCTCTAGTCGA
    TCTATCTCGGCTTCGATCGTA
  • Easier with large exact matches highlighted?

8
Overview MUMs Part 3
  • Any optimal global alignment will probably use
    these two subsequences as anchors
  • This is the shortcut needed to calculate global
    alignment quickly on large sequences
  • Very intuitive in alignment process, but only a
    heuristic

9
Overview MUMs Part 4
  • MUM Maximal Unique Match
  • MUMs occur exactly once in each sequence
  • Ignores repeat sequences
  • MUMs found efficiently using suffix tree data
    structure (to be explained later)

10
Overview Choosing MUMs
  • Once have anchors, need to choose which ones to
    use in alignment
  • All MUMs
  • MUMs used in alignment (subset)

11
Overview Closing the Gaps
  • After choose/align anchors, what next?
  • Close the gaps
  • Use dynamic programming (Smith-Waterman)
  • to extend MUMs
  • Smaller regions, so can compute quickly
  • Implicit assumption sequences very similar

12
3 Phases of MUMmer
  • 3 Phases
  • 1) Obtaining MUMs (via Suffix Trees)
  • 2) MUM choosing (via Longest Increasing
    Subsequences)
  • 3) Gap closing (via dynamic programming,
    Smith-Waterman)
  • Comprised of previously known algorithms packaged
    to form a unique algorithm

13
Critiquing MUMmers Output
  • Sample Output
  • Sequence A ACTGC_TGAC_CTA
  • Sequence B ACC_CA_GGCTCG_
  • MUMmer best-case same alignment as
    Needleman/Wunsch
  • MUMmer worst-case sub optimal alignment
  • At least computable, whereas Needleman/Wunsch is
    not

14
Phase 1 Obtaining MUMs
  • via Suffix Trees
  • Edward M. McCreight. (1973) A Space-Economical
    Suffix Tree Construction Algorithm.
    http//doi.acm.org/10.1145/321941.321946

15
Suffix Trees Outline
  • I. Suffix Trees
  • A. Motivations
  • B. Tries
  • C. Suffix Trees
  • D. How Mummer utilizes suffix trees

16
Tries
  • Term trie comes from retrieval
  • Introduced in 1960s by Fredkin
  • Suffix trees are a type of trie
  • Uses
  • Quickly search large text via preprocessing
  • Used for regular expressions, longest common
    substring, automatic command completion, etc

17
Non-Compact Trie Example
  • 5 strings encoded BIG, BIGGER, BILL, GOOD, GOSH
  • Every edge represents a symbol of the alphabet

18
Implementation of Tries
  • Use linked list
  • Include pointers to sibling and first child

19
Compacting Tries Part 1
  • Method 1 trim chains leading to leaves
  • Compact trie for strings BIG, BIGGER, BILL,
    GOOD, GOSH

20
Compacting Tries Part 2
  • Method 2 Patricia Tries
  • Before, one edge per character
  • Now, unary nodes are collapsed

21
Suffix Trie
  • Normal trie, but input strings are suffixes
  • Assume text string t1tn
  • Q Tree has how many leaves?
  • A Tree has n Leaves

22
Suffix Tree
  • First compact suffix trie
  • Next collapse unary nodes

23
Suffix Trees Decreasing storage
  • Rather than storing strings, store a pair of
    indices (x,y) where x is beginning of string and
    y is the end
  • Storage becomes O(n)

24
Suffix Tree Algorithms
  • First linear-time algorithm given by Weiner
    (1973)
  • McCreight developed more space efficient
    algorithm (1976)
  • Two original papers reputations difficult to
    understand

25
McCreights Algorithm Part 1
  • Algorithm M
  • Maps finite string S of characters into
    auxiliary index to S in the form of a digital
    search tree T whose paths are the suffixes of S,
    and whose terminal nodes correspond uniquely to
    positions within S.
  • A Space-Economical Suffix Tree Construction
    Algorithm. Edward M. McCreight.
  • http//doi.acm.org/10.1145/321941.321946

26
McCreights Algorithm Part 2
  • S ababc
  • Definitions
  • sufi is suffix of S beginning at character
    position i
  • headi is longest prefix of sufi which is also
    prefix for sufj for some
  • j lt i
  • taili is sufi headi
  • suf3 ?
  • abc
  • head3 ?
  • ab
  • tail3 ?
  • abc ab c

27
McCreights Algorithm Part 3
  • Builds suffix tree by adding sufi to Treei-1
  • Initially, Tree1 contains only suff1 (the entire
    string)
  • To obtain Tree2, add suff2 to Tree1
  • Continue until you have added suffn to
  • Treen-1
  • Treen is the final suffix tree

28
McCreights Algorithm Part 4
  • Adding a suffix (going from T2 to T3)
  • suf3 abc head3 ab tail3 c

29
Suffix Trees Complexity
  • Adding a non-terminal and a new arc that
    corresponds to tail takes at most constant time
  • If could find head in at most constant time, it
    would run in linear time n, the length of the
    string S
  • Do so by using suffix links (see paper for
    details)

30
Finding MUMs from Suffix Trees
31
Finding MUMs from Suffix Trees 2
32
Finding MUMs from Suffix Trees 3
  • More
  • general
  • case

33
Phase 2 Choosing MUMs For Alignment
  • via Longest Increasing Subsequence (LIS)
  • Gusfield, D. (1997) Algorithms on Strings, Trees
    and Sequences Computer Science and Computational
    Biology.

34
Motivation For Choosing MUMs
  • Q Why cant we use all MUMs for alignment?
  • A Due to crossing of MUMs can only choose
    increasing set of MUMs
  • Problem given a set of MUMs, how do we choose
    the optimal sequence?

35
Choosing MUMs (Continued)
  • Configuration can be uniquely represented
  • P 1, 2, 3, 4, 6, 7, 5
  • LIS(P) 1, 2, 3, 4, 6, 7
  • Determining optimal sequence of MUMs reduces to
  • finding LIS of P

36
IS Definition
  • Increasing Subsequence values (strictly)
    increase from left to right
  • Sequence P 4, 2, 1, 5, 8, 6, 9, 10
  • Examples of two increasing subsequences
  • 4, 5, 9 or 2, 5, 6, 9, 10

37
DS Subsequence
  • Decreasing Subsequence numbers that are
    decreasing from left to right
  • Sequence P 4, 2, 1, 5, 8, 6, 9, 10
  • Examples? ltinsert class participation heregt
  • 4, 2, 1, 4, 2, 4, 1, 2, 1, or 8, 6

38
Covers Definition Part 1
  • Cover of P set of decreasing subsequences of P
    that contains all numbers of P
  • P 7, 3, 4, 8, 6, 2, 1
  • Some possible covers ?
  • 7, 3 4 8 6, 2, 1
  • OR
  • 7, 3, 2, 1 4 8, 6
  • And Others

39
Covers Definitions Part 2
  • Size of cover number of decreasing subsequences
    it contains
  • Smallest cover cover with minimum size
  • If I is an increasing subsequence of P with
    length equal to the size of a cover of P, call it
    C, then I is a longest increasing subsequence of
    P and C is a smallest cover of P
  • Why?

40
Covers Relation to LIS
  • If I is an increasing subsequence of P with
    length equal to the size of a cover of P, call it
    C, then I is a longest increasing subsequence of
    P and C is a smallest cover of P. Why?
  • Because no increasing subsequence can contain
    more than one character from each decreasing
    subsequence in a cover

41
Covers Examples
  • P 7, 3, 4, 8, 6, 2, 1
  • Two possible covers
  • 7, 3 4 8 6, 2, 1
  • 7, 3, 2, 1 4 8, 6
  • What is the size of the smallest cover?
  • 3 (no cover can contain lt 3 decreasing
    subsequences)
  • How many elements in LIS?
  • 3

42
Covers Examples Continued
  • P 7, 3, 4, 8, 6, 2, 1
  • For a particular cover, say
  • 7, 3, 2, 1 4 8, 6
  • You can only choose one element from each
    subsequence, otherwise subsequence would not be
    increasing.
  • Example
  • IS 3, 4, 6 to add an element, would need to
    choose from a subsequence from which you already
    chose

43
Greedy Cover Algorithm Part 1
  • To create a smallest cover, use Greedy Cover
    algorithm
  • Start from left of sequence P
  • Examine each number
  • Place number at the end of the left-most
    subsequence it can extend
  • If none exists, make a new decreasing subsequence
    (to the far right)

44
Greedy Cover Algorithm Part 2
  • Example P 6, 3, 5, 1, 9, 7, 2, 10, 4
  • 6
  • 6, 3
  • 6, 3 5
  • 6, 3, 1 5
  • 6, 3, 1 5 9
  • 6, 3, 1 5 9, 7
  • 6, 3, 1 5, 2 9, 7
  • 6, 3, 1 5, 2 9, 7 10
  • 6, 3, 1 5, 2 9, 7, 4 10
  • 6, 3, 1 5, 2 9, 7, 4 10 (smallest cover)

45
Obtaining LIS From Smallest Cover
  • LIS Algorithm
  • Set i subsequences in greedy cover
  • Set I to empty list
  • Choose any element x in subsequence I and place
    in front of List I
  • While i gt 1
  • Scan from top of subsequence (i-1) and find first
    element y smaller than x
  • x y and i i -1
  • Place x in the front of list I

46
Obtaining LIS Example
  • P 6, 3, 5, 1, 9, 7, 2, 10, 4
  • Smallest Cover 6, 3, 1 5, 2 9, 7, 4 10
  • 6 5 9 10
  • 3 2 7
  • 1 4
  • i subsequences 4
  • i 4 x 10 I 10
  • i 3 x 9 I 9, 10
  • i 2 x 5 I 5, 9, 10
  • i 1 x 3 I 3, 5, 9, 10
  • P 6, 3, 5, 1, 9, 7, 2, 10, 4
  • LIS 3, 5, 9, 10

47
How Mummer Utilizes LIS
  • P 1, 2, 3, 4, 6, 7, 5
  • LIS 1, 2, 3, 4, 6, 7

48
Obtaining LIS Complexity Analysis
  • Greedy cover can be found in O(nlogn)
  • LIS found from greedy cover in O(n)

49
Phase 3 Closing the Gaps
  • via Smith-Waterman
  • Smith, T. and Waterman, M. (1981) Identification
    of Common Molecular Subsequences. Journal of
    Molecular Biology , 147, 195-197.
  • http//citeseer.ist.psu.edu/smith81identification.
    html

50
Closing the Gaps
  • After global-MUM alignment found, need to close
    local gaps
  • Gap interruption in MUM-alignment
  • Types of gaps
  • 1) SNP Single Nucleotide Polymorphisms
  • 2) Insertion
  • 3) Highly polymorphic region
  • 4) Repeat

51
Types of Gaps Examples
52
SNP Processing
  • SNP Single Nucleotide Polymorphism
  • SNPs in human DNA appear to be associated with
    many health issues (genetic disease)
  • Q How can we determine SNPs by using MUMmer?
  • A By looking between MUMs SNPs surrounded by
    matching subsequences

53
Inserts Transpositions
  • Large gaps in one genome but not the other
  • Transpositions subsequences that were deleted
    from one location, inserted elsewhere
  • Detected during post-processing
  • How?
  • Part of
  • MUM-alignment

54
Simple Inserts
  • Simple inserts subsequences that appear in only
    one genome
  • Do not appear in MUM-alignment

55
Polymorphic Regions
  • Gaps in alignment caused by sequences with large
    numbers of differences
  • If regions small enough, align using dynamic
    programming (Smith-Waterman)
  • Gives optimal alignment given pre-specified
    insertion and mutation costs
  • If regions too large, recursively call MUMmer
    algorithm
  • Q What must change when running MUMmer again?
  • A Minimum MUM size must be smaller

56
Repeat Processing
  • Repeat sequences do not appear in MUM alignment
  • Only includes sequences that appear exactly once
    in each genome
  • In a sense, repeat sequences can fake out the
    alignment

57
Finding Inversions
  • Professor Lopresti mentioned a method of finding
    inversions last class.
  • Q How can we identify inversions by using
    MUMmer?
  • A By running in both directions on gap regions

58
Results Mummer 1.0 Storage
  • 12 bytes per leaf node (suffix tree)
  • 24 bytes per internal node (suffix tree)
  • 1 byte for each base in genome
  • Generous upper-bound 37 bytes per base
  • Therefore, comparing two 100Mb would require lt 8
    gigabytes of memory

59
Improvements Mummer 2.0
  • New suffix tree algorithm (Kurtz)
  • At most 20 bytes per base pair (compared to 37)
  • Stream query string against suffix tree, cutting
    down suffix tree storage

60
Improvements Mummer 3.0
  • New graphical modules
  • Search operations are optimal or near-optimal
  • Code rewrite (modular design)
  • OpenSource

61
MUMmer Conclusion
  • via Arthur Loder

62
In Conclusion
  • MUMmer allows alignment of sequences which are
    too long to be aligned using previous algorithms
  • Utilizes suffix trees, LIS, and Smith-Waterman to
    obtain results
  • Outputs a base-to-base alignment of two sequences

63
MUMmer Discussion
64
Discussion Questions
  • What if MUMs overlap? What if some are longer
    than others? How does LIS take this into account?
  • Are repeats really ignored?
  • Should repeats be ignored? Arent they part of
    the global alignment? The goal is to obtain the
    same optimal alignment as in Needleman/Wunsch,
    which does not ignore repeats.

65
Demonstration
  • http//www.tigr.org/tigr-scripts/CMR2/webmum/mumpl
    ot
Write a Comment
User Comments (0)
About PowerShow.com