CSE 5290: Algorithms for Bioinformatics Fall 2011 - PowerPoint PPT Presentation

Loading...

PPT – CSE 5290: Algorithms for Bioinformatics Fall 2011 PowerPoint presentation | free to download - id: 71d339-YWFhZ



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

CSE 5290: Algorithms for Bioinformatics Fall 2011

Description:

CSE 5290: Algorithms for Bioinformatics Fall 2011 Suprakash Datta datta_at_cse.yorku.ca Office: CSEB 3043 Phone: 416-736-2100 ext 77875 Course page: http://www.cse.yorku ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 84
Provided by: Wim92
Learn more at: http://www.eecs.yorku.ca
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: CSE 5290: Algorithms for Bioinformatics Fall 2011


1
CSE 5290Algorithms for Bioinformatics Fall 2011
  • Suprakash Datta
  • datta_at_cse.yorku.ca
  • Office CSEB 3043
  • Phone 416-736-2100 ext 77875
  • Course page http//www.cse.yorku.ca/course/5290

2
Last time
  • Graph algorithms for sequence assembly
  • Next Pattern matching
  • The following slides are based on slides by the
    authors of our text.

3
Genomic Repeats
  • Example of repeats
  • ATGGTCTAGGTCCTAGTGGTC
  • Motivation to find them
  • Genomic rearrangements are often associated with
    repeats
  • Trace evolutionary secrets
  • Many tumors are characterized by an explosion of
    repeats

4
Genomic Repeats
  • The problem is often more difficult
  • ATGGTCTAGGACCTAGTGTTC
  • Motivation to find them
  • Genomic rearrangements are often associated with
    repeats
  • Trace evolutionary secrets
  • Many tumors are characterized by an explosion of
    repeats

5
l-mer Repeats
  • Long repeats are difficult to find
  • Short repeats are easy to find (e.g., hashing)
  • Simple approach to finding long repeats
  • Find exact repeats of short l-mers (l is usually
    10 to 13)
  • Use l-mer repeats to potentially extend into
    longer, maximal repeats

6
l-mer Repeats (contd)
  • There are typically many locations where an l-mer
    is repeated
  • GCTTACAGATTCAGTCTTACAGATGGT
  • The 4-mer TTAC starts at locations 3 and 17

7
Extending l-mer Repeats
  • GCTTACAGATTCAGTCTTACAGATGGT
  • Extend these 4-mer matches
  • GCTTACAGATTCAGTCTTACAGATGGT
  • Maximal repeat TTACAGAT

8
Maximal Repeats
  • To find maximal repeats in this way, we need ALL
    start locations of all l-mers in the genome
  • Hashing lets us find repeats quickly in this
    manner

9
Hashing What is it?
  • What does hashing do?
  • For different data, generate a unique integer
  • Store data in an array at the unique integer
    index generated from the data
  • Hashing is a very efficient way to store and
    retrieve data

10
Hashing Definitions
  • Hash table array used in hashing
  • Records data stored in a hash table
  • Keys identifies sets of records
  • Buckets
  • Hash function uses a key to generate an index
    (bucket) to insert at in hash table
  • Collision when more than one record is mapped to
    the same index in the hash table

11
When does hashing work well?
  • When there are few collisions!
  • Good hash functions minimize collisions
  • Problems
  • knowing what function to use
  • What buckets to use

12
Hashing Example
  • Where do the animals eat?
  • Records each animal
  • Keys where each animal eats

13
Hashing DNA sequences
  • Each l-mer can be translated into a binary string
    (A, T, C, G can be represented as 00, 01, 10, 11)
  • After assigning a unique integer per l-mer it is
    easy to get all start locations of each l-mer in
    a genome

14
Hashing Maximal Repeats
  • To find repeats in a genome
  • For all l-mers in the genome, note the start
    position and the sequence
  • Generate a hash table index for each unique l-mer
    sequence
  • In each index of the hash table, store all genome
    start locations of the l-mer which generated that
    index
  • Extend l-mer repeats to maximal repeats

15
Hashing Collisions
  • Dealing with collisions
  • Chain all start locations of l-mers (linked
    list)

16
Hashing Summary
  • When finding genomic repeats from l-mers
  • Generate a hash table index for each l-mer
    sequence
  • In each index, store all genome start locations
    of the l-mer which generated that index
  • Extend l-mer repeats to maximal repeats

17
Pattern Matching
  • What if, instead of finding repeats in a genome,
    we want to find all sequences in a database that
    contain a given pattern?
  • This leads us to a different problem, the Pattern
    Matching Problem

18
Pattern Matching Problem
  • Goal Find all occurrences of a pattern in a text
  • Input Pattern p p1pn and text t t1tm
  • Output All positions 1lt i lt (m n 1) such
    that the n-letter substring of t starting at i
    matches p
  • Motivation Searching database for a known pattern

19
Exact Pattern Matching A Brute-Force Algorithm
  • PatternMatching(p,t)
  • n ? length of pattern p
  • m ? length of text t
  • for i ? 1 to (m n 1)
  • if titin-1 p
  • output i

20
Exact Pattern Matching An Example
  • PatternMatching algorithm for
  • Pattern GCAT
  • Text CGCATC

GCAT
CGCATC
GCAT
CGCATC
GCAT
CGCATC
GCAT
CGCATC
GCAT
CGCATC
21
Exact Pattern Matching Running Time
  • PatternMatching runtime O(nm)
  • The average case is often O(m)
  • Rarely will there be close to n comparisons in
    line 4
  • Better solution suffix trees
  • Can solve problem in O(m) time
  • Conceptually related to keyword trees

22
Keyword Trees Example
  • Keyword tree
  • Apple

23
Keyword Trees Example (contd)
  • Keyword tree
  • Apple
  • Apropos

24
Keyword Trees Example (contd)
  • Keyword tree
  • Apple
  • Apropos
  • Banana

25
Keyword Trees Example (contd)
  • Keyword tree
  • Apple
  • Apropos
  • Banana
  • Bandana

26
Keyword Trees Example (contd)
  • Keyword tree
  • Apple
  • Apropos
  • Banana
  • Bandana
  • Orange

27
Keyword Trees Properties
  • -Stores a set of keywords in a rooted labeled
    tree
  • Each edge labeled with a letter from an alphabet
  • Any two edges coming out of the same vertex have
    distinct labels
  • Every keyword stored can be spelled on a path
    from root to some leaf

28
Keyword Trees Threading (contd)
  • Thread appeal
  • appeal

29
Keyword Trees Threading (contd)
  • Thread appeal
  • appeal

30
Keyword Trees Threading (contd)
  • Thread appeal
  • appeal

31
Keyword Trees Threading (contd)
  • Thread appeal
  • appeal

32
Keyword Trees Threading (contd)
  • Thread apple
  • apple

33
Keyword Trees Threading (contd)
  • Thread apple
  • apple

34
Keyword Trees Threading (contd)
  • Thread apple
  • apple

35
Keyword Trees Threading (contd)
  • Thread apple
  • apple

36
Keyword Trees Threading (contd)
  • Thread apple
  • apple

37
Multiple Pattern Matching Problem
  • Goal Given a set of patterns and a text, find
    all occurrences of any of patterns in text
  • Input k patterns p1,,pk, and text t t1tm
  • Output Positions 1 lt i lt m where substring of t
    starting at i matches pj for 1 lt j lt k
  • Motivation Searching database for known multiple
    patterns

38
Multiple Pattern Matching Straightforward
Approach
  • Can solve as k Pattern Matching Problems
  • Runtime
  • O(kmn)
  • using the PatternMatching algorithm k times
  • m - length of the text
  • n - average length of the pattern

39
Multiple Pattern Matching Keyword Tree Approach
  • Or, we could use keyword trees
  • Build keyword tree in O(N) time N is total
    length of all patterns
  • With naive threading O(N nm)
  • Aho-Corasick algorithm O(N m)

40
Keyword Trees Threading
  • To match patterns in a text using a keyword tree
  • Build keyword tree of patterns
  • Thread the text through the keyword tree

41
Keyword Trees Threading (contd)
  • Threading is complete when we reach a leaf in
    the keyword tree
  • When threading is complete, weve found a
    pattern in the text

42
Suffix TreesCollapsed Keyword Trees
  • Similar to keyword trees, except edges that form
    paths are collapsed
  • Each edge is labeled with a substring of a text
  • All internal edges have at least two outgoing
    edges
  • Leaves labeled by the index of the pattern.

43
Suffix Tree of a Text
  • Suffix trees of a text is constructed for all its
    suffixes

ATCATG TCATG CATG ATG
TG G
Keyword Tree
Suffix Tree
44
Suffix Tree of a Text
  • Suffix trees of a text is constructed for all its
    suffixes

ATCATG TCATG CATG ATG
TG G
Keyword Tree
Suffix Tree
How much time does it take?
45
Suffix Tree of a Text
  • Suffix trees of a text is constructed for all its
    suffixes

ATCATG TCATG CATG ATG
TG G
Keyword Tree
Suffix Tree
quadratic
Time is linear in the total size of all suffixes,
i.e., it is quadratic in the length of the text
46
Suffix Trees Advantages
  • Suffix trees of a text is constructed for all its
    suffixes
  • Suffix trees build faster than keyword trees

ATCATG TCATG CATG ATG
TG G
Keyword Tree
Suffix Tree
quadratic
linear (Weiner suffix tree algorithm)
47
Use of Suffix Trees
  • Suffix trees hold all suffixes of a text
  • i.e., ATCGC ATCGC, TCGC, CGC, GC, C
  • Builds in O(m) time for text of length m
  • To find any pattern of length n in a text
  • Build suffix tree for text
  • Thread the pattern through the suffix tree
  • Can find pattern in text in O(n) time!
  • O(n m) time for Pattern Matching Problem
  • Build suffix tree and lookup pattern

48
Pattern Matching with Suffix Trees
  • SuffixTreePatternMatching(p,t)
  • Build suffix tree for text t
  • Thread pattern p through suffix tree
  • if threading is complete
  • output positions of all p-matching leaves in
    the tree
  • else
  • output Pattern does not appear in text

49
Suffix Trees Example
50
Multiple Pattern Matching Summary
  • Keyword and suffix trees are used to find
    patterns in a text
  • Keyword trees
  • Build keyword tree of patterns, and thread text
    through it
  • Suffix trees
  • Build suffix tree of text, and thread patterns
    through it

51
Approximate vs. Exact Pattern Matching
  • So far all weve seen exact pattern matching
    algorithms
  • Usually, because of mutations, it makes much more
    biological sense to find approximate pattern
    matches
  • Biologists often use fast heuristic approaches
    (rather than local alignment) to find approximate
    matches

52
Heuristic Similarity Searches
  • Genomes are huge Smith-Waterman quadratic
    alignment algorithms are too slow
  • Alignment of two sequences usually has short
    identical or highly similar fragments
  • Many heuristic methods (i.e., FASTA) are based on
    the same idea of filtration
  • Find short exact matches, and use them as seeds
    for potential match extension
  • Filter out positions with no extendable matches

53
Dot Matrices
  • Dot matrices show similarities between two
    sequences
  • FASTA makes an implicit dot matrix from short
    exact matches, and tries to find long diagonals
    (allowing for some mismatches)

54
Dot Matrices (contd)
  • Identify diagonals above a threshold length
  • Diagonals in the dot matrix indicate exact
    substring matching

55
Diagonals in Dot Matrices
  • Extend diagonals and try to link them together,
    allowing for minimal mismatches/indels
  • Linking diagonals reveals approximate matches
    over longer substrings

56
Approximate Pattern Matching Problem
  • Goal Find all approximate occurrences of a
    pattern in a text
  • Input A pattern p p1pn, text t t1tm, and
    k, the maximum number of mismatches
  • Output All positions 1 lt i lt (m n 1) such
    that titin-1 and p1pn have at most k
    mismatches (i.e., Hamming distance between
    titin-1 and p lt k)

57
Approximate Pattern Matching A Brute-Force
Algorithm
  • ApproximatePatternMatching(p, t, k)
  • n ? length of pattern p
  • m ? length of text t
  • for i ? 1 to m n 1
  • dist ? 0
  • for j ? 1 to n
  • if tij-1 ! pj
  • dist ? dist 1
  • if dist lt k
  • output i

58
Approximate Pattern Matching Running Time
  • That algorithm runs in O(nm).
  • Landau-Vishkin algorithm O(kn)
  • We can generalize the Approximate Pattern
    Matching Problem into a Query Matching
    Problem
  • We want to match substrings in a query to
    substrings in a text with at most k mismatches
  • Motivation we want to see similarities to some
    gene, but we may not know which parts of the gene
    to look for

59
Query Matching Problem
  • Goal Find all substrings of the query that
    approximately match the text
  • Input Query q q1qw,
  • text t t1tm,
  • n (length of matching
    substrings),
  • k (maximum number of
    mismatches)
  • Output All pairs of positions (i, j) such that
    the
  • n-letter substring of q starting
    at i approximately matches the
  • n-letter substring of t starting
    at j,
  • with at most k mismatches

60
Approximate Pattern Matching vs Query Matching
61
Query Matching Main Idea
  • Approximately matching strings share some
    perfectly matching substrings.
  • Instead of searching for approximately matching
    strings (difficult) search for perfectly matching
    substrings (easy).

62
Filtration in Query Matching
  • We want all n-matches between a query and a text
    with up to k mismatches
  • Filter out positions we know do not match
    between text and query
  • Potential match detection find all matches of
    l-tuples in query and text for some small l
  • Potential match verification Verify each
    potential match by extending it to the left and
    right, until (k 1) mismatches are found

63
Filtration Match Detection
  • If x1xn and y1yn match with at most k
    mismatches, they must share an l-tuple that is
    perfectly matched, with l ?n/(k 1)?
  • Break string of length n into k1 parts, each
    each of length ?n/(k 1)?
  • k mismatches can affect at most k of these k1
    parts
  • At least one of these k1 parts is perfectly
    matched

64
Filtration Match Detection (contd)
  • Suppose k 3. We would then have
    ln/(k1)n/4
  • There are at most k mismatches in n, so at the
    very least there must be one out of the k1 l
    tuples without a mismatch

1l
l 12l
2l 13l
3l 1n
1
2
k
k 1
65
Filtration Match Verification
  • For each l -match we find, try to extend the
    match further to see if it is substantial

Extend perfect match of length l until we find an
approximate match of length n with k mismatches
query
text
66
Filtration Example
k 0 k 1 k 2 k 3 k 4 k 5
l -tuple length n n/2 n/3 n/4 n/5 n/6
Shorter perfect matches required
Performance decreases
67
Local alignment is to slow
  • Quadratic local alignment is too slow while
    looking for similarities between long strings
    (e.g. the entire GenBank database)

68
Local alignment is to slow
  • Quadratic local alignment is too slow while
    looking for similarities between long strings
    (e.g. the entire GenBank database)

69
Local alignment is to slow
  • Quadratic local alignment is too slow while
    looking for similarities between long strings
    (e.g. the entire GenBank database)

70
Local alignment is to slow
  • Quadratic local alignment is too slow while
    looking for similarities between long strings
    (e.g. the entire GenBank database)
  • Guaranteed to find the optimal local alignment
  • Sets the standard for sensitivity

71
Local alignment is to slow
  • Quadratic local alignment is too slow while
    looking for similarities between long strings
    (e.g. the entire GenBank database)
  • Basic Local Alignment Search Tool
  • Altschul, S., Gish, W., Miller, W., Myers, E.
    Lipman, D.J.
  • Journal of Mol. Biol., 1990
  • Search sequence databases for local alignments to
    a query

72
Next
  • Pattern matching using BLAST
  • Some of the following slides are based on slides
    by the authors of our text.

73
BLAST
  • Great improvement in speed, with a modest
    decrease in sensitivity
  • Minimizes search space instead of exploring
    entire search space between two sequences
  • Finds short exact matches (seeds), only
    explores locally around these hits

74
What Similarity Reveals
  • BLASTing a new gene
  • Evolutionary relationship
  • Similarity between protein function
  • BLASTing a genome
  • Potential genes

75
BLAST algorithm
  • Keyword search of all words of length w from the
    query of length n in database of length m with
    score above threshold
  • w 11 for DNA queries, w 3 for proteins
  • Local alignment extension for each found keyword
  • Extend result until longest match above threshold
    is achieved
  • Running time O(nm)

76
BLAST algorithm (contd)
keyword
Query KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVL
KIFLENVIRD
GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK
11 GEK 11 GDK 11
Neighborhood words
neighborhood score threshold (T 13)
extension
Query 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK
60 DN G IR L GK I L E
RGK Sbjct 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EK
HRGIIK 263
High-scoring Pair (HSP)
77
Original BLAST
  • Dictionary
  • All words of length w
  • Alignment
  • Ungapped extensions until score falls below some
    statistical threshold
  • Output
  • All local alignments with score gt threshold

78
Original BLAST Example
A C G A A G T A A G G T C
C A G T
  • w 4
  • Exact keyword match of GGTC
  • Extend diagonals with mismatches until score is
    under 50
  • Output result
  • GTAAGGTCC
  • GTTAGGTCC


















C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
79
Gapped BLAST Example
A C G A A G T A A G G T C
C A G T
  • Original BLAST exact keyword search, THEN
  • Extend with gaps around ends of exact match until
    score lt threshold
  • Output result
  • GTAAGGTCCAGT
  • GTTAGGTC-AGT


















C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
80
Assessing sequence similarity
  • Need to know how strong an alignment can be
    expected from chance alone
  • Chance relates to comparison of sequences that
    are generated randomly based upon a certain
    sequence model
  • Sequence models may take into account
  • GC content
  • Junk DNA
  • Codon bias
  • Etc.

81
BLAST Segment Score
  • BLAST uses scoring matrices (d) to improve on
    efficiency of match detection
  • Some proteins may have very different amino acid
    sequences, but are still similar
  • For any two l-mers x1xl and y1yl
  • Segment pair pair of l-mers, one from each
    sequence
  • Segment score Sli1 d(xi, yi)

82
BLAST Locally Maximal Segment Pairs
  • A segment pair is maximal if it has the best
    score over all segment pairs
  • A segment pair is locally maximal if its score
    cant be improved by extending or shortening
  • Statistically significant locally maximal segment
    pairs are of biological interest
  • BLAST finds all locally maximal segment pairs
    with scores above some threshold
  • A significantly high threshold will filter out
    some statistically insignificant matches

83
BLAST Statistics
  • Threshold Altschul-Dembo-Karlin statistics
  • Identifies smallest segment score that is
    unlikely to happen by chance
  • matches above q has mean E(q) Kmne-lq K is a
    constant, m and n are the lengths of the two
    compared sequences
  • Parameter l is positive root of
  • S x,y in A(pxpyed(x,y)) 1, where px and py are
    frequencies of amino acids x and y, and A is the
    twenty letter amino acid alphabet
About PowerShow.com