Loading...

PPT – CSE 5290: Algorithms for Bioinformatics Fall 2011 PowerPoint presentation | free to download - id: 71d339-YWFhZ

The Adobe Flash plugin is needed to view this content

CSE 5290Algorithms for Bioinformatics Fall 2011

- Suprakash Datta
- datta_at_cse.yorku.ca
- Office CSEB 3043
- Phone 416-736-2100 ext 77875
- Course page http//www.cse.yorku.ca/course/5290

Last time

- Graph algorithms for sequence assembly
- Next Pattern matching
- The following slides are based on slides by the

authors of our text.

Genomic Repeats

- Example of repeats
- ATGGTCTAGGTCCTAGTGGTC
- Motivation to find them
- Genomic rearrangements are often associated with

repeats - Trace evolutionary secrets
- Many tumors are characterized by an explosion of

repeats

Genomic Repeats

- The problem is often more difficult
- ATGGTCTAGGACCTAGTGTTC
- Motivation to find them
- Genomic rearrangements are often associated with

repeats - Trace evolutionary secrets
- Many tumors are characterized by an explosion of

repeats

l-mer Repeats

- Long repeats are difficult to find
- Short repeats are easy to find (e.g., hashing)
- Simple approach to finding long repeats
- Find exact repeats of short l-mers (l is usually

10 to 13) - Use l-mer repeats to potentially extend into

longer, maximal repeats

l-mer Repeats (contd)

- There are typically many locations where an l-mer

is repeated - GCTTACAGATTCAGTCTTACAGATGGT
- The 4-mer TTAC starts at locations 3 and 17

Extending l-mer Repeats

- GCTTACAGATTCAGTCTTACAGATGGT
- Extend these 4-mer matches
- GCTTACAGATTCAGTCTTACAGATGGT
- Maximal repeat TTACAGAT

Maximal Repeats

- To find maximal repeats in this way, we need ALL

start locations of all l-mers in the genome - Hashing lets us find repeats quickly in this

manner

Hashing What is it?

- What does hashing do?
- For different data, generate a unique integer
- Store data in an array at the unique integer

index generated from the data - Hashing is a very efficient way to store and

retrieve data

Hashing Definitions

- Hash table array used in hashing
- Records data stored in a hash table
- Keys identifies sets of records
- Buckets
- Hash function uses a key to generate an index

(bucket) to insert at in hash table - Collision when more than one record is mapped to

the same index in the hash table

When does hashing work well?

- When there are few collisions!
- Good hash functions minimize collisions
- Problems
- knowing what function to use
- What buckets to use

Hashing Example

- Where do the animals eat?
- Records each animal
- Keys where each animal eats

Hashing DNA sequences

- Each l-mer can be translated into a binary string

(A, T, C, G can be represented as 00, 01, 10, 11) - After assigning a unique integer per l-mer it is

easy to get all start locations of each l-mer in

a genome

Hashing Maximal Repeats

- To find repeats in a genome
- For all l-mers in the genome, note the start

position and the sequence - Generate a hash table index for each unique l-mer

sequence - In each index of the hash table, store all genome

start locations of the l-mer which generated that

index - Extend l-mer repeats to maximal repeats

Hashing Collisions

- Dealing with collisions
- Chain all start locations of l-mers (linked

list)

Hashing Summary

- When finding genomic repeats from l-mers
- Generate a hash table index for each l-mer

sequence - In each index, store all genome start locations

of the l-mer which generated that index - Extend l-mer repeats to maximal repeats

Pattern Matching

- What if, instead of finding repeats in a genome,

we want to find all sequences in a database that

contain a given pattern? - This leads us to a different problem, the Pattern

Matching Problem

Pattern Matching Problem

- Goal Find all occurrences of a pattern in a text
- Input Pattern p p1pn and text t t1tm
- Output All positions 1lt i lt (m n 1) such

that the n-letter substring of t starting at i

matches p - Motivation Searching database for a known pattern

Exact Pattern Matching A Brute-Force Algorithm

- PatternMatching(p,t)
- n ? length of pattern p
- m ? length of text t
- for i ? 1 to (m n 1)
- if titin-1 p
- output i

Exact Pattern Matching An Example

- PatternMatching algorithm for
- Pattern GCAT
- Text CGCATC

GCAT

CGCATC

GCAT

CGCATC

GCAT

CGCATC

GCAT

CGCATC

GCAT

CGCATC

Exact Pattern Matching Running Time

- PatternMatching runtime O(nm)
- The average case is often O(m)
- Rarely will there be close to n comparisons in

line 4 - Better solution suffix trees
- Can solve problem in O(m) time
- Conceptually related to keyword trees

Keyword Trees Example

- Keyword tree
- Apple

Keyword Trees Example (contd)

- Keyword tree
- Apple
- Apropos

Keyword Trees Example (contd)

- Keyword tree
- Apple
- Apropos
- Banana

Keyword Trees Example (contd)

- Keyword tree
- Apple
- Apropos
- Banana
- Bandana

Keyword Trees Example (contd)

- Keyword tree
- Apple
- Apropos
- Banana
- Bandana
- Orange

Keyword Trees Properties

- -Stores a set of keywords in a rooted labeled

tree - Each edge labeled with a letter from an alphabet
- Any two edges coming out of the same vertex have

distinct labels - Every keyword stored can be spelled on a path

from root to some leaf

Keyword Trees Threading (contd)

- Thread appeal
- appeal

Keyword Trees Threading (contd)

- Thread appeal
- appeal

Keyword Trees Threading (contd)

- Thread appeal
- appeal

Keyword Trees Threading (contd)

- Thread appeal
- appeal

Keyword Trees Threading (contd)

- Thread apple
- apple

Keyword Trees Threading (contd)

- Thread apple
- apple

Keyword Trees Threading (contd)

- Thread apple
- apple

Keyword Trees Threading (contd)

- Thread apple
- apple

Keyword Trees Threading (contd)

- Thread apple
- apple

Multiple Pattern Matching Problem

- Goal Given a set of patterns and a text, find

all occurrences of any of patterns in text - Input k patterns p1,,pk, and text t t1tm
- Output Positions 1 lt i lt m where substring of t

starting at i matches pj for 1 lt j lt k - Motivation Searching database for known multiple

patterns

Multiple Pattern Matching Straightforward

Approach

- Can solve as k Pattern Matching Problems
- Runtime
- O(kmn)
- using the PatternMatching algorithm k times
- m - length of the text
- n - average length of the pattern

Multiple Pattern Matching Keyword Tree Approach

- Or, we could use keyword trees
- Build keyword tree in O(N) time N is total

length of all patterns - With naive threading O(N nm)
- Aho-Corasick algorithm O(N m)

Keyword Trees Threading

- To match patterns in a text using a keyword tree
- Build keyword tree of patterns
- Thread the text through the keyword tree

Keyword Trees Threading (contd)

- Threading is complete when we reach a leaf in

the keyword tree - When threading is complete, weve found a

pattern in the text

Suffix TreesCollapsed Keyword Trees

- Similar to keyword trees, except edges that form

paths are collapsed - Each edge is labeled with a substring of a text
- All internal edges have at least two outgoing

edges - Leaves labeled by the index of the pattern.

Suffix Tree of a Text

- Suffix trees of a text is constructed for all its

suffixes

ATCATG TCATG CATG ATG

TG G

Keyword Tree

Suffix Tree

Suffix Tree of a Text

- Suffix trees of a text is constructed for all its

suffixes

ATCATG TCATG CATG ATG

TG G

Keyword Tree

Suffix Tree

How much time does it take?

Suffix Tree of a Text

- Suffix trees of a text is constructed for all its

suffixes

ATCATG TCATG CATG ATG

TG G

Keyword Tree

Suffix Tree

quadratic

Time is linear in the total size of all suffixes,

i.e., it is quadratic in the length of the text

Suffix Trees Advantages

- Suffix trees of a text is constructed for all its

suffixes - Suffix trees build faster than keyword trees

ATCATG TCATG CATG ATG

TG G

Keyword Tree

Suffix Tree

quadratic

linear (Weiner suffix tree algorithm)

Use of Suffix Trees

- Suffix trees hold all suffixes of a text
- i.e., ATCGC ATCGC, TCGC, CGC, GC, C
- Builds in O(m) time for text of length m
- To find any pattern of length n in a text
- Build suffix tree for text
- Thread the pattern through the suffix tree
- Can find pattern in text in O(n) time!
- O(n m) time for Pattern Matching Problem
- Build suffix tree and lookup pattern

Pattern Matching with Suffix Trees

- SuffixTreePatternMatching(p,t)
- Build suffix tree for text t
- Thread pattern p through suffix tree
- if threading is complete
- output positions of all p-matching leaves in

the tree - else
- output Pattern does not appear in text

Suffix Trees Example

Multiple Pattern Matching Summary

- Keyword and suffix trees are used to find

patterns in a text - Keyword trees
- Build keyword tree of patterns, and thread text

through it - Suffix trees
- Build suffix tree of text, and thread patterns

through it

Approximate vs. Exact Pattern Matching

- So far all weve seen exact pattern matching

algorithms - Usually, because of mutations, it makes much more

biological sense to find approximate pattern

matches - Biologists often use fast heuristic approaches

(rather than local alignment) to find approximate

matches

Heuristic Similarity Searches

- Genomes are huge Smith-Waterman quadratic

alignment algorithms are too slow - Alignment of two sequences usually has short

identical or highly similar fragments - Many heuristic methods (i.e., FASTA) are based on

the same idea of filtration - Find short exact matches, and use them as seeds

for potential match extension - Filter out positions with no extendable matches

Dot Matrices

- Dot matrices show similarities between two

sequences - FASTA makes an implicit dot matrix from short

exact matches, and tries to find long diagonals

(allowing for some mismatches)

Dot Matrices (contd)

- Identify diagonals above a threshold length
- Diagonals in the dot matrix indicate exact

substring matching

Diagonals in Dot Matrices

- Extend diagonals and try to link them together,

allowing for minimal mismatches/indels - Linking diagonals reveals approximate matches

over longer substrings

Approximate Pattern Matching Problem

- Goal Find all approximate occurrences of a

pattern in a text - Input A pattern p p1pn, text t t1tm, and

k, the maximum number of mismatches - Output All positions 1 lt i lt (m n 1) such

that titin-1 and p1pn have at most k

mismatches (i.e., Hamming distance between

titin-1 and p lt k)

Approximate Pattern Matching A Brute-Force

Algorithm

- ApproximatePatternMatching(p, t, k)
- n ? length of pattern p
- m ? length of text t
- for i ? 1 to m n 1
- dist ? 0
- for j ? 1 to n
- if tij-1 ! pj
- dist ? dist 1
- if dist lt k
- output i

Approximate Pattern Matching Running Time

- That algorithm runs in O(nm).
- Landau-Vishkin algorithm O(kn)
- We can generalize the Approximate Pattern

Matching Problem into a Query Matching

Problem - We want to match substrings in a query to

substrings in a text with at most k mismatches - Motivation we want to see similarities to some

gene, but we may not know which parts of the gene

to look for

Query Matching Problem

- Goal Find all substrings of the query that

approximately match the text - Input Query q q1qw,
- text t t1tm,
- n (length of matching

substrings), - k (maximum number of

mismatches) - Output All pairs of positions (i, j) such that

the - n-letter substring of q starting

at i approximately matches the - n-letter substring of t starting

at j, - with at most k mismatches

Approximate Pattern Matching vs Query Matching

Query Matching Main Idea

- Approximately matching strings share some

perfectly matching substrings. - Instead of searching for approximately matching

strings (difficult) search for perfectly matching

substrings (easy).

Filtration in Query Matching

- We want all n-matches between a query and a text

with up to k mismatches - Filter out positions we know do not match

between text and query - Potential match detection find all matches of

l-tuples in query and text for some small l - Potential match verification Verify each

potential match by extending it to the left and

right, until (k 1) mismatches are found

Filtration Match Detection

- If x1xn and y1yn match with at most k

mismatches, they must share an l-tuple that is

perfectly matched, with l ?n/(k 1)? - Break string of length n into k1 parts, each

each of length ?n/(k 1)? - k mismatches can affect at most k of these k1

parts - At least one of these k1 parts is perfectly

matched

Filtration Match Detection (contd)

- Suppose k 3. We would then have

ln/(k1)n/4 - There are at most k mismatches in n, so at the

very least there must be one out of the k1 l

tuples without a mismatch

1l

l 12l

2l 13l

3l 1n

1

2

k

k 1

Filtration Match Verification

- For each l -match we find, try to extend the

match further to see if it is substantial

Extend perfect match of length l until we find an

approximate match of length n with k mismatches

query

text

Filtration Example

k 0 k 1 k 2 k 3 k 4 k 5

l -tuple length n n/2 n/3 n/4 n/5 n/6

Shorter perfect matches required

Performance decreases

Local alignment is to slow

- Quadratic local alignment is too slow while

looking for similarities between long strings

(e.g. the entire GenBank database)

Local alignment is to slow

- Quadratic local alignment is too slow while

looking for similarities between long strings

(e.g. the entire GenBank database)

Local alignment is to slow

- Quadratic local alignment is too slow while

looking for similarities between long strings

(e.g. the entire GenBank database)

Local alignment is to slow

- Quadratic local alignment is too slow while

looking for similarities between long strings

(e.g. the entire GenBank database) - Guaranteed to find the optimal local alignment
- Sets the standard for sensitivity

Local alignment is to slow

- Quadratic local alignment is too slow while

looking for similarities between long strings

(e.g. the entire GenBank database) - Basic Local Alignment Search Tool
- Altschul, S., Gish, W., Miller, W., Myers, E.

Lipman, D.J. - Journal of Mol. Biol., 1990
- Search sequence databases for local alignments to

a query

Next

- Pattern matching using BLAST
- Some of the following slides are based on slides

by the authors of our text.

BLAST

- Great improvement in speed, with a modest

decrease in sensitivity - Minimizes search space instead of exploring

entire search space between two sequences - Finds short exact matches (seeds), only

explores locally around these hits

What Similarity Reveals

- BLASTing a new gene
- Evolutionary relationship
- Similarity between protein function
- BLASTing a genome
- Potential genes

BLAST algorithm

- Keyword search of all words of length w from the

query of length n in database of length m with

score above threshold - w 11 for DNA queries, w 3 for proteins
- Local alignment extension for each found keyword
- Extend result until longest match above threshold

is achieved - Running time O(nm)

BLAST algorithm (contd)

keyword

Query KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVL

KIFLENVIRD

GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK

11 GEK 11 GDK 11

Neighborhood words

neighborhood score threshold (T 13)

extension

Query 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK

60 DN G IR L GK I L E

RGK Sbjct 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EK

HRGIIK 263

High-scoring Pair (HSP)

Original BLAST

- Dictionary
- All words of length w
- Alignment
- Ungapped extensions until score falls below some

statistical threshold - Output
- All local alignments with score gt threshold

Original BLAST Example

A C G A A G T A A G G T C

C A G T

- w 4
- Exact keyword match of GGTC
- Extend diagonals with mismatches until score is

under 50 - Output result
- GTAAGGTCC
- GTTAGGTCC

C T G A T C C T G G A T T

G C G A

From lectures by Serafim Batzoglou (Stanford)

Gapped BLAST Example

A C G A A G T A A G G T C

C A G T

- Original BLAST exact keyword search, THEN
- Extend with gaps around ends of exact match until

score lt threshold - Output result
- GTAAGGTCCAGT
- GTTAGGTC-AGT

C T G A T C C T G G A T T

G C G A

From lectures by Serafim Batzoglou (Stanford)

Assessing sequence similarity

- Need to know how strong an alignment can be

expected from chance alone - Chance relates to comparison of sequences that

are generated randomly based upon a certain

sequence model - Sequence models may take into account
- GC content
- Junk DNA
- Codon bias
- Etc.

BLAST Segment Score

- BLAST uses scoring matrices (d) to improve on

efficiency of match detection - Some proteins may have very different amino acid

sequences, but are still similar - For any two l-mers x1xl and y1yl
- Segment pair pair of l-mers, one from each

sequence - Segment score Sli1 d(xi, yi)

BLAST Locally Maximal Segment Pairs

- A segment pair is maximal if it has the best

score over all segment pairs - A segment pair is locally maximal if its score

cant be improved by extending or shortening - Statistically significant locally maximal segment

pairs are of biological interest - BLAST finds all locally maximal segment pairs

with scores above some threshold - A significantly high threshold will filter out

some statistically insignificant matches

BLAST Statistics

- Threshold Altschul-Dembo-Karlin statistics
- Identifies smallest segment score that is

unlikely to happen by chance - matches above q has mean E(q) Kmne-lq K is a

constant, m and n are the lengths of the two

compared sequences - Parameter l is positive root of
- S x,y in A(pxpyed(x,y)) 1, where px and py are

frequencies of amino acids x and y, and A is the

twenty letter amino acid alphabet