Combinatorial Pattern Matching - PowerPoint PPT Presentation

About This Presentation
Title:

Combinatorial Pattern Matching

Description:

This leads us to a different problem, the Pattern Matching Problem ... When threading is 'complete,' we've found a pattern in the text ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 103
Provided by: mch116
Category:

less

Transcript and Presenter's Notes

Title: Combinatorial Pattern Matching


1
Combinatorial Pattern Matching
2
Outline
  • Hash Tables
  • Repeat Finding
  • Exact Pattern Matching
  • Keyword Trees
  • Suffix Trees
  • Heuristic Similarity Search Algorithms
  • Approximate String Matching
  • Filtration
  • Comparing a Sequence Against a Database
  • Algorithm behind BLAST
  • Statistics behind BLAST
  • PatternHunter and BLAT

3
Outline- CHANGES
  • Hash Tables CHANGE- start from an
    example-modify
  • Heuristic Similarity Search Algorithms _SHOW
    FILTERED DOT-MATRIX

4
Genomic Repeats
  • Example of repeats
  • ATGGTCTAGGTCCTAGTGGTC
  • Motivation to find them
  • Genomic rearrangements are often associated with
    repeats
  • Trace evolutionary secrets
  • Many tumors are characterized by an explosion of
    repeats

5
Genomic Repeats
  • The problem is often more difficult
  • ATGGTCTAGGACCTAGTGTTC
  • Motivation to find them
  • Genomic rearrangements are often associated with
    repeats
  • Trace evolutionary secrets
  • Many tumors are characterized by an explosion of
    repeats

6
l-mer Repeats
  • Long repeats are difficult to find
  • Short repeats are easy to find (e.g., hashing)
  • Simple approach to finding long repeats
  • Find exact repeats of short l-mers (l is usually
    10 to 13)
  • Use l-mer repeats to potentially extend into
    longer, maximal repeats

7
l-mer Repeats (contd)
  • There are typically many locations where an l-mer
    is repeated
  • GCTTACAGATTCAGTCTTACAGATGGT
  • The 4-mer TTAC starts at locations 3 and 17

8
Extending l-mer Repeats
  • GCTTACAGATTCAGTCTTACAGATGGT
  • Extend these 4-mer matches
  • GCTTACAGATTCAGTCTTACAGATGGT
  • Maximal repeat TTACAGAT

9
Maximal Repeats
  • To find maximal repeats in this way, we need ALL
    start locations of all l-mers in the genome
  • Hashing lets us find repeats quickly in this
    manner

10
Hashing What is it?
  • What does hashing do?
  • For different data, generate a unique integer
  • Store data in an array at the unique integer
    index generated from the data
  • Hashing is a very efficient way to store and
    retrieve data

11
Hashing Definitions
  • Hash table array used in hashing
  • Records data stored in a hash table
  • Keys identifies sets of records
  • Hash function uses a key to generate an index to
    insert at in hash table
  • Collision when more than one record is mapped to
    the same index in the hash table

12
Hashing Example
  • Where do the animals eat?
  • Records each animal
  • Keys where each animal eats

13
Hashing DNA sequences
  • Each l-mer can be translated into a binary string
    (A, T, C, G can be represented as 00, 01, 10, 11)
  • After assigning a unique integer per l-mer it is
    easy to get all start locations of each l-mer in
    a genome

14
Hashing Maximal Repeats
  • To find repeats in a genome
  • For all l-mers in the genome, note the start
    position and the sequence
  • Generate a hash table index for each unique l-mer
    sequence
  • In each index of the hash table, store all genome
    start locations of the l-mer which generated that
    index
  • Extend l-mer repeats to maximal repeats

15
Hashing Collisions
  • Dealing with collisions
  • Chain all start locations of l-mers (linked
    list)

16
Hashing Summary
  • When finding genomic repeats from l-mers
  • Generate a hash table index for each l-mer
    sequence
  • In each index, store all genome start locations
    of the l-mer which generated that index
  • Extend l-mer repeats to maximal repeats

17
Pattern Matching
  • What if, instead of finding repeats in a genome,
    we want to find all sequences in a database that
    contain a given pattern?
  • This leads us to a different problem, the Pattern
    Matching Problem

18
Pattern Matching Problem
  • Goal Find all occurrences of a pattern in a text
  • Input Pattern p p1pn and text t t1tm
  • Output All positions 1lt i lt (m n 1) such
    that the n-letter substring of t starting at i
    matches p
  • Motivation Searching database for a known pattern

19
Exact Pattern Matching A Brute-Force Algorithm
  • PatternMatching(p,t)
  • n ? length of pattern p
  • m ? length of text t
  • for i ? 1 to (m n 1)
  • if titin-1 p
  • output i

20
Exact Pattern Matching An Example
GCAT
  • PatternMatching algorithm for
  • Pattern GCAT
  • Text CGCATC

CGCATC
GCAT
CGCATC
GCAT
CGCATC
GCAT
CGCATC
GCAT
CGCATC
21
Exact Pattern Matching Running Time
  • PatternMatching runtime O(nm)
  • Probability-wise, its more like O(m)
  • Rarely will there be close to n comparisons in
    line 4
  • Better solution suffix trees
  • Can solve problem in O(m) time
  • Conceptually related to keyword trees

22
Keyword Trees Example
  • Keyword tree
  • Apple

23
Keyword Trees Example (contd)
  • Keyword tree
  • Apple
  • Apropos

24
Keyword Trees Example (contd)
  • Keyword tree
  • Apple
  • Apropos
  • Banana

25
Keyword Trees Example (contd)
  • Keyword tree
  • Apple
  • Apropos
  • Banana
  • Bandana

26
Keyword Trees Example (contd)
  • Keyword tree
  • Apple
  • Apropos
  • Banana
  • Bandana
  • Orange

27
Keyword Trees Properties
  • Stores a set of keywords in a rooted labeled tree
  • Each edge labeled with a letter from an alphabet
  • Any two edges coming out of the same vertex have
    distinct labels
  • Every keyword stored can be spelled on a path
    from root to some leaf

28
Keyword Trees Threading (contd)
  • Thread appeal
  • appeal

29
Keyword Trees Threading (contd)
  • Thread appeal
  • appeal

30
Keyword Trees Threading (contd)
  • Thread appeal
  • appeal

31
Keyword Trees Threading (contd)
  • Thread appeal
  • appeal

32
Keyword Trees Threading (contd)
  • Thread apple
  • apple

33
Keyword Trees Threading (contd)
  • Thread apple
  • apple

34
Keyword Trees Threading (contd)
  • Thread apple
  • apple

35
Keyword Trees Threading (contd)
  • Thread apple
  • apple

36
Keyword Trees Threading (contd)
  • Thread apple
  • apple

37
Multiple Pattern Matching Problem
  • Goal Given a set of patterns and a text, find
    all occurrences of any of patterns in text
  • Input k patterns p1,,pk, and text t t1tm
  • Output Positions 1 lt i lt m where substring of t
    starting at i matches pj for 1 lt j lt k
  • Motivation Searching database for known multiple
    patterns

38
Multiple Pattern Matching Straightforward
Approach
  • Can solve as k Pattern Matching Problems
  • Runtime
  • O(kmn)
  • using the PatternMatching algorithm k times
  • m - length of the text
  • n - average length of the pattern

39
Multiple Pattern Matching Keyword Tree Approach
  • Or, we could use keyword trees
  • Build keyword tree in O(N) time N is total
    length of all patterns
  • With naive threading O(N nm)
  • Aho-Corasick algorithm O(N m)

40
Keyword Trees Threading
  • To match patterns in a text using a keyword tree
  • Build keyword tree of patterns
  • Thread the text through the keyword tree

41
Keyword Trees Threading (contd)
  • Threading is complete when we reach a leaf in
    the keyword tree
  • When threading is complete, weve found a
    pattern in the text

42
Suffix TreesCollapsed Keyword Trees
  • Similar to keyword trees, except edges that form
    paths are collapsed
  • Each edge is labeled with a substring of a text
  • All internal edges have at least two outgoing
    edges
  • Leaves labeled by the index of the pattern.

43
Suffix Tree of a Text
  • Suffix trees of a text is constructed for all its
    suffixes

ATCATG TCATG CATG ATG
TG G
Keyword Tree
Suffix Tree
44
Suffix Tree of a Text
  • Suffix trees of a text is constructed for all its
    suffixes

ATCATG TCATG CATG ATG
TG G
Keyword Tree
Suffix Tree
How much time does it take?
45
Suffix Tree of a Text
  • Suffix trees of a text is constructed for all its
    suffixes

ATCATG TCATG CATG ATG
TG G
Keyword Tree
Suffix Tree
quadratic
Time is linear in the total size of all suffixes,
i.e., it is quadratic in the length of the text
46
Suffix Trees Advantages
  • Suffix trees of a text is constructed for all its
    suffixes
  • Suffix trees build faster than keyword trees

ATCATG TCATG CATG ATG
TG G
Keyword Tree
Suffix Tree
quadratic
linear (Weiner suffix tree algorithm)
47
Use of Suffix Trees
  • Suffix trees hold all suffixes of a text
  • i.e., ATCGC ATCGC, TCGC, CGC, GC, C
  • Builds in O(m) time for text of length m
  • To find any pattern of length n in a text
  • Build suffix tree for text
  • Thread the pattern through the suffix tree
  • Can find pattern in text in O(n) time!
  • O(n m) time for Pattern Matching Problem
  • Build suffix tree and lookup pattern

48
Pattern Matching with Suffix Trees
  • SuffixTreePatternMatching(p,t)
  • Build suffix tree for text t
  • Thread pattern p through suffix tree
  • if threading is complete
  • output positions of all p-matching leaves in
    the tree
  • else
  • output Pattern does not appear in text

49
Suffix Trees Example
50
Multiple Pattern Matching Summary
  • Keyword and suffix trees are used to find
    patterns in a text
  • Keyword trees
  • Build keyword tree of patterns, and thread text
    through it
  • Suffix trees
  • Build suffix tree of text, and thread patterns
    through it

51
Approximate vs. Exact Pattern Matching
  • So far all weve seen exact pattern matching
    algorithms
  • Usually, because of mutations, it makes much more
    biological sense to find approximate pattern
    matches
  • Biologists often use fast heuristic approaches
    (rather than local alignment) to find approximate
    matches

52
Heuristic Similarity Searches
  • Genomes are huge Smith-Waterman quadratic
    alignment algorithms are too slow
  • Alignment of two sequences usually has short
    identical or highly similar fragments
  • Many heuristic methods (i.e., FASTA) are based on
    the same idea of filtration
  • Find short exact matches, and use them as seeds
    for potential match extension
  • Filter out positions with no extendable matches

53
Dot Matrices
  • Dot matrices show similarities between two
    sequences
  • FASTA makes an implicit dot matrix from short
    exact matches, and tries to find long diagonals
    (allowing for some mismatches)

54
Dot Matrices (contd)
  • Identify diagonals above a threshold length
  • Diagonals in the dot matrix indicate exact
    substring matching

55
Diagonals in Dot Matrices
  • Extend diagonals and try to link them together,
    allowing for minimal mismatches/indels
  • Linking diagonals reveals approximate matches
    over longer substrings

56
Approximate Pattern Matching Problem
  • Goal Find all approximate occurrences of a
    pattern in a text
  • Input A pattern p p1pn, text t t1tm, and
    k, the maximum number of mismatches
  • Output All positions 1 lt i lt (m n 1) such
    that titin-1 and p1pn have at most k
    mismatches (i.e., Hamming distance between
    titin-1 and p lt k)

57
Approximate Pattern Matching A Brute-Force
Algorithm
  • ApproximatePatternMatching(p, t, k)
  • n ? length of pattern p
  • m ? length of text t
  • for i ? 1 to m n 1
  • dist ? 0
  • for j ? 1 to n
  • if tij-1 ! pj
  • dist ? dist 1
  • if dist lt k
  • output i

58
Approximate Pattern Matching Running Time
  • That algorithm runs in O(nm).
  • Landau-Vishkin algorithm O(kn)
  • We can generalize the Approximate Pattern
    Matching Problem into a Query Matching
    Problem
  • We want to match substrings in a query to
    substrings in a text with at most k mismatches
  • Motivation we want to see similarities to some
    gene, but we may not know which parts of the gene
    to look for

59
Query Matching Problem
  • Goal Find all substrings of the query that
    approximately match the text
  • Input Query q q1qw,
  • text t t1tm,
  • n (length of matching
    substrings),
  • k (maximum number of
    mismatches)
  • Output All pairs of positions (i, j) such that
    the
  • n-letter substring of q starting
    at i approximately matches the
  • n-letter substring of t starting
    at j,
  • with at most k mismatches

60
Approximate Pattern Matching vs Query Matching
61
Query Matching Main Idea
  • Approximately matching strings share some
    perfectly matching substrings.
  • Instead of searching for approximately matching
    strings (difficult) search for perfectly matching
    substrings (easy).

62
Filtration in Query Matching
  • We want all n-matches between a query and a text
    with up to k mismatches
  • Filter out positions we know do not match
    between text and query
  • Potential match detection find all matches of
    l-tuples in query and text for some small l
  • Potential match verification Verify each
    potential match by extending it to the left and
    right, until (k 1) mismatches are found

63
Filtration Match Detection
  • If x1xn and y1yn match with at most k
    mismatches, they must share an l-tuple that is
    perfectly matched, with l ?n/(k 1)?
  • Break string of length n into k1 parts, each
    each of length ?n/(k 1)?
  • k mismatches can affect at most k of these k1
    parts
  • At least one of these k1 parts is perfectly
    matched

64
Filtration Match Detection (contd)
  • Suppose k 3. We would then have
    ln/(k1)n/4
  • There are at most k mismatches in n, so at the
    very least there must be one out of the k1 l
    tuples without a mismatch

1l
l 12l
2l 13l
3l 1n
1
2
k
k 1
65
Filtration Match Verification
  • For each l -match we find, try to extend the
    match further to see if it is substantial

Extend perfect match of length l until we find an
approximate match of length n with k mismatches
query
text
66
Filtration Example
k 0 k 1 k 2 k 3 k 4 k 5
l -tuple length n n/2 n/3 n/4 n/5 n/6
Shorter perfect matches required
Performance decreases
67
Local alignment is to slow
  • Quadratic local alignment is too slow while
    looking for similarities between long strings
    (e.g. the entire GenBank database)

68
Local alignment is to slow
  • Quadratic local alignment is too slow while
    looking for similarities between long strings
    (e.g. the entire GenBank database)

69
Local alignment is to slow
  • Quadratic local alignment is too slow while
    looking for similarities between long strings
    (e.g. the entire GenBank database)

70
Local alignment is to slow
  • Quadratic local alignment is too slow while
    looking for similarities between long strings
    (e.g. the entire GenBank database)
  • Guaranteed to find the optimal local alignment
  • Sets the standard for sensitivity

71
Local alignment is to slow
  • Quadratic local alignment is too slow while
    looking for similarities between long strings
    (e.g. the entire GenBank database)
  • Basic Local Alignment Search Tool
  • Altschul, S., Gish, W., Miller, W., Myers, E.
    Lipman, D.J.
  • Journal of Mol. Biol., 1990
  • Search sequence databases for local alignments to
    a query

72
BLAST
  • Great improvement in speed, with a modest
    decrease in sensitivity
  • Minimizes search space instead of exploring
    entire search space between two sequences
  • Finds short exact matches (seeds), only
    explores locally around these hits

73
What Similarity Reveals
  • BLASTing a new gene
  • Evolutionary relationship
  • Similarity between protein function
  • BLASTing a genome
  • Potential genes

74
BLAST algorithm
  • Keyword search of all words of length w from the
    query of length n in database of length m with
    score above threshold
  • w 11 for DNA queries, w 3 for proteins
  • Local alignment extension for each found keyword
  • Extend result until longest match above threshold
    is achieved
  • Running time O(nm)

75
BLAST algorithm (contd)
keyword
Query KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVL
KIFLENVIRD
GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK
11 GEK 11 GDK 11
Neighborhood words
neighborhood score threshold (T 13)
extension
Query 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK
60 DN G IR L GK I L E
RGK Sbjct 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EK
HRGIIK 263
High-scoring Pair (HSP)
76
Original BLAST
  • Dictionary
  • All words of length w
  • Alignment
  • Ungapped extensions until score falls below some
    statistical threshold
  • Output
  • All local alignments with score gt threshold

77
Original BLAST Example
A C G A A G T A A G G T C
C A G T
  • w 4
  • Exact keyword match of GGTC
  • Extend diagonals with mismatches until score is
    under 50
  • Output result
  • GTAAGGTCC
  • GTTAGGTCC


















C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
78
Gapped BLAST Example
A C G A A G T A A G G T C
C A G T
  • Original BLAST exact keyword search, THEN
  • Extend with gaps around ends of exact match until
    score lt threshold
  • Output result
  • GTAAGGTCCAGT
  • GTTAGGTC-AGT


















C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
79
Incarnations of BLAST
  • blastn Nucleotide-nucleotide
  • blastp Protein-protein
  • blastx Translated query vs. protein database
  • tblastn Protein query vs. translated database
  • tblastx Translated query vs. translated
  • database (6 frames each)

80
Incarnations of BLAST (contd)
  • PSI-BLAST
  • Find members of a protein family or build a
    custom position-specific score matrix
  • Megablast
  • Search longer sequences with fewer differences
  • WU-BLAST (Wash U BLAST)
  • Optimized, added features

81
Assessing sequence similarity
  • Need to know how strong an alignment can be
    expected from chance alone
  • Chance relates to comparison of sequences that
    are generated randomly based upon a certain
    sequence model
  • Sequence models may take into account
  • GC content
  • Poly-A tails
  • Junk DNA
  • Codon bias
  • Etc.

82
BLAST Segment Score
  • BLAST uses scoring matrices (d) to improve on
    efficiency of match detection
  • Some proteins may have very different amino acid
    sequences, but are still similar
  • For any two l-mers x1xl and y1yl
  • Segment pair pair of l-mers, one from each
    sequence
  • Segment score Sli1 d(xi, yi)

83
BLAST Locally Maximal Segment Pairs
  • A segment pair is maximal if it has the best
    score over all segment pairs
  • A segment pair is locally maximal if its score
    cant be improved by extending or shortening
  • Statistically significant locally maximal segment
    pairs are of biological interest
  • BLAST finds all locally maximal segment pairs
    with scores above some threshold
  • A significantly high threshold will filter out
    some statistically insignificant matches

84
BLAST Statistics
  • Threshold Altschul-Dembo-Karlin statistics
  • Identifies smallest segment score that is
    unlikely to happen by chance
  • matches above q has mean E(q) Kmne-lq K is a
    constant, m and n are the lengths of the two
    compared sequences
  • Parameter l is positive root of
  • S x,y in A(pxpyed(x,y)) 1, where px and py are
    frequenceies of amino acids x and y, and A is the
    twenty letter amino acid alphabet

85
P-values
  • The probability of finding b HSPs with a score S
    is given by
  • (e-EEb)/b!
  • For b 0, that chance is
  • e-E
  • Thus the probability of finding at least one HSP
    with a score S is
  • P 1 e-E

86
Sample BLAST output
  • Blast of human beta globin protein against zebra
    fish
  • Score E
  • Sequences producing significant alignments
    (bits) Value
  • gi18858329refNP_571095.1 ba1 globin Danio
    rerio gtgi147757... 171 3e-44
  • gi18858331refNP_571096.1 ba2 globin
    SIdZ118J2.3 Danio rer... 170 7e-44
  • gi37606100embCAE48992.1 SIbY187G17.6 (novel
    beta globin) D... 170 7e-44
  • gi31419195gbAAH53176.1 Ba1 protein Danio
    rerio 168 3e-43
  • ALIGNMENTS
  • gtgi18858329refNP_571095.1 ba1 globin Danio
    rerio
  • Length 148
  • Score 171 bits (434), Expect 3e-44
  • Identities 76/148 (51), Positives 106/148
    (71), Gaps 1/148 (0)
  • Query 1 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWT
    QRFFESFGDLSTPDAVMGNPK 60
  • MV T EA LWGKNDEG AL R
    LVYPWTQRF FGLSP AMGNPK
  • Sbjct 1 MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWT
    QRYFATFGNLSSPAAIMGNPK 60

87
Sample BLAST output (contd)
  • Blast of human beta globin DNA against human DNA
  • Score E
  • Sequences producing significant alignments
    (bits) Value
  • gi19849266gbAF487523.1 Homo sapiens gamma A
    hemoglobin (HBG1... 289 1e-75
  • gi183868gbM11427.1HUMHBG3E Human gamma-globin
    mRNA, 3' end 289 1e-75
  • gi44887617gbAY534688.1 Homo sapiens A-gamma
    globin (HBG1) ge... 280 1e-72
  • gi31726embV00512.1HSGGL1 Human messenger RNA
    for gamma-globin 260 1e-66
  • gi38683401refNR_001589.1 Homo sapiens
    hemoglobin, beta pseud... 151 7e-34
  • gi18462073gbAF339400.1 Homo sapiens haplotype
    PB26 beta-glob... 149 3e-33
  • ALIGNMENTS
  • gtgi28380636refNG_000007.3 Homo sapiens beta
    globin region (HBB_at_) on chromosome 11
  • Length 81706
  • Score 149 bits (75), Expect 3e-33
  • Identities 183/219 (83)
  • Strand Plus / Plus

  • Query 267 ttgggagatgccacaaagcacctggatgatctcaagg
    gcacctttgcccagctgagtgaa 326


88
Timeline
  • 1970 Needleman-Wunsch global alignment algorithm
  • 1981 Smith-Waterman local alignment algorithm
  • 1985 FASTA
  • 1990 BLAST (basic local alignment search tool)
  • 2000s BLAST has become too slow in genome vs.
    genome comparisons - new faster algorithms
    evolve!
  • Pattern Hunter
  • BLAT

89
PatternHunter faster and even more sensitive
  • BLAST matches short consecutive sequences
    (consecutive seed)
  • Length k
  • Example (k 11)
  • 11111111111
  • Each 1 represents a match
  • PatternHunter matches short non-consecutive
    sequences (spaced seed)
  • Increases sensitivity by locating homologies that
    would otherwise be missed
  • Example (a spaced seed of length 18 w/ 11
    matches)
  • 111010010100110111
  • Each 0 represents a dont care, so there can be
    a match or a mismatch

90
Spaced seeds
  • Example of a hit using a spaced seed

How does this result in better sensitivity?
91
Why is PH better?
  • PatternHunter
  • BLAST redundant hits

This results in gt 1 hit and creates clusters of
redundant hits
This results in very few redundant hits
92
Why is PH better?
  • BLAST may also miss a hit
  • GAGTACTCAACACCAACATTAGTGGGCAATGGAAAAT
  • GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT

9 matches
In this example, despite a clear homology, there
is no sequence of continuous matches longer than
length 9. BLAST uses a length 11 and because of
this, BLAST does not recognize this as a
hit! Resolving this would require reducing the
seed length to 9, which would have a damaging
effect on speed
93
Advantage of Gapped Seeds
11 positions
11 positions
10 positions
94
Why is PH better?
  • Higher hit probability
  • Lower expected number of random hits

95
Use of Multiple Seeds
  • Basic Searching Algorithm
  • Select a group of spaced seed models
  • For each hit of each model, conduct extension to
    find a homology.

96
Another method BLAT
  • BLAT (BLAST-Like Alignment Tool)
  • Same idea as BLAST - locate short sequence hits
    and extend

97
BLAT vs. BLAST Differences
  • BLAT builds an index of the database and scans
    linearly through the query sequence, whereas
    BLAST builds an index of the query sequence and
    then scans linearly through the database
  • Index is stored in RAM which is memory intensive,
    but results in faster searches

98
BLAT Fast cDNA Alignments
  • Steps
  • 1. Break cDNA into 500 base chunks.
  • 2. Use an index to find regions in genome similar
    to each chunk of cDNA.
  • 3. Do a detailed alignment between genomic
    regions and cDNA chunk.
  • 4. Use dynamic programming to stitch together
    detailed alignments of chunks into detailed
    alignment of whole.

99
BLAT Indexing
  • An index is built that contains the positions of
    each k-mer in the genome
  • Each k-mer in the query sequence is compared to
    each k-mer in the index
  • A list of hits is generated - positions in cDNA
    and in genome that match for k bases

100
Indexing An Example
  • Here is an example with k 3
  • Genome cacaattatcacgaccgc
  • 3-mers (non-overlapping) cac aat tat cac gac cgc
  • Index aat 3 gac 12
  • cac 0,9 tat 6
  • cgc 15
  • cDNA (query sequence) aattctcac
  • 3-mers (overlapping) aat att ttc tct ctc tca cac
  • 0 1 2 3 4 5 6
  • Hits aat 0,3
  • cac 6,0
  • cac 6,9
  • clump cacAATtatCACgaccgc

Multiple instances map to single index
Position of 3-mer in query, genome
101
However
  • BLAT was designed to find sequences of 95 and
    greater similarity of length gt40 may miss more
    divergent or shorter sequence alignments

102
PatternHunter and BLAT vs. BLAST
  • PatternHunter is 5-100 times faster than Blastn,
    depending on data size, at the same sensitivity
  • BLAT is several times faster than BLAST, but best
    results are limited to closely related sequences

103
Resources
  • tandem.bu.edu/classes/ 2004/papers/pathunter_grp_p
    rsnt.ppt
  • http//www.jax.org/courses/archives/2004/gsa04_kin
    g_presentation.pdf
  • http//www.genomeblat.com/genomeblat/blatRapShow.p
    ps
Write a Comment
User Comments (0)
About PowerShow.com