Title: Sequence Alignment II
1Sequence Alignment II
 Ktuple methodsStatistics of alignments
2Database searches
 What is the problem?
 Large number of sequences to search your query
sequence against.  Various indexing schemes and heuristics are used,
one of which is BLAST.  heuristic is a technique to solve a problem that
ignores whether the solution can be proven to be
correct, but usually produces a good solution,
are intended to gain computational performance or
conceptual simplicity potentially at the cost of
accuracy or precision.
http//en.wikipedia.org/wiki/HeuristicsComputer_s
cience
3Ktuple methods
http//creativecommons.org/licenses/bysa/2.0/
4Concepts of Sequence Similarity Searching
 The premise
 The sequence itself is not informative it must
be analyzed by comparative methods against
existing databases to develop hypothesis
concerning relatives and function.
5Important Terms for Sequence Similarity Searching
with very different meanings
 Similarity
 The extent to which nucleotide or protein
sequences are related. In BLAST similarity refers
to a positive matrix score.  Identity
 The extent to which two (nucleotide or amino
acid) sequences are invariant.  Homology
 Similarity attributed to descent from a common
ancestor.
6Sequence Similarity Searching The Approach
 Sequence similarity searching involves the use of
a set of algorithms (such as the BLAST programs)
to compare a query sequence to all the sequences
in a specified database.  Comparisons are made in a pairwise fashion. Each
comparison is given a score reflecting the degree
of similarity between the query and the sequence
being compared.
7Blast
QUERY sequence(s)
BLAST results
BLAST program
BLAST database
8Topics
BLAST program
 There are different blast programs
 Understanding the BLAST algorithm
 Word size
 HSPs (High Scoring Pairs)
 Understanding BLAST statistics
 The alignment score (S)
 Scoring Matrices
 Dealing with gaps in an alignment
 The expectation value (E)
9The BLAST algorithm
 The BLAST programs (Basic Local Alignment Search
Tools) are a set of sequence comparison
algorithms introduced in 1990 for optimal local
alignments to a query.  Altschul SF, Gish W, Miller W, Myers EW, Lipman
DJ (1990) Basic local alignment search tool. J.
Mol. Biol. 215403410.  Altschul SF, Madden TL, Schaeffer AA, Zhang J,
Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST
and PSIBLAST a new generation of protein
database search programs. NAR 2533893402.
10http//www.ncbi.nlm.nih.gov/BLAST
blastn
11Other BLAST programs
 BLAST 2 Sequences (bl2seq)
 Aligns two sequences of your choice
 Gives dotplot like output
12More BLAST programs
 BLAST against genomes
 Many available
 BLAST parameters preoptimized
 Handy for mapping query to genome
 Search for short exact matches
 BLAST parameters preoptimized
 Great for checking probes and primers
13How Does BLAST Work?
 The BLAST programs improved the overall speed of
searches while retaining good sensitivity
(important as databases continue to grow) by
breaking the query and database sequences into
fragments ("words"), and initially seeking
matches between fragments.  Word hits are then extended in either direction
in an attempt to generate an alignment with a
score exceeding the threshold of T".
14Picture used with permission from Chapter 11 of
Bioinformatics A Practical Guide to the
Analysis of Genes and Proteins
15Each BLAST hit generates an alignment that can
contain one or more high scoring pairs (HSPs)
16Each BLAST hit generates an alignment that can
contain one or more high scoring pairs (HSPs)
17Where does the score (S) come from?
 The quality of each pairwise alignment is
represented as a score and the scores are ranked.
 Scoring matrices are used to calculate the score
of the alignment base by base (DNA) or amino acid
by amino acid (protein).  The alignment score will be the sum of the scores
for each position.
18Whats a scoring matrix?
 Substitution matrices are used for amino acid
alignments. These are matrices in which each
possible residue substitution is given a score
reflecting the probability that it is related to
the corresponding residue in the query.
19PAM vs. BLOSUM scoring matrices
 BLOSUM 62 is the default matrix in BLAST 2.0.
Though it is tailored for comparisons of
moderately distant proteins, it performs well in
detecting closer relationships. A search for
distant relatives may be more sensitive with a
different matrix.
20PAM vs BLOSUM scoring matrices
 The PAM Family
 PAM matrices are based on global alignments of
closely related proteins.  The PAM1 is the matrix calculated from
comparisons of sequences with no more than 1
divergence.  Other PAM matrices are extrapolated from PAM1.
 The BLOSUM family
 BLOSUM matrices are based on local alignments.
 BLOSUM 62 is a matrix calculated from comparisons
of sequences with no less than 62 divergence.  All BLOSUM matrices are based on observed
alignments they are not extrapolated from
comparisons of closely related proteins.
21What happens if you have a gap in the alignment?
 A gap is a position in the alignment at which a
letter is paired with a null  Gap scores are negative. Since a single
mutational event may cause the insertion or
deletion of more than one residue, the presence
of a gap is frequently ascribed more significance
than the length of the gap.  Hence the gap is penalized heavily, whereas a
lesser penalty is assigned to each subsequent
residue in the gap.
22Percent Sequence Identity
 The extent to which two nucleotide or amino acid
sequences are invariant
A C C T G A G A G A C G T G G C
A G
mismatch
indel
70 identical
23BLAST algorithm
 Keyword search of all words of length w in the
query of default length n in database of length m
with score above threshold  w 11 for nucleotide queries, 3 for proteins
 Do local alignment extension for each hit of
keyword search  Extend result until longest match above threshold
is achieved and output
24BLAST algorithm (contd)
keyword
Query KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVL
KIFLENVIRD
GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK
11 GEK 11 GDK 11
Neighborhood words
neighborhood score threshold (T 13)
extension
Query 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK
60 DN G IR L GK I L E
RGK Sbjct 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLVEK
HRGIIK 263
Highscoring Pair (HSP)
25Local alignment
 Find the best local alignment between two
strings, over the recurrence
26Local alignment (contd)
 Input strings v and w and scoring matrix d
 Output substrings of v and w whose global
alignment as defined by d, is maximal among all
global alignments of all substrings of v and w
27Original BLAST
 Dictionary
 All words of length w
 Alignment
 Ungapped extensions until score falls below
statistical threshold T  Output
 All local alignments with score gt statistical
threshold
28Original BLAST Example
A C G A A G T A A G G T C
C A G T
 w 4, T 4
 Exact keyword match of GGTC
 Extend diagonals with mismatches until score
falls below a threshold  Output result
 GTAAGGTCC
 GTTAGGTCC
C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
29Gapped BLAST Example
A C G A A G T A A G G T C
C A G T
 Original BLAST exact keyword search, THEN
 Extend with gaps in a zone around ends of exact
match  Output result
 GTAAGGTCCAGT
 GTTAGGTCAGT
C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
30Gapped BLAST Example (contd)
A C G A A G T A A G G T C
C A G T
 Original BLAST exact keyword search, THEN
 Extend with gaps around ends of exact match until
score ltT, then merge nearby alignments  Output result
 GTAAGGTCCAGT
 GTTAGGTCAGT
C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
31Topics
BLAST databases
 The different blast databases provided by the
NCBI  Protein databases
 Nucleotide databases
 Genomic databases
 Considerations for choosing a BLAST database
 Custom databases for BLAST
32BLAST protein databases available at through
blastp web interface _at_ NCBI
33Considerations for choosing a BLAST database
 First consider your research question
 Are you looking for an ortholog in a particular
species?  BLAST against the genome of that species.
 Are you looking for additional members of a
protein family across all species?  BLAST against nr, if you cant find hits check
wgs, htgs, and the trace archives.  Are you looking to annotate genes in your species
of interest?  BLAST against known genes (RefSeq) and/or ESTs
from a closely related species.
34When choosing a database for BLAST
 It is important to know your reagents.
 Changing your choice of database is changing your
search space completely  Database size affects the BLAST statistics
 record BLAST parameters, database choice,
database size in your bioinformatics lab book,
just as you would for your wetbench experiments.  Databases change rapidly and are updated
frequently  It may be necessary to repeat your analyses
35Topics
BLAST results
 Choosing the right BLAST program
 Running a blastp search
 BLAST parameters and options to consider
 Viewing BLAST results
 Look at your alignments
 Using the BLAST taxonomy report
36BLAST parameters and options to consider
conserved domains
Entrez query
Evalue cutoff
Word size
37More BLAST parameters and options to consider
filtering
gap penalities
matrix
38Run your BLAST search
BLAST
39The BLAST Queue
click for more info
Note your RID
40Formatting and Retrieving your BLAST results
Results
options
41A graphical view of your BLAST results
42The BLAST hit list
Score
EValue
GenBank
alignment
EntrezGene
43The BLAST pairwise alignments
Identity
Similarity
44Sample BLAST output
 Blast of human beta globin protein against zebra
fish
 Score E
 Sequences producing significant alignments
(bits) Value  gi18858329refNP_571095.1 ba1 globin Danio
rerio gtgi147757... 171 3e44  gi18858331refNP_571096.1 ba2 globin
SIdZ118J2.3 Danio rer... 170 7e44  gi37606100embCAE48992.1 SIbY187G17.6 (novel
beta globin) D... 170 7e44  gi31419195gbAAH53176.1 Ba1 protein Danio
rerio 168 3e43  ALIGNMENTS
 gtgi18858329refNP_571095.1 ba1 globin Danio
rerio  Length 148
 Score 171 bits (434), Expect 3e44
 Identities 76/148 (51), Positives 106/148
(71), Gaps 1/148 (0)  Query 1 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWT
QRFFESFGDLSTPDAVMGNPK 60  MV T EA LWGKNDEG AL R
LVYPWTQRF FGLSP AMGNPK  Sbjct 1 MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWT
QRYFATFGNLSSPAAIMGNPK 60
45Sample BLAST output (contd)
 Blast of human beta globin DNA against human DNA
 Score E
 Sequences producing significant alignments
(bits) Value  gi19849266gbAF487523.1 Homo sapiens gamma A
hemoglobin (HBG1... 289 1e75  gi183868gbM11427.1HUMHBG3E Human gammaglobin
mRNA, 3' end 289 1e75  gi44887617gbAY534688.1 Homo sapiens Agamma
globin (HBG1) ge... 280 1e72  gi31726embV00512.1HSGGL1 Human messenger RNA
for gammaglobin 260 1e66  gi38683401refNR_001589.1 Homo sapiens
hemoglobin, beta pseud... 151 7e34  gi18462073gbAF339400.1 Homo sapiens haplotype
PB26 betaglob... 149 3e33  ALIGNMENTS
 gtgi28380636refNG_000007.3 Homo sapiens beta
globin region (HBB_at_) on chromosome 11  Length 81706
 Score 149 bits (75), Expect 3e33
 Identities 183/219 (83)
 Strand Plus / Plus

 Query 267 ttgggagatgccacaaagcacctggatgatctcaagg
gcacctttgcccagctgagtgaa 326 
46What do the Score and the evalue really mean?
 The quality of the alignment is represented by
the Score.  Score (S)
 The score of an alignment is calculated as the
sum of substitution and gap scores. Substitution
scores are given by a lookup table (PAM, BLOSUM)
whereas gap scores are assigned empirically .  The significance of each alignment is computed as
an E value.  E value (E)
 Expectation value. The number of different
alignments with scores equivalent to or better
than S that are expected to occur in a database
search by chance. The lower the E value, the more
significant the score.
47E value
 E value (E)
 Expectation value. The number of different
alignments with scores equivalent to or better
than S expected to occur in a database search by
chance. The lower the E value, the more
significant the score.
48Assessing sequence homology
 Need to know how strong an alignment can be
expected from chance alone  Chance is the comparison of
 Real but nonhomologous sequences
 Real sequences that are shuffled to preserve
compositional properties  Sequences that are generated randomly based upon
a DNA or protein sequence model (favored)
49High Scoring Pairs (HSPs)
 All segment pairs whose scores can not be
improved by extension or trimming  Need to model a random sequence to analyze how
high the score is in relation to chance
50Expected number of HSPs
 Expected number of HSPs with score gt S
 Evalue E for the score S
 E KmnelS
 Given
 Two sequences, length n and m
 The statistics of HSP scores are characterized by
two parameters K and ?  K scale for the search space size
 ? scale for the scoring system
51BLAST statistics to record in your bioinformatics
labbook
Record the statistics that are found at bottom of
your BLAST results page
52Scoring matrices
 Amino acid substitution matrices
 PAM
 BLOSUM
53Bit Scores
 Normalized score to be able to compare sequences
 Bit score
 S lS ln(K) ln(2)
 Evalue of bit score
 E mn2S
54Assessing the significance of an alignment
 How to assess the significance of an alignment
between the comparison of a protein of length m
to a database containing many different proteins,
of varying lengths?  Calculate a "database search" Evalue. Multiply
the pairwisecomparison Evalue by the number of
sequences in the database N divided by the length
of the sequence in the database n 
55Homology Some Guidelines
 Similarity can be indicative of homology
 Generally, if two sequences are significantly
similar over entire length they are likely
homologous  Low complexity regions can be highly similar
without being homologous  Homologous sequences not always highly similar
56Homology Some Guidelines
 Suggested BLAST Cutoffs
 (source Chapter 11 Bioinformatics A Practical
Guide to the Analysis of Genes and Proteins)  For nucleotide based searches, one should look
for hits with Evalues of 106 or less and
sequence identity of 70 or more  For protein based searches, one should look for
hits with Evalues of 103 or less and sequence
identity of 25 or more
57Contributors
 Special thanks to David Wishart, Andy Baxevanis,
Stephanie Minnema, Sohrab Shah, and Francis
Ouellette for their contributions to these
materials
http//creativecommons.org/licenses/bysa/2.0/
58FASTA
 A FASTA search begins by breaking the search
sequence into words.  For genomic sequences, a word size of 4 or 6
nucleotides is used 1 or 2 for polypeptide
sequences.
59FASTA
 Next a table is constructed for the query
sequence (word size is 1)  E.g. FAMLGFIKYLPGCM
A C D E F G H I K L M N P Q R S T V W Y
2
60FASTA
 Next a table is constructed for the query
sequence  E.g. FAMLGFIKYLPGCM
A C D E F G H I K L M N P Q R S T V W Y
2 13
61FASTA
 Next a table is constructed for the query
sequence  E.g. FAMLGFIKYLPGCM
A C D E F G H I K L M N P Q R S T V W Y
2 13 1
6
62FASTA
 Next a table is constructed for the query
sequence  E.g. FAMLGFIKYLPGCM
A C D E F G H I K L M N P Q R S T V W Y
2 13 1 5
6 12
63FASTA
 Next a table is constructed for the query
sequence  E.g. FAMLGFIKYLPGCM
A C D E F G H I K L M N P Q R S T V W Y
2 13 1 5 7
6 12
64FASTA
 The table for the query sequence is complete
 E.g. FAMLGFIKYLPGCM
A C D E F G H I K L M N P Q R S T V W Y
2 13 1 5 7 8 4 3 11 9
6 12 10 14
65FASTA
 Compare the query sequence table with the target
sequence  Query FAMLGFIKYLPGCM
 Index of Gs are 5 and 12
 Target TGFIKYLPGACT
 Index of Gs are 2 and 9
 Subtract 2 from 5 and 12 producing 3 and 10
 Subtract 9 from 5 and 12 producing 4 and 3
1 2 3 4 5 6 7 8 9 10 11 12
T G F I K Y L P G A C T
3 4
10 3
66FASTA
 Compare the query sequence table with the target
sequence  Query FAMLGFIKYLPGCM
 Index of Fs are 1 and 6
 Target TGFIKYLPGACT
 Index of F is 3
 Subtract 3 from 1 and 6 producing 2 and 3
1 2 3 4 5 6 7 8 9 10 11 12
T G F I K Y L P G A C T
3 2 4
10 3 3
67FASTA
 Compare the query sequence table with the target
sequence  Query FAMLGFIKYLPGCM
 Index of Fs are 1 and 6
 Target TGFIKYLPGACT
 Index of F is 3
 Subtract 3 from 1 and 6 producing 2 and 3
1 2 3 4 5 6 7 8 9 10 11 12
T G F I K Y L P G A C T
3 2 3 3 3 3 3 4 8 2
10 3 3 3
68FASTA
 FAMLGFIKYLPGCM

 TGFIKYLPGACT
 Offset by 3
1 2 3 4 5 6 7 8 9 10 11 12
T G F I K Y L P G A C T
3 2 3 3 3 3 3 4 8 2
10 3 3 3
69Fasta (word size 2)
70Database searches
71Odds score in sequence alignment
 The chance of an aligned amino acid pair being
found in alignments of related sequences compared
to the chance of that pair being found in random
alignments of unrelated sequences.
72Statistical significance of an alignment
 The probability that random or unrelated
sequences could be aligned to produce the same
score.  Smaller the probability is the better.
73Probability
 What is the probability that a coin toss will
yield a head?  What is the probability that the next pair of
nucleotides will be a match or mismatch?
74Bernoulli trials
 A series of n number of independent trials with
the same outcome probabilities and number of
choices (e.g., head or tail or match (m) or
mismatch (mi)).  P(hhhhh)
 P(mmmmm)
75Head or Tail..Longest run of heads or tails
 Longest run of heads one would get in a random
series of coin tosses?  Fair coin, p 0.5 1/p 1/0.5 2
 Erdös and Rènyi longest run log1/p(n)
 If n 100 longest run 6.65
76Alignment analogy
 You have two sequences a and b of equal length
 a1a2a3a4
 b1b2b3b4
 if an bn then it is head (match)
 If an does not equal to bn then it is tail
(mismatch)
77Alignment Statistics
 For two sequences of length n and m, n times m
comparisons are being made thus the longest
length of the predicted match would be log1/p(mn).
78Alignment Statistics
 Expectation value or the mean longest match would
be  E(M) log1/p(Kmn), where K is a constant that
depends on amino acid or base composition and p
is the probability of a match.  This is only true for ungapped local alignments.
79Distribution of alignment scores
 resembles Gumbel extreme value distribution.
80Extreme Value Distribution
81Extreme Value Distribution
 In this distribution, the probability of a score
being higher than x is given by
 m and n are the lengths of the sequences compared
 K and ? can be calculated from the data in the
matrix used and from the relative frequencies of
the amino acids (or nucleotides)
82Alignment Statistics
 For two sequences of length n and m, n times m
comparisons are being made thus the longest
length of the predicted match would be
log1/p(mn).  For a pair of random DNA sequences of length 100
and p 0.25 (equal A,T,C,G), the longest
expected run of matches would be  2 x log1/p (n) 2 x log4 100 6.65
83Alignment Statistics
 E(M)log1/p(Kmn) means that match length gets
bigger as the log of the product of sequence
lengths. Amino acid substitution matrices will
turn match lengths into alignment scores (S).  More commonly ? ln(1/p) is used.
 Number of longest run HSP will be estimated
 E Kmne?S
 How good a sequence score is evaluated based on
how many HSPs (i.e. E value) one would expect for
that score.
84Alignment Statistics
 Two ways to get K and ?
 For 10000 random amino acid sequences with
various gap penalties, K and lambda parameters
have been tabulated.  Calculation of the distribution for two sequences
being aligned by keeping one of them fixed and
scrambling the other, thus preserving both the
sequence length and amino acid composition.
85Generate random sequences
 You may use the function randperm
 gtgt help randperm
 RANDPERM(n) is a random permutation of the
integers from 1 to n.  For example, RANDPERM(6) might be
 2 4 5 6 1 3.


86Align a sequence with its randomly permuted state
 gtgt x 'atagacagacca'
 gtgt l length (x)
 l
 12
 gtgt ind randperm(12)
 ans
 Columns 1 through 9
 9 4 5 7 3 11 2 8 6
 Columns 10 through 12
 10 1 12
 gtgt y x(ind)
 y
 agaaactgccaa
 gtgt align1
 atagacagacca
 agaaactgccaa
87Alignment Statistics
88Alignment Statistics
89Alignment Statistics
90Alignment Statistics
91Probability Distributions Binomial Distribution
 The number of an event (x) in n trials is given
by binomial distribution
Binomial coefficient
probability
Probability of event 1
Probability of event 1
n, p, and q are constant x varies n and x are
discrete pq 1
92Binomial Distribution
 Only two outcomes are possible on each of n
trials.  The probability of success for each trial is
constant (p, and q does not change).  All trials are independent of each other.
93Matlab binopdf function
 Y binopdf(x,n,p)
 Where x equals the number of successes (outcome),
n is the total possible number of trials, P is
the probability of one type of outcome.
94Matlab binopdf function
 gtgt x 010 from 0, 1,2, ...,10 number of
trials  gtgt y binopdf(x,10,0.5) calculate pdf
 gtgt plot(x,y,'') plot n over y using sign
95Binomial probability density function
96Applications
 Calculate the probability of a couples (mother
AA and father AB genotype) 2 of 10 children
having AB blood type?  n 10 total number of children
 x 2 number of children with AB blood
 p 0.5 probability of having AB genotype
 q 0.5 probability of having AA genotype
97Matlab
 gtgt p 0.5
 gtgt q 1q
 gtgt n 10
 gtgt x 2
 gtgt fn factorial(n)
 gtgt fx factorial(x)
 gtgt fnminusx factorial(nx)
 gtgt binocoef fn./(fx.fnminusx)
 gtgt Pr binocoefpnq(Nn)
98Use parentheses in order to determine order in
calculations
 gtgt p 0.5
 gtgt q 1q
 gtgt n 10
 gtgt x 2
 gtgt fn factorial(n)
 gtgt fx factorial(x)
 gtgt fnminusx factorial(nx)
 gtgt binocoef fn./fx.fnminusx
 gtgt Pr binocoefpnq(Nn)
99Try this!
 gtgt n 1100
 gtgt y binopdf(n,100,0.5)
 gtgt plot(n,y,'')
100Binomial distribution
101Binomial Cumulative Distribution Function
 Adds the probability value of the previous case
to the next.  gtgt x 010
 gtgt n 10
 gtgt p 0.5
 gtgt y binocdf(x,n,p)
 gtgt plot(x,y,'r')
102Cumulative distribution
103Expected value mean value
 The mean or expected value of an outcome (e.g.,
getting an H from a coin toss) for n trials would
be  E(H) np
 p E(H)/n
 ?2 np(1p)
104Null hypothesis in statistics
 States equality (or in cases greater than or less
than) between observed and an expected value  To test a null hypothesis
 perform a statistical test
 calculate a p value
 reject or do not reject the null hypothesis
using a threshold.
105Example
 If a baseball team plays 162 games in a season
and has a 5050 chance of winning any game (p
winning 0.5 q losing 0.5), then the
probability of that team winning more than 100
games in a season is  gtgt 1  binocdf(100,162,0.5)
 The result is 0.001 (i.e., 10.999).
 If a team wins 100 or more games in a season,
this result suggests that it is likely that the
team's true probability of winning any game is
greater than 0.5.
106Example
 In a population of Drosophila, the frequency of
AA genotype is p (0.5) and the frequency of AB
genotype is q (0.5).  If you sample from this population the number of
AA or AB individuals in the sampled population
will be a function of their relative frequencies
and the sample size (n).  If n individuals are selected and x number of AB
individuals are found, is this number greater or
less than what could be obtained by chance alone?  gtgt binopdf(7,10,0.5)
 ans
 0.1172
 gtgt binopdf(70,100,0.5)
 ans
 2.3171e005
107Normal Distribution
 A standard normal distribution will have a mean
of 0 and variance of 1.
108Normal Probability Distribution
 gtgt x 50.055
 gtgt y normpdf(x)
 gtgtplot(x,y)
109Plot(x,y)
110Normal cumulative distribution
 What is the probability that an observation from
a standard normal distribution will fall on the
interval 1 1?  gtgtp normcdf(1 1)
 gtgtp(2)  p(1)
 ans
 0.6827
111PAM2
112PAM250
113PAM250
114PAM250
115PAM250
116PAM250
117PAM250
118Multiple Sequence Alignment
119Multiple Sequence Alignment
120MegaBLAST
 megaBLAST
 For aligning sequences which differ slightly due
to sequencing errors etc.  Very efficient for long query sequences
 Uses big word (ktuple) sizes to start search
 Very fast
 Accepts batch submissions of ESTs
 Can upload files of sequences as queries
 More detailed info see megaBLAST pages
121Pvalues
 The probability of finding b HSPs with a score
gtS is given by  (eEEb)/b!
 For b 0, that chance is
 eE
 Thus the probability of finding at least one such
HSP is  P 1 eE
122Alignment Statistics
123Alignment Statistics