Sequence Alignment II - PowerPoint PPT Presentation

About This Presentation
Title:

Sequence Alignment II

Description:

Large number of sequences to search your query sequence against. ... Blast of human beta globin protein against zebra fish. Sample BLAST output (cont'd) ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 124
Provided by: fenBilk
Category:

less

Transcript and Presenter's Notes

Title: Sequence Alignment II


1
Sequence Alignment II
  • K-tuple methodsStatistics of alignments

2
Database searches
  • What is the problem?
  • Large number of sequences to search your query
    sequence against.
  • Various indexing schemes and heuristics are used,
    one of which is BLAST.
  • heuristic is a technique to solve a problem that
    ignores whether the solution can be proven to be
    correct, but usually produces a good solution,
    are intended to gain computational performance or
    conceptual simplicity potentially at the cost of
    accuracy or precision.

http//en.wikipedia.org/wiki/HeuristicsComputer_s
cience
3
K-tuple methods
http//creativecommons.org/licenses/by-sa/2.0/
4
Concepts of Sequence Similarity Searching
  • The premise
  • The sequence itself is not informative it must
    be analyzed by comparative methods against
    existing databases to develop hypothesis
    concerning relatives and function.

5
Important Terms for Sequence Similarity Searching
with very different meanings
  • Similarity
  • The extent to which nucleotide or protein
    sequences are related. In BLAST similarity refers
    to a positive matrix score.
  • Identity
  • The extent to which two (nucleotide or amino
    acid) sequences are invariant.
  • Homology
  • Similarity attributed to descent from a common
    ancestor.

6
Sequence Similarity Searching The Approach
  • Sequence similarity searching involves the use of
    a set of algorithms (such as the BLAST programs)
    to compare a query sequence to all the sequences
    in a specified database.
  • Comparisons are made in a pairwise fashion. Each
    comparison is given a score reflecting the degree
    of similarity between the query and the sequence
    being compared.

7
Blast
QUERY sequence(s)
BLAST results
BLAST program
BLAST database
8
Topics
BLAST program
  • There are different blast programs
  • Understanding the BLAST algorithm
  • Word size
  • HSPs (High Scoring Pairs)
  • Understanding BLAST statistics
  • The alignment score (S)
  • Scoring Matrices
  • Dealing with gaps in an alignment
  • The expectation value (E)

9
The BLAST algorithm
  • The BLAST programs (Basic Local Alignment Search
    Tools) are a set of sequence comparison
    algorithms introduced in 1990 for optimal local
    alignments to a query.
  • Altschul SF, Gish W, Miller W, Myers EW, Lipman
    DJ (1990) Basic local alignment search tool. J.
    Mol. Biol. 215403-410.
  • Altschul SF, Madden TL, Schaeffer AA, Zhang J,
    Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST
    and PSI-BLAST a new generation of protein
    database search programs. NAR 253389-3402.

10
http//www.ncbi.nlm.nih.gov/BLAST
blastn
11
Other BLAST programs
  • BLAST 2 Sequences (bl2seq)
  • Aligns two sequences of your choice
  • Gives dot-plot like output

12
More BLAST programs
  • BLAST against genomes
  • Many available
  • BLAST parameters pre-optimized
  • Handy for mapping query to genome
  • Search for short exact matches
  • BLAST parameters pre-optimized
  • Great for checking probes and primers

13
How Does BLAST Work?
  • The BLAST programs improved the overall speed of
    searches while retaining good sensitivity
    (important as databases continue to grow) by
    breaking the query and database sequences into
    fragments ("words"), and initially seeking
    matches between fragments.
  • Word hits are then extended in either direction
    in an attempt to generate an alignment with a
    score exceeding the threshold of T".

14
Picture used with permission from Chapter 11 of
Bioinformatics A Practical Guide to the
Analysis of Genes and Proteins
15
Each BLAST hit generates an alignment that can
contain one or more high scoring pairs (HSPs)
16
Each BLAST hit generates an alignment that can
contain one or more high scoring pairs (HSPs)
17
Where does the score (S) come from?
  • The quality of each pair-wise alignment is
    represented as a score and the scores are ranked.
  • Scoring matrices are used to calculate the score
    of the alignment base by base (DNA) or amino acid
    by amino acid (protein).
  • The alignment score will be the sum of the scores
    for each position.

18
Whats a scoring matrix?
  • Substitution matrices are used for amino acid
    alignments. These are matrices in which each
    possible residue substitution is given a score
    reflecting the probability that it is related to
    the corresponding residue in the query.

19
PAM vs. BLOSUM scoring matrices
  • BLOSUM 62 is the default matrix in BLAST 2.0.
    Though it is tailored for comparisons of
    moderately distant proteins, it performs well in
    detecting closer relationships. A search for
    distant relatives may be more sensitive with a
    different matrix.

20
PAM vs BLOSUM scoring matrices
  • The PAM Family
  • PAM matrices are based on global alignments of
    closely related proteins.
  • The PAM1 is the matrix calculated from
    comparisons of sequences with no more than 1
    divergence.
  • Other PAM matrices are extrapolated from PAM1.
  • The BLOSUM family
  • BLOSUM matrices are based on local alignments.
  • BLOSUM 62 is a matrix calculated from comparisons
    of sequences with no less than 62 divergence.
  • All BLOSUM matrices are based on observed
    alignments they are not extrapolated from
    comparisons of closely related proteins.

21
What happens if you have a gap in the alignment?
  • A gap is a position in the alignment at which a
    letter is paired with a null
  • Gap scores are negative. Since a single
    mutational event may cause the insertion or
    deletion of more than one residue, the presence
    of a gap is frequently ascribed more significance
    than the length of the gap.
  • Hence the gap is penalized heavily, whereas a
    lesser penalty is assigned to each subsequent
    residue in the gap.

22
Percent Sequence Identity
  • The extent to which two nucleotide or amino acid
    sequences are invariant

A C C T G A G A G A C G T G G C
A G
mismatch
indel
70 identical
23
BLAST algorithm
  • Keyword search of all words of length w in the
    query of default length n in database of length m
    with score above threshold
  • w 11 for nucleotide queries, 3 for proteins
  • Do local alignment extension for each hit of
    keyword search
  • Extend result until longest match above threshold
    is achieved and output

24
BLAST algorithm (contd)
keyword
Query KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVL
KIFLENVIRD
GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK
11 GEK 11 GDK 11
Neighborhood words
neighborhood score threshold (T 13)
extension
Query 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK
60 DN G IR L GK I L E
RGK Sbjct 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EK
HRGIIK 263
High-scoring Pair (HSP)
25
Local alignment
  • Find the best local alignment between two
    strings, over the recurrence

26
Local alignment (contd)
  • Input strings v and w and scoring matrix d
  • Output substrings of v and w whose global
    alignment as defined by d, is maximal among all
    global alignments of all substrings of v and w

27
Original BLAST
  • Dictionary
  • All words of length w
  • Alignment
  • Ungapped extensions until score falls below
    statistical threshold T
  • Output
  • All local alignments with score gt statistical
    threshold

28
Original BLAST Example
A C G A A G T A A G G T C
C A G T
  • w 4, T 4
  • Exact keyword match of GGTC
  • Extend diagonals with mismatches until score
    falls below a threshold
  • Output result
  • GTAAGGTCC
  • GTTAGGTCC


















C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
29
Gapped BLAST Example
A C G A A G T A A G G T C
C A G T
  • Original BLAST exact keyword search, THEN
  • Extend with gaps in a zone around ends of exact
    match
  • Output result
  • GTAAGGTCCAGT
  • GTTAGGTC-AGT


















C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
30
Gapped BLAST Example (contd)
A C G A A G T A A G G T C
C A G T
















  • Original BLAST exact keyword search, THEN
  • Extend with gaps around ends of exact match until
    score ltT, then merge nearby alignments
  • Output result
  • GTAAGGTCCAGT
  • GTTAGGTC-AGT

C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
31
Topics
BLAST databases
  • The different blast databases provided by the
    NCBI
  • Protein databases
  • Nucleotide databases
  • Genomic databases
  • Considerations for choosing a BLAST database
  • Custom databases for BLAST

32
BLAST protein databases available at through
blastp web interface _at_ NCBI
33
Considerations for choosing a BLAST database
  • First consider your research question
  • Are you looking for an ortholog in a particular
    species?
  • BLAST against the genome of that species.
  • Are you looking for additional members of a
    protein family across all species?
  • BLAST against nr, if you cant find hits check
    wgs, htgs, and the trace archives.
  • Are you looking to annotate genes in your species
    of interest?
  • BLAST against known genes (RefSeq) and/or ESTs
    from a closely related species.

34
When choosing a database for BLAST
  • It is important to know your reagents.
  • Changing your choice of database is changing your
    search space completely
  • Database size affects the BLAST statistics
  • record BLAST parameters, database choice,
    database size in your bioinformatics lab book,
    just as you would for your wet-bench experiments.
  • Databases change rapidly and are updated
    frequently
  • It may be necessary to repeat your analyses

35
Topics
BLAST results
  • Choosing the right BLAST program
  • Running a blastp search
  • BLAST parameters and options to consider
  • Viewing BLAST results
  • Look at your alignments
  • Using the BLAST taxonomy report

36
BLAST parameters and options to consider
conserved domains
Entrez query
E-value cutoff
Word size
37
More BLAST parameters and options to consider
filtering
gap penalities
matrix
38
Run your BLAST search
BLAST
39
The BLAST Queue
click for more info
Note your RID
40
Formatting and Retrieving your BLAST results
Results
options
41
A graphical view of your BLAST results
42
The BLAST hit list
Score
E-Value
GenBank
alignment
EntrezGene
43
The BLAST pairwise alignments
Identity
Similarity
44
Sample BLAST output
  • Blast of human beta globin protein against zebra
    fish
  • Score E
  • Sequences producing significant alignments
    (bits) Value
  • gi18858329refNP_571095.1 ba1 globin Danio
    rerio gtgi147757... 171 3e-44
  • gi18858331refNP_571096.1 ba2 globin
    SIdZ118J2.3 Danio rer... 170 7e-44
  • gi37606100embCAE48992.1 SIbY187G17.6 (novel
    beta globin) D... 170 7e-44
  • gi31419195gbAAH53176.1 Ba1 protein Danio
    rerio 168 3e-43
  • ALIGNMENTS
  • gtgi18858329refNP_571095.1 ba1 globin Danio
    rerio
  • Length 148
  • Score 171 bits (434), Expect 3e-44
  • Identities 76/148 (51), Positives 106/148
    (71), Gaps 1/148 (0)
  • Query 1 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWT
    QRFFESFGDLSTPDAVMGNPK 60
  • MV T EA LWGKNDEG AL R
    LVYPWTQRF FGLSP AMGNPK
  • Sbjct 1 MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWT
    QRYFATFGNLSSPAAIMGNPK 60

45
Sample BLAST output (contd)
  • Blast of human beta globin DNA against human DNA
  • Score E
  • Sequences producing significant alignments
    (bits) Value
  • gi19849266gbAF487523.1 Homo sapiens gamma A
    hemoglobin (HBG1... 289 1e-75
  • gi183868gbM11427.1HUMHBG3E Human gamma-globin
    mRNA, 3' end 289 1e-75
  • gi44887617gbAY534688.1 Homo sapiens A-gamma
    globin (HBG1) ge... 280 1e-72
  • gi31726embV00512.1HSGGL1 Human messenger RNA
    for gamma-globin 260 1e-66
  • gi38683401refNR_001589.1 Homo sapiens
    hemoglobin, beta pseud... 151 7e-34
  • gi18462073gbAF339400.1 Homo sapiens haplotype
    PB26 beta-glob... 149 3e-33
  • ALIGNMENTS
  • gtgi28380636refNG_000007.3 Homo sapiens beta
    globin region (HBB_at_) on chromosome 11
  • Length 81706
  • Score 149 bits (75), Expect 3e-33
  • Identities 183/219 (83)
  • Strand Plus / Plus

  • Query 267 ttgggagatgccacaaagcacctggatgatctcaagg
    gcacctttgcccagctgagtgaa 326


46
What do the Score and the e-value really mean?
  • The quality of the alignment is represented by
    the Score.
  • Score (S)
  • The score of an alignment is calculated as the
    sum of substitution and gap scores. Substitution
    scores are given by a look-up table (PAM, BLOSUM)
    whereas gap scores are assigned empirically .
  • The significance of each alignment is computed as
    an E value.
  • E value (E)
  • Expectation value. The number of different
    alignments with scores equivalent to or better
    than S that are expected to occur in a database
    search by chance. The lower the E value, the more
    significant the score.

47
E value
  • E value (E)
  • Expectation value. The number of different
    alignments with scores equivalent to or better
    than S expected to occur in a database search by
    chance. The lower the E value, the more
    significant the score.

48
Assessing sequence homology
  • Need to know how strong an alignment can be
    expected from chance alone
  • Chance is the comparison of
  • Real but non-homologous sequences
  • Real sequences that are shuffled to preserve
    compositional properties
  • Sequences that are generated randomly based upon
    a DNA or protein sequence model (favored)

49
High Scoring Pairs (HSPs)
  • All segment pairs whose scores can not be
    improved by extension or trimming
  • Need to model a random sequence to analyze how
    high the score is in relation to chance

50
Expected number of HSPs
  • Expected number of HSPs with score gt S
  • E-value E for the score S
  • E Kmne-lS
  • Given
  • Two sequences, length n and m
  • The statistics of HSP scores are characterized by
    two parameters K and ?
  • K scale for the search space size
  • ? scale for the scoring system

51
BLAST statistics to record in your bioinformatics
labbook
Record the statistics that are found at bottom of
your BLAST results page
52
Scoring matrices
  • Amino acid substitution matrices
  • PAM
  • BLOSUM

53
Bit Scores
  • Normalized score to be able to compare sequences
  • Bit score
  • S lS ln(K) ln(2)
  • E-value of bit score
  • E mn2-S

54
Assessing the significance of an alignment
  • How to assess the significance of an alignment
    between the comparison of a protein of length m
    to a database containing many different proteins,
    of varying lengths?
  • Calculate a "database search" E-value. Multiply
    the pairwise-comparison E-value by the number of
    sequences in the database N divided by the length
    of the sequence in the database n

55
Homology Some Guidelines
  • Similarity can be indicative of homology
  • Generally, if two sequences are significantly
    similar over entire length they are likely
    homologous
  • Low complexity regions can be highly similar
    without being homologous
  • Homologous sequences not always highly similar

56
Homology Some Guidelines
  • Suggested BLAST Cutoffs
  • (source Chapter 11 Bioinformatics A Practical
    Guide to the Analysis of Genes and Proteins)
  • For nucleotide based searches, one should look
    for hits with E-values of 10-6 or less and
    sequence identity of 70 or more
  • For protein based searches, one should look for
    hits with E-values of 10-3 or less and sequence
    identity of 25 or more

57
Contributors
  • Special thanks to David Wishart, Andy Baxevanis,
    Stephanie Minnema, Sohrab Shah, and Francis
    Ouellette for their contributions to these
    materials

http//creativecommons.org/licenses/by-sa/2.0/
58
FASTA
  • A FASTA search begins by breaking the search
    sequence into words.
  • For genomic sequences, a word size of 4 or 6
    nucleotides is used 1 or 2 for polypeptide
    sequences.

59
FASTA
  • Next a table is constructed for the query
    sequence (word size is 1)
  • E.g. FAMLGFIKYLPGCM

A C D E F G H I K L M N P Q R S T V W Y
2

60
FASTA
  • Next a table is constructed for the query
    sequence
  • E.g. FAMLGFIKYLPGCM

A C D E F G H I K L M N P Q R S T V W Y
2 13

61
FASTA
  • Next a table is constructed for the query
    sequence
  • E.g. FAMLGFIKYLPGCM

A C D E F G H I K L M N P Q R S T V W Y
2 13 1
6
62
FASTA
  • Next a table is constructed for the query
    sequence
  • E.g. FAMLGFIKYLPGCM

A C D E F G H I K L M N P Q R S T V W Y
2 13 1 5
6 12
63
FASTA
  • Next a table is constructed for the query
    sequence
  • E.g. FAMLGFIKYLPGCM

A C D E F G H I K L M N P Q R S T V W Y
2 13 1 5 7
6 12
64
FASTA
  • The table for the query sequence is complete
  • E.g. FAMLGFIKYLPGCM

A C D E F G H I K L M N P Q R S T V W Y
2 13 1 5 7 8 4 3 11 9
6 12 10 14
65
FASTA
  • Compare the query sequence table with the target
    sequence
  • Query FAMLGFIKYLPGCM
  • Index of Gs are 5 and 12
  • Target TGFIKYLPGACT
  • Index of Gs are 2 and 9
  • Subtract 2 from 5 and 12 producing 3 and 10
  • Subtract 9 from 5 and 12 producing -4 and 3

1 2 3 4 5 6 7 8 9 10 11 12
T G F I K Y L P G A C T
3 -4
10 3
66
FASTA
  • Compare the query sequence table with the target
    sequence
  • Query FAMLGFIKYLPGCM
  • Index of Fs are 1 and 6
  • Target TGFIKYLPGACT
  • Index of F is 3
  • Subtract 3 from 1 and 6 producing -2 and 3

1 2 3 4 5 6 7 8 9 10 11 12
T G F I K Y L P G A C T
3 -2 -4
10 3 3
67
FASTA
  • Compare the query sequence table with the target
    sequence
  • Query FAMLGFIKYLPGCM
  • Index of Fs are 1 and 6
  • Target TGFIKYLPGACT
  • Index of F is 3
  • Subtract 3 from 1 and 6 producing -2 and 3

1 2 3 4 5 6 7 8 9 10 11 12
T G F I K Y L P G A C T
3 -2 3 3 3 -3 3 -4 -8 2
10 3 3 3
68
FASTA
  • FAMLGFIKYLPGCM
  • TGFIKYLPGACT
  • Offset by 3

1 2 3 4 5 6 7 8 9 10 11 12
T G F I K Y L P G A C T
3 -2 3 3 3 -3 3 -4 -8 2
10 3 3 3
69
Fasta (word size 2)
70
Database searches
71
Odds score in sequence alignment
  • The chance of an aligned amino acid pair being
    found in alignments of related sequences compared
    to the chance of that pair being found in random
    alignments of unrelated sequences.

72
Statistical significance of an alignment
  • The probability that random or unrelated
    sequences could be aligned to produce the same
    score.
  • Smaller the probability is the better.

73
Probability
  • What is the probability that a coin toss will
    yield a head?
  • What is the probability that the next pair of
    nucleotides will be a match or mismatch?

74
Bernoulli trials
  • A series of n number of independent trials with
    the same outcome probabilities and number of
    choices (e.g., head or tail or match (m) or
    mismatch (mi)).
  • P(hhhhh)
  • P(mmmmm)

75
Head or Tail..Longest run of heads or tails
  • Longest run of heads one would get in a random
    series of coin tosses?
  • Fair coin, p 0.5 1/p 1/0.5 2
  • Erdös and Rènyi longest run log1/p(n)
  • If n 100 longest run 6.65

76
Alignment analogy
  • You have two sequences a and b of equal length
  • a1a2a3a4
  • b1b2b3b4
  • if an bn then it is head (match)
  • If an does not equal to bn then it is tail
    (mismatch)

77
Alignment Statistics
  • For two sequences of length n and m, n times m
    comparisons are being made thus the longest
    length of the predicted match would be log1/p(mn).

78
Alignment Statistics
  • Expectation value or the mean longest match would
    be
  • E(M) log1/p(Kmn), where K is a constant that
    depends on amino acid or base composition and p
    is the probability of a match.
  • This is only true for ungapped local alignments.

79
Distribution of alignment scores
  • resembles Gumbel extreme value distribution.

80
Extreme Value Distribution
81
Extreme Value Distribution
  • In this distribution, the probability of a score
    being higher than x is given by
  • m and n are the lengths of the sequences compared
  • K and ?   can be calculated from the data in the
    matrix used and from the relative frequencies of
    the amino acids (or nucleotides)

82
Alignment Statistics
  • For two sequences of length n and m, n times m
    comparisons are being made thus the longest
    length of the predicted match would be
    log1/p(mn).
  • For a pair of random DNA sequences of length 100
    and p 0.25 (equal A,T,C,G), the longest
    expected run of matches would be
  • 2 x log1/p (n) 2 x log4 100 6.65

83
Alignment Statistics
  • E(M)log1/p(Kmn) means that match length gets
    bigger as the log of the product of sequence
    lengths. Amino acid substitution matrices will
    turn match lengths into alignment scores (S).
  • More commonly ? ln(1/p) is used.
  • Number of longest run HSP will be estimated
  • E Kmne-?S
  • How good a sequence score is evaluated based on
    how many HSPs (i.e. E value) one would expect for
    that score.

84
Alignment Statistics
  • Two ways to get K and ?
  • For 10000 random amino acid sequences with
    various gap penalties, K and lambda parameters
    have been tabulated.
  • Calculation of the distribution for two sequences
    being aligned by keeping one of them fixed and
    scrambling the other, thus preserving both the
    sequence length and amino acid composition.

85
Generate random sequences
  • You may use the function randperm
  • gtgt help randperm
  • RANDPERM(n) is a random permutation of the
    integers from 1 to n.
  • For example, RANDPERM(6) might be
  • 2 4 5 6 1 3.

86
Align a sequence with its randomly permuted state
  • gtgt x 'atagacagacca'
  • gtgt l length (x)
  • l
  • 12
  • gtgt ind randperm(12)
  • ans
  • Columns 1 through 9
  • 9 4 5 7 3 11 2 8 6
  • Columns 10 through 12
  • 10 1 12
  • gtgt y x(ind)
  • y
  • agaaactgccaa
  • gtgt align1
  • atagacagacca
  • agaaactgccaa

87
Alignment Statistics
88
Alignment Statistics
89
Alignment Statistics
90
Alignment Statistics
91
Probability Distributions Binomial Distribution
  • The number of an event (x) in n trials is given
    by binomial distribution

Binomial coefficient
probability
Probability of event 1
Probability of event 1
n, p, and q are constant x varies n and x are
discrete pq 1
92
Binomial Distribution
  • Only two outcomes are possible on each of n
    trials.
  • The probability of success for each trial is
    constant (p, and q does not change).
  • All trials are independent of each other.

93
Matlab binopdf function
  • Y binopdf(x,n,p)
  • Where x equals the number of successes (outcome),
    n is the total possible number of trials, P is
    the probability of one type of outcome.

94
Matlab binopdf function
  • gtgt x 010 from 0, 1,2, ...,10 number of
    trials
  • gtgt y binopdf(x,10,0.5) calculate pdf
  • gtgt plot(x,y,'') plot n over y using sign

95
Binomial probability density function
96
Applications
  • Calculate the probability of a couples (mother
    AA and father AB genotype) 2 of 10 children
    having AB blood type?
  • n 10 total number of children
  • x 2 number of children with AB blood
  • p 0.5 probability of having AB genotype
  • q 0.5 probability of having AA genotype

97
Matlab
  • gtgt p 0.5
  • gtgt q 1-q
  • gtgt n 10
  • gtgt x 2
  • gtgt fn factorial(n)
  • gtgt fx factorial(x)
  • gtgt fnminusx factorial(n-x)
  • gtgt binocoef fn./(fx.fnminusx)
  • gtgt Pr binocoefpnq(N-n)

98
Use parentheses in order to determine order in
calculations
  • gtgt p 0.5
  • gtgt q 1-q
  • gtgt n 10
  • gtgt x 2
  • gtgt fn factorial(n)
  • gtgt fx factorial(x)
  • gtgt fnminusx factorial(n-x)
  • gtgt binocoef fn./fx.fnminusx
  • gtgt Pr binocoefpnq(N-n)

99
Try this!
  • gtgt n 1100
  • gtgt y binopdf(n,100,0.5)
  • gtgt plot(n,y,'')

100
Binomial distribution
101
Binomial Cumulative Distribution Function
  • Adds the probability value of the previous case
    to the next.
  • gtgt x 010
  • gtgt n 10
  • gtgt p 0.5
  • gtgt y binocdf(x,n,p)
  • gtgt plot(x,y,'r')

102
Cumulative distribution
103
Expected value mean value
  • The mean or expected value of an outcome (e.g.,
    getting an H from a coin toss) for n trials would
    be
  • E(H) np
  • p E(H)/n
  • ?2 np(1-p)

104
Null hypothesis in statistics
  • States equality (or in cases greater than or less
    than) between observed and an expected value
  • To test a null hypothesis
  • perform a statistical test
  • calculate a p value
  • reject or do not reject the null hypothesis
    using a threshold.

105
Example
  • If a baseball team plays 162 games in a season
    and has a 50-50 chance of winning any game (p
    winning 0.5 q losing 0.5), then the
    probability of that team winning more than 100
    games in a season is
  • gtgt 1 - binocdf(100,162,0.5)
  • The result is 0.001 (i.e., 1-0.999).
  • If a team wins 100 or more games in a season,
    this result suggests that it is likely that the
    team's true probability of winning any game is
    greater than 0.5.

106
Example
  • In a population of Drosophila, the frequency of
    AA genotype is p (0.5) and the frequency of AB
    genotype is q (0.5).
  • If you sample from this population the number of
    AA or AB individuals in the sampled population
    will be a function of their relative frequencies
    and the sample size (n).
  • If n individuals are selected and x number of AB
    individuals are found, is this number greater or
    less than what could be obtained by chance alone?
  • gtgt binopdf(7,10,0.5)
  • ans
  • 0.1172
  • gtgt binopdf(70,100,0.5)
  • ans
  • 2.3171e-005

107
Normal Distribution
  • A standard normal distribution will have a mean
    of 0 and variance of 1.

108
Normal Probability Distribution
  • gtgt x -50.055
  • gtgt y normpdf(x)
  • gtgtplot(x,y)

109
Plot(x,y)
110
Normal cumulative distribution
  • What is the probability that an observation from
    a standard normal distribution will fall on the
    interval -1 1?
  • gtgtp normcdf(-1 1)
  • gtgtp(2) - p(1)
  • ans
  • 0.6827

111
PAM-2
112
PAM-250
113
PAM-250
114
PAM-250
115
PAM-250
116
PAM-250
117
PAM-250
118
Multiple Sequence Alignment
119
Multiple Sequence Alignment
120
MegaBLAST
  • megaBLAST
  • For aligning sequences which differ slightly due
    to sequencing errors etc.
  • Very efficient for long query sequences
  • Uses big word (k-tuple) sizes to start search
  • Very fast
  • Accepts batch submissions of ESTs
  • Can upload files of sequences as queries
  • More detailed info see megaBLAST pages

121
P-values
  • The probability of finding b HSPs with a score
    gtS is given by
  • (e-EEb)/b!
  • For b 0, that chance is
  • e-E
  • Thus the probability of finding at least one such
    HSP is
  • P 1 e-E

122
Alignment Statistics
123
Alignment Statistics
Write a Comment
User Comments (0)
About PowerShow.com