CS 5263 Bioinformatics - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

CS 5263 Bioinformatics

Description:

CS 5263 Bioinformatics. Lecture 7: Heuristic Sequence ... Neurospora 4 107 Fugu fish 3.3. 108 Tetraodon 3 108 Mosquito 2.8. 108 Drosophila 1.2 108 Worm 1.0 ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 66
Provided by: jianhu
Learn more at: http://www.cs.utsa.edu
Category:

less

Transcript and Presenter's Notes

Title: CS 5263 Bioinformatics


1
CS 5263 Bioinformatics
  • Lecture 7 Heuristic Sequence Alignment
    Algorithms (BLAST)

2
Roadmap
  • Last lecture review
  • Sequence alignment statistics
  • Today
  • Gene finding by alignment
  • Heuristic alignment algorithms
  • Basic Local Alignment Search Tools

3
Sequence Alignment Statistics
  • Substitution matrices
  • How is the BLOSUM matrix made
  • How to make your own substitution matrix
  • Whats the meaning of an arbitrary substitution
    matrix
  • Significance of sequence alignments
  • P-value estimation for
  • Global alignment scores
  • Local alignment scores

4
What is a p-value?
  • The probability of observing an effect as strong
    or stronger than you observed, given the null
    hypothesis. i.e., How likely is this effect to
    occur by chance?
  • Pr(x gt S null)

5
What is a null-hypothesis?
  • A statisticians way of characterizing chance.
  • Generally, a mathematical model of randomness
    with respect to a particular set of observations.
  • The purpose of most statistical tests is to
    determine whether the observed data can be
    explained by the null hypothesis.

6
For sequence alignment
  • Your null hypothesis is the two sequences are
    unrelated
  • Your alternative hypothesis is the two sequences
    are related
  • You obtained a score S
  • how likely that you can obtain such a score if
    the null hypothesis is true?

7
How to test that null-hypothesis?
  • Randomly generate some pairs of unrelated
    sequences
  • See what alignment scores you may get for those
    unrelated sequences
  • Must keep other factors in mind
  • Your random sequences must be as close as
    possible to your true sequences
  • Except that they are unrelated (i.e., not from
    a common ancestor)

8
Possible ways to get unrelated sequences
  • Which is better?
  • Randomly pick some sequences from a database
  • Randomly pick some sequences from a database and
    truncate to the same length as your real
    sequences
  • Generate random sequences according to the
    frequency that each letter is used by your real
    sequences
  • Randomly shuffle your sequences

9
Possible ways to get random sequences
  • Which is better?
  • Randomly pick some sequences from a database
  • Randomly pick some sequences from a database and
    truncate to the same length as your real
    sequences
  • Generate random sequences according to the
    frequency that each letter is used by your real
    sequences
  • Randomly shuffle your sequences

10
  • Random shuffling is what we do to estimate
    p-values for global alignment

11
Mouse HEXA
Human HEXA
Score 732

12
732
Distribution of the alignment scores between
mouse HEXA and 200 randomly shuffled human HEXA
sequences P-value less than 1/200 0.005
13
  • Advantages
  • Easy to implement
  • You dont need to know a lot of theories to do
    this
  • Disadvantages
  • Slow
  • Cannot estimate small p-values
  • If we had repeated 1,000 times, would we get a
    score as high 732?
  • Based on what Ive already seen, I would guess
    probably no
  • What about 1,000,000 times?

14
When theory exists
  • It gets much better
  • You dont really need to go there to know whats
    there
  • (I know roughly how many times you can get a
    score as high as 732 if you repeat your
    experiments a billion times)
  • That is what happened for local alignment

15
  • For ungapped alignment between two sequences of
    lengths M and N
  • E(S) KMN exp-?S
  • (Expected value, E-value. )
  • K, ? depends on sequence composition and
    substitution matrix
  • Can be obtained either empirically or analytically

16
when P is small.
17
Distribution of alignment scores for 1000 random
sequence pairs
Extreme value distribution
Theory says my score distribution should have
this shape
My experiment shows me that the theory seems
correct
18
Example
  • You are aligning two sequences, each has 1000
    bases
  • m 1, s -1, d -inf (ungapped alignment)
  • You obtain a score 20
  • Is this score significant?

19
  • ? ln3 1.1
  • E(S) K MN exp- ?S
  • E(20) 0.1 1000 1000 3-20 3 x 10-5
  • P-value 3 x 10-5 ltlt 0.05
  • The alignment is significant

20
20
Distribution of 1000 random sequence pairs
21
Multiple-testing problem
  • You are searching a 1000-base sequence against a
    database of 106 sequences (average length 1000
    bases)
  • You get a score 20
  • You are essentially comparing 1000 bases with
    1000x106 109 bases (ignore edge effect)
  • E(20) 0.1 1000 109 3-20 30
  • By chance we would expect to see 30 matches
  • P-value 1 e-30 0.9999999999
  • Not significant at all

22
Interpretation of p-value
  • Your null hypothesis is the two sequences are
    not related
  • Your p-value is 0.999999999, so you cannot reject
    your null hypothesis
  • Which means you cannot exclude the possibility
    that the two sequence may be unrelated
  • But can you conclude that the two sequences are
    unrelated?
  • No! your did not test that!

23
A better way to understand p-value
  • Your p-value is 0.99999
  • You have very low confidence (lt0.00001) to say
    that the null hypothesis is wrong
  • Is the null hypothesis true then (i.e., the two
    sequences are unrelated)?
  • You dont know
  • Your test was not designed to tell you that

24
In practice
  • You search the sequence against a database of 106
    sequences
  • You get 35 matches
  • You expect to get about 30 by chance
  • It could be all 35 are real, or none, or some
  • You already reduced your target from 106
    sequences to 35 sequences
  • Take all 35 sequences with caution. Look for
    other evidences

25
Statistics for gapped local alignment
  • Theory not well developed
  • Extreme value distribution works well empirically
  • Need to estimate K and ? empirically

26
Exercising FSA
  • How do you make an FSA for the Needleman-Wunsch
    algorithm?

27
Exercising FSA
  • How do you make an FSA for the Needleman-Wunsch
    algorithm?

Ix
(-, yj)/d
(xi,yj) /?
(-, yj) / d
(xi,yj) /?
F
(xi,-)/d
(-, yj) / d
Iy
(xi,-) / d
(xi,-)/d
(xi,yj) /?
28
Simplify
(xi,yj) /?
(xi,-) / d
(xi,-) / d
F
I
(xi,yj) /?
(-, yj) / d
(-, yj) / d
29
Simplify more
(xi,yj) /?
F(i-1, j-1) ?(xi, yj) F(i,j) max F(i-1,
j) d F(i, j-1) d
F
(xi,-) / d
(-, yj) / d
30
A more difficult alignment problem
  • (A gene finder indeed!)
  • X is a genomic sequence (DNA)
  • X encodes a gene
  • May contain introns
  • Y is an ORF from another species
  • Contains only exons
  • We want to compare X against Y
  • Conservation is on the level of amino acids

31
DNA
intron
intron
Pre-mRNA
exon
exon
3 UTR
exon
5 UTR
Splice
Mature mRNA (mRNA)
Open reading frame (ORF)
Start codon
Stop codon
32
  • We have a predicted gene
  • We know the positions of the start codon and stop
    codon
  • But we dont know where are the splicing sites
  • Not even the number of introns

intron
intron
exon
exon
exon
Start codon
Stop codon
33
  • Most splicing sites start at GT and end at AG
  • But there are lots of GT and AG in the sequence
  • Aligning to a orthologous gene with known ORF may
    help us determine the splicing sites
  • Orthologous genes two genes evolved from the
    same ancestor
  • Coding region are likely conserved on amino acid
    level
  • UUA, UUG encode the same amino acid
  • So do UCA, UCU, UCG, UCC

GTAG
Mouse putative gene
human ORF
34
The Genetic Code
Third letter
35
If know where are the exons
  • Easy

Mouse putative gene
Remove introns
Mouse putative ORF
translate
Global alignment
translate
human ORF
36
Or directly align triplets
Mouse putative gene
Remove introns
Mouse putative ORF
Global alignment
human ORF
37
Codon substitution scores
AAA AAG AAU AAC UCU UCC
AAA 4 3 -1 -1 -1 -1
AAG 3 4 -1 -1 -1 -1
AAU -1 -1 4 3 1 1
AAC -1 -1 3 4 1 1



UCU -1 -1 1 1 4 3
UCC -1 -1 1 1 3 4
64 x 64 substitution matrix
38
FSA for aligning genomic DNA to ORF
(xi-2xi-1xi, yj-2yj-1yj) / ?
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
(xi-2xi-1xi, yj-2yj-1yj) / ?
A
B
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
Considering only exons
39
  • We dont know exactly where are the splicing
    sites
  • Length of introns may not be a multiple of 3
  • - If convert the whole seq into triplets, may
    result in ORF shift

17 bases?
Mouse putative gene
human ORF
40
Model introns
  • Most splicing sites start at GT and end at AG
  • For simplicity, assume length of exon is a
    multiple of 3
  • Not true in reality
  • Only a little more work without this assumption

GTAG
Mouse putative gene
126 nt 42 aa
human ORF
120 nt 40 aa
41
Aligning genomic DNA to ORF
Fixed cost to have an intron
Alignment with Affine gap penalty
42
FSA for aligning genomic DNA to ORF
(xi-2xi-1xi, yj-2yj-1yj) / ?
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
(xi-2xi-1xi, yj-2yj-1yj) / ?
A
B
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
Considering only exons
43
FSA for aligning genomic DNA to ORF
(xi-2xi-1xi, yj-2yj-1yj) / ?
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
Start an intron
(xi-2xi-1xi, yj-2yj-1yj) / ?
A
C
B
(-, GT) / s
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
44
FSA for aligning genomic DNA to ORF
Continue in intron
(xi-2xi-1xi, yj-2yj-1yj) / ?
(-, yi) / 0
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
Start an intron
(xi-2xi-1xi, yj-2yj-1yj) / ?
A
C
B
(-, GT) / s
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
45
FSA for aligning genomic DNA to ORF
Continue in intron
(xi-2xi-1xi, yj-2yj-1yj) / ?
(-, yi) / 0
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
Start an intron
(xi-2xi-1xi, yj-2yj-1yj) / ?
A
C
B
(-, GT) / s
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
(-, AG) / s
Close an intron
46
(xi-2xi-1xi, yj-2yj-1yj) / ?
(-, yj) / 0
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
(xi-2xi-1xi, yj-2yj-1yj) / ?
A
C
B
(-, GT) / s
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
(-, AG) / s
A(i-3,j-3) ? (xi-2xi-1xi,
yj-2yj-1yj) A(i, j) max B(i-3,j-3) ?
(xi-2xi-1xi, yj-2yj-1yj) C(i, j-2) s, if
yj-1yj AG
47
(xi-2xi-1xi, yj-2yj-1yj) / ?
(-, yj) / 0
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
(xi-2xi-1xi, yj-2yj-1yj) / ?
A
C
B
(-, GT) / s
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
(-, AG) / s
A(i, j-3) d A(i-3, j) d B(i, j)
max B(i, j-3) e B(i-3, j) e
48
(xi-2xi-1xi, yj-2yj-1yj) / ?
(-, yj) / 0
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
(xi-2xi-1xi, yj-2yj-1yj) / ?
A
C
B
(-, GT) / s
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
(-, AG) / s
B(i, j-2) s, if yj-1yj GT C(i, j) max
C(i, j-1)
49
ACGGATGCGATCAGTTGTACTACGAGCTGACGGTCCTCAGACTTGATTA
















































50
  • There is a close relationship between dynamic
    programming, FSA, regular expression, and regular
    grammar
  • Using FSA, you can design more complex alignment
    algorithms
  • If you can draw the state diagram for a problem,
    it can be easily formulated into a DP problem
  • In particular, Hidden Markov Models
  • Will discuss more in a few weeks

51
Heuristic Local AlignersBLAST and alike
52
State of biological databases
  • Sequenced Genomes
  • Human 3?109 Yeast 1.2?107
  • Mouse 2.7?109 Rat 2.6?109
  • Neurospora 4?107 Fugu fish 3.3?108
  • Tetraodon 3?108 Mosquito 2.8?108
  • Drosophila 1.2?108 Worm 1.0?108
  • Rice 1.0?109 Arabidopsis 1.2?108
  • sea squirts 1.6?108
  • Current rate of sequencing
  • 4 big labs ? 3 ?109 bp /year/lab
  • 10s small labs

53
State of biological databases
  • Number of genes in these genomes
  • Vertebrate 30,000
  • Insects 14,000
  • Worm 17,000
  • Fungi 6,000-10,000
  • Small organisms 100s-1,000s
  • Each known or predicted gene has an associated
    protein sequence
  • gt1,000,000 known / predicted protein sequences

54
Some useful applications of alignments
  • Given a newly discovered gene,
  • - Does it occur in other species?
  • - How fast does it evolve?
  • Assume we try Smith-Waterman

Our new gene
104
The entire genomic database
1010 - 1011
May take several weeks!
55
Some useful applications of alignments
  • Given a newly sequenced organism,
  • - Which subregions align with other organisms?
  • - Potential genes
  • - Other biological characteristics
  • Assume we try Smith-Waterman

Our newly sequenced mammal
3?109
The entire genomic database
1010 - 1011
gt 1000 years ???
56
BLAST
  • Basic Local Alignment Search Tool
  • Altschul, Gish, Miller, Myers, Lipman, J Mol Biol
    1990
  • The most widely used comp bio tool
  • The most cited paper
  • Which is better long mediocre match or a few
    nearby, short, strong matches with the same total
    score?
  • score-wise, exactly equivalent
  • biologically, later may be more interesting, is
    common
  • at least, if must miss some, rather miss the
    former
  • BLAST is a heuristic emphasizing the later
  • speed/sensitivity tradeoff BLAST may miss
    former, but gains greatly in speed

57
BLAST
  • Main idea
  • Construct a dictionary of all the words in the
    query
  • Initiate a local alignment for each word match
    between query and DB
  • Running Time O(MN)
  • However, orders of magnitude faster than
    Smith-Waterman

query
DB
58
BLAST ? Original Version
  • Dictionary
  • All words of length k (11 for DNA, 3 for
    proteins)
  • Alignment initiated between words of alignment
    score ? T (typically T k)
  • Alignment
  • Ungapped extensions until score
  • below statistical threshold
  • Output
  • All local alignments with score
  • gt statistical threshold


query

scan
DB
query
59
BLAST ? Original Version
A C G A A G T A A G G T C
C A G T
Example k 4, T 4 The matching word GGTC
initiates an alignment Extension to the left and
right with no gaps until alignment falls lt
50 Output GTAAGGTCC GTTAGGTCC

















C C C T T C C T G G A T T
G C G A
60
Gapped BLAST
A C G A A G T A A G G T C
C A G T
  • Added features
  • Pairs of words can initiate alignment
  • Extensions with gaps in a band around anchor
  • Output
  • GTAAGGTCCAGT
  • GTTAGGTC-AGT


















C T G A T C C T G G A T T
G C G A
61
Example
  • Query gattacaccccgattacaccccgattaca (29 letters)
    2 mins
  • Database All GenBankEMBLDDBJPDB sequences
    (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS
    sequences) 1,726,556 sequences 8,074,398,388
    total letters
  • gtgi28570323gbAC108906.9 Oryza sativa
    chromosome 3 BAC OSJNBa0087C10 genomic sequence,
    complete sequence Length 144487 Score 34.2
    bits (17), Expect 4.5 Identities 20/21 (95)
    Strand Plus / Plus
  • Query 4 tacaccccgattacaccccga 24
  • Sbjct 125138 tacacccagattacaccccga 125158
  • Score 34.2 bits (17),
  • Expect 4.5 Identities 20/21 (95) Strand
    Plus / Plus
  • Query 4 tacaccccgattacaccccga 24
  • Sbjct 125104 tacacccagattacaccccga 125124
  • gtgi28173089gbAC104321.7 Oryza sativa
    chromosome 3 BAC OSJNBa0052F07 genomic sequence,
    complete sequence Length 139823 Score 34.2
    bits (17), Expect 4.5 Identities 20/21 (95)
    Strand Plus / Plus

62
Example
  • Query Human atoh enhancer, 179 letters 1.5
    min
  • Result 57 blast hits
  • gi7677270gbAF218259.1AF218259 Homo sapiens
    ATOH1 enhanc... 355 1e-95
  • gi22779500gbAC091158.11 Mus musculus Strain
    C57BL6/J ch... 264 4e-68
  • gi7677269gbAF218258.1AF218258 Mus musculus
    Atoh1 enhanc... 256 9e-66
  • gi28875397gbAF467292.1 Gallus gallus CATH1
    (CATH1) gene... 78 5e-12
  • gi27550980embAL807792.6 Zebrafish DNA
    sequence from clo... 54 7e-05
  • gi22002129gbAC092389.4 Oryza sativa
    chromosome 10 BAC O... 44 0.068
  • gi22094122refNM_013676.1 Mus musculus
    suppressor of Ty ... 42 0.27
  • gi13938031gbBC007132.1 Mus musculus, Similar
    to suppres... 42 0.27
  • gi7677269gbAF218258.1AF218258 Mus musculus
    Atoh1 enhancer sequence Length 1517
  • Score 256 bits (129), Expect 9e-66
    Identities 167/177 (94),
  • Gaps 2/177 (1) Strand Plus / Plus
  • Query 3 tgacaatagagggtctggcagaggctcctggccgcggt
    gcggagcgtctggagcggagca 62

  • Sbjct 1144 tgacaatagaggggctggcagaggctcctggccccggt
    gcggagcgtctggagcggagca 1203

63
Different types of BLAST
  • blastn search nucleic acid database
  • blastp search protein database
  • blastx you give a nucleic acid sequence, search
    protein database
  • Tblastn you give a protein sequence, search
    nucleic acid database
  • tblastx you give a nucleic database, search
    nucleic acid database, implicitly translate both
    into protein sequences

64
Variants of BLAST
  • MEGABLAST
  • Optimized to align very similar sequences
  • Linear gap penalty
  • NCBI-BLAST
  • WU-BLAST (Wash Univ BLAST)
  • Optimized, added features
  • BLAT Blast-Like Alignment Tool
  • BlastZ
  • Optimized for aligning two genomes
  • PSI-BLAST
  • BLAST produces many hits
  • Those are aligned, and a pattern is extracted
  • Pattern is used for next search above steps
    iterated
  • Sensitive for weak homologies
  • Slower

65
Pattern hunter
  • Instead of exact matches of consecutive matches
    of k-mer, we can
  • look for discontinuous matches
  • My query sequence looks like
  • ACGTAGACTAGCAGTTAAG
  • Search for sequences in database that match
  • AXGXAGXCTAXC
  • X stands for dont care
  • Seed 101011011101

66
Pattern hunter
  • A good seed may give you both a higher
    sensitivity and higher specificity
  • You may think 110110110110 is the best seed
  • Because mutation in the third position of a codon
    often doesnt change the amino acid
  • Best seed is actually
  • 110100110010101111
  • How to design such seed is an open problem
  • May combine multiple random seeds
Write a Comment
User Comments (0)
About PowerShow.com