CS 5263 Bioinformatics

About This Presentation

Title:

CS 5263 Bioinformatics

Description:

CS 5263 Bioinformatics. Lecture 7: Heuristic Sequence ... Neurospora 4 107 Fugu fish 3.3. 108 Tetraodon 3 108 Mosquito 2.8. 108 Drosophila 1.2 108 Worm 1.0 ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 66

Provided by: jianhu

Learn more at: http://www.cs.utsa.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 5263 Bioinformatics

1
CS 5263 Bioinformatics

Lecture 7 Heuristic Sequence Alignment
Algorithms (BLAST)

2
Roadmap

Last lecture review
Sequence alignment statistics
Today
Gene finding by alignment
Heuristic alignment algorithms
Basic Local Alignment Search Tools

3
Sequence Alignment Statistics

Substitution matrices
How is the BLOSUM matrix made
How to make your own substitution matrix
Whats the meaning of an arbitrary substitution
matrix
Significance of sequence alignments
P-value estimation for
Global alignment scores
Local alignment scores

4
What is a p-value?

The probability of observing an effect as strong
or stronger than you observed, given the null
hypothesis. i.e., How likely is this effect to
occur by chance?
Pr(x gt S null)

5
What is a null-hypothesis?

A statisticians way of characterizing chance.
Generally, a mathematical model of randomness
with respect to a particular set of observations.
The purpose of most statistical tests is to
determine whether the observed data can be
explained by the null hypothesis.

6
For sequence alignment

Your null hypothesis is the two sequences are
unrelated
Your alternative hypothesis is the two sequences
are related
You obtained a score S
how likely that you can obtain such a score if
the null hypothesis is true?

7
How to test that null-hypothesis?

Randomly generate some pairs of unrelated
sequences
See what alignment scores you may get for those
unrelated sequences
Must keep other factors in mind
Your random sequences must be as close as
possible to your true sequences
Except that they are unrelated (i.e., not from
a common ancestor)

8
Possible ways to get unrelated sequences

Which is better?
Randomly pick some sequences from a database
Randomly pick some sequences from a database and
truncate to the same length as your real
sequences
Generate random sequences according to the
frequency that each letter is used by your real
sequences
Randomly shuffle your sequences

9
Possible ways to get random sequences

Which is better?
Randomly pick some sequences from a database
Randomly pick some sequences from a database and
truncate to the same length as your real
sequences
Generate random sequences according to the
frequency that each letter is used by your real
sequences
Randomly shuffle your sequences

Random shuffling is what we do to estimate
p-values for global alignment

11
Mouse HEXA
Human HEXA
Score 732

12
732
Distribution of the alignment scores between
mouse HEXA and 200 randomly shuffled human HEXA
sequences P-value less than 1/200 0.005
13

Advantages
Easy to implement
You dont need to know a lot of theories to do
this
Disadvantages
Slow
Cannot estimate small p-values
If we had repeated 1,000 times, would we get a
score as high 732?
Based on what Ive already seen, I would guess
probably no
What about 1,000,000 times?

14
When theory exists

It gets much better
You dont really need to go there to know whats
there
(I know roughly how many times you can get a
score as high as 732 if you repeat your
experiments a billion times)
That is what happened for local alignment

For ungapped alignment between two sequences of
lengths M and N
E(S) KMN exp-?S
(Expected value, E-value. )
K, ? depends on sequence composition and
substitution matrix
Can be obtained either empirically or analytically

16
when P is small.
17
Distribution of alignment scores for 1000 random
sequence pairs
Extreme value distribution
Theory says my score distribution should have
this shape
My experiment shows me that the theory seems
correct
18
Example

You are aligning two sequences, each has 1000
bases
m 1, s -1, d -inf (ungapped alignment)
You obtain a score 20
Is this score significant?

? ln3 1.1
E(S) K MN exp- ?S
E(20) 0.1 1000 1000 3-20 3 x 10-5
P-value 3 x 10-5 ltlt 0.05
The alignment is significant

20
20
Distribution of 1000 random sequence pairs
21
Multiple-testing problem

You are searching a 1000-base sequence against a
database of 106 sequences (average length 1000
bases)
You get a score 20
You are essentially comparing 1000 bases with
1000x106 109 bases (ignore edge effect)
E(20) 0.1 1000 109 3-20 30
By chance we would expect to see 30 matches
P-value 1 e-30 0.9999999999
Not significant at all

22
Interpretation of p-value

Your null hypothesis is the two sequences are
not related
Your p-value is 0.999999999, so you cannot reject
your null hypothesis
Which means you cannot exclude the possibility
that the two sequence may be unrelated
But can you conclude that the two sequences are
unrelated?
No! your did not test that!

23
A better way to understand p-value

Your p-value is 0.99999
You have very low confidence (lt0.00001) to say
that the null hypothesis is wrong
Is the null hypothesis true then (i.e., the two
sequences are unrelated)?
You dont know
Your test was not designed to tell you that

24
In practice

You search the sequence against a database of 106
sequences
You get 35 matches
You expect to get about 30 by chance
It could be all 35 are real, or none, or some
You already reduced your target from 106
sequences to 35 sequences
Take all 35 sequences with caution. Look for
other evidences

25
Statistics for gapped local alignment

Theory not well developed
Extreme value distribution works well empirically
Need to estimate K and ? empirically

26
Exercising FSA

How do you make an FSA for the Needleman-Wunsch
algorithm?

27
Exercising FSA

How do you make an FSA for the Needleman-Wunsch
algorithm?

Ix
(-, yj)/d
(xi,yj) /?
(-, yj) / d
(xi,yj) /?
F
(xi,-)/d
(-, yj) / d
Iy
(xi,-) / d
(xi,-)/d
(xi,yj) /?
28
Simplify
(xi,yj) /?
(xi,-) / d
(xi,-) / d
F
I
(xi,yj) /?
(-, yj) / d
(-, yj) / d
29
Simplify more
(xi,yj) /?
F(i-1, j-1) ?(xi, yj) F(i,j) max F(i-1,
j) d F(i, j-1) d
F
(xi,-) / d
(-, yj) / d
30
A more difficult alignment problem

(A gene finder indeed!)
X is a genomic sequence (DNA)
X encodes a gene
May contain introns
Y is an ORF from another species
Contains only exons
We want to compare X against Y
Conservation is on the level of amino acids

31
DNA
intron
intron
Pre-mRNA
exon
exon
3 UTR
exon
5 UTR
Splice
Mature mRNA (mRNA)
Open reading frame (ORF)
Start codon
Stop codon
32

We have a predicted gene
We know the positions of the start codon and stop
codon
But we dont know where are the splicing sites
Not even the number of introns

intron
intron
exon
exon
exon
Start codon
Stop codon
33

Most splicing sites start at GT and end at AG
But there are lots of GT and AG in the sequence
Aligning to a orthologous gene with known ORF may
help us determine the splicing sites
Orthologous genes two genes evolved from the
same ancestor
Coding region are likely conserved on amino acid
level
UUA, UUG encode the same amino acid
So do UCA, UCU, UCG, UCC

GTAG
Mouse putative gene
human ORF
34
The Genetic Code
Third letter
35
If know where are the exons

Easy

Mouse putative gene
Remove introns
Mouse putative ORF
translate
Global alignment
translate
human ORF
36
Or directly align triplets
Mouse putative gene
Remove introns
Mouse putative ORF
Global alignment
human ORF
37
Codon substitution scores
AAA AAG AAU AAC UCU UCC
AAA 4 3 -1 -1 -1 -1
AAG 3 4 -1 -1 -1 -1
AAU -1 -1 4 3 1 1
AAC -1 -1 3 4 1 1

UCU -1 -1 1 1 4 3
UCC -1 -1 1 1 3 4
64 x 64 substitution matrix
38
FSA for aligning genomic DNA to ORF
(xi-2xi-1xi, yj-2yj-1yj) / ?
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
(xi-2xi-1xi, yj-2yj-1yj) / ?
A
B
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
Considering only exons
39

We dont know exactly where are the splicing
sites
Length of introns may not be a multiple of 3
- If convert the whole seq into triplets, may
result in ORF shift

17 bases?
Mouse putative gene
human ORF
40
Model introns

Most splicing sites start at GT and end at AG
For simplicity, assume length of exon is a
multiple of 3
Not true in reality
Only a little more work without this assumption

GTAG
Mouse putative gene
126 nt 42 aa
human ORF
120 nt 40 aa
41
Aligning genomic DNA to ORF
Fixed cost to have an intron
Alignment with Affine gap penalty
42
FSA for aligning genomic DNA to ORF
(xi-2xi-1xi, yj-2yj-1yj) / ?
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
(xi-2xi-1xi, yj-2yj-1yj) / ?
A
B
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
Considering only exons
43
FSA for aligning genomic DNA to ORF
(xi-2xi-1xi, yj-2yj-1yj) / ?
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
Start an intron
(xi-2xi-1xi, yj-2yj-1yj) / ?
A
C
B
(-, GT) / s
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
44
FSA for aligning genomic DNA to ORF
Continue in intron
(xi-2xi-1xi, yj-2yj-1yj) / ?
(-, yi) / 0
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
Start an intron
(xi-2xi-1xi, yj-2yj-1yj) / ?
A
C
B
(-, GT) / s
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
45
FSA for aligning genomic DNA to ORF
Continue in intron
(xi-2xi-1xi, yj-2yj-1yj) / ?
(-, yi) / 0
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
Start an intron
(xi-2xi-1xi, yj-2yj-1yj) / ?
A
C
B
(-, GT) / s
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
(-, AG) / s
Close an intron
46
(xi-2xi-1xi, yj-2yj-1yj) / ?
(-, yj) / 0
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
(xi-2xi-1xi, yj-2yj-1yj) / ?
A
C
B
(-, GT) / s
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
(-, AG) / s
A(i-3,j-3) ? (xi-2xi-1xi,
yj-2yj-1yj) A(i, j) max B(i-3,j-3) ?
(xi-2xi-1xi, yj-2yj-1yj) C(i, j-2) s, if
yj-1yj AG
47
(xi-2xi-1xi, yj-2yj-1yj) / ?
(-, yj) / 0
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
(xi-2xi-1xi, yj-2yj-1yj) / ?
A
C
B
(-, GT) / s
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
(-, AG) / s
A(i, j-3) d A(i-3, j) d B(i, j)
max B(i, j-3) e B(i-3, j) e
48
(xi-2xi-1xi, yj-2yj-1yj) / ?
(-, yj) / 0
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
(xi-2xi-1xi, yj-2yj-1yj) / ?
A
C
B
(-, GT) / s
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
(-, AG) / s
B(i, j-2) s, if yj-1yj GT C(i, j) max
C(i, j-1)
49
ACGGATGCGATCAGTTGTACTACGAGCTGACGGTCCTCAGACTTGATTA

50

There is a close relationship between dynamic
programming, FSA, regular expression, and regular
grammar
Using FSA, you can design more complex alignment
algorithms
If you can draw the state diagram for a problem,
it can be easily formulated into a DP problem
In particular, Hidden Markov Models
Will discuss more in a few weeks

51
Heuristic Local AlignersBLAST and alike
52
State of biological databases

Sequenced Genomes
Human 3?109 Yeast 1.2?107
Mouse 2.7?109 Rat 2.6?109
Neurospora 4?107 Fugu fish 3.3?108
Tetraodon 3?108 Mosquito 2.8?108
Drosophila 1.2?108 Worm 1.0?108
Rice 1.0?109 Arabidopsis 1.2?108
sea squirts 1.6?108
Current rate of sequencing
4 big labs ? 3 ?109 bp /year/lab
10s small labs

53
State of biological databases

Number of genes in these genomes
Vertebrate 30,000
Insects 14,000
Worm 17,000
Fungi 6,000-10,000
Small organisms 100s-1,000s
Each known or predicted gene has an associated
protein sequence
gt1,000,000 known / predicted protein sequences

54
Some useful applications of alignments

Given a newly discovered gene,
- Does it occur in other species?
- How fast does it evolve?
Assume we try Smith-Waterman

Our new gene
104
The entire genomic database
1010 - 1011
May take several weeks!
55
Some useful applications of alignments

Given a newly sequenced organism,
- Which subregions align with other organisms?
- Potential genes
- Other biological characteristics
Assume we try Smith-Waterman

Our newly sequenced mammal
3?109
The entire genomic database
1010 - 1011
gt 1000 years ???
56
BLAST

Basic Local Alignment Search Tool
Altschul, Gish, Miller, Myers, Lipman, J Mol Biol
1990
The most widely used comp bio tool
The most cited paper
Which is better long mediocre match or a few
nearby, short, strong matches with the same total
score?
score-wise, exactly equivalent
biologically, later may be more interesting, is
common
at least, if must miss some, rather miss the
former
BLAST is a heuristic emphasizing the later
speed/sensitivity tradeoff BLAST may miss
former, but gains greatly in speed

57
BLAST

Main idea
Construct a dictionary of all the words in the
query
Initiate a local alignment for each word match
between query and DB
Running Time O(MN)
However, orders of magnitude faster than
Smith-Waterman

query
DB
58
BLAST ? Original Version

Dictionary
All words of length k (11 for DNA, 3 for
proteins)
Alignment initiated between words of alignment
score ? T (typically T k)
Alignment
Ungapped extensions until score
below statistical threshold
Output
All local alignments with score
gt statistical threshold

query

scan
DB
query
59
BLAST ? Original Version
A C G A A G T A A G G T C
C A G T
Example k 4, T 4 The matching word GGTC
initiates an alignment Extension to the left and
right with no gaps until alignment falls lt
50 Output GTAAGGTCC GTTAGGTCC

C C C T T C C T G G A T T
G C G A
60
Gapped BLAST
A C G A A G T A A G G T C
C A G T

Added features
Pairs of words can initiate alignment
Extensions with gaps in a band around anchor
Output
GTAAGGTCCAGT
GTTAGGTC-AGT

C T G A T C C T G G A T T
G C G A
61
Example

Query gattacaccccgattacaccccgattaca (29 letters)
2 mins
Database All GenBankEMBLDDBJPDB sequences
(but no EST, STS, GSS, or phase 0, 1 or 2 HTGS
sequences) 1,726,556 sequences 8,074,398,388
total letters
gtgi28570323gbAC108906.9 Oryza sativa
chromosome 3 BAC OSJNBa0087C10 genomic sequence,
complete sequence Length 144487 Score 34.2
bits (17), Expect 4.5 Identities 20/21 (95)
Strand Plus / Plus
Query 4 tacaccccgattacaccccga 24
Sbjct 125138 tacacccagattacaccccga 125158
Score 34.2 bits (17),
Expect 4.5 Identities 20/21 (95) Strand
Plus / Plus
Query 4 tacaccccgattacaccccga 24
Sbjct 125104 tacacccagattacaccccga 125124
gtgi28173089gbAC104321.7 Oryza sativa
chromosome 3 BAC OSJNBa0052F07 genomic sequence,
complete sequence Length 139823 Score 34.2
bits (17), Expect 4.5 Identities 20/21 (95)
Strand Plus / Plus

62
Example

Query Human atoh enhancer, 179 letters 1.5
min
Result 57 blast hits
gi7677270gbAF218259.1AF218259 Homo sapiens
ATOH1 enhanc... 355 1e-95
gi22779500gbAC091158.11 Mus musculus Strain
C57BL6/J ch... 264 4e-68
gi7677269gbAF218258.1AF218258 Mus musculus
Atoh1 enhanc... 256 9e-66
gi28875397gbAF467292.1 Gallus gallus CATH1
(CATH1) gene... 78 5e-12
gi27550980embAL807792.6 Zebrafish DNA
sequence from clo... 54 7e-05
gi22002129gbAC092389.4 Oryza sativa
chromosome 10 BAC O... 44 0.068
gi22094122refNM_013676.1 Mus musculus
suppressor of Ty ... 42 0.27
gi13938031gbBC007132.1 Mus musculus, Similar
to suppres... 42 0.27
gi7677269gbAF218258.1AF218258 Mus musculus
Atoh1 enhancer sequence Length 1517
Score 256 bits (129), Expect 9e-66
Identities 167/177 (94),
Gaps 2/177 (1) Strand Plus / Plus
Query 3 tgacaatagagggtctggcagaggctcctggccgcggt
gcggagcgtctggagcggagca 62
Sbjct 1144 tgacaatagaggggctggcagaggctcctggccccggt
gcggagcgtctggagcggagca 1203

63
Different types of BLAST

blastn search nucleic acid database
blastp search protein database
blastx you give a nucleic acid sequence, search
protein database
Tblastn you give a protein sequence, search
nucleic acid database
tblastx you give a nucleic database, search
nucleic acid database, implicitly translate both
into protein sequences

64
Variants of BLAST

MEGABLAST
Optimized to align very similar sequences
Linear gap penalty
NCBI-BLAST
WU-BLAST (Wash Univ BLAST)
Optimized, added features
BLAT Blast-Like Alignment Tool
BlastZ
Optimized for aligning two genomes
PSI-BLAST
BLAST produces many hits
Those are aligned, and a pattern is extracted
Pattern is used for next search above steps
iterated
Sensitive for weak homologies
Slower

65
Pattern hunter

Instead of exact matches of consecutive matches
of k-mer, we can
look for discontinuous matches
My query sequence looks like
ACGTAGACTAGCAGTTAAG
Search for sequences in database that match
AXGXAGXCTAXC
X stands for dont care
Seed 101011011101

66
Pattern hunter

A good seed may give you both a higher
sensitivity and higher specificity
You may think 110110110110 is the best seed
Because mutation in the third position of a codon
often doesnt change the amino acid
Best seed is actually
110100110010101111
How to design such seed is an open problem
May combine multiple random seeds

Write a Comment

User Comments (0)