Bioinformatics Workshop 1 Sequences and Similarity Searches - PowerPoint PPT Presentation

1 / 84
About This Presentation
Title:

Bioinformatics Workshop 1 Sequences and Similarity Searches

Description:

It also contains links to the example sequence files used in the workshop, and ... There should be two sequences: surfeit1' for frog and fly. ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 85
Provided by: mikegil
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Workshop 1 Sequences and Similarity Searches


1
Bioinformatics Workshop 1Sequences and
Similarity Searches
  • Open a web browser and type in the URL
  • informatics.gurdon.cam.ac.uk/online/workshops
  • Bookmark this page
  • Click on the link to the file
  • useful-websites.html
  • Bookmark this page too
  • It also contains links to the example sequence
    files used in the workshop, and the presentations
    themselves

2
The Basic Questions
Where, and how, do I find something?
How do I know its real?
Exercise 0 Write a concise definition of what a
gene is.
3
Part 1 Structural Genomics
DNA arranged in chromosomes
Vertebrate 109 base pairs
4
Chromosomes and Genes
Total of 30,000 genes on 20 chromosomes 1000
2000 genes per chromosome
5
Gene to Protein
gene
locus
genome
primary transcript
mRNA
protein
6
Sequence Signals
mRNA
CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA
CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA
M
L
T
I
L
A
L
7
Genomic Signals
8
Derivative Sequences
mRNA
5
3
capture by cloning into cDNA library
EST single pass sequence from each end of the
clone
cDNA sequence
cDNA multiple pass sequencing over whole length
of the clone
9
Gene Models
exons
gene model
10
Sequences and Genes(Accession Numbers and Names)
gene
11
Gene Symbols, Names, Etc.
Gene Symbol CCNB1 Gene Name cyclin B1 Homo
sapiens Description G2/mitotic-specific
cyclin B1 Aliases CCNB, CYCB1
12
A Gene-Centric View
Entrez Gene http//www.ncbi.nlm.nih.gov/
Cyclin B1
Exercise 1 Go to Entrez Gene and look for your
favourite gene or genes.
13
Sequences and Accession Numbers
NM_001015922.1 gi62860271 GATCGTTCGATTAGCTAGGGA
CACCACCGATCGATATGACCACAAAAA
NM_001015922.2 gi62860589 GACCGTTCGATTAGCTAGGGAC
ACCACCGATCGATATGACCACAAA
BC009638.1 gi16307106
GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA
NP_001015922.1 protein translated from
mRNA XM_001102567.1 predicted mRNA XP_001089765.1
predicted protein translated from predicted mRNA
14
mRNA Splicing Signals
mRNA
CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCG
TAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA
gene model
splice sites
CTACCATCCATGCTAACCATTCTAC
CATTTTATACTCATGCAACGGACCGT
AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA
CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATT
TCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAG
TCGCTTAGCATCCTTTATAACTGGCTA
GTAAG.donor
.TTTCAG acceptor
15
Gene Predictions
  • Given
  • coding sequence must run from ATG STOP codon
    in-frame
  • introns GT. . . . . . AG can be spliced out
  • Also take a statistical approach
  • coding and non-coding sequence are slightly
    different in composition
  • some possible splice sites are more likely
    than others

scan genomic sequence
. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCAT
TTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .
. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCAT
TTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .
. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCAT
TTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .
. . .CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTA
GGTAAGACTA. . .
. . .CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTA
GGTAAGACTA. . .
. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCAT
TTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .
. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGC
CTAACGTATATAGCTAGGTAAGACTA. . .
. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCAT
TTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .
. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGC
CTAACGTATATAGCTAGGTAAGACTA. . .
most likely gene model
. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCAT
TTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .
16
Supporting Evidence!
exons 1 2
3 4
gene model
genome
EST evidence
We note that in the absence of EST evidence it is
only really possible to predict coding sequence
with any confidence (and even then). So
predicted genes based on computational gene
models alone will usually lack UTR regions, which
has some important consequences.
17
Theoretical/Predicted Sequences
exons 1 2
3 4
predicted gene model
genome
predicted transcript
predicted protein
Weve now reversed the process of working out
exon structure from aligning cDNA sequences
against the genome sequence, but we shouldnt
lose sight of the fact that we dont really know
if these predicted proteins exists especially
where supporting EST evidence is weak, or
non-existent.
18
Sequences for a model organism
ESTs millions _at_ 10 each Cheap to sequence so
we get millions per organism But lots of
errors And incomplete gene sequences Can give us
relative expression levels cDNAs tens of
thousands _at_ 1000 each Expensive but only need
to do one (or a small number) per gene Few errors
with multipass sequencing Gives us protein
sequences Genomes one ! _at_ 30,000,000 Extremely
expensive But the only way to get the whole
picture Gives us gene regulation
19
So Whats in the Databases Now?
NCBI July 2005
15,000,000 ESTs
3,300,000 cDNAs
DNA
nr
RefSeq
2,700,000 proteins
950,000 proteins
Proteins
20
Part 2 Comparative Genomics
Evolution by sequence mutation
Gene sequence
ATGAAGGCTGCCTACGACTGCCGTG
ATGCAGGCTGCCTACGACTGCCGTG
ATGCAGGCTGCCAACGACTGCCGTG
Imagine one mutation gets fixed every 100,000
years in this gene sequence
ATGCATGCTGCCAACGACTGCCGTG
ATGCATGCTGCCAACGACTGCCCTG
ATGCATGCTGCCAACGGCTGCCCTG
ATGCATGCTGCCAACGGATGCCCTG
ATGCATGCCGCCAACGGATGCCCTG
ATGCATGCCGCCAACGGATGTCCTG
21
Speciation
Gene A
ATGAAGGCTGCCTACGACTGCCGTG
ATGAAGGCTGCCTACGACTGCCGTG
ATGAAGGCCGCCTACGACTGCCGTG
ATGCAGGCTGCCTACGACTGCCGTG
ATGAAGGCCGCCAACGACTGTCGTG
ATGCAGGCTGCCAACGACTGCCGTG
ATGAAAGCCGCCAACGACTGTCGTG
ATGCATGCTGCCAACGACTGCCGTG
ATGCATGCTGCCAACGACTGCCCTG
ATGAAAGCCGCCAACGACAGTCGTG
ATGCATGCTGCCAACGGCTGCCCTG
ATGAAAGCCGCCTACGACAGTCGTG
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG
ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG
If the genetic difference means they can no
longer interbreed, with fertile offspring then
we have a new species
22
Residual Similarity
We can still easily detect residual similarity
between these sequences, this is what we call
homology detectable similarity because of
common evolutionary origin.
After longer periods of evolution, homology may
no longer be detectable in the DNA sequence
23
Computers Can Detect Homology
In fact computers are very good at this task
the two primary challenges are (a) performing
the search fast enough to look through millions
of sequence in a timescale compatible with a lab
scientists attention span (b) at low levels of
similarity, being able to distinguish between
biologically related sequences and chance matches
24
Orthologs
Gene duplication though speciation
The two copies of Gene A will now evolve
independently, but will continue to have the
same function
They are ORTHOLOGS
25
Paralogs
The two copies of Gene A will now evolve
independently, but will probably not continue to
have exactly the same function
Gene duplication though internal genome
duplication
They are PARALOGS
26
Other-logs
What about gene duplication after speciation ?
How can we describe the relationship(s) between
the various copies of gene A in the two frogs?
Bear in mind that understanding gene function is
more important than semantics The two copies of
A in the orange frog are sometimes called
IN-PARALOGS. If they were also present in the
green frog (and therefore were in the ancestor
species) they would be OUT-PARALOGS.
27
The Essential Paradigm
1. any group of modern species can be traced back
to some extinct common ancestor
2. in all likelihood they share orthologous genes
which have the same function in the modern animal
as in the extinct ancestor
3. If we can experimentally determine the
function of a gene in one of these organisms,
then there is a good chance the ORTHOLOGOUS gene
in another organism will have the same function
28
Function Conserved Longer than Detectable
Similarity
start from first self-replicating sequence
whole genome duplication
local duplication
same function
detectable similarity
29
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A
GCT A TGC C cystine TGT C
GAC D aspartate GAT D GGA
G glycine GGC G GGG G GGT G
Synonymous or silent mutations in the third
position of the codon triplets have no effect on
the amino acid coded for so there is no
evolutionary pressure against this
30
Protein Similarity Persists Longer
CTATCACGAGAACCTGTG
CTATCCCGAGAACCTGTG
CTATCCCGAGAACCAGTG
CTATCCCGTGAACCAGTG
CTATCCCGTGAGCCAGTG
CTATCCCGTGAGCCAGTT
CTGTCCCGTGAGCCAGTT
100
67
80
44
31
Always Compare Protein Sequences
amino acid comparison
DNA comparison
ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG
MNAAYDCRARMLR
ATGAAGGCCGCATACGACTGTCGT
GCTAGAATCCTGAGA MKAAYDCRARILR
The DNA sequence can change while the amino acid
sequence stays the same, so always look for
similarities by comparing amino acid sequences.
32
Exercise 1nucleotide vs amino acid search
Go to the file example-sequences.html, and locate
the section for this exercise. There should be
two sequences surfeit1 for frog and fly. Go to
NCBI Blast home page, then Align two sequences
(bottom left special panel), paste one sequence
into each window and hit Align this will do a
direct DNA/DNA comparison. Now find the open
reading frames of the two genes, and translate
them into amino acid protein sequences, then
repeat the two sequences comparison. Go to NCBI
ORF Finder paste sequence hit OrfFind
identify longest ORF click on it next screen,
hit Accept change View to Fasta protein hit
View copy sequence to Blast2Seqs. Do the same
with the other sequence. Before you hit Align
change the Program (top left) to blastp
33
Answers Exercise 1
34
The Essential Task
experiment
data mining
what is its function?
gene sequence
Cyclin-A
FoxA1
database of proteins in other species
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
we can only do this because of implied function
based on orthology
frizzled
Wint8
Troponin T3
35
Functional Orthologs ?
function known, annotation Gravin available
Human gene
sequence similarity orthologs
same function ?
But we know that function is largely determined
by shape
similar shape?
Which in general we cannot determine but it is
probably SHAPE not SEQUENCE that is conserved!
We make an assumption that the same gene function
is likely to be present in the two organisms, and
the ones that have this function are likely to be
the most similar in sequence
36
Finding Orthologs
So how do we find orthologs, and can we know when
we have? The simplest is Reciprocal Best BLAST,
but it implicitly relies on having all the
protein sequences of you own organism, and the
one you wish to find an ortholog in.
database of human proteins
database of frog proteins
best match human protein
frog protein
x
37
Using Synteny is Better
We know that large regions of (say) vertebrate
genomes have preserved their overall organisation
from one organism to another.
Mouse chromosome 10
Human chromosome 5
Mouse chromosome 2
And we find the same genes (i.e. orthologs!) in
more or less the same order in the syntenic
sections. These of course represent chromosomal
re-arrangements since these organisms diverged.
38
Metazome
Fortunately someone has done all the hard work
for us. Dan Rokhsar http//www.metazome.net/
39
Metazome Exercise
Go back to Entrez Gene and look for your
favourite gene again. Pick probable ortholog
vertebrate genes from common organisms (human,
mouse, rat, chicken, frog, fish) and paste their
protein sequences into a temporary space. Go to
Metazome (http//www.metazome.net/), find the
blast window, open two versions of it, and blast
your sequences against the Tetrapod or Jawed
vertebrate node. See if you get the same cluster
ID as best top hit and have a look at the
Metazome alignment(s)
40
Part 3 Finding Sequence Similarities
We want computer programs which will compare
sequences at all possible different alignments,
looking for a degree of similarity greater than
we would expect to find by chance. But first we
have to consider the implication of
gaps Insertions and deletions are other
possible forms of mutations and they can really
mess up our simple alignments
41
Gaps in Alignments
Consider these two obviously similar sequences
TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACT
CGCCCCAAAATCAAGCTCACCCCGTCCAAGAA

TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGG
ACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA In
fact we realise that the most probable alignment
(regarding biological origin) is with a small gap
in each sequence TTCCCAACTCTCCTCTTTCACCATGAAGCT
CAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA


TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCC
ACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA So in general
we allow ourselves to insert gaps, until we find
the optimal alignment. But where should this
process stop?
42
The Downside of Gaps
Take two random sequences, with no real
similarity GACACTAGGTCGATGCGTGGTGGCGAGA ACGCATCCG
GATGTGCACCGTGGAACTG And allow cost free
gaps GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA

ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG Clearly,
although the alignment has no mismatches, it is
obviously not biologically meaningful! To
prevent this we assign a cost to adding gaps
which is offset against the benefit of finding
matches and this is the essence of finding
gapped alignments. We want to find the
alignment between the two (or more) sequences
which shows the greatest degree of similarity
while introducing the fewest gaps
43
BLAST
There are many programs used to find similarities
between sequences. They range from relatively
slow programs which find the exact best matching
alignment, through ones which take progressively
inexact shortcuts to speed things up. Of this
latter class, the best known, and easily most
widely used is BLAST, developed by Stephen
Altschul and others, and continuously refined
over the last 10-15 years. The essential idea is
to compare your query sequence against a
collection or database of target sequences,
looking for the one(s) that match the query
sequence the best.
database
gttarget1 AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTC
GAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTAC
TTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTA
CACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT
GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGC
TTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCC
TTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGAT
TTGGCTTTGCAG gttarget2 CTCTTAATTTATTTCTCTTCCTGCAGCT
CCCTCGCTTTTTCCTTTCCCTGTTACATTCAT
CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATA
TGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG
gttarget3 GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGC
CATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATG
CCAGGAGTCCTTCCTACAGGGCCTTCCTGAA
ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTT
CACCTTTTCT GGAAAAAAAAATACTGGCTTCC
gttarget4 CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCT
GCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGT
AGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC
CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAG
TCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAA
CTTCTAAAGAACAGGTGGTAG
query
gtquery AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTA
CGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCG
TAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGG
CGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAAC
TTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTT
TACCCTTTAAAATCAGCTTCAACCAGCAACGAA
COMPARE
LIST MATCHES
44
Flavours of BLAST
BLASTn
ACGATAGATCCCATCCATAAAT
ATGACGATAGATCCCATCAT CGATAGGACCACCACA
GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT
FAST
BLASTp
MKJLSPWERSYTRGHYTWER MGHTVNBZY
MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWR
TVBN
MQWCGYRWTYQGYRW
FAST
BLASTx
MKJLSPWERSYTRGHYTWER MGHTVNBZY
MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWR
TVBN
ACGATAGATCCCATCCATAAAT
6 frame translation
SLOW
tBLASTn
ATGACGATAGATCCCATCAT CGATAGGACCACCACA
GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT
MQWCGYRWTYQGYRW
SLOWER
tBLASTx
HORRIBLY SLOW!
ACGATAGATCCCATCCATAAAT
ATGACGATAGATCCCATCAT CGATAGGACCACCACA
GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT
45
How does it work?
The main task of any sequence comparison program
is to test all possible mutual alignments of two
sequence and see how good the match is
query
CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGG
CGTTC
CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATG
GCGTTC





CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTT
CAGGAGT
CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTT
CAGGAGT
CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTT
CAGGAGT
1st database sequence
This would actually be a very slow search process
if implemented like this BLAST achieves its
speed through two strategies - it takes a WORD
based approach - it pre-INDEXES database sequences
46
BLAST WORDS and INDEXING
1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA 2
TAAGCAAATTTAATTTTGTTTACATTTTC 3
GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA
Database of sequences
Numbered list of all possible words
AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003
ACAAATCC 07967 ACAAATCC 07968 ACAAATCC
07979 GACAAATC 33568 GACAAATG 33569
TCCAAACC 64321 TCCAAACC 64322
Build a position index of all words in the
database
sequence position word 1 1 33658 1 2 07967
1 3 16210 3 15 33568 3
16 07967
47
Analyse the Query Sequence
gtquery AGACAAATCCAAACCCCTGAAGTTCTCCA
CCAGCAAAGCCA
QUERY SEQUENCE
Numbered list of all possible words
Analyse QUERY SEQUENCE
AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003
ACAAATCC 07967 ACAAATCC 07968 ACAAATCC
07979 GACAAATC 33568 GACAAATG 33569
TCCAAACC 64321 TCCAAACC 64322
position word 1 14236 2 33658 3
07967
Index of database
sequence position word 1 1 33658 1 2 07967
1 3 16210 3 15 33568 3
16 07967
48
Expand from Word Based Matches
We instantly know which sequences in the
database have at least a word length match with
our query sequence, and at what relative
position. Next, the potential alignments are
expanded, adding up a score for (total matches
mismatches gap penalties), to make the best
possible alignment. But this is usually for a
tiny proportion of the sequences in the database
so overall it is much quicker. The highest
scoring alignments are reported. But we can
potentially miss alignments with no word-size
bits in common, consider BLASTn with a default
word-size of 11 TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTT
GGAGGCCAGTGTTCTGGCTGAGC
TCGGAAGTGTAAGCTCAACC
TGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC Care is
sometimes needed
49
BLAST Typical Output
INPUT gtpartial cDNA sequence, Xenopus
tropicalis CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAAT
AGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGC
TCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAG
GGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAA
AACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGC
GTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTG
GTGATGCTGCCAGGGTCGGCGA
OUTPUT Query (311 letters) Database NCBI
Protein Reference Sequences 954,378 sequences
347,895,532 total letters gtgi41055060refNP_95
7420.1    similar to guanine
nucleotide-releasing factor 2 (specific for crk
proto-oncogene) Danio rerio Length691
Score 133 bits (335) Expect 6e-31
Identities 76/98 (77) Positives 82/98
(83) Gaps 4/98 (4) Frame 2 Query
26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DL
TVKTEEKPVNKTL 196 MSGKIE K
SQSHLSSFTMKL KFHSPKIKRTPSKKGK VKT EKPVNK
Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPS
KKGKQLQPEPAVKTPEKPVNKKV 59 Query 197
SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310
SRLEEQEKVVALRYFKTIVDKM VD VL MLPGSA Sbjct
60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97
50
When is a match significant?
Here is a typical weak alignment from BLASTp
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIV
HPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S
C N Y N C P W P
D I N M NFSWKKTSEKETNCQFDYPND
Y--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQI
FN------MCWLEVNSS
In fact the sequences were randomly generated, so
there is no biologically significant alignment
RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSV
SFPQSDPCQIKMNQNLHLVQMMYDMQTTHV NFSWKKTSEKETNCQFDY
PNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQ
IFNMCWLEVNSS
51
E-values
The number of matches like the discovered match
that I would expect to find by chance. An
E-value of 0.0 implies that I would expect no
matches like this to arise by chance,
therefore An E-value of 1 implies I would
expect 1 match like this to arise by chance, so
if I have a match with such an E-value
Also expect value or expectation
52
E-values From First Principles
Notation 1.2e-35 1.2 x 10-35 4.8 x 106
4,800,000
Some database statistics (23rd July
2005) Database NCBI RefSeq mRNA
272,619 sequences 503,566,580 total letters
(5.0 x 108) Database NCBI nr
3,329,110 sequences 14,601,814,750 total
letters (1.4 x 1010)
We will consider first searching a nucleotide
sequence (ACGTAGACGT) against a nucleotide
database, e.g. the RefSeq mRNA above. Then we
will consider the more complex case of amino acid
sequence (protein) searches. Which is of course
what we mostly do.
53
Calculating an E-value
The RefSeq mRNA database has 5.0 x 108 letters
There are 4 possible nucleotides - ACGT How
many matches do we expect to find by chance?
Query A
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGAT
AGGCTAACCGTAATGGCG
A
A
A
A
A
A
A
A
A
A
A
A
A
A
Expected number of matches (5.0 x 108) / 4
1.2 x 108
Query AC
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGAT
AGGCTAACCGTAATGGCG
AC
AC
AC
AC
Expected number of matches (5.0 x 108) / (4x 4)
3.1 x 107
Query ACG
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGAT
AGGCTAACCGTAATGGCG
ACG
Expected number of matches (5.0 x 108) / (4 x 4
x 4) 8.1 x 106
Query ACGTCGA..CTGATTCG - 60-mer
Expected number of matches (5.0 x 108) / (4 x 4
x 4 x 4 60 times ) (5.0 x
108) / 1036 5.0 x
10-28 E-value 5.0 x 10-28
54
E-values In Practice
So if I take a 60 nt sequence gtsequence ACAGCTCGT
CCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGC
A and actually BLAST it against the RefSeq mRNA
database, I get BLAST OUTPUT gtgi27469838gbBC
041710.1 Homo sapiens Rap guanine nucleotide
exchange factor (GEF) 1, transcript variant 2,
mRNA (cDNA clone MGC49019 IMAGE6051007),
complete cds Length6060 Score 119 bits (60),
Expect 2e-26 Identities 60/60 (100), Gaps
0/60 (0) StrandPlus/Plus Query 1
ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAA
CCGCCGTGCA 60

Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACC
GGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036 What do
I get if I BLAST it against the larger nr
database? BLAST OUTPUT gtgi27469838gbBC041710.1
Homo sapiens Rap guanine nucleotide
exchange factor (GEF) 1, transcript variant 2,
mRNA (cDNA clone MGC49019 IMAGE6051007),
complete cds Length6060 Score 119 bits (60),
Expect 6e-25 Identities 60/60 (100), Gaps
0/60 (0) StrandPlus/Plus Query 1
ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAA
CCGCCGTGCA 60

Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACC
GGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036
theoretical value was 5.0e-28 - !?
55
E-value Exercise
Given a transcription factor binding
site ACCT/GTA How many would you expect to
find by chance in a 10k promoter sequence? How
would this differ if there was an optional
additional base between the 4th and 5th
positions? I.e. ACCT/GTA OR ACCT/G?TA
56
E-value Exercise Answer
ACCT/GTA Expect A every 4 nt Expect ACC
every 4x4x4 64 nt Expect T or G every 2nd
nt Expect ACCT/G every 64x2 nt 128
nt Expect TA every 4x4 16 nt Expect
ACCT/GTA every 128x16 nt 2048 nt
(4x4x4x2x4x4) We would expect 5 of these
promoter sites every 10k by chance If also
ACCT/G?TAA allowed? The two motifs
independently have the same E-value. To allow
either means we expect twice as many.
57
E-values Effect of Database Size
The nr mRNA database has 1.4 x 1010 letters
(was RefSeq and 5.0 x108) There are 4 possible
nucleotides - ACGT How many matches do we expect
to find by chance?
Query A
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGAT
AGGCTAACCGTAATGGCG
A
A
A
A
A
A
A
A
A
A
A
A
A
A
Expected number of matches (1.4 x 1010) / 4
1.2 x 108
Query AC
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGAT
AGGCTAACCGTAATGGCG
AC
AC
AC
AC
Expected number of matches (1.4 x 1010) / (4x
4) 3.1 x 107
Query ACG
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGAT
AGGCTAACCGTAATGGCG
ACG
Expected number of matches (1.4 x 1010) / (4 x
4 x 4) 8.1 x 106
Query ACGTCGA..CTGATTCG - 60-mer
Expected number of matches (1.4 x 1010) / (4 x
4 x 4 x 4 60 times ) (1.4 x
1010) / 1036 1.4 x
10-26 E-value 1.4 x 10-26 (was E-value 5.0
x 10-28)
58
E-values Effect of Database Size
1.4 x 1010 letters
5.0 x108 letters
nr
RefSeq
30 x bigger
E-value 5.0e-28
BLAST the same sequence against each
E-value 1.4e-26
The database was 30 times bigger and so the
E-value was 30 times bigger.
The E-value is simply dependent on database size.
59
Why were the values different?
Our calculated E-value for searching against the
RefSeq mRNA database was 5.0 x 10-28. But our
actual BLAST search at NCBI gave a value of 2.0
x 10-26 - about 40x larger - why is
this? Gapped alignments If we were expecting N
matches for a query sequence ACGTACGTACGT,
imagine what would happen to N if we allowed gaps
in our matches. ACGTAC?GTACGT This would now
give us additional possible alignments that would
meet our match criteria ACGTACGTACGT
ACGTACAGTACGT ACGTACCGTACGT etc.
ACGTACGTACGT
ACGTAC-GTACGT ACGTAC-GTACGT We will expect
many more matches in a given database, if we
allow our alignments to have gaps. The E-value
will be larger.
60
E-values Effect of Query Length
BLAST 500 nt sequence against a database
gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGAT
AGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGAT
CGCTCGCTGTGTAGATAGATCGGCGATAGA
database
BLASTn
Get a full length match with sequence XYZ at an
E-value 5.0e-160
BLAST half of the same sequence against the same
database
gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGAT
AGACGATAGATAGTAAGTCG
database
Get a match with sequence XYZ again, but at an
E-value 5.0e-80
BLASTn
Biologically its the same match! Does it mean we
are any less sure that this match didnt occur by
chance? The E-value is simply dependent on match
length.
61
Why not just use identity?
At some levels this a good question. But
consider two very different searches, both of
which give a 75 identity match Query1 was 60 nt
long CGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGA
AACGATACAGCCAGCG
CGGAGCTCAGGCCTCACCGGCG
GACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCG Which
would have an E-value 5.0 x 10-19 And, Query2
only 16 nt long ACGTACGTACGTACGT
ACGCACCTTCGTAGGT Which would have an E-value
30 And intuitively we feel we would expect to
see that sort of number of matches in the
database just by chance
62
So whats the real problem?
Basically you are usually trying to answer the
question Can I find the ortholog of my gene in
some other species, so that I can work out what
it might be doing in my organism?
Basically you are usually trying to answer the
question Can I find the ortholog of my gene in
some other species, so that I can work out what
it might be doing in my organism?
biological knowledge
nature of query sequence
phylogenetic relationship
match length, PI, size of database
Are there any useful guidelines though, at least
for biological meaningfulness?
63
Rules of Thumb
How good does an E-value have to be before we
might even think we have an ortholog?
? larger/worse
smaller/better ? E-values 10-5 10-10
10-40 10-100 0.0
fantasy

borderline

encouraging
pretty
good cant get better But
note that in some gene families with closely
related members you can get an E-value of 0.0 for
several different matches, and then identity
may be more sensitive. Also bear in mind, in
cases like this, that ideas of functional
orthology may break down, with more than one
locus producing identical proteins which share
the same function
64
Protein BLAST
Its (nearly) always better to make comparisons
at the amino acid level between protein sequences
than the DNA level, because the amino acid
sequence is more conserved than the underlying
DNA sequence. Does this cause us to treat
expected values any differently? If we follow
the argument as before then for an exact match of
a 20 amino acid sequence in the RefSeq protein
database, each additional amino acid will reduce
the E-value by 1/20th (there are 20 different
amino acids). And as there are 347,895,532
letters in that database, E-value 3.5 x 108 /
(20 x 20 x 20 20 times) 3.5 x 10-18. But
this is what we get if we run the blast at NCBI
Score 43.1 bits (100), Expect 8e-04
Identities 20/20 (100), Positives 20/20
(100), Gaps 0/20 (0) Frame 3 Query 3
SSSSFRAYRAALSEVEPPCI 62
SSSSFRAYRAALSEVEPPCI Sbjct 972
SSSSFRAYRAALSEVEPPCI 991
Really too big a discrepancy to easily explain
with hand waving
65
Amino Acid Substitutions
In fact we need to take into account both amino
acid substitutability, as well as, as before,
allowing gapped alignments. On average any
residue can be substituted for by about 2 others,
so each position has about 1/7th chance of
matching rather than 1/20th.
So now we get E-value 3.5 x 108 / (7 x 7 x 7
20 times) 4.4 x 10-9, which is much closer to
the actual BLAST value.
These substitutabilities are dealt with by the
BLOSUM and PAM matrices
66
Exercises
Go to the file random-DNA-sequences.html, select
one of the 20 randomly generated nucleotide
sequences, and do a BLASTx (translated
DNA-gtprotein) at NCBI against the nr protein
database. Did you find any significant
hits? Repeat with a second sequence. What
conclusions might you draw from this
exercise? Try the same sequence(s) against the nr
nucleotide database. Is there any general
difference?
67
Part 4 Tweaking BLAST
Although you normally see BLAST as a web page
with boxes to place data in and tick boxes, etc.,
it is actually a command line program that can be
run just by typing the appropriate command and
options, e.g. promptgt blastall p
ltblast typegt i ltinput sequencegt d
ltdatabasegt This is the simplest form where the
basic program blastall takes a number of
different options, or parameters, indicated by
the x and followed by its value. -p ltwhich
blast flavour to rungt -i ltfile with query
sequence ingt -d ltpre-indexed database namegt
68
Not All Parameters are
There are many other parameters, and if not
listed explicitly they will use a default value
most appropriate to the blast flavour requested.
E.g. for W ltword sizegt blastn uses W 11, where
blastx uses W 3. There are also some options
that appear on the web pages that are not really
parameters but manage the job in a similar way.
One of the most useful of these is on the NCBI
blast pages where you can use Entrez queries or
pick from an organism list to modify your search.
69
The Many Parameters of BLAST
There are almost literally hundreds of
parameters, but most are way too obscure even for
die-hard techies like me! Very few of them are
regularly useful in any but their default value,
but just occasionally they are very
necessary. Here are some of the ones that I have
used -e max expected value -m output
format (graphical or tabular/spreadsheet) -F
filter query sequence for low complexity (default
TRUE) -U use only upper case regions of query
(default FALSE) -G gap opening cost -E gap
extension cost -q nucleotide mismatch
penalty (BLASTx uses matrices) -r
nucleotide match reward -b number of
matching sequences to report -g allow
gaps (default TRUE) -W word size -z
effective database size (removes effect of
actual database size!) -S query strands
to search (default both directions) -l restrict
database sequences to given list of gi numbers
70
(No Transcript)
71
BLAST Parameters Exercises
1. BLASTn vs. BLASTx Open the file
example-sequences.html, copy the sequence
gtblastn-vs-blastx This is a Xenopus tropicalis
cDNA sequence. Go to the NCBI BLAST Home
Page/Nucleotide-nucleotide BLAST (blastn)
section. Paste your sequence into the box. Run
BLASTn against the nr nucleotide database using
all default options. Then hit format to wait
for the results in a new page. (hint if you paste
the sequence definition line gtname into the box
as well, your results will be labelled
accordingly, which can be useful) Now repeat but
go to the TRANSLATED BLAST section, and BLAST
against the nr protein database using
BLASTx. How might the different results help us
view the presence of this gene in other
vertebrates?
72
Results for Exercise 1.
BLASTn
BLASTx
73
BLAST Parameters Exercises
2. Low complexity filtering Open the file
example-sequences.html, copy the sequence
gtlow-complexity-filtering-A This sequence
contains a long AT tandem repeat. Go to the NCBI
BLAST Home Page/TRANSLATED BLAST section/BLASTx.
Paste your sequence into the box. Carefully
UNTICK the Choose filter  Low complexity
BOX in the second section. And then run BLASTx
against the nr database. What do you feel about
these alignments? Re-run, but leave the
low-complexity filter ON this time. Does this
change our view of the protein matches? Now
continue with gtlow-complexity-filtering-B and
C. C is an especially interesting case what
can we deduce about the cDNA sequence? Annotators
beware!
74
Results for Exercise 2A (OFF)
BLASTn low complexity filtering OFF
75
Results for Exercise 2A (ON)
BLASTn low complexity filtering ON
76
Results for Exercise 2B
ON
OFF
77
Results for Exercise 2C
ON
OFF
There is a sequence error, an extra G at position
117 in the sequence cDNA
(117) AGAAAAGAAGAAACATGGCAATGGATCAGAA
AGAAAAGAAGAAACAT-GCAATGGATC
AGAA Genomic sequence
78
BLAST Parameters Exercises
3. Limit by Entrez query Entrez queries can be
used in the NCBI BLAST web page to restrict the
search to more specific items. For instance to
find only matching sequences in fruit fly, enter
Drosophila melanogasterORGN in the Limit by
entrez query box in the second section (you can
also select the organism from the adjacent
drop-down list). To combine items use logical
AND, OR or NOT. Open the file example-sequences.h
tml. Copy the sequence gtcyclin-D1-Xt and go to
the NCBI BLAST Home Page/ TRANSLATED BLAST
section/BLASTx, and paste the sequence. Use an
Entrez query to find all rodent sequences (rat
and mouse) with a good match to cyclin-D1. At
what E-value do we expect we are no longer
looking at cyclins? Try running the search again
with that E-value as a limit
79
BLAST Parameters Exercises
4. BLASTn vs tBLASTx and nucleotide mismatch
penalties Open the file example-sequences.html.
Also open the NCBI BLAST Home Page/SPECIAL
Align two sequences section. There are several
Xenopus tropicalis cyclins in the examples
file. Copy the sequence gtcyclin-A1-Xt to the
Sequence 1 BLAST window Copy the sequence
gtcyclin-A2-Xt to the Sequence 2 BLAST window (i)
Run the default comparison, should be BLASTn.
Note the alignment. Now run again using tBLASTx
what does this do to our understanding of the
relationship between these two sequences? Are
they homologs, orthologs or paralogs or none of
these? (ii) Revert to BLASTn, and try varying
the values for mismatch penalties and gapping
start by reducing the mismatch penalty to -1.
Then try reducing the gap open and gap extension
penalties. What do we learn from this? (iii)
Now repeat the first parts of the exercise with
cyclin-D1 in place of cyclin-A2
80
Results for Exercise 4 (i)
BLASTn
tBLASTx
81
Results for Exercise 4 (ii)
Mismatch penalty -2 (default)
Mismatch penalty -1
82
BLAST Parameters Exercises
5. E-Value maximum for reporting Open the file
example-sequences.html. Copy the sequence
gtsumo-binding-motif and go to the NCBI BLAST Home
Page. Go to the PROTEIN BLAST section, BLASTp,
and paste the sequence. Run the search with the
default values. Now re-run the search, setting
the maximum E-value in the box Expect
100 What difference does this make?
83
BLAST Parameters Exercises
6. Word Size Open the file example-sequences.html
. Copy the sequence gtmorpholino and go to the
NCBI BLAST Home Page. Go to the NUCLEOTIDE BLAST
section, BLASTn, and paste the sequence. Check
OFF the low complexity filter, and then run the
search. Now re-run the search, setting the
following parameters Low complexity OFF Expect
100 Word Size 7 Other advanced -q-1 (mismatch
penalty -1 instead of default -3) What
difference does this make?
84
END
Write a Comment
User Comments (0)
About PowerShow.com