An%20Example - PowerPoint PPT Presentation

About This Presentation

Title:

An%20Example

Description:

For a query of N letters against a subject sequence of M letters, it ... Drosophila Drosophila genome proteins provided by Celera and Berkeley Drosophila ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 35

Provided by: gregsc2

Category:

more less

Transcript and Presenter's Notes

Title: An%20Example

1
Computing in Molecular Biology
Hugues Sicotte National Center for Biotechnology
Information sicotte_at_ncbi.nlm.nih.gov
2
Alignment methods
Sequence Alignment representation using a dot
plot. For a query of N letters against a subject
sequence of M letters, it requires MxN
comparisons.
Query sequence
Subject sequence
3
H A S H I N G M E T H O D S
Hashing is a common method for accelerating
database searches
MLI
LII
IIK
IKR
all overlappingwords of size 3
Compile dictionary of words from the query
sequence. Put each word in a look-up table that
points to the original position in the sequence.
Thus given one word, you can know if it is in the
query in a single operation.
KRD
RDE
DEL
ELV
LVI
VIS
ISW
SWA
WAS
ASH
SHE
HER
ERE
4
Index lookup

Each word is assigned a unique integer.
E.g. for a word of 3 letters made up of an
alphabet of 20 letters.
Assign a code to each letter Code(l) (0 to 19)
For a word of 3 letters L1 L2 L3 the code is
index Code(L1)202 Code(L2)201 Code(L3)
3. Have an array with a list of the positions
that have that word.

AAA
AAB
MLI
MLJ
0
1
2
3
1
Position in query sequence of word
5
H A S H I N G M E T H O D S
Building the dictionary for the query sequence
requires (N-2) operations.
MLI
LII
IIK
IKR
all overlappingwords of size 3
KRD
RDE
DEL
ELV
The database contains (M-2) words, and it takes
only one operation to see if the word was in the
query.
LVI
VIS
ISW
SWA
WAS
ASH
SHE
HER
ERE
6
H A S H I N G M E T H O D S
Query sequence
Scan the subject, looking up words in the
dictionary
Use word hits to determine were to search for
alignments fills the dynamic programming
matrix in (N-2)(M-2) operations instead of MxN.
Subject sequence
7
H A S H I N G M E T H O D S
Query sequence
Scan the database, looking up words in the
dictionary
Use word hits to determine were to search for
alignments
Subject sequence
FASTA searches in a band
8
H A S H I N G M E T H O D S
Query sequence
Scan the database, looking up words in the
dictionary
Use word hits to determine were to search for
alignments
Database sequence
BLAST extends from word hits
9
Database Search Space
Simplest Database searching could is a large
dynamic programming example. With all the
database sequences concatenated one after another.
Query sequence
Concanated Database sequence
10
Database Search Space
Which alignment is more significant?
Query sequence
Concanated Database sequence
11
Database Search Space
Score can be used to judge alignments. But a
score absolute value is a function of the score
parameters. Match1,Mismatch-1, Gap_open5, gap_
extend1 Yields same alignments
as Match10,Mismatch-10, Gap_open50, gap_extend
10 Scores useful for relative ranking.
Query sequence
Concanated Database sequence
12
Database Search Space
To Judge relevancy of an alignment, need to judge
if match is significant. E-value Expect(S) is a
function of the score, database size and
composition, and query size. Number of Aligments
with scores gt S expected if the query was a
random given the database size and
composition. Expect of 0.0 means a very good
match unlikely to be random.
Query sequence
Concanated Database sequence
13
D A T A B A S E S E A R C H I N G
Compare one query sequence against an entire
database
gt
fasta
myquery
swissprot
-ktup 2
search program
querysequence
sequencedatabase
optionalparameters
A typical search has four basic elements
14
D A T A B A S E S E A R C H I N G
With exponential database growth, searches keep
taking more time
gt
fasta
myquery
swissprot
-ktup 2
searching
.
.
.
.
.
.
15
E-value
Hits can be sorted according to their E-value
or their score. The E-value is better known as
the EXPECT value and is a function of score,
database size and query sequence length. E-value
Number of alignments with a score gtS that you
expect to find if the database was a collection
of random letters. e.g. For a score of 1, one
only requires 1 match, and there should be an
enormous amount of alignments. One expects to
find less alignments with a score of 5, and so
on.. Eventually when the score is big enough, one
expects to find an insignificant number of of
alignments that could be due to chance. E-value
of less than 1e-6 (1 10-6 in scientific
notation) are usually very good and for proteins,
Elt1e-2 is usually considered significant. It is
still possible for a Hit with Egt1 to be
biologically meaningful, but more analysis is
required to comfirm that. Even for VERY good
hits, it is possible that the hit is due to a
biological artifact (sequencing/cloning vector,
repeats, low-complexity sequence)
16
D A T A B A S E S E A R C H I N G
The hit list gives titles and scores for
matched sequences
gt
fasta
myquery
swissprot
-ktup 2
The best scores are
initn init1 opt z-sc E(77110) gi1706794spP4978
9FHIT_HUMAN BIS(5'-ADENOSYL)- 996 996 996
1262.1 0 gi1703339spP49776APH1_SCHPO
BIS(5'-NUCLEOSYL) 412 382 395 507.6
1.4e-21 gi1723425spP49775HNT2_YEAST HIT
FAMILY PROTEI 238 133 316 407.4
5.4e-16 gi3915958spQ58276Y866_METJA
HYPOTHETICAL HIT- 153 98 190 253.1
2.1e-07 gi3916020spQ11066YHIT_MYCTU
HYPOTHETICAL 15.7 163 163 184 244.8
6.1e-07 gi3023940spO07513HIT_BACSU HIT
PROTEIN 164 164 170 227.2
5.8e-06 gi2506515spQ04344HNT1_YEAST HIT
FAMILY PROTEI 130 91 157 210.3
5.1e-05 gi2495235spP75504YHIT_MYCPN
HYPOTHETICAL 16.1 125 125 148 199.7
0.0002 gi418447spP32084YHIT_SYNP7
HYPOTHETICAL 12.4 42 42 140 191.3
0.00058 gi3025190spP94252YHIT_BORBU
HYPOTHETICAL 15.9 128 73 139 188.7
0.00082 gi1351828spP47378YHIT_MYCGE
HYPOTHETICAL HIT- 76 76 133 181.0
0.0022 gi418446spP32083YHIT_MYCHR
HYPOTHETICAL 13.1 27 27 119 165.2
0.017 gi1708543spP49773IPK1_HUMAN HINT
PROTEIN (PRO 66 66 118 163.0
0.022 gi2495231spP70349IPK1_MOUSE HINT
PROTEIN (PRO 65 65 116 160.5
0.03 gi1724020spP49774YHIT_MYCLE HYPOTHETICAL
HIT- 52 52 117 160.3 0.031 gi1170581spP164
36IPK1_BOVIN HINT PROTEIN (PRO 66 66 115
159.3 0.035 gi2495232spP80912IPK1_RABIT HINT
PROTEIN (PRO 66 66 112 155.5
0.057 gi1177047spP42856ZB14_MAIZE 14 KD
ZINC-BINDIN 73 73 112 155.4
0.058 gi1177046spP42855ZB14_BRAJU 14 KD
ZINC-BINDIN 76 76 110 153.8
0.072 gi1169825spP31764GAL7_HAEIN
GALACTOSE-1-PHOSP 58 58 104 138.5
0.51 gi113999spP16550APA1_YEAST
5',5'''-P-1,P-4-TE 47 47 103 137.8
0.56 gi1351948spP49348APA2_KLULA
5',5'''-P-1,P-4-T 63 63 98 131.3
1.3 gi123331spP23228HMCS_CHICK
HYDROXYMETHYLGLUTA 58 58 99 129.4
1.6 gi1170899spP06994MDH_ECOLI MALATE
DEHYDROGENA 70 48 91 122.9
3.7 gi3915666spQ10798DXR_MYCTU
1-DEOXY-D-XYLULOSE 75 50 92 121.9
4.3 gi124341spP05113IL5_HUMAN INTERLEUKIN-5
PRECU 36 36 85 121.3 4.7 gi1170538spP46
685IL5_CERTO INTERLEUKIN-5 PREC 36 36 84
120.0 5.5 gi121369spP15124GLNA_METCA
GLUTAMINE SYNTHETA 45 45 90 118.9
6.3 gi2506868spP33937NAPA_ECOLI PERIPLASMIC
NITRA 48 48 92 117.4 7.6 gi119377spP104
03ENV1_DROME RETROVIRUS-RELATED 59 59 89
117.0 8 gi1351041spP48415SC16_YEAST
MULTIDOMAIN VESIC 48 48 97 117.0
8 gi4033418spO67501IPYR_AQUAE INORGANIC
PYROPHO 38 38 83 116.8 8.3
17
D A T A B A S E S E A R C H I N G
Detailed alignments are shown farther down in the
output
gt
fasta
myquery
swissprot
-ktup 2
gtgtgi1703339spP49776APH1_SCHPO
BIS(5'-NUCLEOSYL)-TETR (182 aa) initn 412
init1 382 opt 395 z-score 507.6 E()
1.4e-21 Smith-Waterman score 395 52.3
identity in 109 aa overlap 10
20 30 40 50 gi170
MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRPVERFHDL
RPDEVADLF X .. .
.. .. ... gi170
MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLVIPQRAVPRLKD
LTPSELTDLF 10 20 30
40 50 60 60 70
80 90 100 110 gi170
QTTQRVGTVVEKHFHGTSLTFSMQDGPEAGQTVKHVHVHVLPRKAGDFHR
NDSIYEELQK .... . ... ....
. .. . . . X. gi170
TSVRKVQQVIEKVFSASASNIGIQDGVDAGQTVPHVHVHIIPRKKADFSE
NDLVYSELEK 70 80 90
100 110 120 120 130
140 gi170 HDKEDFPASWRSEEEMAAEAAALRVYFQ
.. gi170 NEGNLASLYLTGNERYAGDERPPTSMRQAIPKDEDRKP
RTLEEMEKEAQWLKGYFSEEQE 130
140 150 160 170
180 gtgtgi1723425spP49775HNT2_YEAST HIT FAMILY
PROTEIN 2 (217 aa) initn 238 init1 133 opt
316 z-score 407.4 E() 5.4e-16 Smith-Waterman
score 316 37.4 identity in 131 aa overlap
10 20 30
40 gi170 MSFRFGQHLIKPSVVFLKTE
LSFALVNRKPVVPGHVLVCPLRP-VER
.. . .v .. .. .. X
18
Database Search Space
Some matches are non-meaningful because they
occur VERY often in database. e.g. nucleotide
AAA (from polyA) Biological repeated
elements(retroposons ALU) Low-complexity repeated
patterns. (CAGCAG, QQQ,KKK,) These elements
should be FILTERED or MASKED to avoid generating
false hits.. It is OK to align through them
if they are near meaningful diagonal hits
Query sequence
19
Score and Statistics
Some amino acids mutations do not affect
structure/function very much. Amino acids with
similar physico-chemical and steric properties
can often replace each other. Scoring system
that doesnt penalize very much mutations to
similar amino acid. PAM Matrices Point Accepted
Mutations. Defined in terms of a divergence of 1
percent PAM. For distant sequences use PAM250,
while for closer sequences (like DNA) use PAM100.
Some sites accumulate mutations some others
dont, thus use of the PAM100 matrice doesnt
mean that the sequences compared were 100
mutated. BLOSUM BLOCK substitution matrices.
Started with the BLOCKS database of multiple
alignment only involving distant sequences.
BLOSUM62 means that the proteins compated were
never closer than 62 Identity. BLOSUM50
matrices involved alignment of more distant
sequences. Recommend use BLOSUM matrices
(BLOSUM62) for most protein alignments.
20
S C O R I N G S Y S T E M S
A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3
-3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2
5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0
0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2
-3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1
-2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2
-1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0
6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4
7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1
4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1
1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1
-4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2
-1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3
3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C
Q E G H I L K M F P S T W Y V
Some amino acid substitutions are more common
than others
BLOSUM62
Substitution scores come from an odds ratio based
on measured substitution rates
Figure 7.8
21
S C O R I N G S Y S T E M S
A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3
-3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2
5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0
0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2
-3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1
-2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2
-1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0
6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4
7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1
4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1
1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1
-4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2
-1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3
3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C
Q E G H I L K M F P S T W Y V
Identities get positive scores, but some are
better than others
BLOSUM62
Figure 7.8
22
S C O R I N G S Y S T E M S
A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3
-3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2
5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0
0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2
-3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1
-2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2
-1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0
6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4
7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1
4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1
1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1
-4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2
-1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3
3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C
Q E G H I L K M F P S T W Y V
Some non-identities have positive scores, but
most are negative
BLOSUM62
Figure 7.8
23
BLAST and BLAST2SEQUENCES
BLAST is a database search engine
based on using hashing to accelerate the
search. blastn (nucleotide query against
nucleotide database) blastp (protein query
against protein database) blastx (nucleotide
query against protein database) - translates a
nucleotide query in all 6 reading frames and
compare it to a protein database. tblastn
(protein query against nucleotide database) -
compare a protein against a nucleotide database
translated in all 6 reading frames. tblastx
(nucleotide query against nucleotide
database) - compares a nucleotide sequence
against a nucleotide database by translating
the query and database in all 6 reading
frames. Very slow! A pairwise alignment
implementation of this program is available at
http//www.ncbi.nlm.nih.gov
/gorf/bl2.html
24
Protein BLAST databases
nr All non-redundant GenBank CDS
translationsPDB SwissProt PIR PRF
month All new or revised GenBank CDS
translationPDBSwissProtPIRPRF released in
the last 30 days. swissprot Last major
release of the SWISS-PROT protein sequence
database (no updates) Drosophila Drosophila
genome proteins provided by Celera and
Berkeley Drosophila Genome Project (BDGP).
yeast Yeast (Saccharomyces cerevisiae)
genomic CDS translations ecoli Escherichia
coli genomic CDS translations pdb Sequences
derived from the 3-dimensional structure from
Brookhaven Protein Data Bank kabat kabatpro
Kabat's database of sequences of immunological
interest alu Translations of select Alu
repeats from REPBASE, suitable for masking Alu
repeats from query sequences.
25
Nucleotide BLAST databases
nr All GenBankEMBLDDBJPDB sequences (but no
EST, STS, GSS, or phase 0, 1 or 2 HTGS
sequences). No longer "non-redundant". month
All new or revised GenBankEMBLDDBJPDB
sequences released in the last 30 days.
Drosophila genome Drosophila genome provided by
Celera and Berkeley Drosophila Genome
Project (BDGP). dbest Database of
GenBankEMBLDDBJ sequences from EST Divisions
dbsts Database of GenBankEMBLDDBJ sequences
from STS Divisions htgs Unfinished High
Throughput Genomic Sequences phases 0, 1 and 2
(finished, phase 3 HTG sequences are in nr) gss
Genome Survey Sequence, includes
single-pass genomic data, exon-trapped
sequences, and Alu PCR sequences. yeast
Yeast (Saccharomyces cerevisiae) genomic
nucleotide sequences E. coli Escherichia
coli genomic nucleotide sequences
26
Nucleotide BLAST databases
pdb Sequences derived from the
3-dimensional structure from Brookhaven Protein
Data Bank kabat kabatnuc Kabat's database of
sequences of immunological interest vector
Vector subset of GenBank(R), NCBI, in
ftp//ncbi.nlm.nih.gov/blast/db/ mito
Database of mitochondrial sequences alu
Select Alu repeats from REPBASE, suitable for
masking Alu repeats from query sequences. epd
Eukaryotic Promotor Database found on the
web at http//www.genome.ad.jp/dbget-bin/www_b
find?epd
27
BLASTN SEARCH (M29204)
Search Nucleotide sequence M29204 against
nr. http//www.ncbi.nlm.nih.gov/blast/blast.cgi?Jf
orm1
28
BLASTP and filtering.
Search using blastp against nr With filtering ON
(default)\ Then with filtering OFF. gtGCF MKKRVTNR
ERHWTHRRRRQRTRKKKKKKKRVLGRRALGPRPWLTGRKGLFGSARLIPA
TA
29
BLASTN vs BLASTX
Search blastn against nr (nucleotide)
U15595 Now search using blastx against nr
(protein) Now Search blastx against ALU
30
TBLASTX against dbEST
Search tblastx against dbEST Picks up homologs
based on protein homology of translations. gtOCRL-s
elected mRNA, partial sequenceTTGAACATCATGAAACATG
AGGTTGTCATTTGGTTGGGAGATTTGAATTATAGACTTTGCATGCCTGA
TGCCAATGAGGTGAAAAGTCTTATTAATAAGAAAGACCTTCAGAGACTCT
TGAAATTCGACCAGCTAAATATTCAGCGCACACAGAAAAAAGCTTTTGT
TGACTTCAATGAAGGGGAAATCAAGTTCATCCCCACTTATAAGTATGAC
TCTAA
31
Prosite search
Search prosite for NP_000271 (Pax6a) http//www.
expasy.ch/prosite
32
PHI-Blast search
Search Prosite db using the NCBIs
PHI-blast.(Pattern-Hit-Initiated blast) using the
pattern for Pax6a. LIVMFYG-ASLVR-X(2)-LIVMSTA
CN-X-(4)-LIV-RKNQESTAIY-LIVFSTNKH-W -e
2e-14
33
PSI-Blast search
Search AB026911 using PSI-blast. (at
NCBI). Position-Specific-Iteration. .. Modifies
the scoring matrix as a function of conserved or
unconserved residues in alignments.
34
ONLINE tutorials
Details of Blast methodology.
http//www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschu
l-1.html
Blast usage and Tutorial
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/in
formation3.html
Quick overview of terminology.
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/si
milarity.html

Write a Comment

User Comments (0)