Title: Getting The Most From Your Bioinformatics Toolbox
1Getting The Most From Your Bioinformatics Toolbox
- David Wishart
- Faculty of Pharmacy Pharmaceutical Sciences
2June 26, 2000
- First Draft of Human Genome Announced
- Estimated to be 97 complete
- 3,100,000 bases in total
- Paper to appear in December (Science)
- 45,000 to 50,000 genes (some suggest it is as low
as 28,000)
Now What?
3Key Tools in the Toolbox
- Gene Prediction
- Sequence Comparison (DotPlot)
- Alignments (BLAST, PSI-BLAST)
- Statistical Significance
- Motifs, Profiles and Domains
- Structure Prediction
- Threading Structure Modeling
4GeneTool PepTool
5The Web (BRONCO)
http//redpoll.pharmacy.ualberta.ca/aleung
6Gene Prediction
branchpoint site
5site
3site
exon 1 intron 1 exon 2
intron 2
CAG/NT
AG/GT
7Evaluation Statistics
TP FP TN FN TP
FN TN
Actual Predicted
Sensitivity Fraction of actual coding regions
that are correctly predicted as
coding Specificity Fraction of the prediction
that is actually correct Correlation Combined
measure of sensitivity and specificity (-1 lt CC
lt 1)
8Gene Predictors
- GRAIL2 (http//cmpbio.ornl.gov)
- Neural Network Model CC0.47
- HMMgene (http//genome/cbs/dtu.dk/services/HMMgene
) - Hidden Markov Model CC0.91
- GENSCAN (http//CCR-081.mit/edu/GENSCAN.html)
- Probabilistic Model CC0.91
- GRPL (GeneTool)
- Reference Point Logistics CC0.94
9What Works Best?
- Expect only a single exon
- BLASTN vs. dbEST
- BLASTX vs. nr(protein)
- Fully sequenced data
- Run GENSCAN HMMgene GRPL
- Combine predictions (CC gt 0.95)
- BLAST vs. nr(protein)
- Combine BLAST result with prediction
10What Next?
- Get ready to do sequence comparisons
- Sequence comparisons lie at the heart to all of
bioinformatics - Dot Plot (pairwise) comparisons
- Sequence database comparisons
- Multiple alignments
- Structure database comparisons
DNA or Protein?
11Sequence Complexity
MCDEFGHIKLAN. High Complexity
ACTGTCACTGAT. Mid Complexity
NNNNTTTTTNNN. Low Complexity
Translate those DNA sequences!!!
12Dot Plots
13Dot Plots
- Invented in 1970 by Gibbs McIntyre
- Good for quick graphical overview
- Simplest method for sequence comparison
- Inter-sequence comparison
- Intra-sequence comparison
- Identifies internal repeats
- Identifies domains or modules
14Dot Plots with PepTool
15Dot Plots with BLAST
16Sequence Similarity Sequence Searching
17BLAST
- Developed in 1990 and 1997 (S. Altschul)
- Looks for clusters of nearby or locally dense
similar or homologous k-tuples - 1st to use statistics to predict significance of
initial matches - saves on false leads - Uses larger word size than FASTA to accelerate
the search process - Looks for High Scoring segment Pairs (HSPs)
18Different Flavours of BLAST
- BLASTP - protein query against protein DB
- BLASTN - DNA/RNA query against GenBank
- BLASTX - 6 frame DNA query against proDB
- TBLASTN - protein query against 6 frame GB
- TBLASTX - 6 frame DNA query to 6 frame GB
- PSI-BLAST - protein profile query in pDB
- PHI-BLAST - protein pattern against pDB
19Is This Alignment Significant?
Lysozyme versus Ribonuclease
20Chance and Significance in Sequence Alignment
21Gaussian Distribution
22Poisson Distribution
23Extreme Value Distribution
24MT0895
- MMKIQIYGTGCANCQMLEKNAREAVKELGIDAEFEKIKEMDQILEAGLTA
LPGLAVDGELKIMGRVASKEEIKKILS
25BLAST Output
26BLAST Output
27BLAST Output
28BLAST Parameters
- Identities - No. exact residue matches
- Positives - No. and similar ID matches
- Gaps - No. gaps introduced (BLAST2)
- Score - Summed HSP score
- Expect - Expected of chance HSP aligns
- S - Alignment (HSP) score cutoff
- P - Probability of getting a score gt X
- T - Minimum word or k-tuple score (Threshold)
29High-scoring Segment Pairs
PGQ 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PQG 12 e
tc.
T
Query 325 LNKCKTPGQQRLVNQWIKQPLMDKN 350
L TP G R W P D Sbjct
290 LDCTVTPMGSRMLKRWLHMPVRDTR 315
30Extending HSPs
-ls
E kNe Number of HSPs found purely by chance
X
Cumulative Score
S
T
Extension ( aa)
31If All Else Fails...
- If two sequence are gt 100 residues and gt 25
identical, they are likely related - If two sequences are 15-25 identical they may be
related, but more tests are needed - If two sequences are lt 15 identical they are
probably not related - If you need more than 1 gap for every 20 residues
the alignment is suspicious
32Doolittles Rules of Thumb
33Scraping the Bottom of the Barrel with Psi-BLAST
34PSI-BLAST
35PSI-BLAST
36PSI-BLAST
37Moving from Multiple Hits to Multiple Alignments
38Multiple Sequence Alignment
39Finding Sequence Patterns
40Rules of Thumb
- Sequence pattern-based motifs should be
determined from no fewer than 5 multiply aligned
sequences - A good degree of sequence divergence is needed.
If S is the similarity and N is the no. of
sequences then 1 - SN gt 0.95 - A good sequence pattern should have no fewer than
8 defined amino acid positions
41Sequence Pattern Databases
- PROSITE - http//www.expasy.ch/
- BLOCKS - http//www.blocks.fhcrc.org/
- DOMO - http//www.infobiogen.fr/gracy/domo
- PFAM - http//pfam.wustl.edu
- PRINTS - http//www.biochem.ucl.ac.uk/bsm/dbrowser
/PRINTS - SEQSITE - PepTool
42Sequence Profiles
43Defining Sequence Profiles
44A Sample Sequence Profile
seq1 seq2 seq3 seq4
. . . . . .
ltegti log2(qi/pi)
45Profiles Motifs are Useful
- Helped identify active site of HIV protease
- Helped identify SH2/SH3 class of STPs
- Helped identify important GTP oncoproteins
- Helped identify hidden leucine zipper in HGA
- Used to scan for lectin binding domains
- Regularly used to predict T-cell epitopes
46Domains are More Useful
47Moving From Sequence To Structure
48Membrane Spanning Regions
49Predicting via Hydrophobicity
Bacteriorhodoposin OmpA
50Predicting via Hydrophobicity
51Predicting via Neural Nets and PSSMs
- PHDhtm http//dodo.cpmc.columbia.edu/predictprot
ein/ - TMAP http//www.mbb.ki.se/tmap/index.html
- TMPred http//www.ch.embnet.org/software/TMPRED_fo
rm.html
ACDEGF...
52Secondary Structure Prediction
- Statistical (Chou-Fasman, GOR)
- Homology or Nearest Neighbor (Levin)
- Physico-Chemical (Lim, Eisenberg)
- Pattern Matching (Cohen, Rooman)
- Neural Nets (Qian Sejnowski, Karplus)
- Evolutionary Methods (Barton, Niemann)
- Combined Approaches (Rost, Levin, Argos)
53The PhD Approach
PRFILE...
54Best of the Best
- PredictProtein-PHD (72)
- http//cubic.bioc.columbia.edu/predictprotein
- Jpred (73-75)
- http//jura.ebi.ac.uk8888/
- PREDATOR (75)
- http//www.embl-heidelberg.de/cgi/predator_serv.pl
- PSIpred (77)
- http//insulin.brunel.ac.uk/psipred
55Definition
- Threading - A protein fold recognition technique
that involves incrementally replacing the
sequence of a known protein structure with a
query sequence of unknown structure. The new
model structure is evaluated using a simple
heuristic measure of protein fold quality. The
process is repeated against all known 3D
structures until an optimal fit is found.
56Visualizing Threading
THREADINGSEQNCEECNQESGNI ERHTHREADINGSEQNCETHREAD
GSEQNCEQCQESGIDAERTHR...
57Visualizing Threading
THREADINGSEQNCEECNQESGNI ERHTHREADINGSEQNCETHREAD
GSEQNCEQCQESGIDAERTHR...
T
H
R
E
58Visualizing Threading
THREADINGSEQNCEECNQESGNI ERHTHREADINGSEQNCETHREAD
GSEQNCEQCQESGIDAERTHR...
T
H
59E-mail Address
Sequence
Name
60GenThreader
Conf Prob Epair Esolv AlnSc Alen DLen Tlen
PDB_ID
HIGH 0.915 -81.3 -0.5 69.0 77 108
77 2trxA0MEDIUM 0.894 -81.2 -2.5 63.0 76
105 77 1erv00MEDIUM 0.795 -53.5 -3.3 54.0
72 82 77 3grx00LOW 0.667 -79.7 -0.4
54.0 77 107 77 1a8l02
gtgtgt Alignment with 2trxA0 10 20
30 40 50 60
CCCEEECCCCCHHHHCCCCCCEEEEEEECCCCHHHHHHHHHHHHHHHHCC
CCEEEEEEEC2trxA0 SDKIIHLTDDSFDTDVLKADGAILVDFWAEWC
GPCKMIAPILDEIADEYQGKLTVAKLNI
---------------------MMKIQIYGTGCANCQMLEKNAREAVKE
LGIDAEFEKI--
10 20 30 70
80 90 100 CCCCCHHHHCCCCCCCEEE
EEECCEEEEEEECCCCHHHHHHHHHHHHC2trxA0
DQNPGTAPKYGIRGIPTLLLFKNGEVAATKVGALSKGQLKEFLDANLA
-KEMDQILEAGLTALPGLAVDGELKIMGRVA---SKEEIKKI
LS---- 40 50 60
70
61MMKIQIYGTGCANCQMLEKNAREAVKELGIDAEFEKIKEM CCEEEEECC
CCCHHHHHHHHHHHHHCCCCEEEEECCCHHH DQILEAGLTALPGLAVD
GELKIMGRVASKEEIKKILS HHHHHHCCCCCCEEEECCEEEEECCCCHH
HHHHHHHC
62Your Bioinformatician
- Ms. Haiyan Zhang
- Rm. 2125 Dentistry/Pharmacy Bldng
- 492-4934
- hzhang_at_redpoll.pharmacy.ualberta.ca
Please feel free to contact her!