Getting The Most From Your Bioinformatics Toolbox - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Getting The Most From Your Bioinformatics Toolbox

Description:

3,100,000 bases in total. Paper to appear in December (Science) ... The Web (BRONCO) http://redpoll.pharmacy.ualberta.ca/~aleung. Gene Prediction. 5'site ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 63
Provided by: Comp632
Category:

less

Transcript and Presenter's Notes

Title: Getting The Most From Your Bioinformatics Toolbox


1
Getting The Most From Your Bioinformatics Toolbox
  • David Wishart
  • Faculty of Pharmacy Pharmaceutical Sciences

2
June 26, 2000
  • First Draft of Human Genome Announced
  • Estimated to be 97 complete
  • 3,100,000 bases in total
  • Paper to appear in December (Science)
  • 45,000 to 50,000 genes (some suggest it is as low
    as 28,000)

Now What?
3
Key Tools in the Toolbox
  • Gene Prediction
  • Sequence Comparison (DotPlot)
  • Alignments (BLAST, PSI-BLAST)
  • Statistical Significance
  • Motifs, Profiles and Domains
  • Structure Prediction
  • Threading Structure Modeling

4
GeneTool PepTool

5
The Web (BRONCO)
http//redpoll.pharmacy.ualberta.ca/aleung
6
Gene Prediction
branchpoint site
5site
3site
exon 1 intron 1 exon 2
intron 2
CAG/NT
AG/GT
7
Evaluation Statistics
TP FP TN FN TP
FN TN
Actual Predicted
Sensitivity Fraction of actual coding regions
that are correctly predicted as
coding Specificity Fraction of the prediction
that is actually correct Correlation Combined
measure of sensitivity and specificity (-1 lt CC
lt 1)
8
Gene Predictors
  • GRAIL2 (http//cmpbio.ornl.gov)
  • Neural Network Model CC0.47
  • HMMgene (http//genome/cbs/dtu.dk/services/HMMgene
    )
  • Hidden Markov Model CC0.91
  • GENSCAN (http//CCR-081.mit/edu/GENSCAN.html)
  • Probabilistic Model CC0.91
  • GRPL (GeneTool)
  • Reference Point Logistics CC0.94

9
What Works Best?
  • Expect only a single exon
  • BLASTN vs. dbEST
  • BLASTX vs. nr(protein)
  • Fully sequenced data
  • Run GENSCAN HMMgene GRPL
  • Combine predictions (CC gt 0.95)
  • BLAST vs. nr(protein)
  • Combine BLAST result with prediction

10
What Next?
  • Get ready to do sequence comparisons
  • Sequence comparisons lie at the heart to all of
    bioinformatics
  • Dot Plot (pairwise) comparisons
  • Sequence database comparisons
  • Multiple alignments
  • Structure database comparisons

DNA or Protein?
11
Sequence Complexity
MCDEFGHIKLAN. High Complexity
ACTGTCACTGAT. Mid Complexity
NNNNTTTTTNNN. Low Complexity
Translate those DNA sequences!!!
12
Dot Plots
13
Dot Plots
  • Invented in 1970 by Gibbs McIntyre
  • Good for quick graphical overview
  • Simplest method for sequence comparison
  • Inter-sequence comparison
  • Intra-sequence comparison
  • Identifies internal repeats
  • Identifies domains or modules

14
Dot Plots with PepTool
15
Dot Plots with BLAST
16
Sequence Similarity Sequence Searching

17
BLAST
  • Developed in 1990 and 1997 (S. Altschul)
  • Looks for clusters of nearby or locally dense
    similar or homologous k-tuples
  • 1st to use statistics to predict significance of
    initial matches - saves on false leads
  • Uses larger word size than FASTA to accelerate
    the search process
  • Looks for High Scoring segment Pairs (HSPs)

18
Different Flavours of BLAST
  • BLASTP - protein query against protein DB
  • BLASTN - DNA/RNA query against GenBank
  • BLASTX - 6 frame DNA query against proDB
  • TBLASTN - protein query against 6 frame GB
  • TBLASTX - 6 frame DNA query to 6 frame GB
  • PSI-BLAST - protein profile query in pDB
  • PHI-BLAST - protein pattern against pDB

19
Is This Alignment Significant?
Lysozyme versus Ribonuclease
20
Chance and Significance in Sequence Alignment

21
Gaussian Distribution
22
Poisson Distribution
23
Extreme Value Distribution
24
MT0895
  • MMKIQIYGTGCANCQMLEKNAREAVKELGIDAEFEKIKEMDQILEAGLTA
    LPGLAVDGELKIMGRVASKEEIKKILS

25
BLAST Output
26
BLAST Output
27
BLAST Output
28
BLAST Parameters
  • Identities - No. exact residue matches
  • Positives - No. and similar ID matches
  • Gaps - No. gaps introduced (BLAST2)
  • Score - Summed HSP score
  • Expect - Expected of chance HSP aligns
  • S - Alignment (HSP) score cutoff
  • P - Probability of getting a score gt X
  • T - Minimum word or k-tuple score (Threshold)

29
High-scoring Segment Pairs
PGQ 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PQG 12 e
tc.
T
Query 325 LNKCKTPGQQRLVNQWIKQPLMDKN 350
L TP G R W P D Sbjct
290 LDCTVTPMGSRMLKRWLHMPVRDTR 315
30
Extending HSPs
-ls
E kNe Number of HSPs found purely by chance
X
Cumulative Score
S
T
Extension ( aa)
31
If All Else Fails...
  • If two sequence are gt 100 residues and gt 25
    identical, they are likely related
  • If two sequences are 15-25 identical they may be
    related, but more tests are needed
  • If two sequences are lt 15 identical they are
    probably not related
  • If you need more than 1 gap for every 20 residues
    the alignment is suspicious

32
Doolittles Rules of Thumb
33
Scraping the Bottom of the Barrel with Psi-BLAST
34
PSI-BLAST
35
PSI-BLAST
36
PSI-BLAST
37
Moving from Multiple Hits to Multiple Alignments
38
Multiple Sequence Alignment
39
Finding Sequence Patterns
40
Rules of Thumb
  • Sequence pattern-based motifs should be
    determined from no fewer than 5 multiply aligned
    sequences
  • A good degree of sequence divergence is needed.
    If S is the similarity and N is the no. of
    sequences then 1 - SN gt 0.95
  • A good sequence pattern should have no fewer than
    8 defined amino acid positions

41
Sequence Pattern Databases
  • PROSITE - http//www.expasy.ch/
  • BLOCKS - http//www.blocks.fhcrc.org/
  • DOMO - http//www.infobiogen.fr/gracy/domo
  • PFAM - http//pfam.wustl.edu
  • PRINTS - http//www.biochem.ucl.ac.uk/bsm/dbrowser
    /PRINTS
  • SEQSITE - PepTool

42
Sequence Profiles
43
Defining Sequence Profiles
44
A Sample Sequence Profile
seq1 seq2 seq3 seq4
. . . . . .
ltegti log2(qi/pi)
45
Profiles Motifs are Useful
  • Helped identify active site of HIV protease
  • Helped identify SH2/SH3 class of STPs
  • Helped identify important GTP oncoproteins
  • Helped identify hidden leucine zipper in HGA
  • Used to scan for lectin binding domains
  • Regularly used to predict T-cell epitopes

46
Domains are More Useful
47
Moving From Sequence To Structure
48
Membrane Spanning Regions
49
Predicting via Hydrophobicity
Bacteriorhodoposin OmpA
50
Predicting via Hydrophobicity
51
Predicting via Neural Nets and PSSMs
  • PHDhtm http//dodo.cpmc.columbia.edu/predictprot
    ein/
  • TMAP http//www.mbb.ki.se/tmap/index.html
  • TMPred http//www.ch.embnet.org/software/TMPRED_fo
    rm.html

ACDEGF...
52
Secondary Structure Prediction
  • Statistical (Chou-Fasman, GOR)
  • Homology or Nearest Neighbor (Levin)
  • Physico-Chemical (Lim, Eisenberg)
  • Pattern Matching (Cohen, Rooman)
  • Neural Nets (Qian Sejnowski, Karplus)
  • Evolutionary Methods (Barton, Niemann)
  • Combined Approaches (Rost, Levin, Argos)

53
The PhD Approach
PRFILE...
54
Best of the Best
  • PredictProtein-PHD (72)
  • http//cubic.bioc.columbia.edu/predictprotein
  • Jpred (73-75)
  • http//jura.ebi.ac.uk8888/
  • PREDATOR (75)
  • http//www.embl-heidelberg.de/cgi/predator_serv.pl
  • PSIpred (77)
  • http//insulin.brunel.ac.uk/psipred

55
Definition
  • Threading - A protein fold recognition technique
    that involves incrementally replacing the
    sequence of a known protein structure with a
    query sequence of unknown structure. The new
    model structure is evaluated using a simple
    heuristic measure of protein fold quality. The
    process is repeated against all known 3D
    structures until an optimal fit is found.

56
Visualizing Threading
THREADINGSEQNCEECNQESGNI ERHTHREADINGSEQNCETHREAD
GSEQNCEQCQESGIDAERTHR...
57
Visualizing Threading
THREADINGSEQNCEECNQESGNI ERHTHREADINGSEQNCETHREAD
GSEQNCEQCQESGIDAERTHR...
T
H
R
E
58
Visualizing Threading
THREADINGSEQNCEECNQESGNI ERHTHREADINGSEQNCETHREAD
GSEQNCEQCQESGIDAERTHR...
T
H
59
E-mail Address
Sequence
Name
60
GenThreader
Conf Prob Epair Esolv AlnSc Alen DLen Tlen
PDB_ID
HIGH 0.915 -81.3 -0.5 69.0 77 108
77 2trxA0MEDIUM 0.894 -81.2 -2.5 63.0 76
105 77 1erv00MEDIUM 0.795 -53.5 -3.3 54.0
72 82 77 3grx00LOW 0.667 -79.7 -0.4
54.0 77 107 77 1a8l02
gtgtgt Alignment with 2trxA0 10 20
30 40 50 60
CCCEEECCCCCHHHHCCCCCCEEEEEEECCCCHHHHHHHHHHHHHHHHCC
CCEEEEEEEC2trxA0 SDKIIHLTDDSFDTDVLKADGAILVDFWAEWC
GPCKMIAPILDEIADEYQGKLTVAKLNI

---------------------MMKIQIYGTGCANCQMLEKNAREAVKE
LGIDAEFEKI--
10 20 30 70
80 90 100 CCCCCHHHHCCCCCCCEEE
EEECCEEEEEEECCCCHHHHHHHHHHHHC2trxA0
DQNPGTAPKYGIRGIPTLLLFKNGEVAATKVGALSKGQLKEFLDANLA

-KEMDQILEAGLTALPGLAVDGELKIMGRVA---SKEEIKKI
LS---- 40 50 60
70
61
MMKIQIYGTGCANCQMLEKNAREAVKELGIDAEFEKIKEM CCEEEEECC
CCCHHHHHHHHHHHHHCCCCEEEEECCCHHH DQILEAGLTALPGLAVD
GELKIMGRVASKEEIKKILS HHHHHHCCCCCCEEEECCEEEEECCCCHH
HHHHHHHC
62
Your Bioinformatician
  • Ms. Haiyan Zhang
  • Rm. 2125 Dentistry/Pharmacy Bldng
  • 492-4934
  • hzhang_at_redpoll.pharmacy.ualberta.ca

Please feel free to contact her!
Write a Comment
User Comments (0)
About PowerShow.com