Bioinformatics Resources and Tools on the Web: A Primer - PowerPoint PPT Presentation

About This Presentation
Title:

Bioinformatics Resources and Tools on the Web: A Primer

Description:

Much of this material comes from the Boston University course: BF527 ... Nematode: http://www.wormbase.org/ Nucleic Acids Research Database Issue ... – PowerPoint PPT presentation

Number of Views:357
Avg rating:3.0/5.0
Slides: 25
Provided by: joelg5
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Resources and Tools on the Web: A Primer


1
Bioinformatics Resources and Tools on the Web A
Primer
  • Joel H. Graber
  • Center for Advanced Biotechnology
  • Boston University

2
Outline
  • Introduction What is bioinformatics?
  • The basics
  • The five sites that all biologists should know
  • Some examples
  • Using the tools in a somewhat less-than-naïve
    manner
  • Questions/comments are welcome at all points
  • Much of this material comes from the Boston
    University course BF527 Bioinformatic
    Applications (http//matrix.bu.edu/BF527/)

3
What is bioinformatics?
4
Examples of Bioinformatics
  • Database interfaces
  • Genbank/EMBL/DDBJ, Medline, SwissProt, PDB,
  • Sequence alignment
  • BLAST, FASTA
  • Multiple sequence alignment
  • Clustal, MultAlin, DiAlign
  • Gene finding
  • Genscan, GenomeScan, GeneMark, GRAIL
  • Protein Domain analysis and identification
  • pfam, BLOCKS, ProDom,
  • Pattern Identification/Characterization
  • Gibbs Sampler, AlignACE, MEME
  • Protein Folding prediction
  • PredictProtein, SwissModeler

5
Things to know and remember about using web
server-based tools
  • You are using someone elses computer
  • You are (probably) getting a reduced set of
    options or capacity
  • Servers are great for sporadic or
    proof-of-principle work, but for intensive work,
    the software should be obtained and run locally

6
Five websites that all biologists should know
  • NCBI (The National Center for Biotechnology
    Information
  • http//www.ncbi.nlm.nih.gov/
  • EBI (The European Bioinformatics Institute)
  • http//www.ebi.ac.uk/
  • The Canadian Bioinformatics Resource
  • http//www.cbr.nrc.ca/
  • SwissProt/ExPASy (Swiss Bioinformatics Resource)
  • http//expasy.cbr.nrc.ca/sprot/
  • PDB (The Protein Databank)
  • http//www.rcsb.org/PDB/

7
NCBI (http//www.ncbi.nlm.nih.gov/)
  • Entrez interface to databases
  • Medline/OMIM
  • Genbank/Genpept/Structures
  • BLAST server(s)
  • Five-plus flavors of blast
  • Draft Human Genome
  • Much, much more

8
EBI (http//www.ebi.ac.uk/)
  • SRS database interface
  • EMBL, SwissProt, and many more
  • Many server-based tools
  • ClustalW, DALI,

9
SwissProt (http//expasy.cbr.nrc.ca/sprot/)
  • Curation!!!
  • Error rate in the information is greatly reduced
    in comparison to most other databases.
  • Extensive cross-linking to other data sources
  • SwissProt is the gold-standard by which other
    databases can be measured, and is the best place
    to start if you have a specific protein to
    investigate

10
A few more resources to be aware of
  • Human Genome Working Draft
  • http//genome.ucsc.edu/
  • TIGR (The Institute for Genomics Research)
  • http//www.tigr.org/
  • Celera
  • http//www.celera.com/
  • (Model) Organism specific information
  • Yeast http//genome-www.stanford.edu/Saccharomyce
    s/
  • Arabidopis http//www.tair.org/
  • Mouse http//www.jax.org/
  • Fruitfly http//www.fruitfly.org/
  • Nematode http//www.wormbase.org/
  • Nucleic Acids Research Database Issue
  • http//nar.oupjournals.org/ (First issue every
    year)

11
Example 1 Searching a new genome for a specific
protein
  • Specific problem We want to find the closest
    match in C. elegans of D. melanogaster protein
    NTF1, a transcription factor
  • First- understanding the different forms of blast

12
The different versions of BLAST
13
1st Step Search the proteins
  • blastp is used to search for C. elegans proteins
    that are similar to NTF1
  • Two reasonable hits are found, but the hits have
    suspicious characteristics
  • besides the fact that they werent included in
    the complete genome!

14
2nd Step Search the nucleotides
  • tblastn is used to search for translations of C.
    elegans nucleotide that are similar to NTF1
  • Now we have only one hit
  • How are they related?

15
Conclusion Incorrect gene prediction/annotation
  • The two predicted proteins have essentially
    identical annotation
  • The protein-protein alignments are disjoint and
    consecutive on the protein
  • The protein-nucleotide alignment includes both
    protein-protein alignments in the proper order
  • Why/how does this happen?

16
Final(?) Check Gene prediction
  • Genscan is the best available ab initio gene
    predictor
  • http//genes.mit.edu/GENSCAN.html
  • Genscans prediction spans both protein-protein
    alignments, reinforcing our conclusion of a bad
    prediction

17
Ab initio vs. similarity vs. hybrid models for
gene finding
  • Ab initio The gene looks like the average of
    many genes
  • Genscan, GeneMark, GRAIL
  • Similarity The gene looks like a specific known
    gene
  • Procrustes,
  • Hybrid A combination of both
  • Genomescan (http//genes.mit.edu/genomescan/)

18
A similar example Fruitfly homolog of mRNA
localization protein VERA
  • Similar procedure as just described
  • Tblastn search with BLOSUM45 produces an
    unexpected exon
  • Conclusion Incomplete (as opposed to incorrect)
    annotation
  • We have verified the existence of the rare
    isoform through RT-PCR

19
Another example Find all genes with pdz domains
  • Multiple methods are possible
  • The best method will depend on many things
  • How much do you know about the domain?
  • Do you know the exact extent of the domain?
  • How many examples do you expect to find?

20
Some possible methods if the domain is a known
domain
  • SwissProt
  • text search capabilities
  • good annotation of known domains
  • crosslinks to other databases (domains)
  • Databases of known domains
  • BLOCKS (http//blocks.fhcrc.org/)
  • Pfam (http//pfam.wustl.edu/)
  • Others (ProDom, ProSite, DOMO,)

21
Determination of the nature of conservation in a
domain
  • For new domains, multiple alignment is your best
    option
  • Global clustalw
  • Local DiAlign
  • Hidden Markov Model HMMER
  • For known domains, this work has largely been
    done for you
  • BLOCKS
  • Pfam

22
If you have a protein, and want to search it to
known domains
  • Search/Analysis tools
  • Pfam
  • BLOCKS
  • PredictProtein (http//cubic.bioc.columbia.edu/pre
    dictprotein/predictprotein.html)

23
Different representations of conserved domains
  • BLOCKS
  • Gapless regions
  • Often multiple blocks for one domain
  • PFAM
  • Statistical model, based on HMM
  • Since gaps are allowed, most domains have only
    one pfam model

24
Conclusions
  • We have only touched small parts of the elephant
  • Trial and error (intelligently) is often your
    best tool
  • Keep up with the main five sites, and youll have
    a pretty good idea of what is happening and
    available
Write a Comment
User Comments (0)
About PowerShow.com