Bioinformatics Resources and Tools on the Web: A Primer - PowerPoint PPT Presentation

About This Presentation

Title:

Bioinformatics Resources and Tools on the Web: A Primer

Description:

Much of this material comes from the Boston University course: BF527 ... Nematode: http://www.wormbase.org/ Nucleic Acids Research Database Issue ... – PowerPoint PPT presentation

Number of Views:357

Avg rating:3.0/5.0

Slides: 25

Provided by: joelg5

Learn more at: https://pga.mgh.harvard.edu

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics Resources and Tools on the Web: A Primer

1
Bioinformatics Resources and Tools on the Web A
Primer

Joel H. Graber
Center for Advanced Biotechnology
Boston University

2
Outline

Introduction What is bioinformatics?
The basics
The five sites that all biologists should know
Some examples
Using the tools in a somewhat less-than-naïve
manner
Questions/comments are welcome at all points
Much of this material comes from the Boston
University course BF527 Bioinformatic
Applications (http//matrix.bu.edu/BF527/)

3
What is bioinformatics?
4
Examples of Bioinformatics

Database interfaces
Genbank/EMBL/DDBJ, Medline, SwissProt, PDB,
Sequence alignment
BLAST, FASTA
Multiple sequence alignment
Clustal, MultAlin, DiAlign
Gene finding
Genscan, GenomeScan, GeneMark, GRAIL
Protein Domain analysis and identification
pfam, BLOCKS, ProDom,
Pattern Identification/Characterization
Gibbs Sampler, AlignACE, MEME
Protein Folding prediction
PredictProtein, SwissModeler

5
Things to know and remember about using web
server-based tools

You are using someone elses computer
You are (probably) getting a reduced set of
options or capacity
Servers are great for sporadic or
proof-of-principle work, but for intensive work,
the software should be obtained and run locally

6
Five websites that all biologists should know

NCBI (The National Center for Biotechnology
Information
http//www.ncbi.nlm.nih.gov/
EBI (The European Bioinformatics Institute)
http//www.ebi.ac.uk/
The Canadian Bioinformatics Resource
http//www.cbr.nrc.ca/
SwissProt/ExPASy (Swiss Bioinformatics Resource)
http//expasy.cbr.nrc.ca/sprot/
PDB (The Protein Databank)
http//www.rcsb.org/PDB/

7
NCBI (http//www.ncbi.nlm.nih.gov/)

Entrez interface to databases
Medline/OMIM
Genbank/Genpept/Structures
BLAST server(s)
Five-plus flavors of blast
Draft Human Genome
Much, much more

8
EBI (http//www.ebi.ac.uk/)

SRS database interface
EMBL, SwissProt, and many more
Many server-based tools
ClustalW, DALI,

9
SwissProt (http//expasy.cbr.nrc.ca/sprot/)

Curation!!!
Error rate in the information is greatly reduced
in comparison to most other databases.
Extensive cross-linking to other data sources
SwissProt is the gold-standard by which other
databases can be measured, and is the best place
to start if you have a specific protein to
investigate

10
A few more resources to be aware of

Human Genome Working Draft
http//genome.ucsc.edu/
TIGR (The Institute for Genomics Research)
http//www.tigr.org/
Celera
http//www.celera.com/
(Model) Organism specific information
Yeast http//genome-www.stanford.edu/Saccharomyce
s/
Arabidopis http//www.tair.org/
Mouse http//www.jax.org/
Fruitfly http//www.fruitfly.org/
Nematode http//www.wormbase.org/
Nucleic Acids Research Database Issue
http//nar.oupjournals.org/ (First issue every
year)

11
Example 1 Searching a new genome for a specific
protein

Specific problem We want to find the closest
match in C. elegans of D. melanogaster protein
NTF1, a transcription factor
First- understanding the different forms of blast

12
The different versions of BLAST
13
1st Step Search the proteins

blastp is used to search for C. elegans proteins
that are similar to NTF1
Two reasonable hits are found, but the hits have
suspicious characteristics
besides the fact that they werent included in
the complete genome!

14
2nd Step Search the nucleotides

tblastn is used to search for translations of C.
elegans nucleotide that are similar to NTF1
Now we have only one hit
How are they related?

15
Conclusion Incorrect gene prediction/annotation

The two predicted proteins have essentially
identical annotation
The protein-protein alignments are disjoint and
consecutive on the protein
The protein-nucleotide alignment includes both
protein-protein alignments in the proper order
Why/how does this happen?

16
Final(?) Check Gene prediction

Genscan is the best available ab initio gene
predictor
http//genes.mit.edu/GENSCAN.html
Genscans prediction spans both protein-protein
alignments, reinforcing our conclusion of a bad
prediction

17
Ab initio vs. similarity vs. hybrid models for
gene finding

Ab initio The gene looks like the average of
many genes
Genscan, GeneMark, GRAIL
Similarity The gene looks like a specific known
gene
Procrustes,
Hybrid A combination of both
Genomescan (http//genes.mit.edu/genomescan/)

18
A similar example Fruitfly homolog of mRNA
localization protein VERA

Similar procedure as just described
Tblastn search with BLOSUM45 produces an
unexpected exon
Conclusion Incomplete (as opposed to incorrect)
annotation
We have verified the existence of the rare
isoform through RT-PCR

19
Another example Find all genes with pdz domains

Multiple methods are possible
The best method will depend on many things
How much do you know about the domain?
Do you know the exact extent of the domain?
How many examples do you expect to find?

20
Some possible methods if the domain is a known
domain

SwissProt
text search capabilities
good annotation of known domains
crosslinks to other databases (domains)
Databases of known domains
BLOCKS (http//blocks.fhcrc.org/)
Pfam (http//pfam.wustl.edu/)
Others (ProDom, ProSite, DOMO,)

21
Determination of the nature of conservation in a
domain

For new domains, multiple alignment is your best
option
Global clustalw
Local DiAlign
Hidden Markov Model HMMER
For known domains, this work has largely been
done for you
BLOCKS
Pfam

22
If you have a protein, and want to search it to
known domains

Search/Analysis tools
Pfam
BLOCKS
PredictProtein (http//cubic.bioc.columbia.edu/pre
dictprotein/predictprotein.html)

23
Different representations of conserved domains

BLOCKS
Gapless regions
Often multiple blocks for one domain
PFAM
Statistical model, based on HMM
Since gaps are allowed, most domains have only
one pfam model

24
Conclusions

We have only touched small parts of the elephant
Trial and error (intelligently) is often your
best tool
Keep up with the main five sites, and youll have
a pretty good idea of what is happening and
available

Write a Comment

User Comments (0)