Title: Bioinformatics Resources and Tools on the Web: A Primer
1Bioinformatics Resources and Tools on the Web A
Primer
- Joel H. Graber
- Center for Advanced Biotechnology
- Boston University
2Outline
- Introduction What is bioinformatics?
- The basics
- The five sites that all biologists should know
- Some examples
- Using the tools in a somewhat less-than-naïve
manner - Questions/comments are welcome at all points
- Much of this material comes from the Boston
University course BF527 Bioinformatic
Applications (http//matrix.bu.edu/BF527/)
3What is bioinformatics?
4Examples of Bioinformatics
- Database interfaces
- Genbank/EMBL/DDBJ, Medline, SwissProt, PDB,
- Sequence alignment
- BLAST, FASTA
- Multiple sequence alignment
- Clustal, MultAlin, DiAlign
- Gene finding
- Genscan, GenomeScan, GeneMark, GRAIL
- Protein Domain analysis and identification
- pfam, BLOCKS, ProDom,
- Pattern Identification/Characterization
- Gibbs Sampler, AlignACE, MEME
- Protein Folding prediction
- PredictProtein, SwissModeler
5Things to know and remember about using web
server-based tools
- You are using someone elses computer
- You are (probably) getting a reduced set of
options or capacity - Servers are great for sporadic or
proof-of-principle work, but for intensive work,
the software should be obtained and run locally
6Five websites that all biologists should know
- NCBI (The National Center for Biotechnology
Information - http//www.ncbi.nlm.nih.gov/
- EBI (The European Bioinformatics Institute)
- http//www.ebi.ac.uk/
- The Canadian Bioinformatics Resource
- http//www.cbr.nrc.ca/
- SwissProt/ExPASy (Swiss Bioinformatics Resource)
- http//expasy.cbr.nrc.ca/sprot/
- PDB (The Protein Databank)
- http//www.rcsb.org/PDB/
7NCBI (http//www.ncbi.nlm.nih.gov/)
- Entrez interface to databases
- Medline/OMIM
- Genbank/Genpept/Structures
- BLAST server(s)
- Five-plus flavors of blast
- Draft Human Genome
- Much, much more
8EBI (http//www.ebi.ac.uk/)
- SRS database interface
- EMBL, SwissProt, and many more
- Many server-based tools
- ClustalW, DALI,
9SwissProt (http//expasy.cbr.nrc.ca/sprot/)
- Curation!!!
- Error rate in the information is greatly reduced
in comparison to most other databases. - Extensive cross-linking to other data sources
- SwissProt is the gold-standard by which other
databases can be measured, and is the best place
to start if you have a specific protein to
investigate
10A few more resources to be aware of
- Human Genome Working Draft
- http//genome.ucsc.edu/
- TIGR (The Institute for Genomics Research)
- http//www.tigr.org/
- Celera
- http//www.celera.com/
- (Model) Organism specific information
- Yeast http//genome-www.stanford.edu/Saccharomyce
s/ - Arabidopis http//www.tair.org/
- Mouse http//www.jax.org/
- Fruitfly http//www.fruitfly.org/
- Nematode http//www.wormbase.org/
- Nucleic Acids Research Database Issue
- http//nar.oupjournals.org/ (First issue every
year)
11Example 1 Searching a new genome for a specific
protein
- Specific problem We want to find the closest
match in C. elegans of D. melanogaster protein
NTF1, a transcription factor - First- understanding the different forms of blast
12The different versions of BLAST
131st Step Search the proteins
- blastp is used to search for C. elegans proteins
that are similar to NTF1 - Two reasonable hits are found, but the hits have
suspicious characteristics - besides the fact that they werent included in
the complete genome!
142nd Step Search the nucleotides
- tblastn is used to search for translations of C.
elegans nucleotide that are similar to NTF1 - Now we have only one hit
- How are they related?
15Conclusion Incorrect gene prediction/annotation
- The two predicted proteins have essentially
identical annotation - The protein-protein alignments are disjoint and
consecutive on the protein - The protein-nucleotide alignment includes both
protein-protein alignments in the proper order - Why/how does this happen?
16Final(?) Check Gene prediction
- Genscan is the best available ab initio gene
predictor - http//genes.mit.edu/GENSCAN.html
- Genscans prediction spans both protein-protein
alignments, reinforcing our conclusion of a bad
prediction
17Ab initio vs. similarity vs. hybrid models for
gene finding
- Ab initio The gene looks like the average of
many genes - Genscan, GeneMark, GRAIL
- Similarity The gene looks like a specific known
gene - Procrustes,
- Hybrid A combination of both
- Genomescan (http//genes.mit.edu/genomescan/)
18A similar example Fruitfly homolog of mRNA
localization protein VERA
- Similar procedure as just described
- Tblastn search with BLOSUM45 produces an
unexpected exon - Conclusion Incomplete (as opposed to incorrect)
annotation - We have verified the existence of the rare
isoform through RT-PCR
19Another example Find all genes with pdz domains
- Multiple methods are possible
- The best method will depend on many things
- How much do you know about the domain?
- Do you know the exact extent of the domain?
- How many examples do you expect to find?
20Some possible methods if the domain is a known
domain
- SwissProt
- text search capabilities
- good annotation of known domains
- crosslinks to other databases (domains)
- Databases of known domains
- BLOCKS (http//blocks.fhcrc.org/)
- Pfam (http//pfam.wustl.edu/)
- Others (ProDom, ProSite, DOMO,)
21Determination of the nature of conservation in a
domain
- For new domains, multiple alignment is your best
option - Global clustalw
- Local DiAlign
- Hidden Markov Model HMMER
- For known domains, this work has largely been
done for you - BLOCKS
- Pfam
22If you have a protein, and want to search it to
known domains
- Search/Analysis tools
- Pfam
- BLOCKS
- PredictProtein (http//cubic.bioc.columbia.edu/pre
dictprotein/predictprotein.html)
23Different representations of conserved domains
- BLOCKS
- Gapless regions
- Often multiple blocks for one domain
- PFAM
- Statistical model, based on HMM
- Since gaps are allowed, most domains have only
one pfam model
24Conclusions
- We have only touched small parts of the elephant
- Trial and error (intelligently) is often your
best tool - Keep up with the main five sites, and youll have
a pretty good idea of what is happening and
available