Title: Advanced bioinformatics tools for analyzing the Arabidopsis genome
1Advanced bioinformatics tools for analyzing the
Arabidopsis genome
Proteins of Arabidopsis thaliana (PAT) Gene
Ontology (GO) Hongyu Zhang, Ph.D.
2Sequence
Bioinformatics
Structure
Function
3PAT Structure-aided function annotation
- PAT is a collaborating project between Ceres and
San Diego Supercomputer Center
http//pat.sdsc.edu - Importance of structure-aided function annotation
- Structure contains more function information than
sequence, like active site, binding motif etc. - Structure is more conserved than sequence during
evolution, therefore protein sequences can have
similar structures even without clearly detected
sequence similarity. It means that we have bigger
chance to find the function relationship from
structure similarity than from sequence
similarity using advanced structure prediction
programs like PSI-BLAST and threading algorithm. - Structure prediction programs can also be used to
predict all sorts of structure features of
proteins, like trans-membrane tendency,
electrostatics potential distribution, or
coil-coil fold tendency. Those structure features
are also valuable to biologists to guess the
possible functions of novel genes.
4Fold recognition
- Frequently implies biochemical function
5Highlights in PAT annotations
- Domain-based prediction
- Structure domain
- PDB, SCOP
- Sequence domain
- Pfam
- Predictions are strictly benchmarked
6Reliability categories
Category Reliable level Benchmark
A Certain gt99.9
B Reliable gt99
C Probable gt90
D Possible gt50
E Potential gt10
7Methods
- Programs
- Protein sequences were analyzed using a spectrum
of programs, including structure prediction,
function prediction and feature annotation
methods. - Database
- All the results were organized and stored in an
Oracle relational database for the ease of data
access and process. - Interface
- Web-based interface convenient for both
computational and non-computational biologist
users.
8Programs used in PAT pipeline
- Protein structure and function
- Homology modeling
- BLAST, PSI-BLAST search against protein structure
database - Threading
- 123D search against a protein fold library
- Protein class and features
- COILS, TMHMM, SignalP, PSI-pred, PSORT
9Protein sequences
sequence info
structure info
Prediction of signal peptides (SignalP,
PSORT) transmembrane (TMHMM, PSORT) coiled
coils (COILS) low complexity regions (SEG)
NR, PFAM
SCOP, PDB
Building FOLDLIB PDB chains SCOP domains PDP
domains CE matches PDB vs. SCOP 90 sequence
non-identical minimum size 25 aa coverage (90,
gaps lt30, endslt30)
Create PSI-BLAST profiles for Protein sequences
Structural assignment of domains by PSI-BLAST on
FOLDLIB
Only sequences w/out A-prediction
Structural assignment of domains by 123D on
FOLDLIB
Only sequences w/out A-prediction
Functional assignment by PFAM, NR, PSIPred
assignments
Domain location prediction by sequence
FOLDLIB
Store assigned regions in the DB
10GUITop Level
11Example P450 family
- Sequence relatives detected by ordinary Blast
search - 313 hits, when E-score cutoff is 0.001
- 324 hits, when E-score cutoff is 0.01
- Sequence relatives detected by PAT
- 367 hits with confidence greater or equal to 99
12Figure 2. SCOP results, super-family level. It
displayed the number of true positive predictions
versus the number of false positive predictions
for the SCOP test set. Here, if two proteins
share the first three SCOP sccs ids, e.g.,
d.126.1.1 and d.126.1.2, they are considered
having the same structure in super-family level.
The results in this figure displayed that
PSI-BLAST are superior than both NCBI-BLAST and
WU-BLAST in picking up the true positives.
13Acknowledgement
- Dr. Nickolai Alexandrov
- Dr. Philip E. Bourne
- Dr. Wilfred W. Li
- Dr. Greg B. Quinn
- Dr. Ilya E. Shindyalov
14Gene Ontology (GO) project
- Gene Ontology Consortium (http//www.geneontology.
org) - Controlled vocabularies for the description of
gene functions. - Three dimensions
- Molecular Function
- the tasks performed by individual gene products
examples are transcription factor and DNA
helicase - Biological Process
- broad biological goals, such as purine metabolism
or mitosis, that are accomplished by ordered
assemblies of molecular functions - Cellular Component
- subcellular structures, locations, and
macromolecular complexes examples include
nucleus, telomere, and origin recognition complex
15Three dimensions of GO
Biological process
Gene product
Molecular Function
Cellular Component
16Hierarchical structure of GO term tree
- .GO0003673 Gene_Ontology .GO0003674
molecular_function .GO0005488
binding .GO0003676 nucleic
acid binding .GO0003677
DNA binding
.GO0003700 transcription factor
.GO0030528 transcription regulator
.GO0003700 transcription factor
17The evidence codes used in GO
- IC inferred by curator
- IDA inferred from direct assay
- IEA inferred from electronic annotation
- IEP inferred from expression pattern
- IGI inferred from genetic interaction
- IMP inferred from mutant phenotype
- IPI inferred from physical interaction
- ISS inferred from sequence or structural
similarity - NAS non-traceable author statement
- ND no biological data available
- TAS traceable author statement
- NR not recorded
18Process to annotate Ceres peptide
- Download GO annotations from TAIR website
(http//www.arabidopsis.org) - Annotating methods
- If
- the sequence of the Ceres peptide is the same as
a GO database sequence based on locus name, copy
all the annotations of the GO database sequence
to the Ceres peptide. - Else
- For each Ceres peptide, pick up its best hit
that does have the TAIR annotation, and then copy
its annotation to this Ceres peptide.
19Example P450 family
- Sequence relatives detected by simple Blast
search - 313 hits, when E-score cutoff is 0.001
- 324 hits, when E-score cutoff is 0.01
- Sequence relatives detected by PAT
- 367 hits with confidence greater or equal to 99
- Sequence relatives annotated by GO
- 365 hits
- Number of Hits based on evidence
- 295 with ISS (inferred from sequence or
structural similarity) - 67 with IEA (inferred from electronic annotation)
- 2 with TAS (traceable author statement)
- 1 with IDA (inferred from direct assay)
20Acknowledgement
- Dr. Nickolai Alexandrov
- Mr. Eric Zetterbaum
- Dr. Richard Flavell
- etc.