Advanced bioinformatics tools for analyzing the Arabidopsis genome - PowerPoint PPT Presentation

About This Presentation

Title:

Advanced bioinformatics tools for analyzing the Arabidopsis genome

Description:

Figure 2. SCOP results, super-family level. ... Here, if two proteins share the first three SCOP sccs ids, e.g., d.126.1.1 and d. ... – PowerPoint PPT presentation

Number of Views:1375

Avg rating:3.0/5.0

Slides: 21

Provided by: hzh31

Learn more at: https://hongyu.org

Category:

more less

Transcript and Presenter's Notes

Title: Advanced bioinformatics tools for analyzing the Arabidopsis genome

1
Advanced bioinformatics tools for analyzing the
Arabidopsis genome
Proteins of Arabidopsis thaliana (PAT) Gene
Ontology (GO) Hongyu Zhang, Ph.D.
2
Sequence
Bioinformatics
Structure
Function
3
PAT Structure-aided function annotation

PAT is a collaborating project between Ceres and
San Diego Supercomputer Center
http//pat.sdsc.edu
Importance of structure-aided function annotation
Structure contains more function information than
sequence, like active site, binding motif etc.
Structure is more conserved than sequence during
evolution, therefore protein sequences can have
similar structures even without clearly detected
sequence similarity. It means that we have bigger
chance to find the function relationship from
structure similarity than from sequence
similarity using advanced structure prediction
programs like PSI-BLAST and threading algorithm.
Structure prediction programs can also be used to
predict all sorts of structure features of
proteins, like trans-membrane tendency,
electrostatics potential distribution, or
coil-coil fold tendency. Those structure features
are also valuable to biologists to guess the
possible functions of novel genes.

4
Fold recognition

Frequently implies biochemical function

5
Highlights in PAT annotations

Domain-based prediction
Structure domain
PDB, SCOP
Sequence domain
Pfam
Predictions are strictly benchmarked

6
Reliability categories
Category Reliable level Benchmark
A Certain gt99.9
B Reliable gt99
C Probable gt90
D Possible gt50
E Potential gt10
7
Methods

Programs
Protein sequences were analyzed using a spectrum
of programs, including structure prediction,
function prediction and feature annotation
methods.
Database
All the results were organized and stored in an
Oracle relational database for the ease of data
access and process.
Interface
Web-based interface convenient for both
computational and non-computational biologist
users.

8
Programs used in PAT pipeline

Protein structure and function
Homology modeling
BLAST, PSI-BLAST search against protein structure
database
Threading
123D search against a protein fold library
Protein class and features
COILS, TMHMM, SignalP, PSI-pred, PSORT

9
Protein sequences
sequence info
structure info
Prediction of signal peptides (SignalP,
PSORT) transmembrane (TMHMM, PSORT) coiled
coils (COILS) low complexity regions (SEG)
NR, PFAM
SCOP, PDB
Building FOLDLIB PDB chains SCOP domains PDP
domains CE matches PDB vs. SCOP 90 sequence
non-identical minimum size 25 aa coverage (90,
gaps lt30, endslt30)

Create PSI-BLAST profiles for Protein sequences
Structural assignment of domains by PSI-BLAST on
FOLDLIB
Only sequences w/out A-prediction
Structural assignment of domains by 123D on
FOLDLIB
Only sequences w/out A-prediction
Functional assignment by PFAM, NR, PSIPred
assignments
Domain location prediction by sequence
FOLDLIB
Store assigned regions in the DB
10
GUITop Level
11
Example P450 family

Sequence relatives detected by ordinary Blast
search
313 hits, when E-score cutoff is 0.001
324 hits, when E-score cutoff is 0.01
Sequence relatives detected by PAT
367 hits with confidence greater or equal to 99

12
Figure 2. SCOP results, super-family level. It
displayed the number of true positive predictions
versus the number of false positive predictions
for the SCOP test set. Here, if two proteins
share the first three SCOP sccs ids, e.g.,
d.126.1.1 and d.126.1.2, they are considered
having the same structure in super-family level.
The results in this figure displayed that
PSI-BLAST are superior than both NCBI-BLAST and
WU-BLAST in picking up the true positives.
13
Acknowledgement

Dr. Nickolai Alexandrov
Dr. Philip E. Bourne
Dr. Wilfred W. Li
Dr. Greg B. Quinn
Dr. Ilya E. Shindyalov

14
Gene Ontology (GO) project

Gene Ontology Consortium (http//www.geneontology.
org)
Controlled vocabularies for the description of
gene functions.
Three dimensions
Molecular Function
the tasks performed by individual gene products
examples are transcription factor and DNA
helicase
Biological Process
broad biological goals, such as purine metabolism
or mitosis, that are accomplished by ordered
assemblies of molecular functions
Cellular Component
subcellular structures, locations, and
macromolecular complexes examples include
nucleus, telomere, and origin recognition complex

15
Three dimensions of GO
Biological process
Gene product
Molecular Function
Cellular Component
16
Hierarchical structure of GO term tree

.GO0003673 Gene_Ontology        .GO0003674
molecular_function             .GO0005488
binding                   .GO0003676 nucleic
acid binding                         .GO0003677
DNA binding
.GO0003700 transcription factor
.GO0030528 transcription regulator
       .GO0003700 transcription factor

17
The evidence codes used in GO

IC inferred by curator
IDA inferred from direct assay
IEA inferred from electronic annotation
IEP inferred from expression pattern
IGI inferred from genetic interaction
IMP inferred from mutant phenotype
IPI inferred from physical interaction
ISS inferred from sequence or structural
similarity
NAS non-traceable author statement
ND no biological data available
TAS traceable author statement
NR not recorded

18
Process to annotate Ceres peptide

Download GO annotations from TAIR website
(http//www.arabidopsis.org)
Annotating methods
If
the sequence of the Ceres peptide is the same as
a GO database sequence based on locus name, copy
all the annotations of the GO database sequence
to the Ceres peptide.
Else
For each Ceres peptide, pick up its best hit
that does have the TAIR annotation, and then copy
its annotation to this Ceres peptide.

19
Example P450 family