Advanced bioinformatics tools for analyzing the Arabidopsis genome - PowerPoint PPT Presentation

About This Presentation
Title:

Advanced bioinformatics tools for analyzing the Arabidopsis genome

Description:

Figure 2. SCOP results, super-family level. ... Here, if two proteins share the first three SCOP sccs ids, e.g., d.126.1.1 and d. ... – PowerPoint PPT presentation

Number of Views:1375
Avg rating:3.0/5.0
Slides: 21
Provided by: hzh31
Learn more at: https://hongyu.org
Category:

less

Transcript and Presenter's Notes

Title: Advanced bioinformatics tools for analyzing the Arabidopsis genome


1
Advanced bioinformatics tools for analyzing the
Arabidopsis genome
Proteins of Arabidopsis thaliana (PAT) Gene
Ontology (GO) Hongyu Zhang, Ph.D.
2
Sequence
Bioinformatics
Structure
Function
3
PAT Structure-aided function annotation
  • PAT is a collaborating project between Ceres and
    San Diego Supercomputer Center
    http//pat.sdsc.edu
  • Importance of structure-aided function annotation
  • Structure contains more function information than
    sequence, like active site, binding motif etc.
  • Structure is more conserved than sequence during
    evolution, therefore protein sequences can have
    similar structures even without clearly detected
    sequence similarity. It means that we have bigger
    chance to find the function relationship from
    structure similarity than from sequence
    similarity using advanced structure prediction
    programs like PSI-BLAST and threading algorithm.
  • Structure prediction programs can also be used to
    predict all sorts of structure features of
    proteins, like trans-membrane tendency,
    electrostatics potential distribution, or
    coil-coil fold tendency. Those structure features
    are also valuable to biologists to guess the
    possible functions of novel genes.

4
Fold recognition
  • Frequently implies biochemical function

5
Highlights in PAT annotations
  • Domain-based prediction
  • Structure domain
  • PDB, SCOP
  • Sequence domain
  • Pfam
  • Predictions are strictly benchmarked

6
Reliability categories
Category Reliable level Benchmark
A Certain gt99.9
B Reliable gt99
C Probable gt90
D Possible gt50
E Potential gt10
7
Methods
  • Programs
  • Protein sequences were analyzed using a spectrum
    of programs, including structure prediction,
    function prediction and feature annotation
    methods.
  • Database
  • All the results were organized and stored in an
    Oracle relational database for the ease of data
    access and process.
  • Interface
  • Web-based interface convenient for both
    computational and non-computational biologist
    users.

8
Programs used in PAT pipeline
  • Protein structure and function
  • Homology modeling
  • BLAST, PSI-BLAST search against protein structure
    database
  • Threading
  • 123D search against a protein fold library
  • Protein class and features
  • COILS, TMHMM, SignalP, PSI-pred, PSORT

9
Protein sequences
sequence info
structure info
Prediction of signal peptides (SignalP,
PSORT) transmembrane (TMHMM, PSORT) coiled
coils (COILS) low complexity regions (SEG)
NR, PFAM
SCOP, PDB
Building FOLDLIB PDB chains SCOP domains PDP
domains CE matches PDB vs. SCOP 90 sequence
non-identical minimum size 25 aa coverage (90,
gaps lt30, endslt30)

Create PSI-BLAST profiles for Protein sequences
Structural assignment of domains by PSI-BLAST on
FOLDLIB
Only sequences w/out A-prediction
Structural assignment of domains by 123D on
FOLDLIB
Only sequences w/out A-prediction
Functional assignment by PFAM, NR, PSIPred
assignments
Domain location prediction by sequence
FOLDLIB
Store assigned regions in the DB
10
GUITop Level
11
Example P450 family
  • Sequence relatives detected by ordinary Blast
    search
  • 313 hits, when E-score cutoff is 0.001
  • 324 hits, when E-score cutoff is 0.01
  • Sequence relatives detected by PAT
  • 367 hits with confidence greater or equal to 99

12
Figure 2. SCOP results, super-family level. It
displayed the number of true positive predictions
versus the number of false positive predictions
for the SCOP test set. Here, if two proteins
share the first three SCOP sccs ids, e.g.,
d.126.1.1 and d.126.1.2, they are considered
having the same structure in super-family level.
The results in this figure displayed that
PSI-BLAST are superior than both NCBI-BLAST and
WU-BLAST in picking up the true positives.
13
Acknowledgement
  • Dr. Nickolai Alexandrov
  • Dr. Philip E. Bourne
  • Dr. Wilfred W. Li
  • Dr. Greg B. Quinn
  • Dr. Ilya E. Shindyalov

14
Gene Ontology (GO) project
  • Gene Ontology Consortium (http//www.geneontology.
    org)
  • Controlled vocabularies for the description of
    gene functions.
  • Three dimensions
  • Molecular Function
  • the tasks performed by individual gene products
    examples are transcription factor and DNA
    helicase
  • Biological Process
  • broad biological goals, such as purine metabolism
    or mitosis, that are accomplished by ordered
    assemblies of molecular functions
  • Cellular Component
  • subcellular structures, locations, and
    macromolecular complexes examples include
    nucleus, telomere, and origin recognition complex

15
Three dimensions of GO
Biological process
Gene product
Molecular Function
Cellular Component
16
Hierarchical structure of GO term tree
  • .GO0003673 Gene_Ontology        .GO0003674
    molecular_function             .GO0005488
    binding                   .GO0003676 nucleic
    acid binding                         .GO0003677
    DNA binding                              
    .GO0003700 transcription factor            
    .GO0030528 transcription regulator            
           .GO0003700 transcription factor

17
The evidence codes used in GO
  • IC inferred by curator
  • IDA inferred from direct assay
  • IEA inferred from electronic annotation
  • IEP inferred from expression pattern
  • IGI inferred from genetic interaction
  • IMP inferred from mutant phenotype
  • IPI inferred from physical interaction
  • ISS inferred from sequence or structural
    similarity
  • NAS non-traceable author statement
  • ND no biological data available
  • TAS traceable author statement
  • NR not recorded

18
Process to annotate Ceres peptide
  • Download GO annotations from TAIR website
    (http//www.arabidopsis.org)
  • Annotating methods
  • If
  • the sequence of the Ceres peptide is the same as
    a GO database sequence based on locus name, copy
    all the annotations of the GO database sequence
    to the Ceres peptide.
  • Else
  • For each Ceres peptide, pick up its best hit
    that does have the TAIR annotation, and then copy
    its annotation to this Ceres peptide.

19
Example P450 family
  • Sequence relatives detected by simple Blast
    search
  • 313 hits, when E-score cutoff is 0.001
  • 324 hits, when E-score cutoff is 0.01
  • Sequence relatives detected by PAT
  • 367 hits with confidence greater or equal to 99
  • Sequence relatives annotated by GO
  • 365 hits
  • Number of Hits based on evidence
  • 295 with ISS (inferred from sequence or
    structural similarity)
  • 67 with IEA (inferred from electronic annotation)
  • 2 with TAS (traceable author statement)
  • 1 with IDA (inferred from direct assay)

20
Acknowledgement
  • Dr. Nickolai Alexandrov
  • Mr. Eric Zetterbaum
  • Dr. Richard Flavell
  • etc.
Write a Comment
User Comments (0)
About PowerShow.com