Functional and structural genomics using PEDANT - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Functional and structural genomics using PEDANT

Description:

(ii)PSI-BLAST search with SCOP against (I) abd saving resulting profiles ... (iii)construct a SCOP profile library using IMPALA (iv)IMPALA search with each ... – PowerPoint PPT presentation

Number of Views:170
Avg rating:3.0/5.0
Slides: 19
Provided by: CHL90
Category:

less

Transcript and Presenter's Notes

Title: Functional and structural genomics using PEDANT


1
Functional and structural genomics using PEDANT
  • ?????
  • ??????
  • ???

2
Introduction
  • With increasing biological sequence data, it need
    a system with ability of storing and retreving
    tens of gigabytes of data, a mature database
    management system, and a good visualization tools
  • From case-oriented sequence analysis work to
    automated large-scale genome annotation

3
Introduction-PEDANT
  • Difference of existing genome analysis programs
  • protein oriented vs. DNA oriented analysis
  • interactive work vs. commandline operation
  • bioinformatics method applied
  • user interface
  • conveniency feature, project management and data
    editors
  • fidelity of result produced
  • Benchmark may vary in terms of chosen of balance
    between sensitivity and selectivity of the
    analyses
  • PEDANT (Protein Extraction, Description, and
    ANalysis Tool) was available in mid-1997(use
    FASTA as similarity search)
  • a workhorse for general bioinformatics research
  • a common framework for a number of genome
    analysis projects
  • a complete database of automated genomes
  • a tool for routine analysis of large amounts of
    genomic contigs and ESTs

4
System Architecture
  • Overview
  • database module storing, modifying and accessing
    data
  • processing module bioinformatics computations
  • user interface web based communication

5
System Architecture-Cont.
  • Data access
  • primary table store raw data (ex DNA, protein
    sequences and program results ex BLAST output )
  • secondary table parsed program results
  • simplified schema
  • Operation in command line mode
  • applying bioinformatics methods to sequences
  • parsing data tables
  • querying the resulting databases
  • Web interface
  • No static HTML pages required
  • DNA and Protein viewers make direct access to the
    SQL tables
  • Implementation and system requirements
  • Perl 5, and C for graphical viewer
  • Performance
  • parallel capabilities

6
Schema
7
Bioinformatics Method
  • Overview of the PEDANT processing pipeline
  • identification of coding regions and various
    analysis genetics elements
  • homology search
  • detection of protein motifs, prediction of
    secondary structure and other protein features
    and sensitive fold recognition
  • automatically attributed to pre-defined
    functional categories
  • Prediction of genes and other genetic elements
  • Table 1
  • choose one of 15 genetic codes
  • http//www.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wp
    rintgc?modec
  • Functional and structural categories
  • similarity search PSI-BLAST(Position-Specific
    Iterated BLAST)
  • special datasets MIPS, COG, PROSITE, PFAM and
    BLOCKS
  • significant matches of PIR annotations,
    keywords, enzyme classification and superfamily
    information
  • with significant relationship of PDB, secondary
    structure information STRIDE(upper case),
    PREDATOR(lower case)
  • low complexity region, membrance regions, coiled
    coils and signal peptides
  • comparison of SCOP with IMPALA

functional
structural
8
Table 1
9
Bioinformatics Method-Cont.
  • Yeast biological role categories
  • first system of biological role of categories
    E.Coli
  • MIPS advanced hierarchical functional catalogue
    (Yeast)
  • Multidimensionality-proteingene is MM
  • automated assignment to MIPS is first
    approximation, will be refined by manual
    annotation
  • Distribution of ORFs
  • Visualization
  • a integrated, hypertext-linked protein report
    with calculated parameters and sequences as
    reference for further manual annotation
  • Protein report page

10
Distribution of ORFs
11
Protein report page
12
Bioinformatics Method-Cont.2
  • Automatic versus manual annotation
  • Problem of error propagation
  • erroneous annotation by human error and spurious
    similarity hits
  • with filtering algorithms and domain structure ?
  • quality improvement of manual review of human
    experts !
  • Manual annotation
  • Catalogue independent
  • Flexibility first place in higher category and
    later step move to the finer categories
  • 528 categories 20 main categories and 6 levels
  • confidence levels reject, low, medium,
    high and default is auto
  • Data release management
  • new release data can be intelligently merged with
    existing data pool
  • transfer manual annotation between subsequent
    data release
  • manual field yes or no and default is no
    initially
  • example a PFAM domain identified in new release
    ORF is manual no and conf auto

13
Manual annotation transfer
  • Two contigs fuse to one
  • Gene boundary change
  • Appears new gene
  • Two genes fuse to one contig

14
The PEDANT Genome Database
  • Annotation of publicly available completely
    sequenced and unfinished genomes
  • Genome annotated by MIPS
  • Completely sequenced and published genomic
    sequences
  • Unfinished and/or unpublished genomics sequences
  • gene prediction by ORPHEUS, allow large overlaps
    between ORFs
  • PEDANT as a structural genomics resource-0.3M
    proteins
  • class-based approach, cost-saving
  • (i)non-redundant protein sequence databases
  • (ii)PSI-BLAST search with SCOP against (I) abd
    saving resulting profiles
  • (iii)construct a SCOP profile library using
    IMPALA
  • (iv)IMPALA search with each genomic sequence
    against SCOP library
  • same procedure for nr PDB sequence database
  • performance of IMPALA
  • Cross-genome comparison
  • treat each genome as an individual contig creat
    cross-genome datasets without any modification
  • 44 genomes

15
Performance of IMPALA
16
Applications
  • Arabidopsis thaliana chromosome IV
  • 3744 predicted protein coding genes
  • roughly 30 are known proteins or strongly
    similar to known proteins
  • multi-cellular organisms has higher all-alpha and
    smaller mixed alpha/beta structural domains ratio
    to unicellular species
  • Assembled human transcripts
  • human UniGene subjected PEDANT analysis, compare
    over 75000 contigs
  • this MySQL DB is close to 8GB
  • acceptable query time show the suitability of
    PEDANT for large-scale EST sequencing projects
  • Analysis of the GroEL substrates
  • GroEL a common E.Coli chaperonin
  • structural motif common in 52 substrates relying
    on GroEL for folding in vivo two or more
    alpha/beta domains involving buried beta-sheets
    with large hydrophobic surfaces--easy aggregation

17
Classification of predicted genes
  • Classification by the degree of homology to
    functionally characterized proteins based on
    BLAST scores

18
Summary and Outlook
  • PEDANT is a useful tool for genome annotation and
    bioinformatics research
  • It can automated and manual assignment of gene
    product to functional and structural categories
  • extensive hyperlinked protein report and advanced
    viewers
  • Outlook
  • better decision rules need to be employed
  • manually annotate predicted genetics eelments(ex.
    LTRs)
  • supporting Oracle RDBMS
  • automatic gene prediction pipeline for higher
    eukaryotes
  • interactive capabilities
Write a Comment
User Comments (0)
About PowerShow.com