In silico Gene Function Prediction Using Ontologybased Pattern Identification - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

In silico Gene Function Prediction Using Ontologybased Pattern Identification

Description:

In silico Gene Function Prediction Using Ontology-based Pattern Identification. Paper Review ... OPI : Ontology-based Pattern Identification is a data-mining ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 23
Provided by: chunya
Category:

less

Transcript and Presenter's Notes

Title: In silico Gene Function Prediction Using Ontologybased Pattern Identification


1
In silico Gene Function Prediction Using
Ontology-based Pattern Identification
  • Paper Review
  • By Chunyan Meng
  • February 13, 2006

2
Paper Source
  • Authors Zhou Y1, Young JA2, Santrosyan A1, Chen
    K1, Yan SF1, Winzeler EA1,2
  • 1Genomics Institute of the Novartis Research
    Foundation San Diego, CA 92121, USA
  • 2Department of Cell Biology, The Scripps
    Research Institute, La Jolla, Ca 92037, USA
  • Source Bioinformatics. 2005 Apr 1 21(7)1237-45

3
Related Words
  • In Silico In or by means of a computer
    simulation (http//www.worldwidewords.org/weirdwor
    ds/ww-ins1.htm)
  • OPI Ontology-based Pattern Identification is a
    data-mining algorithm introduced in this paper
  • GBA Guilt by association- an observation that
    functionally related genes tend to share similar
    mRNA expression profiles
  • Malarial parasite Plasmodium falciparum the most
    deadly of the four human malaria parasites

4
Microarray Data Analysis Open Questions
  • Is data transformation beneficial? Which better?
  • How to filter out the trivial expression
    patterns?
  • What is the best similarity metrics for
    clustering?
  • Pearson correlation coefficient-based or
    Euclidean distance-based (DChip manual)?
  • Is the partition-based method better? What is the
    right k for the k-means clustering (Yeung et al.,
    2001)?
  • Is the hierarchical clustering method more
    flexible? What is the similarity threshold to
    identify the right sub-tree for functional
    analysis (Allocco et al., 2004)?

5
Motivation of This Research
  • Due to limited success of traditional methods for
    producing clusters of genes with great amounts of
    functional similarity, new data mining algorithms
    are required to fully exploit the potential of
    high-throughput genomic approaches.
  • This research tries to address the above
    mentioned open questions.

6
Key Contribution of This Study
  • A tool based on the OPI algorithm to discover the
    optimal analysis pipeline and its corresponding
    parameters under the condition of some existing
    related biology knowledge.
  • The research applied OPI to a publicly available
    gene expression data set on the life cycle of the
    malarial parasite Plasmodium falciparum and
    systematically annotated genes for 320 functional
    categories based on current Gene Ontology
    annotations.

7
OPI Method Foundation
  • Minimize PD(D(M,X,G),G) VM ? M
  • M a method refers to a data analysis pipeline
    and its parameter set
  • X a data set
  • G a piece of existing knowledge
  • D a discovery by applying method M to the data
    set X with hints from knowledge G
  • PD the discovery score, lower scores means more
    interesting discoveries

8
OPI Method Implementation
  • For gene expression-based functional annotation.
  • X n x m matrix of gene expression profile,
    where n is the total number of genes and m is the
    total number of experiments.
  • P. falciparum life cycle gene expression profile
    n5159genes m 16 expression profiles covering
    different parasite life cycle stages.

9
OPI Method Implementation
  • G the list of NG known genes among the total
    number of NT genes that survive the
    data-filtering process using the method M.
  • The list of genes on each vertex of the GO (Gene
    Ontology) representation graph which represents a
    group of genes with known functionality.
    2096/5159 genes have certain annotation. Manually
    confirmed annotations only.

10
OPI Method Implementation
  • Apply OPI on each functional group of GO
  • M hyper-dimensional space of analysis methods
    have the combination of the following
  • PANOVA x x
    x
  • x S0

LinearLog
UnweightedWeighted
)
(
(
)
QSingle QAverage
(
)
11
Objective Function of the Problem
NT
For a gene functional group, D is ND genes that
are predicted to have the same function according
to the GBA.
NG
NC
ND
The probability of at least NC genes are
correctly annotated in a random sampling of ND
genes out of the total collection of NT genes
follows an accumulative hypergeometric
distribution. The problem is to minimize the
probability of knowledge discovery by chance.
12
Solving the Problem
  • The implemented program exhaustively searches all
    possible local minima in the local parameter
    space.

For the Cell-Cell-Adhesion group, searching
along the S0 method dimension with the other 4
dimension fixed.
13
Results
  • GNG-TSRI Malaria e-Annotation Database
  • Systematic view of coordination among functional
    categories
  • Validation of functional annotation for antigenic
    variation group
  • Pattern similarity and functional granularity
  • Comparison of analysis methods

14
GNG-TSRI Malaria e-Annotation Database
Generated 320 functional categories (clusters)
15
Systematic View of Coordination Among Functional
Categories
16
Systematic View of Coordination Among Functional
Categories
It is the first time the gene expression data
were clustered at the functional-group level
instead of individual genes or samples
17
Validation of Functional Annotation
  • Use Antigenic Variation Group As Example
  • 67 was predicted in this group by OPI based on
    2003 GO database. 50 /67 have no indication in
    the GO database to belong to this group. 12 of 50
    are now confirmed.
  • 246 was predicted in this group by OPI based on
    2004 GO database with better score.

18
Pattern Similarity
Size of the marker indicate the size of a
resultant gene list
  • Findings no universal quantitative formula
    relating FDR to expression similarity. The
    threshold S0 must be set specifically for each
    individual GO category.

19
Comparison of Analysis Methods
  • The number of times each of the 8 methods yielded
    a group with FDRlt50, resulting in a total of 104
    groups

20
Comments from the Negative Aspects
  • The mathematics is complex and computational
    expensive. Especially the optimal solution
    finding process.
  • Validation of gene function annotation is based
    on one example and lacks systematic validation.
  • Similarity coefficient threshold S0 has very
    important role in the clustering process and it
    depends on the quality of the annotated genes of
    GO.

21
Positive Aspects
  • OPI gives richer and more precise biological
    interpretation to the same data than the k-means
    approach. Most clusters of OPI have smaller
    p-values that means higher statistical
    confidence.
  • Take advantages of both supervised clustering(by
    using GO categories) and unsupervised clustering
    (by grouping based on gene expression
    similarities).
  • Compared to k-means method, OPI allows more
    function specific clusters due to the
    hierarchical clustering feature.

22
References
  • Allocco, D.J. et al. (2004) Quantifying the
    relationship between co-expression, co-regulation
    and gene function. BMC Bioinformatics, 5, 18.
  • Yeung, K. et al. (2001) Validating clustering for
    gene expression data. Bioinformatics, 17,
    309-318.
Write a Comment
User Comments (0)
About PowerShow.com