Gene Classification: Issues and Challenges for Relational Learning Claudia Perlich - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Gene Classification: Issues and Challenges for Relational Learning Claudia Perlich

Description:

1. Gene Classification: Issues and Challenges for Relational Learning. Claudia Perlich & Srujana Merugu. IBM T.J. Watson Research Center. Presented by. Srujana Merugu ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 23

Provided by: srujana3

Category:

more less

Transcript and Presenter's Notes

Title: Gene Classification: Issues and Challenges for Relational Learning Claudia Perlich

1
Gene Classification Issues and Challenges for
Relational LearningClaudia Perlich Srujana
MeruguIBM T.J. Watson Research CenterPresented
by Srujana Merugu
2
Outline

Objectives
Problem and Domain Description
Challenges for Relational Learning
Potential Solutions using ACORA
Experimental Results

3
Objectives

Demonstrate that statistical relational learning
approaches are suitable for gene-classification
Motivate the relational learning community to use
this domain for benchmarking
Highlight the current limitations of relational
learning techniques with regard to expressive
power and motivate further developments

4
Problem Description (ILP Challenge 2005)

Task Functional annotation of the yeast genome
Training data
Functional annotations (class labels) for
partial set of genes
Secondary protein structure of each gene
Intra-gene homology
Protein-gene homology and protein features

5
Functional Annotations (FunCat Hierarchy)

FunCat hierarchy - forest of multiple class
trees
More than 400 classes
Each gene can have multiple classes
4000 labeled examples and 674 unlabeled

6
Secondary Protein Structure (PROF Predictions)
Protein Sequence
caaaaaaaaaaaaaaaaaaaacccbbbbbcccccccccccaaa
a Alpha helix b Beta sheet c Random coil

Each gene is associated with a single protein
Homogeneous components are encoded as multiple
records

7
Intra-Gene Homology (BLAST Scores)
w15

Scores probability of match being found by
chance
Score lt 1e-6 gt significant match
100000 gene homology pairs with highly skewed
degree distribution
Missing pairs correspond to unknown scores

8
Protein-Gene Homology and Protein Features

Yeast-SwissProt homology scores similar to the
intra-gene scores
45000 proteins and 3700000 gene-protein pairs
with highly skewed degree distribution

9

Summary of Data Components

Functional Annotations (FunCat)
Predicted Secondary Structure (PROF)
Intra-Gene homology (BLAST scores)
Protein homology (SwissProt database) protein
features

10
Suitability for Benchmarking

Highly structured information that is
intrinsically relational
Important scientific problem with limited
predictability so far
Real-life data that require handling noise and
sparsity
Reasonable data set size requiring scalable
techniques

11
Challenging Data Characteristics

Weighted Links
Homology scores do not correspond to the
uncertainty in relationship
Sparse data (missing links)
Missing pairs in homology tables need to be
treated as links with unknown weights
Ordered data
Representing secondary protein structure using
the order, component and length features
results in highly dependent features
Multiple class labels
Classification problem needs to be posed as
multiple one vs. all problems Interpretation of
hierarchy is also important

12
Propositionalization

Transformation
Exploration- finding related entities using joins
Aggregation- constructing features, e.g., mean,
mode
Model Estimation
Logistic regression, Decision trees, Naïve Bayes,
SVM, etc.
Require each entity to be feature vector
(x1,x2,,xn ,class)

13
ACORA Automatic Construction of Relational
Attributes

Explores using breadth first search over all
joins on identifier attributes starting from
target table
Joins back to target (class) table to create
attributes based on class labels of related
objects
Aggregation
Traditional aggregates mean, median, mode
Bayesian aggregates for categorical attributes
distances to class-conditional distributions,
counts of discriminative values

14
Assumptions Violations

Co-occurrence in tables ? existence of
relationship
Missing pairs in homology table - unknown scores
and scores do not also reflect uncertainty in
relationship
Further joins with protein features are also
affected by homology scores
Bags of related objects are random samples
(i.i.d.)
Homogenous components of a single protein
sequence are not i.i.d.
Class-conditional independence between attributes
Protein sequence representation
(order,component,length) and homology
information (gene/protein id, score) both involve
dependent attributes

Most relational learners (including ACORA)
require the above assumptions
15
Potential Solutions

Parameterize the learner to allow the user to
specify the search space (e.g., Declarative
Language Bias )
Change assumptions of relational learner to match
domain properties (e.g., PolyFarm)
Change domain representation to match the
assumptions of the relational learner

16
Protein Structure Representation
Two choices 1. Counts of sub-sequences
disregarding length 2. Counts of sub-sequences
augmented with log length
17
Homology Representation

Three choices
Stochastic aggregation Use score as probability
to weight evidence
Subset Keep only the n (n10-50) closest objects
for each gene
Subset with Cutoff Discard all links with a
score above a threshold (1e-6)

18
Experimental Methodology

Classification problem Class 20 (cellular
transport) vs. rest prior 0.767
Training using 3000 genes, testing on 1000
holdout genes
ACORA settings using all aggregation operators
and logistic classification models
Evaluation metrics Accuracy AUC

19
Classification Performance
a) Naïve aggregation
c) Adjusted protein structure
b) Adjusted homology scores
20
Conclusion

Gene classification task highlights some
important limitations of existing relational
learning methods
handling ordered sequences, weighted and missing
links
common to other domains e.g., time series
analysis
ACORA with modified domain representation results
in better predictions as compared to naïve
aggregation
Yeast genome is a good benchmark dataset for
further exploration and empirical analysis

21
Thank You !
?
22
Performance on 10 Most Common Classes
Large variation in class priors and performance
across classes
23
Joining back to the class

Make features for the known class labels of
similar genes
Ratio of binary class
Treat it as just another attribute (but ignore
your own label)

24
Exploration Graph Traversal
25
Aggregation

ytq0045

Typical aggregation
COUNT7
MEAN(Length)572
MEAN(Weight)63806
MODE(Category)chondrus

(7,572,63806,chondrus,class)
(x1,x2,,xn ,class)
26
Aggregating Categorical Attributes

MODE is often not appropriate for aggregation of
categorical attributes
Large loss of information
Not defined for unique attributes such as
identifiers
Known problem text classification
Naïve Bayes approach
Class-conditional distributions

27
Bayesian aggregates
1 Class-conditional distributions
2 Case vectors
?
4 Extended feature vector
3 Cosine distances for P1 Cosine(G1, DClass
1) 0.316 Cosine(G1, DClass 0) 0.97

Write a Comment

User Comments (0)