Title: Gene Classification: Issues and Challenges for Relational Learning Claudia Perlich
1Gene Classification Issues and Challenges for
Relational LearningClaudia Perlich Srujana
MeruguIBM T.J. Watson Research CenterPresented
by Srujana Merugu
2Outline
- Objectives
- Problem and Domain Description
- Challenges for Relational Learning
- Potential Solutions using ACORA
- Experimental Results
3Objectives
- Demonstrate that statistical relational learning
approaches are suitable for gene-classification - Motivate the relational learning community to use
this domain for benchmarking - Highlight the current limitations of relational
learning techniques with regard to expressive
power and motivate further developments
4Problem Description (ILP Challenge 2005)
- Task Functional annotation of the yeast genome
- Training data
- Functional annotations (class labels) for
partial set of genes - Secondary protein structure of each gene
- Intra-gene homology
- Protein-gene homology and protein features
5Functional Annotations (FunCat Hierarchy)
- FunCat hierarchy - forest of multiple class
trees - More than 400 classes
- Each gene can have multiple classes
- 4000 labeled examples and 674 unlabeled
6Secondary Protein Structure (PROF Predictions)
Protein Sequence
caaaaaaaaaaaaaaaaaaaacccbbbbbcccccccccccaaa
a Alpha helix b Beta sheet c Random coil
- Each gene is associated with a single protein
- Homogeneous components are encoded as multiple
records -
7Intra-Gene Homology (BLAST Scores)
w15
- Scores probability of match being found by
chance - Score lt 1e-6 gt significant match
- 100000 gene homology pairs with highly skewed
degree distribution - Missing pairs correspond to unknown scores
-
8Protein-Gene Homology and Protein Features
- Yeast-SwissProt homology scores similar to the
intra-gene scores - 45000 proteins and 3700000 gene-protein pairs
with highly skewed degree distribution -
9 Summary of Data Components
- Functional Annotations (FunCat)
- Predicted Secondary Structure (PROF)
- Intra-Gene homology (BLAST scores)
- Protein homology (SwissProt database) protein
features
10Suitability for Benchmarking
- Highly structured information that is
intrinsically relational - Important scientific problem with limited
predictability so far - Real-life data that require handling noise and
sparsity - Reasonable data set size requiring scalable
techniques
11Challenging Data Characteristics
- Weighted Links
- Homology scores do not correspond to the
uncertainty in relationship - Sparse data (missing links)
- Missing pairs in homology tables need to be
treated as links with unknown weights - Ordered data
- Representing secondary protein structure using
the order, component and length features
results in highly dependent features - Multiple class labels
- Classification problem needs to be posed as
multiple one vs. all problems Interpretation of
hierarchy is also important
12Propositionalization
- Transformation
- Exploration- finding related entities using joins
- Aggregation- constructing features, e.g., mean,
mode - Model Estimation
- Logistic regression, Decision trees, Naïve Bayes,
SVM, etc. - Require each entity to be feature vector
(x1,x2,,xn ,class)
13ACORA Automatic Construction of Relational
Attributes
- Explores using breadth first search over all
joins on identifier attributes starting from
target table - Joins back to target (class) table to create
attributes based on class labels of related
objects - Aggregation
- Traditional aggregates mean, median, mode
- Bayesian aggregates for categorical attributes
distances to class-conditional distributions,
counts of discriminative values
14Assumptions Violations
- Co-occurrence in tables ? existence of
relationship - Missing pairs in homology table - unknown scores
and scores do not also reflect uncertainty in
relationship - Further joins with protein features are also
affected by homology scores - Bags of related objects are random samples
(i.i.d.) - Homogenous components of a single protein
sequence are not i.i.d. - Class-conditional independence between attributes
- Protein sequence representation
(order,component,length) and homology
information (gene/protein id, score) both involve
dependent attributes
Most relational learners (including ACORA)
require the above assumptions
15Potential Solutions
- Parameterize the learner to allow the user to
specify the search space (e.g., Declarative
Language Bias ) - Change assumptions of relational learner to match
domain properties (e.g., PolyFarm) - Change domain representation to match the
assumptions of the relational learner
16Protein Structure Representation
Two choices 1. Counts of sub-sequences
disregarding length 2. Counts of sub-sequences
augmented with log length
17Homology Representation
- Three choices
- Stochastic aggregation Use score as probability
to weight evidence - Subset Keep only the n (n10-50) closest objects
for each gene - Subset with Cutoff Discard all links with a
score above a threshold (1e-6)
18Experimental Methodology
- Classification problem Class 20 (cellular
transport) vs. rest prior 0.767 - Training using 3000 genes, testing on 1000
holdout genes - ACORA settings using all aggregation operators
and logistic classification models - Evaluation metrics Accuracy AUC
19Classification Performance
a) Naïve aggregation
c) Adjusted protein structure
b) Adjusted homology scores
20Conclusion
- Gene classification task highlights some
important limitations of existing relational
learning methods - handling ordered sequences, weighted and missing
links - common to other domains e.g., time series
analysis - ACORA with modified domain representation results
in better predictions as compared to naïve
aggregation - Yeast genome is a good benchmark dataset for
further exploration and empirical analysis
21Thank You !
?
22Performance on 10 Most Common Classes
Large variation in class priors and performance
across classes
23Joining back to the class
- Make features for the known class labels of
similar genes - Ratio of binary class
- Treat it as just another attribute (but ignore
your own label)
24Exploration Graph Traversal
25Aggregation
- Typical aggregation
- COUNT7
- MEAN(Length)572
- MEAN(Weight)63806
- MODE(Category)chondrus
(7,572,63806,chondrus,class)
(x1,x2,,xn ,class)
26Aggregating Categorical Attributes
- MODE is often not appropriate for aggregation of
categorical attributes - Large loss of information
- Not defined for unique attributes such as
identifiers - Known problem text classification
- Naïve Bayes approach
- Class-conditional distributions
27Bayesian aggregates
1 Class-conditional distributions
2 Case vectors
?
4 Extended feature vector
3 Cosine distances for P1 Cosine(G1, DClass
1) 0.316 Cosine(G1, DClass 0) 0.97