Gene Classification: Issues and Challenges for Relational Learning Claudia Perlich - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Gene Classification: Issues and Challenges for Relational Learning Claudia Perlich

Description:

1. Gene Classification: Issues and Challenges for Relational Learning. Claudia Perlich & Srujana Merugu. IBM T.J. Watson Research Center. Presented by. Srujana Merugu ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 23
Provided by: srujana3
Category:

less

Transcript and Presenter's Notes

Title: Gene Classification: Issues and Challenges for Relational Learning Claudia Perlich


1
Gene Classification Issues and Challenges for
Relational LearningClaudia Perlich Srujana
MeruguIBM T.J. Watson Research CenterPresented
by Srujana Merugu
2
Outline
  • Objectives
  • Problem and Domain Description
  • Challenges for Relational Learning
  • Potential Solutions using ACORA
  • Experimental Results

3
Objectives
  • Demonstrate that statistical relational learning
    approaches are suitable for gene-classification
  • Motivate the relational learning community to use
    this domain for benchmarking
  • Highlight the current limitations of relational
    learning techniques with regard to expressive
    power and motivate further developments

4
Problem Description (ILP Challenge 2005)
  • Task Functional annotation of the yeast genome
  • Training data
  • Functional annotations (class labels) for
    partial set of genes
  • Secondary protein structure of each gene
  • Intra-gene homology
  • Protein-gene homology and protein features

5
Functional Annotations (FunCat Hierarchy)
  • FunCat hierarchy - forest of multiple class
    trees
  • More than 400 classes
  • Each gene can have multiple classes
  • 4000 labeled examples and 674 unlabeled

6
Secondary Protein Structure (PROF Predictions)
Protein Sequence
caaaaaaaaaaaaaaaaaaaacccbbbbbcccccccccccaaa
a Alpha helix b Beta sheet c Random coil
  • Each gene is associated with a single protein
  • Homogeneous components are encoded as multiple
    records

7
Intra-Gene Homology (BLAST Scores)
w15
  • Scores probability of match being found by
    chance
  • Score lt 1e-6 gt significant match
  • 100000 gene homology pairs with highly skewed
    degree distribution
  • Missing pairs correspond to unknown scores

8
Protein-Gene Homology and Protein Features
  • Yeast-SwissProt homology scores similar to the
    intra-gene scores
  • 45000 proteins and 3700000 gene-protein pairs
    with highly skewed degree distribution

9

Summary of Data Components
  • Functional Annotations (FunCat)
  • Predicted Secondary Structure (PROF)
  • Intra-Gene homology (BLAST scores)
  • Protein homology (SwissProt database) protein
    features

10
Suitability for Benchmarking
  • Highly structured information that is
    intrinsically relational
  • Important scientific problem with limited
    predictability so far
  • Real-life data that require handling noise and
    sparsity
  • Reasonable data set size requiring scalable
    techniques

11
Challenging Data Characteristics
  • Weighted Links
  • Homology scores do not correspond to the
    uncertainty in relationship
  • Sparse data (missing links)
  • Missing pairs in homology tables need to be
    treated as links with unknown weights
  • Ordered data
  • Representing secondary protein structure using
    the order, component and length features
    results in highly dependent features
  • Multiple class labels
  • Classification problem needs to be posed as
    multiple one vs. all problems Interpretation of
    hierarchy is also important

12
Propositionalization
  • Transformation
  • Exploration- finding related entities using joins
  • Aggregation- constructing features, e.g., mean,
    mode
  • Model Estimation
  • Logistic regression, Decision trees, Naïve Bayes,
    SVM, etc.
  • Require each entity to be feature vector
    (x1,x2,,xn ,class)

13
ACORA Automatic Construction of Relational
Attributes
  • Explores using breadth first search over all
    joins on identifier attributes starting from
    target table
  • Joins back to target (class) table to create
    attributes based on class labels of related
    objects
  • Aggregation
  • Traditional aggregates mean, median, mode
  • Bayesian aggregates for categorical attributes
    distances to class-conditional distributions,
    counts of discriminative values

14
Assumptions Violations
  • Co-occurrence in tables ? existence of
    relationship
  • Missing pairs in homology table - unknown scores
    and scores do not also reflect uncertainty in
    relationship
  • Further joins with protein features are also
    affected by homology scores
  • Bags of related objects are random samples
    (i.i.d.)
  • Homogenous components of a single protein
    sequence are not i.i.d.
  • Class-conditional independence between attributes
  • Protein sequence representation
    (order,component,length) and homology
    information (gene/protein id, score) both involve
    dependent attributes

Most relational learners (including ACORA)
require the above assumptions
15
Potential Solutions
  • Parameterize the learner to allow the user to
    specify the search space (e.g., Declarative
    Language Bias )
  • Change assumptions of relational learner to match
    domain properties (e.g., PolyFarm)
  • Change domain representation to match the
    assumptions of the relational learner

16
Protein Structure Representation
Two choices 1. Counts of sub-sequences
disregarding length 2. Counts of sub-sequences
augmented with log length
17
Homology Representation
  • Three choices
  • Stochastic aggregation Use score as probability
    to weight evidence
  • Subset Keep only the n (n10-50) closest objects
    for each gene
  • Subset with Cutoff Discard all links with a
    score above a threshold (1e-6)

18
Experimental Methodology
  • Classification problem Class 20 (cellular
    transport) vs. rest prior 0.767
  • Training using 3000 genes, testing on 1000
    holdout genes
  • ACORA settings using all aggregation operators
    and logistic classification models
  • Evaluation metrics Accuracy AUC

19
Classification Performance
a) Naïve aggregation
c) Adjusted protein structure
b) Adjusted homology scores
20
Conclusion
  • Gene classification task highlights some
    important limitations of existing relational
    learning methods
  • handling ordered sequences, weighted and missing
    links
  • common to other domains e.g., time series
    analysis
  • ACORA with modified domain representation results
    in better predictions as compared to naïve
    aggregation
  • Yeast genome is a good benchmark dataset for
    further exploration and empirical analysis

21
Thank You !
?
22
Performance on 10 Most Common Classes
Large variation in class priors and performance
across classes
23
Joining back to the class
  • Make features for the known class labels of
    similar genes
  • Ratio of binary class
  • Treat it as just another attribute (but ignore
    your own label)

24
Exploration Graph Traversal
25
Aggregation
  • ytq0045
  • Typical aggregation
  • COUNT7
  • MEAN(Length)572
  • MEAN(Weight)63806
  • MODE(Category)chondrus

(7,572,63806,chondrus,class)
(x1,x2,,xn ,class)
26
Aggregating Categorical Attributes
  • MODE is often not appropriate for aggregation of
    categorical attributes
  • Large loss of information
  • Not defined for unique attributes such as
    identifiers
  • Known problem text classification
  • Naïve Bayes approach
  • Class-conditional distributions

27
Bayesian aggregates
1 Class-conditional distributions
2 Case vectors
?
4 Extended feature vector
3 Cosine distances for P1 Cosine(G1, DClass
1) 0.316 Cosine(G1, DClass 0) 0.97
Write a Comment
User Comments (0)
About PowerShow.com