Application of Stacked Generalization to a Protein Localization Prediction Task - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Application of Stacked Generalization to a Protein Localization Prediction Task

Description:

Application of Stacked Generalization to ... Application of machine learning algorithms to large databases ... Chi-squared p value splitting criterion: p 0.2 ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 29
Provided by: melissak2
Category:

less

Transcript and Presenter's Notes

Title: Application of Stacked Generalization to a Protein Localization Prediction Task


1
Application of Stacked Generalization to a
Protein Localization Prediction Task
  • Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D.
  • Pace University, School of Computer Science and
    Information Systems
  • September 27, 2003

2
Overview
  • Introduction
  • Purpose
  • Methods
  • Algorithms
  • Results
  • Conclusions and Future Work

3
Introduction
4
Introduction Data Mining
  • Application of machine learning algorithms to
    large databases
  • Often used to classify future data based on
    training set
  • Target variable is variable to be predicted
  • Theoretically, algorithms are context-independent

5
Introduction Stacked Generalization
  • Method for combining models
  • Part of training set used to train level-0, or
    base, models as usual
  • Level-1 data built from predictions of level-0
    models on remainder of set
  • Level-1 Generalizers are models trained on
    level-1 data

6
Introduction Bioinformatics and Protein
Localization
  • Bioinformatics application of computing to
    molecular biology
  • Currently much interest in information about
    proteins
  • Expression of proteins localized in a particular
    type or part of cell (localization)
  • Knowledge of protein localization can shed light
    on proteins function
  • Data mining employed to predict localization from
    database of information about encoding genes

7
Introduction KDD Cup 2001 Task
  • KDD Cup Annual data mining competition sponsored
    by ACM SIGKDD
  • Participants use training set to predict target
    variable values in test dataset of different
    instances
  • Winner is most accurate model (correct
    predictions/total instances in test set)
  • 2001 task predict protein localization of genes
  • Anonymized genes were instances, information
    about genes were attributes
  • Datasets (incl. revealed target) used in this
    project

8
Purpose
  • Use Stacked Generalization approach on this task
  • Compare inter-algorithm performance using level-0
    models and level-1 generalizers
  • Evaluate strategy of equally distributing target
    variable

9
Methods
10
Methods Dataset Manipulations
  • Reduce number of input variables
  • Reduce number of potential target values to 3
  • Separate original training dataset into training
    and validation sets for stacking
  • Eliminate effectively unary variables in final
    training dataset

11
Table Target Variable Distribution
12
Methods Equally Distributed Approach
  • Second training set created by stratifying to
    ensure equally distributed localizations
  • Level-0 models trained on both raw (unequally
    distributed) and equally distributed training
    sets
  • Separate level-1 data and level-1 generalizers
    from this dataset

13
Algorithms
14
Algorithms Level-0 Artificial Neural Network
(ANN)
  • Fully connected feedforward network
  • Input variables ? dummy variables ? 186 input
    nodes
  • Target variable ? dummy variables ? 2 output
    nodes
  • 1 hidden node
  • Training based on change in misclassification rate

15
Algorithms Level-0 Decision Tree
  • Used CHAID-like algorithm
  • Chi-squared p value splitting criterion p lt 0.2
  • Model selection based on proportion of instances
    correctly classified

16
Algorithms Level-0 Nearest Neighbor (NN)
  • Compare each instance between two datasets
  • Count number of matching attributes
  • Predict target value of instance matching on
    greatest number of attributes
  • Use relative frequency in unequally distributed
    dataset to break ties

17
Algorithms Level-0 Hybrid Decision Tree/ANN
  • Difficult for ANN to learn with too many
    variables
  • Decision Tree can be used as a feature selector
  • Important variables are those used as branching
    criteria
  • New ANN trained using only important variables as
    inputs

18
Algorithms Level-1 Generalizers
  • ANN and Decision Tree
  • Designed and trained essentially the same as
    level-0 counterparts
  • ANN had 8 input nodes
  • Naïve Bayesian Model
  • Calculated likelihood of each target value based
    on Bayes rule
  • Predicted value with highest likelihood

19
Results
20
Results Accuracy Rates
21
Results Evaluation of Accuracy Rates
  • Similar to highest-performing KDD Cup models
  • However, predictions drawn from much smaller pool
    of potential localizations
  • Also not much better than just predicting nucleus
  • Still, had fewer input variables with which to
    work

22
Level-1 Decision Tree Diagram
23
Results Statistical Comparisons
  • No significant inter-algorithm differences for
    level-0 models
  • Hybrid offered some improvement over ANN alone
  • Equal distribution usually resulted in slightly
    worse performance
  • Stacked Generalization resulted in better
    performance, sometimes significantly so

24
Conclusions and Future Work
25
Conclusions and Future Work Stratifying for
Equal Distribution
  • Not worth it and perhaps harmful
  • Resulting small sample size may be to blame
  • Could sample from full training set
  • Other sampling approaches could be used
  • Weight variable not necessarily meaningful

26
Conclusions and Future Work Specific Models
  • Algorithms performed comparably to each other
  • ANN may need more hidden nodes
  • Hybrid model improved ANNs performance slightly,
    but not much
  • NN may owe some of performance to tie-breaker
    implementation
  • Naïve Bayesian not standout, as might be expected
  • Could run A Priori search first

27
Conclusions and Future Work Stacked
Generalization in General
  • Somewhat, not drastically, better performance
  • Possible ways to improve performance
  • Cross-validation could improve both performance
    and evaluation
  • Use posterior probabilities instead of actual
    predictions
  • Try different algorithms
  • Continue stacking on more levels (level-2,
    level-3, etc.)
  • Apply Stacked Generalization to actual KDD Cup
    task

28
References
  • Page, D. (2001). KDD Cup 2001. Website located at
    http//www.cs.wisc.edu/dpage/kddcup2001/.
  • Ting, K.M., Witten, I.H. (1997). Stacked
    generalization when does it work?. Proc
    International Joint Conference on Artificial
    Intelligence, Japan, 866-871.
  • Witten, I.H., Frank, E. (2000). Data Mining.
    Morgan Kaufmann (San Francisco).
  • Wolpert, D.H. (1992). Stacked Generalization.
    Neural Networks, 5241-259.
Write a Comment
User Comments (0)
About PowerShow.com