Title: Application of Stacked Generalization to a Protein Localization Prediction Task
1Application of Stacked Generalization to a
Protein Localization Prediction Task
- Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D.
- Pace University, School of Computer Science and
Information Systems - September 27, 2003
2Overview
- Introduction
- Purpose
- Methods
- Algorithms
- Results
- Conclusions and Future Work
3Introduction
4Introduction Data Mining
- Application of machine learning algorithms to
large databases - Often used to classify future data based on
training set - Target variable is variable to be predicted
- Theoretically, algorithms are context-independent
5Introduction Stacked Generalization
- Method for combining models
- Part of training set used to train level-0, or
base, models as usual - Level-1 data built from predictions of level-0
models on remainder of set - Level-1 Generalizers are models trained on
level-1 data
6Introduction Bioinformatics and Protein
Localization
- Bioinformatics application of computing to
molecular biology - Currently much interest in information about
proteins - Expression of proteins localized in a particular
type or part of cell (localization) - Knowledge of protein localization can shed light
on proteins function - Data mining employed to predict localization from
database of information about encoding genes
7Introduction KDD Cup 2001 Task
- KDD Cup Annual data mining competition sponsored
by ACM SIGKDD - Participants use training set to predict target
variable values in test dataset of different
instances - Winner is most accurate model (correct
predictions/total instances in test set) - 2001 task predict protein localization of genes
- Anonymized genes were instances, information
about genes were attributes - Datasets (incl. revealed target) used in this
project
8Purpose
- Use Stacked Generalization approach on this task
- Compare inter-algorithm performance using level-0
models and level-1 generalizers - Evaluate strategy of equally distributing target
variable
9Methods
10Methods Dataset Manipulations
- Reduce number of input variables
- Reduce number of potential target values to 3
- Separate original training dataset into training
and validation sets for stacking - Eliminate effectively unary variables in final
training dataset
11Table Target Variable Distribution
12Methods Equally Distributed Approach
- Second training set created by stratifying to
ensure equally distributed localizations - Level-0 models trained on both raw (unequally
distributed) and equally distributed training
sets - Separate level-1 data and level-1 generalizers
from this dataset
13Algorithms
14Algorithms Level-0 Artificial Neural Network
(ANN)
- Fully connected feedforward network
- Input variables ? dummy variables ? 186 input
nodes - Target variable ? dummy variables ? 2 output
nodes - 1 hidden node
- Training based on change in misclassification rate
15Algorithms Level-0 Decision Tree
- Used CHAID-like algorithm
- Chi-squared p value splitting criterion p lt 0.2
- Model selection based on proportion of instances
correctly classified
16Algorithms Level-0 Nearest Neighbor (NN)
- Compare each instance between two datasets
- Count number of matching attributes
- Predict target value of instance matching on
greatest number of attributes - Use relative frequency in unequally distributed
dataset to break ties
17Algorithms Level-0 Hybrid Decision Tree/ANN
- Difficult for ANN to learn with too many
variables - Decision Tree can be used as a feature selector
- Important variables are those used as branching
criteria - New ANN trained using only important variables as
inputs
18Algorithms Level-1 Generalizers
- ANN and Decision Tree
- Designed and trained essentially the same as
level-0 counterparts - ANN had 8 input nodes
- Naïve Bayesian Model
- Calculated likelihood of each target value based
on Bayes rule - Predicted value with highest likelihood
19Results
20Results Accuracy Rates
21Results Evaluation of Accuracy Rates
- Similar to highest-performing KDD Cup models
- However, predictions drawn from much smaller pool
of potential localizations - Also not much better than just predicting nucleus
- Still, had fewer input variables with which to
work
22Level-1 Decision Tree Diagram
23Results Statistical Comparisons
- No significant inter-algorithm differences for
level-0 models - Hybrid offered some improvement over ANN alone
- Equal distribution usually resulted in slightly
worse performance - Stacked Generalization resulted in better
performance, sometimes significantly so
24Conclusions and Future Work
25Conclusions and Future Work Stratifying for
Equal Distribution
- Not worth it and perhaps harmful
- Resulting small sample size may be to blame
- Could sample from full training set
- Other sampling approaches could be used
- Weight variable not necessarily meaningful
26Conclusions and Future Work Specific Models
- Algorithms performed comparably to each other
- ANN may need more hidden nodes
- Hybrid model improved ANNs performance slightly,
but not much - NN may owe some of performance to tie-breaker
implementation - Naïve Bayesian not standout, as might be expected
- Could run A Priori search first
27Conclusions and Future Work Stacked
Generalization in General
- Somewhat, not drastically, better performance
- Possible ways to improve performance
- Cross-validation could improve both performance
and evaluation - Use posterior probabilities instead of actual
predictions - Try different algorithms
- Continue stacking on more levels (level-2,
level-3, etc.) - Apply Stacked Generalization to actual KDD Cup
task
28References
- Page, D. (2001). KDD Cup 2001. Website located at
http//www.cs.wisc.edu/dpage/kddcup2001/. - Ting, K.M., Witten, I.H. (1997). Stacked
generalization when does it work?. Proc
International Joint Conference on Artificial
Intelligence, Japan, 866-871. - Witten, I.H., Frank, E. (2000). Data Mining.
Morgan Kaufmann (San Francisco). - Wolpert, D.H. (1992). Stacked Generalization.
Neural Networks, 5241-259.