Application of Stacked Generalization to a Protein Localization Prediction Task - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Application of Stacked Generalization to a Protein Localization Prediction Task

Description:

Application of Stacked Generalization to ... Application of machine learning algorithms to large databases ... Chi-squared p value splitting criterion: p 0.2 ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 29

Provided by: melissak2

Category:

more less

Transcript and Presenter's Notes

Title: Application of Stacked Generalization to a Protein Localization Prediction Task

1
Application of Stacked Generalization to a
Protein Localization Prediction Task

Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D.
Pace University, School of Computer Science and
Information Systems
September 27, 2003

2
Overview

Introduction
Purpose
Methods
Algorithms
Results
Conclusions and Future Work

3
Introduction
4
Introduction Data Mining

Application of machine learning algorithms to
large databases
Often used to classify future data based on
training set
Target variable is variable to be predicted
Theoretically, algorithms are context-independent

5
Introduction Stacked Generalization

Method for combining models
Part of training set used to train level-0, or
base, models as usual
Level-1 data built from predictions of level-0
models on remainder of set
Level-1 Generalizers are models trained on
level-1 data

6
Introduction Bioinformatics and Protein
Localization

Bioinformatics application of computing to
molecular biology
Currently much interest in information about
proteins
Expression of proteins localized in a particular
type or part of cell (localization)
Knowledge of protein localization can shed light
on proteins function
Data mining employed to predict localization from
database of information about encoding genes

7
Introduction KDD Cup 2001 Task

KDD Cup Annual data mining competition sponsored
by ACM SIGKDD
Participants use training set to predict target
variable values in test dataset of different
instances
Winner is most accurate model (correct
predictions/total instances in test set)
2001 task predict protein localization of genes
Anonymized genes were instances, information
about genes were attributes
Datasets (incl. revealed target) used in this
project

8
Purpose

Use Stacked Generalization approach on this task
Compare inter-algorithm performance using level-0
models and level-1 generalizers
Evaluate strategy of equally distributing target
variable

9
Methods
10
Methods Dataset Manipulations

Reduce number of input variables
Reduce number of potential target values to 3
Separate original training dataset into training
and validation sets for stacking
Eliminate effectively unary variables in final
training dataset

11
Table Target Variable Distribution
12
Methods Equally Distributed Approach

Second training set created by stratifying to
ensure equally distributed localizations
Level-0 models trained on both raw (unequally
distributed) and equally distributed training
sets
Separate level-1 data and level-1 generalizers
from this dataset

13
Algorithms
14
Algorithms Level-0 Artificial Neural Network
(ANN)

Fully connected feedforward network
Input variables ? dummy variables ? 186 input
nodes
Target variable ? dummy variables ? 2 output
nodes
1 hidden node
Training based on change in misclassification rate

15
Algorithms Level-0 Decision Tree

Used CHAID-like algorithm
Chi-squared p value splitting criterion p lt 0.2
Model selection based on proportion of instances
correctly classified

16
Algorithms Level-0 Nearest Neighbor (NN)

Compare each instance between two datasets
Count number of matching attributes
Predict target value of instance matching on
greatest number of attributes
Use relative frequency in unequally distributed
dataset to break ties

17
Algorithms Level-0 Hybrid Decision Tree/ANN

Difficult for ANN to learn with too many
variables
Decision Tree can be used as a feature selector
Important variables are those used as branching
criteria
New ANN trained using only important variables as
inputs

18
Algorithms Level-1 Generalizers

ANN and Decision Tree
Designed and trained essentially the same as
level-0 counterparts
ANN had 8 input nodes
Naïve Bayesian Model
Calculated likelihood of each target value based
on Bayes rule
Predicted value with highest likelihood

19
Results
20
Results Accuracy Rates
21
Results Evaluation of Accuracy Rates

Similar to highest-performing KDD Cup models
However, predictions drawn from much smaller pool
of potential localizations
Also not much better than just predicting nucleus
Still, had fewer input variables with which to
work

22
Level-1 Decision Tree Diagram
23
Results Statistical Comparisons

No significant inter-algorithm differences for
level-0 models
Hybrid offered some improvement over ANN alone
Equal distribution usually resulted in slightly
worse performance
Stacked Generalization resulted in better
performance, sometimes significantly so

24
Conclusions and Future Work
25
Conclusions and Future Work Stratifying for
Equal Distribution

Not worth it and perhaps harmful
Resulting small sample size may be to blame
Could sample from full training set
Other sampling approaches could be used
Weight variable not necessarily meaningful

26
Conclusions and Future Work Specific Models

Algorithms performed comparably to each other
ANN may need more hidden nodes
Hybrid model improved ANNs performance slightly,
but not much
NN may owe some of performance to tie-breaker
implementation
Naïve Bayesian not standout, as might be expected
Could run A Priori search first

27
Conclusions and Future Work Stacked
Generalization in General

Somewhat, not drastically, better performance
Possible ways to improve performance
Cross-validation could improve both performance
and evaluation
Use posterior probabilities instead of actual
predictions
Try different algorithms
Continue stacking on more levels (level-2,
level-3, etc.)
Apply Stacked Generalization to actual KDD Cup
task

28
References

Page, D. (2001). KDD Cup 2001. Website located at
http//www.cs.wisc.edu/dpage/kddcup2001/.
Ting, K.M., Witten, I.H. (1997). Stacked
generalization when does it work?. Proc
International Joint Conference on Artificial
Intelligence, Japan, 866-871.
Witten, I.H., Frank, E. (2000). Data Mining.
Morgan Kaufmann (San Francisco).
Wolpert, D.H. (1992). Stacked Generalization.
Neural Networks, 5241-259.