A Comparative Evaluation of Different machine Learning Methods on Microarray Gene Expression Data - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

A Comparative Evaluation of Different machine Learning Methods on Microarray Gene Expression Data

Description:

Linear regression and two-class classification with gene ... using gene clusters in a two-step empirical bayes method for predicting classes of samples. ... – PowerPoint PPT presentation

Number of Views:235

Avg rating:3.0/5.0

Slides: 21

Provided by: ls5

Category:

more less

Transcript and Presenter's Notes

Title: A Comparative Evaluation of Different machine Learning Methods on Microarray Gene Expression Data

1
A Comparative Evaluation of Different machine
Learning Methods on Microarray Gene Expression
Data
CS573 Class Project

Song Li

2
What is Microarray

We want to know how genes are expressed in an
individual (is this gene turned on ? )
Microarray technology allows researchers to
monitor the expression levels of thousands of
genes simultaneously

3
Microarray Data
Each row represent a specific gene. Each column
represent an experiment. Xij is the expression
intensity of gene i observed in sample j.
4
Classification of Microarray Data

A typical usage of microarray data is tumor
classification in cancer research.

Data of one microarray experiment
Is it cancer? if yes, what type ?
Learning
Training
Labeled microarray data
5
Challenge

big p, small n problem
Normally thousands of genes are tested in a
single microarray experiment
The number of samples are relatively small (still
an expensive experiment)
Many learning algorithm cannot handle it well
Dimension reduction is needed

6
Methods

Support Vector Machine
Linear regression based methods
Huang et al. 1 summarized several models as
linear regression models
They also proposed PLS (Partial Least Squares)
and PPLS (Penalized Partial Least Squares) based
linear regression model
Clustering Bayesian classification
Ji et al. 2 proposed a two-stage model using
clustering and Bayesian classification
1 X. Huang and W. Pan. Linear regression and
two-class classification with gene expression
data. Bioinformatics, 19 2072 - 2078, 2003.
2 X. Ji, K. Tsui, and K. Kim. A novel means of
using gene clusters in a two-step empirical bayes
method for predicting classes of samples.
Bioinformatics, 21 1055 - 1061, 2005.

7
Evaluation of Existing Models

Very limited number of datasets has been tested
on these newly proposed models
Huangs paper has results on Leukemia data and
colon data. However, according to the paper the
colon data used is not publicly available
Jis paper has results on an anonymous dataset,
and Leukemia data.
Helman et al. studied the performances of a large
number of microarray data classifiers, but the
data is from the original papers 3
How about using WEKA as the evaluation platform?
3 P. Helman, R. Veroff, S. Atlas, and
C. Willman. A bayesian network classification
methodology for gene expression data. J. of
Computational Biology, 11, 581-615, 2004.

8
This Project

Collects microarray datasets and converts them
into WEKAs .arff files
Implements
PLS and PPLS based linear regression model
described in Huangs paper
Two-stage clusteringBayesian method described in
Jis paper
Compare the results that these new methods and
WEKAs SVM SMO implementation( 2-class
classification)

9
Datasets

Source caGEDA at University of Pittsburgh
(http//bioinformatics.upmc.edu/GE2/GEDA.html)

10
Model 1 Linear regression

PLS
Define

11
Contd
12
Contd
13
Predictor
14
PPLS
15
Model 2 Clustering Bayesian
Original data (p genes)
k meta-genes
Clustering
Bayesian model

K ltlt p
My implementation use K-mean as clustering method

16
Bayesian Model
17
Contd
18
Result
19
Discussion

These newly proposed methods offer comparable
performance with WEKAs SVM-SMO
Parameter selection sometimes matters a lot. e.g.
K value in Bayesian method, q value in PPLS
method.
Clustering thousands of genes is computationally
intensive
Question does clustering algorithm has impact on
the performance ?

20
Project Web Site