A Comparative Evaluation of Different machine Learning Methods on Microarray Gene Expression Data - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

A Comparative Evaluation of Different machine Learning Methods on Microarray Gene Expression Data

Description:

Linear regression and two-class classification with gene ... using gene clusters in a two-step empirical bayes method for predicting classes of samples. ... – PowerPoint PPT presentation

Number of Views:235
Avg rating:3.0/5.0
Slides: 21
Provided by: ls5
Category:

less

Transcript and Presenter's Notes

Title: A Comparative Evaluation of Different machine Learning Methods on Microarray Gene Expression Data


1
A Comparative Evaluation of Different machine
Learning Methods on Microarray Gene Expression
Data
CS573 Class Project
  • Song Li

2
What is Microarray
  • We want to know how genes are expressed in an
    individual (is this gene turned on ? )
  • Microarray technology allows researchers to
    monitor the expression levels of thousands of
    genes simultaneously

3
Microarray Data
Each row represent a specific gene. Each column
represent an experiment. Xij is the expression
intensity of gene i observed in sample j.
4
Classification of Microarray Data
  • A typical usage of microarray data is tumor
    classification in cancer research.

Data of one microarray experiment
Is it cancer? if yes, what type ?
Learning
Training
Labeled microarray data
5
Challenge
  • big p, small n problem
  • Normally thousands of genes are tested in a
    single microarray experiment
  • The number of samples are relatively small (still
    an expensive experiment)
  • Many learning algorithm cannot handle it well
  • Dimension reduction is needed

6
Methods
  • Support Vector Machine
  • Linear regression based methods
  • Huang et al. 1 summarized several models as
    linear regression models
  • They also proposed PLS (Partial Least Squares)
    and PPLS (Penalized Partial Least Squares) based
    linear regression model
  • Clustering Bayesian classification
  • Ji et al. 2 proposed a two-stage model using
    clustering and Bayesian classification
  • 1 X. Huang and W. Pan. Linear regression and
    two-class classification with gene expression
    data. Bioinformatics, 19 2072 - 2078, 2003.
  • 2 X. Ji, K. Tsui, and K. Kim. A novel means of
    using gene clusters in a two-step empirical bayes
    method for predicting classes of samples.
    Bioinformatics, 21 1055 - 1061, 2005.

7
Evaluation of Existing Models
  • Very limited number of datasets has been tested
    on these newly proposed models
  • Huangs paper has results on Leukemia data and
    colon data. However, according to the paper the
    colon data used is not publicly available
  • Jis paper has results on an anonymous dataset,
    and Leukemia data.
  • Helman et al. studied the performances of a large
    number of microarray data classifiers, but the
    data is from the original papers 3
  • How about using WEKA as the evaluation platform?
  • 3 P. Helman, R. Veroff, S. Atlas, and
    C. Willman. A bayesian network classification
    methodology for gene expression data. J. of
    Computational Biology, 11, 581-615, 2004.

8
This Project
  • Collects microarray datasets and converts them
    into WEKAs .arff files
  • Implements
  • PLS and PPLS based linear regression model
    described in Huangs paper
  • Two-stage clusteringBayesian method described in
    Jis paper
  • Compare the results that these new methods and
    WEKAs SVM SMO implementation( 2-class
    classification)

9
Datasets
  • Source caGEDA at University of Pittsburgh
    (http//bioinformatics.upmc.edu/GE2/GEDA.html)

10
Model 1 Linear regression
  • PLS
  • Define

11
Contd
12
Contd
13
Predictor
14
PPLS
15
Model 2 Clustering Bayesian
Original data (p genes)
k meta-genes
Clustering
Bayesian model
  • K ltlt p
  • My implementation use K-mean as clustering method

16
Bayesian Model
17
Contd
18
Result
19
Discussion
  • These newly proposed methods offer comparable
    performance with WEKAs SVM-SMO
  • Parameter selection sometimes matters a lot. e.g.
    K value in Bayesian method, q value in PPLS
    method.
  • Clustering thousands of genes is computationally
    intensive
  • Question does clustering algorithm has impact on
    the performance ?

20
Project Web Site
  • http//www.cs.iastate.edu/lisong/cs573
Write a Comment
User Comments (0)
About PowerShow.com