Gene selection for sample classification based on gene expression data: study of sensitivity to choi

About This Presentation

Title:

Gene selection for sample classification based on gene expression data: study of sensitivity to choi

Description:

Gene selection for sample classification based on gene ... Weinberg, T.A. Darden, and L.G. Pederson. Bioinformatics, 17(12), 1131-42, 2001. Summarized by Kyu ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 26

Provided by: Kyubae8

Category:

more less

Transcript and Presenter's Notes

Title: Gene selection for sample classification based on gene expression data: study of sensitivity to choi

1
Gene selection for sample classification based on
gene expression data study of sensitivity to
choice of parameters of the GA/KNN method

L. Li, C.R. Weinberg, T.A. Darden,
and L.G. Pederson
Bioinformatics, 17(12), 1131-42, 2001.
Summarized by Kyu-Baek Hwang.

2
Abstract

Motivation
A multivariate approach to selection of genes in
microarray for sample classification
Lymphoma and colon datasets
Sensitivity, reproducibility, and stability to
the choice of parameters
Methods
Genetic algorithms and k-nearest neighbor
Results
Can capture the correlated structure in the data.
Reproducibility of the suggested method.
Gene selection is less robust than classification.

3
DNA Microarrays

Monitor thousands of gene expression levels
simultaneously ?? traditional one gene
experiments.
Fabricated by high-speed robotics.

Known probes
4
Comparative Hybridization Experiments
5
Previous Work

Sample classification
Tumor vs. normal, ALL vs. AML, etc.
Dimensionality problem
of genes in a microarray gtgt of instances in a
dataset
For robust analysis, downsizing of the
dimensionality is required often.
Selection of the attributes
Pearson correlation (also its variants), mutual
information, an so on.
The authors suggested a selection method based on
GA and kNN.

6
In the Paper

Followings are examined empirically on lymphoma
and colon microarray datasets.
Sensitivity of the method on the choice of
algorithm parameters.
E.g. chromosome size in GA
The patterns of gene selection and the
classification reliability of the selected genes.
Reproducibility of the algorithm (random
initializations of GA)
Sensitivity of the method on the assignment of
samples to the training set.

7
Methods
8
Datasets

Lymphoma
Germinal center B-like DLBCL (diffuse large
B-cell lymphoma) vs. activated B-like DLBCL.
4026 genes, 47 samples (24 germinal center, 23
activated)
Training set of 34 samples ? normalized.
log transformed (base 2).
Colon
Tumor tissue vs. normal tissue.
2000 genes, 57 samples
Training set of 40 samples ? normalized.
log transformed (base 2).

9
A Flowchart for Gene Selection
10
GA in the Paper

Chromosome consists of d distinct genes.
A population has 100 chromosomes.
Ten populations evolve in parallel.
The fitness (R2) the ability of classifying
training examples.
The best chromosome ? next population.
99 positions are selected probabilistically.
Mutation probability for each chromosome (1 5
genes)
0.531, 0.25, 0.125, 0.0625, and 0.03125 (random
mutation)
Stopping criterion
At least one chromosome achieves the targeted
fitness value.
Until obtain 10000 high R2 chromosome

11
For the Analysis

Choice of d
5, 10, 20, 30, 40, and 50
Two independent runs with different random seed
Reassignments of training and test sets
Original assignment (first N samples)
Lymphoma N 34 of 47 ( of training samples)
Colon N 40 of 57 ( of training samples)
Random assignment (randomly chosen N samples)
Discrepant assignment (last N samples)

12
Results
13
Gene Selection
14
Sensitivity of gene selection to the choice of d
5 and 10 are different from 20, 30, 40, and 50.
15
Classification Accuracy
At the maximum 11 of 13 test samples are
correctly classified by consensus rule. 100 are
classified correctly with majority rule. - 52-55,
68-79, and 122-131 genes for d 40.
16
Clustering of lymphoma data using 50 genes
17
Reproducibility
The proposed method is highly reproducible.
18
Stability
Approximately 25-37 of the top 50 genes obtained
using the training set from either the random
selection or the discrepant selection appeared in
the list of the original assignment.
19
Classification of LymphomaTest Samples
20
Classification of Colon Test Samples
21
Discussion
22
The Choice of d

In the colon and lymphoma datasets, d 20 50
seems appropriate to the classification.
Similar systematic studies for the choices of d
may be necessary on other data sets.

23
Choice for the termination R2

A slightly less stringent termination criterion
(e.g. R2 (M 1)/M or (M 2)/M, M of
instances in training data) is favorable.
Computational efficiency and classification
correctness.

24
Relationships between Gene Selection and
t-Statistics