Title: Gene selection for sample classification based on gene expression data: study of sensitivity to choi
1Gene selection for sample classification based on
gene expression data study of sensitivity to
choice of parameters of the GA/KNN method
- L. Li, C.R. Weinberg, T.A. Darden,
- and L.G. Pederson
- Bioinformatics, 17(12), 1131-42, 2001.
- Summarized by Kyu-Baek Hwang.
2Abstract
- Motivation
- A multivariate approach to selection of genes in
microarray for sample classification - Lymphoma and colon datasets
- Sensitivity, reproducibility, and stability to
the choice of parameters - Methods
- Genetic algorithms and k-nearest neighbor
- Results
- Can capture the correlated structure in the data.
- Reproducibility of the suggested method.
- Gene selection is less robust than classification.
3DNA Microarrays
- Monitor thousands of gene expression levels
simultaneously ?? traditional one gene
experiments. - Fabricated by high-speed robotics.
Known probes
4Comparative Hybridization Experiments
5Previous Work
- Sample classification
- Tumor vs. normal, ALL vs. AML, etc.
- Dimensionality problem
- of genes in a microarray gtgt of instances in a
dataset - For robust analysis, downsizing of the
dimensionality is required often. - Selection of the attributes
- Pearson correlation (also its variants), mutual
information, an so on. - The authors suggested a selection method based on
GA and kNN.
6In the Paper
- Followings are examined empirically on lymphoma
and colon microarray datasets. - Sensitivity of the method on the choice of
algorithm parameters. - E.g. chromosome size in GA
- The patterns of gene selection and the
classification reliability of the selected genes. - Reproducibility of the algorithm (random
initializations of GA) - Sensitivity of the method on the assignment of
samples to the training set.
7Methods
8Datasets
- Lymphoma
- Germinal center B-like DLBCL (diffuse large
B-cell lymphoma) vs. activated B-like DLBCL. - 4026 genes, 47 samples (24 germinal center, 23
activated) - Training set of 34 samples ? normalized.
- log transformed (base 2).
- Colon
- Tumor tissue vs. normal tissue.
- 2000 genes, 57 samples
- Training set of 40 samples ? normalized.
- log transformed (base 2).
9A Flowchart for Gene Selection
10GA in the Paper
- Chromosome consists of d distinct genes.
- A population has 100 chromosomes.
- Ten populations evolve in parallel.
- The fitness (R2) the ability of classifying
training examples. - The best chromosome ? next population.
- 99 positions are selected probabilistically.
- Mutation probability for each chromosome (1 5
genes) - 0.531, 0.25, 0.125, 0.0625, and 0.03125 (random
mutation) - Stopping criterion
- At least one chromosome achieves the targeted
fitness value. - Until obtain 10000 high R2 chromosome
11For the Analysis
- Choice of d
- 5, 10, 20, 30, 40, and 50
- Two independent runs with different random seed
- Reassignments of training and test sets
- Original assignment (first N samples)
- Lymphoma N 34 of 47 ( of training samples)
- Colon N 40 of 57 ( of training samples)
- Random assignment (randomly chosen N samples)
- Discrepant assignment (last N samples)
12Results
13Gene Selection
14Sensitivity of gene selection to the choice of d
5 and 10 are different from 20, 30, 40, and 50.
15Classification Accuracy
At the maximum 11 of 13 test samples are
correctly classified by consensus rule. 100 are
classified correctly with majority rule. - 52-55,
68-79, and 122-131 genes for d 40.
16Clustering of lymphoma data using 50 genes
17Reproducibility
The proposed method is highly reproducible.
18Stability
Approximately 25-37 of the top 50 genes obtained
using the training set from either the random
selection or the discrepant selection appeared in
the list of the original assignment.
19Classification of LymphomaTest Samples
20Classification of Colon Test Samples
21Discussion
22The Choice of d
- In the colon and lymphoma datasets, d 20 50
seems appropriate to the classification. - Similar systematic studies for the choices of d
may be necessary on other data sets.
23Choice for the termination R2
- A slightly less stringent termination criterion
(e.g. R2 (M 1)/M or (M 2)/M, M of
instances in training data) is favorable. - Computational efficiency and classification
correctness.
24Relationships between Gene Selection and
t-Statistics
- A gene which is not differentially expressed may
be important for sample distinction when
considered with other genes (explaining away).
25Conclusions
- Multivariate approach to selection of attributes
for sample classification in sparse datasets. - GA and kNN based approach.