Gene selection for sample classification based on gene expression data: study of sensitivity to choi - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Gene selection for sample classification based on gene expression data: study of sensitivity to choi

Description:

Gene selection for sample classification based on gene ... Weinberg, T.A. Darden, and L.G. Pederson. Bioinformatics, 17(12), 1131-42, 2001. Summarized by Kyu ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 26
Provided by: Kyubae8
Category:

less

Transcript and Presenter's Notes

Title: Gene selection for sample classification based on gene expression data: study of sensitivity to choi


1
Gene selection for sample classification based on
gene expression data study of sensitivity to
choice of parameters of the GA/KNN method
  • L. Li, C.R. Weinberg, T.A. Darden,
  • and L.G. Pederson
  • Bioinformatics, 17(12), 1131-42, 2001.
  • Summarized by Kyu-Baek Hwang.

2
Abstract
  • Motivation
  • A multivariate approach to selection of genes in
    microarray for sample classification
  • Lymphoma and colon datasets
  • Sensitivity, reproducibility, and stability to
    the choice of parameters
  • Methods
  • Genetic algorithms and k-nearest neighbor
  • Results
  • Can capture the correlated structure in the data.
  • Reproducibility of the suggested method.
  • Gene selection is less robust than classification.

3
DNA Microarrays
  • Monitor thousands of gene expression levels
    simultaneously ?? traditional one gene
    experiments.
  • Fabricated by high-speed robotics.

Known probes
4
Comparative Hybridization Experiments
5
Previous Work
  • Sample classification
  • Tumor vs. normal, ALL vs. AML, etc.
  • Dimensionality problem
  • of genes in a microarray gtgt of instances in a
    dataset
  • For robust analysis, downsizing of the
    dimensionality is required often.
  • Selection of the attributes
  • Pearson correlation (also its variants), mutual
    information, an so on.
  • The authors suggested a selection method based on
    GA and kNN.

6
In the Paper
  • Followings are examined empirically on lymphoma
    and colon microarray datasets.
  • Sensitivity of the method on the choice of
    algorithm parameters.
  • E.g. chromosome size in GA
  • The patterns of gene selection and the
    classification reliability of the selected genes.
  • Reproducibility of the algorithm (random
    initializations of GA)
  • Sensitivity of the method on the assignment of
    samples to the training set.

7
Methods
8
Datasets
  • Lymphoma
  • Germinal center B-like DLBCL (diffuse large
    B-cell lymphoma) vs. activated B-like DLBCL.
  • 4026 genes, 47 samples (24 germinal center, 23
    activated)
  • Training set of 34 samples ? normalized.
  • log transformed (base 2).
  • Colon
  • Tumor tissue vs. normal tissue.
  • 2000 genes, 57 samples
  • Training set of 40 samples ? normalized.
  • log transformed (base 2).

9
A Flowchart for Gene Selection
10
GA in the Paper
  • Chromosome consists of d distinct genes.
  • A population has 100 chromosomes.
  • Ten populations evolve in parallel.
  • The fitness (R2) the ability of classifying
    training examples.
  • The best chromosome ? next population.
  • 99 positions are selected probabilistically.
  • Mutation probability for each chromosome (1 5
    genes)
  • 0.531, 0.25, 0.125, 0.0625, and 0.03125 (random
    mutation)
  • Stopping criterion
  • At least one chromosome achieves the targeted
    fitness value.
  • Until obtain 10000 high R2 chromosome

11
For the Analysis
  • Choice of d
  • 5, 10, 20, 30, 40, and 50
  • Two independent runs with different random seed
  • Reassignments of training and test sets
  • Original assignment (first N samples)
  • Lymphoma N 34 of 47 ( of training samples)
  • Colon N 40 of 57 ( of training samples)
  • Random assignment (randomly chosen N samples)
  • Discrepant assignment (last N samples)

12
Results
13
Gene Selection
14
Sensitivity of gene selection to the choice of d
5 and 10 are different from 20, 30, 40, and 50.
15
Classification Accuracy
At the maximum 11 of 13 test samples are
correctly classified by consensus rule. 100 are
classified correctly with majority rule. - 52-55,
68-79, and 122-131 genes for d 40.
16
Clustering of lymphoma data using 50 genes
17
Reproducibility
The proposed method is highly reproducible.
18
Stability
Approximately 25-37 of the top 50 genes obtained
using the training set from either the random
selection or the discrepant selection appeared in
the list of the original assignment.
19
Classification of LymphomaTest Samples
20
Classification of Colon Test Samples
21
Discussion
22
The Choice of d
  • In the colon and lymphoma datasets, d 20 50
    seems appropriate to the classification.
  • Similar systematic studies for the choices of d
    may be necessary on other data sets.

23
Choice for the termination R2
  • A slightly less stringent termination criterion
    (e.g. R2 (M 1)/M or (M 2)/M, M of
    instances in training data) is favorable.
  • Computational efficiency and classification
    correctness.

24
Relationships between Gene Selection and
t-Statistics
  • A gene which is not differentially expressed may
    be important for sample distinction when
    considered with other genes (explaining away).

25
Conclusions
  • Multivariate approach to selection of attributes
    for sample classification in sparse datasets.
  • GA and kNN based approach.
Write a Comment
User Comments (0)
About PowerShow.com