Title: Recursive Partitioning for Tumor Classification with Gene Expression Microarray Data
1Recursive Partitioning for Tumor Classification
with Gene Expression Microarray Data
- Heping Zhang, Chang-Yung Yu,
- Burton Singer, Momian Xiong
- Presented by Weihua Huang
2Expression profiles of 2,000 genes using an
Affymetrix oligonucleotide array in 22 normal and
40 colon cancer tissuesThe response is binary
indicating normal or cancer tissue and the
predictor variables are the 2000 genes
Data used in the article
3Classification Tree Using Recursive Partitioning
Goal To partition the feature space into
disjoint regions by growing a tree so that the
group in the same region are homogeneous in terms
of response. Algorithm Start with a root node
containing the study sample and split it into
smaller and smaller nodes according to whether a
particular selected predictor is above a chosen
cutoff value. At each splitting step, the
selected predictor and its corresponding level
are chosen to maximize the reduction in node
impurity ?I P(A)I(A) P(AL)I(AL) P(AR)I(AR)
4Classification Tree using Recursive Partitioning
Node impurity One example of node impurity is
measured by entropy function
- P log(P) - (1-P) log(1-P), where P is
the probability of a tissue being normal within
the node
- Minimum impurity ( 0 )
- When all tissues are of the same type within the
node ( P 0 or 1)
- Maximum impurity ( log2)
- When half normal tissues and half cancer tissues
are within the node (P0.5)
5Results From Classification Tree on the DataFig
1. Classification tree for tissue types by using
expression data from three genes ( M26383,
R15447, M28214)
6Another Way to Visualize the Recursive
PartitioningFig 3. A scatterplot of expression
data from R15447 and M28214 for a subset of
tissues (node 3 in Fig. 1).
7Results from Recursive partitioning
- Quality of the tree-based classification
- Using localized 5-fold cross validation error
rate -
- The same genes to the same nodes
- Randomly divide the 40 cancer tissues into 5
subsamples of 8, and the 22 normal tissues into 5
subsamples of 4,4,4,5, and 5 four subsamples
each from the cancer and normal tissues were
used to choose the cutoff values for the three
splits. The remaining samples were used to count
the misclassified tissues as a result of new
cutoff values. - The error rate is between 6-8 from two runs of
cross validation, which is much better than that
obtained by existing analysis.
8Correlation Analysis on Genes
- Functional expressions from various genes are
- correlated.
- Examine the correlation patterns of the three
- selected genes in Fig. 1.
9Correlation Between the Three Selected Genes and
the Remaining Expression Data
10Another Tree Based on a Different Set of Three
GenesFig. 6. Classification tree for tissue
types using expression data from three genes
(R87126, T62947, X15183)
11Correlation Matrix Among Genes in Fig.1 and Fig.
6
121. Efficient with large number of genes2.
Automatically selects valuable and user-friendly
genes as predictors3. More precise than some
other classification methods such as support
vector machine and linear discriminant analysis
Advantages of the Classification Tree
131. It is likely that the information contained in
a large number of genes can be captured by a
small optimal set of genes without significant
loss of information. 2. The precision of
classification of recursive partitioning is
important for clinical application.
Conclusions