Title: Microarray Data Analysis for Gene Selection and Cancer Classification
1Microarray Data Analysis for Gene Selection and
Cancer Classification
- De-Shuang Huang(???)
- http//www.intelengine.cn/
- Intelligent Computing Lab,
- Institute of Intelligent Machines, CAS, China
- University of Science Technology of China
- September 25, 2005
2Contents
- 1. Microarray Data Analysis and Microarray
Technology - 2. Gene Selection Method Based on Support Vectors
and Penalty Strategy - 3. Gene Selection Method Based on Gene Regulation
Probability (GRP) - 4. ICA for Cancer Classification Using Gene
Expression Data - 5. An MSA-HLA Based RBF Classifier for Cancer
Classification - 6. Conclusions
31. Microarray Data Analysis and Microarray
Technology
- Microarray technology was developed in 1993 and
used to simultaneously detect the express levels
of hundreds of thousands of genes in biological
bodies. - It is a historic technology as sensitive in vivo
sensors for clinical diagnosis. - The tremendous amount of data from microarray
technology presents a challenge for data
analysis. - Microarray data analysis is being developed and
currently has become an important content in
Bioinformatics. - Two important issues in application of microarray
technology on cancer research Gene Selection
and Cancer Classification.
4Two means for manufacture of microarrays
- In situ synthesized oligonucleotide arrays
-
-
- By Affymetrix Inc.
5 Pre-synthesized cDNA arrays
By Patrick Browns Lab at Stanford University
6 7- Acquisition of gene expression data by microarray
technology
8- An image sample scanned by laser scanner with two
channels
The image will be digitalized as gene
expression values, where a colored point
represents a gene.
9- Image analysis and synthesis through special
software
10- 1.1 Structure of microarray data
Description Matrix with one row for each gene
and one column for each condition (replicate).
11- 1.2 Aims and applications of microarray data
analysis - Identify biologically specific genes
- Prognose and diagnose diseases
- Explore relation between genes or other
biological factors - Discover the gene alternative splicing expression
law - Help to understand disease pathology
- Assist studying the gene regulation network
12- 1.3 Difficulty and Complication in Microarray
Data Analysis - High dimensionality generally 5,000-15,000
- Inherently very noisy
- High degree of variability
- The vast majority of variables hidden
- Complicated relations between genes
- Complicated relations between phenotypes
- Complicated relations between genes and
phenotypes -
132. Gene Selection Based on Support Vectors and
Penalty Strategy
- In the algorithm, a cross validation procedure is
performed on datasets - For each validation sub-procedure, support vector
machines (SVM) are trained and tested - For each SVM, its support vectors are weighted
and combined into the initial gene correlation
degree with the class distinction - A penalty strategy is proposed to penalize the
initial CDs to obtain the penalized CDs that are
used to produce a criterion for gene selection - The applications on the leukemia dataset and the
colon dataset show that our algorithm can
identify the key genes related with the class
distinction and it is competitive to the previous
methods.
14- Step 1. Compute the original gene correlation
degree with the class distinction
where is the number of the support vectors
in set.
15- Step 2. Compute the penalized correlation degree
- where
16Step 3. For, each gene, compute the compositive
correlation degree
where is the number of the cross validation
procedures
Step 4. Rank all genes
17Biotechnology Letters, vol.27, no.8, pp.597-603,
2005.
- Experimental results
- The trends of the weights changes of 50 genes
during 50 times random cross validations
leukemia dataset
18-
- The convergent process of correlation degree
colon dataset
leukemia dataset
193. Gene selection method based on gene regulation
probability (GRP)
- In the method, the gene regulation is defined in
statistics - The probabilities of the gene regulation are
estimated using probability statistics methods - The method can extract the gene regulation
information and be used for gene selection - The applications on the leukemia dataset and the
colon dataset suggest that our proposed method is
effective and efficient and competitive to
previous methods.
20- Established gene regulation model
Commonly, we can preset
, where the cutoff coefficient
.
21Definition 3 The regulation matrix, B, is such a
matrix that has the same representation form as
the microarray data matrix, A, but contains the
generic elements, ,
which are determined by Definitions 1 and 2 and
used to record the regulation states of the gene,
g, in the tissue sample, s.
For the up-regulation matrix, B,
22For the down-regulation matrix B-,
Define
23Next, to estimate the regulation probabilities
using the probabilistic statistic methods,
assume that the regulation event, E, occurs by
the probability, , under the background
context C, that is,
where x can be got from the regulation matrices,
B and B-, i.e.,
24The marginal probability of x can be computed as
follows
25- Experimental Results
- Gene distribution over the up- or down-regulation
event
26Cell (submitted) .
- Fitting distribution of GRP
27Illustration of the distinction in expression
levels of the partial selected genes
28- Selected genes and their biological descriptions
29- Significance test of the gene regulation
probability difference
30- Genes regulated under different significance
levels
31- Comparison between actual GRP and permuted
- GRP I
32- Comparison between actual GRP and permuted GRP II
33 4. ICA for Cancer Classification Using Gene
Expression Data
A . Independent Component Analysis
- Mixing model
- Demixing model
Neurocomputing (accepted)
34B. The Independent Basis Snapshot Representation
35 C. Classifiers
- Leave-out-one cross-validation (LOO-CV)
- Accuracy
36Experimental Results
A. Datasets
37B1. Experiments
38B2. Experiments
39C. Comparison of Experimental Results for
Different Methods
40D. Comparison of Experimental Results for
Different Methods
41E. The Effects of Gene Numbers Used in ICA
425. An MSA-HLA Based RBF Classifier for Cancer
Classification
- A modified simulated annealing (MSA) algorithm is
developed and combined with the linear least
square paradigms to optimize the structure of the
radial basis function classifier (RBFclassifier).
- The optimized RBFclassifier is applied to cancer
classifications. - Experimental results show that the optimized
RBFclassifier is not only parsimonious but also
has better generalization performance.
43- Methods
- The modified SA (MSA)
-
- Simulated annealing (SA) is modified to
search for the optimal number of RBF centers, and
the resulting MSA uses MSE of RBF classifiers as
the evolving environment. The MSA algorithm is
stated as follows
44Step1. Initialize the initial state and the
temperature of the system, and set the annealing
schedule. Step 2. At each T(t), repeat a
predetermined number of times (i)
Randomly produce a new state, , and compute
and . (ii) is rejected if
and accepted by
otherwise. Step 3. T(t) is updated by the
annealing schedule and the process is stopped if
the lowest temperature is arrived and go to Step
2 otherwise.
45- The hybrid learning algorithm (HLA)
-
- HLA is employed to further optimize the
spreads and centers as well as weights of RBF
classifiers based on the results from MSA. HLA
algorithm is stated as follows -
- Step 1. Weights are obtained by LS
-
46- Step 2. Compute the MSE
- Step 3. Update centers
47 48IEE Electronics Letters, vol.41, no.11,
pp.630-632, 2005.
- Experimental results
- Evolving curves for the numbers of centers by MSA
49- Comparison of cancer recognition rates
- among different algorithms
50Conclusions
- Our studies show that it is indeed feasible to
identify or classify genes by analyzing
microarray data - Microarray data analysis can be used to identify
the genes and pathway, and reveal new targets for
therapy, and prognose the individual cancer
subtype - When more expression signatures of larger tumor
sets become available, it will become clear how
microarray data analysis will improve monitoring
of the stages in which tumors grow and spread,
and therefore prognosis
51Conclusions
- 4. More and better methods for analyzing
microarry data need to be developed - 5. Lots of pattern recognition tools and
machine learning methods will be possibly applied
on microarray data - 6. Due to inherent noise and variations in
microarray data, it is necessary to develop
probabilistic methods for extracting useful
information. - 7. At present, the most commonly used
computational approach for analyzing microarray
data is clustering analysis, e.g.,hierarchical
clustering , k-means, SOM. etc.
52