Title: An unsupervised conditional random fields approach for clustering gene expression time series
1An unsupervised conditional random fields
approach for clustering gene expression time
series
- Chang-Tsun Li, Yinyin Yuan and Roland Wilson
- Bioinformatics, 2008
2Outline
- Introduction
- Methods
- Results and Discussion
- Conclusions
3Introduction
- Gene expression
- Gene expression time-series data
- Conditional random fields model
- Rand index
4Gene expression
- Gene expression the process by which inheritable
information from a gene(the DNA sequence) is made
into a functional gene product(protein or RNA). - Its a temporal process, so we need to record
time-course data of gene expression.
5Gene expression data
- For example, Cho mitotic cell cycle 17 time
points of expression data for synchronized yeast
cells.
Expression levels at time t10 minutes after
release from cell-cycle arrest.
6Gene expression data
- The co-expression indicates co-regulation or
relation in functional pathways. - We need an efficient clustering method to
identify genes.
7Conditional Random Fields
- A type of discriminative probabilistic model most
often used for the labeling or parsing of
sequential data. - Let X be a random variable over the observations
and Y be a random variable over corresponding
labels. - A CRF model can be expressed as a joint
distribution of Y given X
8Rand index
- Rand index is a measure of the similarity between
two data clusters. - The Rand index has a value between 0 and 1, with
0 indicating that the two data clusters do not
agree on any pair of points and 1 indicating that
the data clusters are exactly the same.
9Methods
- Two steps
- Data transformation
- Model formulation
- The proposed algorithm CRFs clustering algorithm
10Data Transformation
- How the gene expression time series are
transformed into a multi-dimensional feature
space? - Consider that
- temporal correlation
- time intervals
11Data Transformation
- Let n denote the number of time series, i.e. the
number of genes. - Each time series Ti of length t is mapped to a
point xi
Temporal correlation.
The length of interval.
Each t-time-point time series is transformed into
a t-1 dimension feature vector.
12Model formulation
- A set of n observed time series, X x1, x2, ,
xn. - A set of the corresponding labels, Y y1, y2,
, yn. - How to assign an optimal class label yi on tine
series i conditioned on observed data xi?
13Model formulation
- Unsupervised clustering methods, we need two
requirements - Not using cluster centroids.
- Allow each time series to be a singleton cluster.
- Consider that
- Voting pool
- Cost function
14Voting pool
- When a time series i is visited, its voting pool
Ni is formed with k time series.
Voting pool k
Most similar(MS) time series s
Time series i
Most different(MD) time series 1
Time series selected at random k-s-1
15Voting pool
- Only the class labels currently assigned to the
members of its voting pool are taken as
candidates. - The class label of time series i is decided with
its voting pool.
16Voting pool
- What purposes for MD, MS and ones selected at
random? - MD inform what label to avoid.
- MS inform what labels to choose.
- Ones selected at random introduce new
information.
17Cost function
- Encourage good labeling with low cost and
penalize poor labeling with high cost. - Like a CRF model
Conditional probability distribution
The prior
Gibbs form
18Cost function
- The two cost functions
- We obtain a new model as
They are dependent on the same set of arguments!
19Cost function
- How to set the new cost function?
- We dont need to specify the number of clusters
and the information about centroids. - So, we define
The estimated threshold between intra- and
inter-class distances.
average of all the minimum distances
average of all the maximum distances
time series i
maximum distance
minimum distance
randomly picked m
20Cost function
- The cost function is defined as
- The assignment of a label is determined as
21CRFs clustering algorithm
- CRFs clustering algorithm
Data transformation
Voting pool
Cost function
22Results and Discussion
- Evaluation on toy cases
- Experiments on time-series datasets
- Simulated datasets
- Yeast galactose dataset
- Yeast cell-cycle dataset
- Evaluation of class prediction
23Evaluation on toy cases
- How does the voting pool affect the performance?
- Experimental setting
- Three-dimensional features patterns.
- Five clusters, each having 30 patterns.
- Voting pool size 4, 6, 10, 18.
- One MS, one MD, (k-2) random samples.
24Evaluation on toy cases
- The five clusters are randomly created with the
five centroids fixed at - µ1 (40, 45, 100)
- µ2 (85, 60, 100)
- µ3 (65, 100, 80)
- µ4 (55, 120, 120)
- µ5 (120, 50, 120)
- The same variance of 64.
25Evaluation on toy cases
- Three cases
- MS and MD included
- MD excluded
- MS excluded
26Evaluation on toy cases
Without MD, the algorithm couldnt know which
label is not suitable.
Without MS, the algorithm tends to guess correct
labels.
27Simulated datasets
- We synthesized five classes of 60, 60, 60, 80, 60
gene expression profiles - Gaussian noise is added to all data.
28Simulated datasets
0.9903 , k 12
Time complexity tends to decline steadily.
The clustering accuracy and efficiency of the
proposed method depend on the choice of the pool
size.
29Simulated datasets
- Another experimental setting
- Three datasets of 320 gene profiles consisting 4,
5 and 6 clusters. - Three datasets of 400 gene profiles consisting 4,
5 and 6 clusters.
30Simulated datasets
A pool size in the range of 24 of the dataset
size is reasonable.
Voting pool size is 6-12
Voting pool size is 8-16
31Yeast galactose dataset
- The Yeast galactose dataset consists of gene
expression measurements in galactose utilization
in Saccharomyces cerevisiae. - We compared the adjusted Rand index between our
algorithm and CORE(Tjaden, 2006). - CORE algorithm a supervised k-means clustering
method.
32Yeast galactose dataset
Good performance with voting pool size is
5-11. However, the optimal value of CORE is 0.7.
Optimal value 0.9478
33Yeast cell-cycle dataset
- In the Yeast cell-cycle(Y5) dataset, there are
more than 6000 yeast genes measured during two
cell-cycles at 17 time points.
The genes are identified based on their peak
times of five phase of the cell-cycle early
G1(G1E), late G1(G1L), S, G2, M.
34Yeast cell-cycle dataset
In the range, the adjusted Rand index are better
than the best result 0.467 of all the methods
without labeled data, like HMM, k-means and
splines, etc.
Optimal value is 0.4808
35Evaluation of class prediction
- We need to assign genes to known classes, i.e.
class prediction. - The datasets(Y5, Yeast galactose, Simulated) are
divided into training and testing sets.
The high-prediction error rate is due to the high
rate of misclassification of the training set.
36Evaluation of class prediction
Y5 Dataset high prediction error rate.
37Conclusion
- An efficient unsupervised CRFs model for both
gene class discovery and class prediction without
a priori knowledge is presented.