An unsupervised conditional random fields approach for clustering gene expression time series

About This Presentation

Title:

An unsupervised conditional random fields approach for clustering gene expression time series

Description:

Rand index is a measure of the similarity between two data clusters. ... Ones selected at random: introduce new information. 16. Cost function ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 38

Provided by: mauric98

Category:

more less

Transcript and Presenter's Notes

Title: An unsupervised conditional random fields approach for clustering gene expression time series

1
An unsupervised conditional random fields
approach for clustering gene expression time
series

Chang-Tsun Li, Yinyin Yuan and Roland Wilson
Bioinformatics, 2008

2
Outline

Introduction
Methods
Results and Discussion
Conclusions

3
Introduction

Gene expression
Gene expression time-series data
Conditional random fields model
Rand index

4
Gene expression

Gene expression the process by which inheritable
information from a gene(the DNA sequence) is made
into a functional gene product(protein or RNA).
Its a temporal process, so we need to record
time-course data of gene expression.

5
Gene expression data

For example, Cho mitotic cell cycle 17 time
points of expression data for synchronized yeast
cells.

Expression levels at time t10 minutes after
release from cell-cycle arrest.
6
Gene expression data

The co-expression indicates co-regulation or
relation in functional pathways.
We need an efficient clustering method to
identify genes.

7
Conditional Random Fields

A type of discriminative probabilistic model most
often used for the labeling or parsing of
sequential data.
Let X be a random variable over the observations
and Y be a random variable over corresponding
labels.
A CRF model can be expressed as a joint
distribution of Y given X

8
Rand index

Rand index is a measure of the similarity between
two data clusters.
The Rand index has a value between 0 and 1, with
0 indicating that the two data clusters do not
agree on any pair of points and 1 indicating that
the data clusters are exactly the same.

9
Methods

Two steps
Data transformation
Model formulation
The proposed algorithm CRFs clustering algorithm

10
Data Transformation

How the gene expression time series are
transformed into a multi-dimensional feature
space?
Consider that
temporal correlation
time intervals

11
Data Transformation

Let n denote the number of time series, i.e. the
number of genes.
Each time series Ti of length t is mapped to a
point xi

Temporal correlation.
The length of interval.
Each t-time-point time series is transformed into
a t-1 dimension feature vector.
12
Model formulation

A set of n observed time series, X x1, x2, ,
xn.
A set of the corresponding labels, Y y1, y2,
, yn.
How to assign an optimal class label yi on tine
series i conditioned on observed data xi?

13
Model formulation

Unsupervised clustering methods, we need two
requirements
Not using cluster centroids.
Allow each time series to be a singleton cluster.
Consider that
Voting pool
Cost function

14
Voting pool

When a time series i is visited, its voting pool
Ni is formed with k time series.

Voting pool k
Most similar(MS) time series s
Time series i
Most different(MD) time series 1
Time series selected at random k-s-1
15
Voting pool

Only the class labels currently assigned to the
members of its voting pool are taken as
candidates.
The class label of time series i is decided with
its voting pool.

16
Voting pool

What purposes for MD, MS and ones selected at
random?
MD inform what label to avoid.
MS inform what labels to choose.
Ones selected at random introduce new
information.

17
Cost function

Encourage good labeling with low cost and
penalize poor labeling with high cost.
Like a CRF model

Conditional probability distribution
The prior
Gibbs form
18
Cost function

The two cost functions
We obtain a new model as

They are dependent on the same set of arguments!
19
Cost function

How to set the new cost function?
We dont need to specify the number of clusters
and the information about centroids.
So, we define

The estimated threshold between intra- and
inter-class distances.
average of all the minimum distances
average of all the maximum distances
time series i
maximum distance
minimum distance
randomly picked m
20
Cost function

The cost function is defined as
The assignment of a label is determined as

21
CRFs clustering algorithm

CRFs clustering algorithm

Data transformation
Voting pool
Cost function
22
Results and Discussion

Evaluation on toy cases
Experiments on time-series datasets
Simulated datasets
Yeast galactose dataset
Yeast cell-cycle dataset
Evaluation of class prediction

23
Evaluation on toy cases

How does the voting pool affect the performance?
Experimental setting
Three-dimensional features patterns.
Five clusters, each having 30 patterns.
Voting pool size 4, 6, 10, 18.
One MS, one MD, (k-2) random samples.

24
Evaluation on toy cases

The five clusters are randomly created with the
five centroids fixed at
µ1 (40, 45, 100)
µ2 (85, 60, 100)
µ3 (65, 100, 80)
µ4 (55, 120, 120)
µ5 (120, 50, 120)
The same variance of 64.

25
Evaluation on toy cases

Three cases
MS and MD included
MD excluded
MS excluded

26
Evaluation on toy cases

The results

Without MD, the algorithm couldnt know which
label is not suitable.
Without MS, the algorithm tends to guess correct
labels.
27
Simulated datasets

We synthesized five classes of 60, 60, 60, 80, 60
gene expression profiles
Gaussian noise is added to all data.

28
Simulated datasets

The results

0.9903 , k 12
Time complexity tends to decline steadily.
The clustering accuracy and efficiency of the
proposed method depend on the choice of the pool
size.
29
Simulated datasets

Another experimental setting
Three datasets of 320 gene profiles consisting 4,
5 and 6 clusters.
Three datasets of 400 gene profiles consisting 4,
5 and 6 clusters.

30
Simulated datasets
A pool size in the range of 24 of the dataset
size is reasonable.

The results

Voting pool size is 6-12
Voting pool size is 8-16
31
Yeast galactose dataset

The Yeast galactose dataset consists of gene
expression measurements in galactose utilization
in Saccharomyces cerevisiae.
We compared the adjusted Rand index between our
algorithm and CORE(Tjaden, 2006).
CORE algorithm a supervised k-means clustering
method.

32
Yeast galactose dataset

The results

Good performance with voting pool size is
5-11. However, the optimal value of CORE is 0.7.
Optimal value 0.9478
33
Yeast cell-cycle dataset

In the Yeast cell-cycle(Y5) dataset, there are
more than 6000 yeast genes measured during two
cell-cycles at 17 time points.

The genes are identified based on their peak
times of five phase of the cell-cycle early
G1(G1E), late G1(G1L), S, G2, M.
34
Yeast cell-cycle dataset
In the range, the adjusted Rand index are better
than the best result 0.467 of all the methods
without labeled data, like HMM, k-means and
splines, etc.

The results

Optimal value is 0.4808
35
Evaluation of class prediction

We need to assign genes to known classes, i.e.
class prediction.
The datasets(Y5, Yeast galactose, Simulated) are
divided into training and testing sets.

The high-prediction error rate is due to the high
rate of misclassification of the training set.
36
Evaluation of class prediction

The results

Y5 Dataset high prediction error rate.
37
Conclusion

An efficient unsupervised CRFs model for both
gene class discovery and class prediction without
a priori knowledge is presented.

Write a Comment

User Comments (0)

About PowerShow.com

An unsupervised conditional random fields approach for clustering gene expression time series - PowerPoint PPT Presentation

An unsupervised conditional random fields approach for clustering gene expression time series

Rand index is a measure of the similarity between two data clusters. ... Ones selected at random: introduce new information. 16. Cost function ... – PowerPoint PPT presentation