An unsupervised conditional random fields approach for clustering gene expression time series - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

An unsupervised conditional random fields approach for clustering gene expression time series

Description:

Rand index is a measure of the similarity between two data clusters. ... Ones selected at random: introduce new information. 16. Cost function ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 38
Provided by: mauric98
Category:

less

Transcript and Presenter's Notes

Title: An unsupervised conditional random fields approach for clustering gene expression time series


1
An unsupervised conditional random fields
approach for clustering gene expression time
series
  • Chang-Tsun Li, Yinyin Yuan and Roland Wilson
  • Bioinformatics, 2008

2
Outline
  • Introduction
  • Methods
  • Results and Discussion
  • Conclusions

3
Introduction
  • Gene expression
  • Gene expression time-series data
  • Conditional random fields model
  • Rand index

4
Gene expression
  • Gene expression the process by which inheritable
    information from a gene(the DNA sequence) is made
    into a functional gene product(protein or RNA).
  • Its a temporal process, so we need to record
    time-course data of gene expression.

5
Gene expression data
  • For example, Cho mitotic cell cycle 17 time
    points of expression data for synchronized yeast
    cells.

Expression levels at time t10 minutes after
release from cell-cycle arrest.
6
Gene expression data
  • The co-expression indicates co-regulation or
    relation in functional pathways.
  • We need an efficient clustering method to
    identify genes.

7
Conditional Random Fields
  • A type of discriminative probabilistic model most
    often used for the labeling or parsing of
    sequential data.
  • Let X be a random variable over the observations
    and Y be a random variable over corresponding
    labels.
  • A CRF model can be expressed as a joint
    distribution of Y given X

8
Rand index
  • Rand index is a measure of the similarity between
    two data clusters.
  • The Rand index has a value between 0 and 1, with
    0 indicating that the two data clusters do not
    agree on any pair of points and 1 indicating that
    the data clusters are exactly the same.

9
Methods
  • Two steps
  • Data transformation
  • Model formulation
  • The proposed algorithm CRFs clustering algorithm

10
Data Transformation
  • How the gene expression time series are
    transformed into a multi-dimensional feature
    space?
  • Consider that
  • temporal correlation
  • time intervals

11
Data Transformation
  • Let n denote the number of time series, i.e. the
    number of genes.
  • Each time series Ti of length t is mapped to a
    point xi

Temporal correlation.
The length of interval.
Each t-time-point time series is transformed into
a t-1 dimension feature vector.
12
Model formulation
  • A set of n observed time series, X x1, x2, ,
    xn.
  • A set of the corresponding labels, Y y1, y2,
    , yn.
  • How to assign an optimal class label yi on tine
    series i conditioned on observed data xi?

13
Model formulation
  • Unsupervised clustering methods, we need two
    requirements
  • Not using cluster centroids.
  • Allow each time series to be a singleton cluster.
  • Consider that
  • Voting pool
  • Cost function

14
Voting pool
  • When a time series i is visited, its voting pool
    Ni is formed with k time series.

Voting pool k
Most similar(MS) time series s
Time series i
Most different(MD) time series 1
Time series selected at random k-s-1
15
Voting pool
  • Only the class labels currently assigned to the
    members of its voting pool are taken as
    candidates.
  • The class label of time series i is decided with
    its voting pool.

16
Voting pool
  • What purposes for MD, MS and ones selected at
    random?
  • MD inform what label to avoid.
  • MS inform what labels to choose.
  • Ones selected at random introduce new
    information.

17
Cost function
  • Encourage good labeling with low cost and
    penalize poor labeling with high cost.
  • Like a CRF model

Conditional probability distribution
The prior
Gibbs form
18
Cost function
  • The two cost functions
  • We obtain a new model as

They are dependent on the same set of arguments!
19
Cost function
  • How to set the new cost function?
  • We dont need to specify the number of clusters
    and the information about centroids.
  • So, we define

The estimated threshold between intra- and
inter-class distances.
average of all the minimum distances
average of all the maximum distances
time series i
maximum distance
minimum distance
randomly picked m
20
Cost function
  • The cost function is defined as
  • The assignment of a label is determined as

21
CRFs clustering algorithm
  • CRFs clustering algorithm

Data transformation
Voting pool
Cost function
22
Results and Discussion
  • Evaluation on toy cases
  • Experiments on time-series datasets
  • Simulated datasets
  • Yeast galactose dataset
  • Yeast cell-cycle dataset
  • Evaluation of class prediction

23
Evaluation on toy cases
  • How does the voting pool affect the performance?
  • Experimental setting
  • Three-dimensional features patterns.
  • Five clusters, each having 30 patterns.
  • Voting pool size 4, 6, 10, 18.
  • One MS, one MD, (k-2) random samples.

24
Evaluation on toy cases
  • The five clusters are randomly created with the
    five centroids fixed at
  • µ1 (40, 45, 100)
  • µ2 (85, 60, 100)
  • µ3 (65, 100, 80)
  • µ4 (55, 120, 120)
  • µ5 (120, 50, 120)
  • The same variance of 64.

25
Evaluation on toy cases
  • Three cases
  • MS and MD included
  • MD excluded
  • MS excluded

26
Evaluation on toy cases
  • The results

Without MD, the algorithm couldnt know which
label is not suitable.
Without MS, the algorithm tends to guess correct
labels.
27
Simulated datasets
  • We synthesized five classes of 60, 60, 60, 80, 60
    gene expression profiles
  • Gaussian noise is added to all data.

28
Simulated datasets
  • The results

0.9903 , k 12
Time complexity tends to decline steadily.
The clustering accuracy and efficiency of the
proposed method depend on the choice of the pool
size.
29
Simulated datasets
  • Another experimental setting
  • Three datasets of 320 gene profiles consisting 4,
    5 and 6 clusters.
  • Three datasets of 400 gene profiles consisting 4,
    5 and 6 clusters.

30
Simulated datasets
A pool size in the range of 24 of the dataset
size is reasonable.
  • The results

Voting pool size is 6-12
Voting pool size is 8-16
31
Yeast galactose dataset
  • The Yeast galactose dataset consists of gene
    expression measurements in galactose utilization
    in Saccharomyces cerevisiae.
  • We compared the adjusted Rand index between our
    algorithm and CORE(Tjaden, 2006).
  • CORE algorithm a supervised k-means clustering
    method.

32
Yeast galactose dataset
  • The results

Good performance with voting pool size is
5-11. However, the optimal value of CORE is 0.7.
Optimal value 0.9478
33
Yeast cell-cycle dataset
  • In the Yeast cell-cycle(Y5) dataset, there are
    more than 6000 yeast genes measured during two
    cell-cycles at 17 time points.

The genes are identified based on their peak
times of five phase of the cell-cycle early
G1(G1E), late G1(G1L), S, G2, M.
34
Yeast cell-cycle dataset
In the range, the adjusted Rand index are better
than the best result 0.467 of all the methods
without labeled data, like HMM, k-means and
splines, etc.
  • The results

Optimal value is 0.4808
35
Evaluation of class prediction
  • We need to assign genes to known classes, i.e.
    class prediction.
  • The datasets(Y5, Yeast galactose, Simulated) are
    divided into training and testing sets.

The high-prediction error rate is due to the high
rate of misclassification of the training set.
36
Evaluation of class prediction
  • The results

Y5 Dataset high prediction error rate.
37
Conclusion
  • An efficient unsupervised CRFs model for both
    gene class discovery and class prediction without
    a priori knowledge is presented.
Write a Comment
User Comments (0)
About PowerShow.com