Biclustering of Expression Data - PowerPoint PPT Presentation

About This Presentation
Title:

Biclustering of Expression Data

Description:

4. Methods proposed by this paper. 4.1 Relative Works and paper's goal ... the problem of finding a maximum biclique in a bipartite graph as a special case ... – PowerPoint PPT presentation

Number of Views:293
Avg rating:3.0/5.0
Slides: 34
Provided by: bob483
Learn more at: https://cs.gmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Biclustering of Expression Data


1
Biclustering of Expression Data
  • by Yizong Cheng and Geoge M. Church
  • Presented by Bojun Yan
  • March 25, 2004

2
outline
  • MicroArray and its relative research
  • 1.1 MicroArray Gene Expression Data
  • 1.2 Main research about MicroArray
  • 2. Why Bicluster?
  • 2.1 Preceding research and its faults
  • 2.2 The concept of Bicluster
  • 2.3 Similarity measure
  • 3. The hardness of Bicluster

3
  • 4. Methods proposed by this paper
  • 4.1 Relative Works and papers goal
  • 4.2 Definition of mean squared residue
    score
  • 4.3 Some special matrices scores
  • 4.4 Some Theorems deduced by authors
  • 4.5 Algorithms proposed by this paper
  • 5. Experiment
  • 5.1 Data preparation
  • 5.2 Determining Algorithm Parameters
  • 5.3 Final Algorithm
  • 5.4 Results and Display

4
1. MicroArray and its relative Research
  • 1.1 MicroArray Gene expression data
  • Being generated by DNA chips and other
    microarray technique, Row---Genes,
    Column---Conditions or Samples
  • 1.2 Main Research about MicroArray
  • (1) Gene Clustering Finding the genes
    having similar functions
  • (2) Conditions Clustering Helpful to case
    analysis
  • (3) Classification Tumor Classification,
    Cancer prediction
  • (4) Gene Selection Find the genes relative
    to some disease
  • (5) Gen Network Explore the regulatory
    interaction between the genes
  • 1.3 Paper Target Biclustering

5
2. Why Bicluster?
  • 2.1 Preceding research and its faults
  • Goal Discover the regulatory patterns or
    condition similarities
  • Methods Based on Euclidean distance or the dot
    product between the vectors (equally weighted)
  • (1) Group genes (row)
  • (2) Group conditions (column)
  • Result Partition the genes or conditions into
    mutually exclusive groups or hierarchies

6
  • Faults obscuring some other similarity groups
    while discovering some similarity groups
  • 2.2 The concept of Bicluster
  • Clustering the genes(rows) and
    conditions(columns) simultaneously---subspace
    clustering
  • 2.3 Similarity Measure
  • (1)Based on Distance Metric, such as Minkowski
    distances
  • (2)Cosine Measure

7
  • (3)Pearson Correlation
  • (4)Extended Jaccard Similarity
  • (5)Mean Sqare Residue (proposed by this
    paper)
  • A measure of the coherence of the genes
    and conditions in the bicluster
  • Symmetric function of the genes and
    conditions
  • Group genes and conditions simultaneously

8
3. Hardness of the bicluster
  • The problem of finding a maximum bicluster with a
    score lower than a threshold includes the problem
    of finding a maximum biclique in a bipartite
    graph as a special case
  • Finding the largest constant square submatrix is
    proven to be NP-hard (Johnson, 1987)
  • The problem of finding a minimum set of
    biclusters, either mutually exclusive or
    overlapping, to cover all the elements in a data
    matrix has been shown to be NP-hard(Orlin,1977)

9
4. Methods proposed by this paper
  • 4.1 Relative Works and the papers goal
  • Relative Works
  • Divisive algorithm partitioning data into sets
    with approximately constant values, proposed by
    Morgan and Sonquist(1963) and Hartigan(1972)
  • Hartigan mentioned that the criterion for
    partitioning may be a two-way analysis of
    variance model, similar to the mean squared
    residue scoring proposed in this article
  • Mirkin(1996) presents a node addition algorithm.

10
  • biclustering has been used by Mirkin(1996),
    which means simultaneous clustering of both row
    and column sets in a data matrix.
  • The term direct clustering(Hartigan 1972),and
    box clustering(Mirkin,1996) have the same
    meaning.
  • (2) The Papers Goal and criterion
  • Goal Finding of a set of genes showing
    strikingly similar up-regulation and
    down-regulation under a set of conditions.
  • Criterion A low mean squared residue score plus
    a large variation from the constant as a
    criterion for identifying these genes and
    conditions
  • Overlapping Biclusters should be allowed to
    overlap in expression data analysis

11
4.2 Definition of mean squared residue score
12
  • The row variance
  • It is an accompanying score to reject trival
    or constant biclusters.

13
4.3 Scores of some special matrice
  • A special case for a perfect score( a zero mean
    squared residue score) is a constant bicluster of
    elements of a single value
  • For the matrix aijij, i,jgt0, no submatrix of a
    size larger than a single cell has a score lower
    than 0.5
  • A KK matrix of all 0s except one 1 has the score
  • Equation
  • A matrix with elements randomly and uniformly
    generated in the range of a,b has an expected
    score of (b-a)(b-a)/12. For example the range is
    0,800, the expected score is 53,333.

14
(No Transcript)
15
  • Some characteristic of mean square residue score
  • (1)Adding a constant number to the matrix
    will not affect the H(I,J) score
  • (2)Multiplying a constant number will affect
    the score (by the square of the constant)
  • (3)Both will not affect the ranking of the
    biclusters in a matrix

16
4.4 Theorems deduced by authors
17
(No Transcript)
18
Comments on Algorithm 0
  • Algorithm 0, although a polynomial-time one, will
    not be efficient enough for a quick analysis of
    most expression data matrices.
  • The complexity of Algorithm 0 is o((nm)nm)

19
(No Transcript)
20
Comments on Algorithm 1
  • In each iterate, a complete recalculation for
    step1 and step 2 is needed
  • The time complexity of Algorithm 1 is o(nm)
  • Higher efficiency than Algorithm 0, but not the
    best.

21
(No Transcript)
22
Comments on Algorithm 2
  • Need to properly select parameter agt1
  • Without updating the score after the removal of
    each node
  • The time complexity of Algorithm 2 is
    o(lognlongm)
  • One may miss some large d-bicluster

23
(No Transcript)
24
Comments on Algorithm 3
  • The time complexity is o(mn)
  • The resulting d-bicluster may still not be
    maximal because of two reasons
  • (1)Lemma 3 only gives a sufficient
    condition for adding rows and conditions
  • (2)By adding rows and columns, the score
    may decrease to the point it much smaller than d

25
5. Experiment 5.1 Data preparation
  • Datasets and Parameters
  • (1)Yeast data,o-value300, n100
  • (2)Human data, o-value1200,n100
  • Missing Data Replacement
  • Replace the missing data using the random
    number underlying the uinform distriubiton
  • Biclusters is Compared to the Cluster results
    from
  • (1)Travazoie et al. (1999)
  • (2)Alizadeh et al. (2000)

26
5.2 Determining Algorithm Parameters
  • Thinking about the clusters from the papres
    Travazoie et al. (1999) and Alizadeh et al.
    (2000)
  • For yeast data, d 300, a1.2
  • For human gene data, d 1200, a1.2
  • The number of biclusters is n100
  • Masking discovered Biclusters Each time a
    bicluster was discovered, the elements in it will
    be replaced by random number because the
    algorithms are deterministic

5.3 Final Algorithm
27
(No Transcript)
28
Biclusters for Yeast data
29
Biclusters for Yeast data
30
Biclusters for Yeast data
31
Biclusters for Yeast data
32
Biclusters for human data
33
Biclusters for human data
Write a Comment
User Comments (0)
About PowerShow.com