MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DAT - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DAT

Description:

Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman, and Roland Somogyi ... Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B., Smith, S., Barker, J.L., and ... – PowerPoint PPT presentation

Number of Views:311
Avg rating:3.0/5.0
Slides: 22
Provided by: Bin107
Category:

less

Transcript and Presenter's Notes

Title: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DAT


1
MINING THE GENE EXPRESSION MATRIX INFERRING GENE
RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION
DATA
  • Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,
    and Roland Somogyi
  • Information Processing in Cells and Tissues, pp.
    203-212, 1998
  • Presented by Bin He

2
Motivations
  • it is necessary to determine large-scale temporal
    gene expression patterns
  • to decipher the logic of gene regulation, we
    should aim to be able to monitor the expression
    level of all genes simultaneously

3
Gene time series
  • assay the expression levels of large numbers of
    genes in a tissue at different time points
  • Gene time series
  • the relative amounts of mRNA produced at these
    time points provide a gene expression time series
    for each gene

4
Gene Expression Matrix
  • Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B.,
    Smith, S., Barker, J.L., and Somogyi, R., 1997,
    Large-scale temporal gene expression mapping of
    CNS development, Proc. Natl. Acad. Sci., in press

5
Previous Approach
  • Euclidean distance and information theoretic
    measures to cluster the genes into related
    expression time series
  • A significant problem with this approach is the
    variety of measures that can be used
  • Each measure produces a unique clustering of gene
    expression patterns

6
Contributions
  • determining significant relationships between
    individual genes, based on
  • linear correlation
  • rank correlation
  • information theory

7
Linear correlation ------positive
correlation
  • positive linear correlation

8
Linear correlation ------negative
correlation
  • negative linear correlation

9
Linear correlation
------restriction
  • for 112 different genes, 112x111/2 6216 pairs
    of expression time series need to be examined
  • to restrict the number of relationships, we might
    want to test which correlations are significantly
    larger than a certain value

10
Linear correlation
------restriction
  • For instance, to find those relationships in
    which at least 50 of the variance is explained
    by the correlation, i.e. rho2gt0.5, we need
    rgt0.96 to reject at the 1 significance level
    the null hypothesis that rholt0.7071

11
Linear correlation
------visualization
  • residual variance based distance measurment
  • d1-r2
  • d0 if perfectly correlated, d1 if uncorrelated
  • multidimensional scaling
  • map time series into a two-dimensional plane

12
Linear correlation
------visualization
  • Multidimensional scaling of 34 time series with
    high correlation

13
Nonlinear correlation
------Model
  • Spearman rank correlation, rs
  • measurement for monotonic relationships
  • can be used for non-Gaussian distributions
  • 491 pairs of expression time series, involving 98
    genes, which have a significant rs, ranging from
    -0.979 to 0.996

14
Nonlinear correlation ------Example
  • High rank correlation but low linear correlation
    between mGluR1 and GRa2

15
Information Theory ------mutual
information
  • if H(A) and H(B) are the entropies of sources A
    and B respectively, and H(A,B) the joint entropy
    of the sources, then M(A,B) H(A) H(B) -
    H(A,B)
  • discrete form is much easier to use
  • We need discretize the time series by
    partitioning the expression levels into bins

16
Information Theory
------Bin size
  • The fewer bins we use to discretize the data, the
    more information about the original time series
    we ignore.
  • On the other hand, too fine a binning will leave
    us with too few points per bin to get a
    reasonable estimate of the frequency of each bin

17
Information Theory
------Mapping
  • Some time series map to the same discretized
    series
  • In total, from 112 unique continuous-valued time
    series we get 91 discretized time series

18
Information Theory
------Mapping
19
Information Theory
------Mapping
  • eliminate one-to-one mapping by permuting the bin
    numbers
  • H(A)H(B)M(A,B)
  • row 3 and row 4
  • replace such time series by one single series,
    leaving us with a set of 77 unique,
    non-equivalent time series.

20
Information Theory
------Measurement
  • symmetric measures
  • M(A,B)/max(H(A),H(B))
  • M(A,B)/H(A,B)
  • asymmetric measures
  • Relative mutual information
  • R(A,B) M(A,B)/H(B)
  • R(A,B) 1.0, means that all the information
    about time series B is contained in time series A

21
Conclusion
  • Linear correlation can be used very effectively
    to detect linear relationships
  • detect relationships not captured by Euclidean
    distance, such as high negative correlations
  • Rank correlation can be used to detect non-linear
    relationships
  • much more robust with respect to the distribution
    of expression levels
  • Information theory can be used to detect genes
    whose (binned) expression patterns share
    information
  • It will detect any mapping from time series A to
    B
Write a Comment
User Comments (0)
About PowerShow.com