Improved Tumor Marker Validation Success Using Weighted Gene Co-expression Networks and Random Forest Clustering - PowerPoint PPT Presentation

1 / 74
About This Presentation
Title:

Improved Tumor Marker Validation Success Using Weighted Gene Co-expression Networks and Random Forest Clustering

Description:

Improved Tumor Marker Validation Success Using Weighted Gene Co-expression Networks and Random Forest Clustering Steve Horvath shorvath_at_mednet.ucla.edu – PowerPoint PPT presentation

Number of Views:232
Avg rating:3.0/5.0
Slides: 75
Provided by: Tao82
Category:

less

Transcript and Presenter's Notes

Title: Improved Tumor Marker Validation Success Using Weighted Gene Co-expression Networks and Random Forest Clustering


1
Improved Tumor Marker Validation Success Using
Weighted Gene Co-expression Networks and
Random Forest Clustering
  • Steve Horvath
  • shorvath_at_mednet.ucla.edu
  • Human Genetics Biostatistics
  • University of California, Los Angeles

2
Contents
  • Describe pathway based tumor marker screening
    strategy
  • Speculate on the biological reasons why it could
    work.
  • Describe 2 empirical success stories for
    identifying tumor markers that validated in
    independent data sets
  • Brain cancer survival time
  • (Affy) gene expression microarray data
  • weighted gene co-expression networks
  • Prostate cancer time to PSA recurrence
  • tissue microarray data (immunohistochemical
    stainings)
  • random forest clustering

3
The Embarassing Validation Problem
  • A tumor marker is found to be highly predictive
    of a clinical outcome in one data set but fails
    to be validated in an independent data set.
  • Bad (analysis) reasons include
  • data snooping
  • overfitting
  • ascertainment issues
  • Good (biological) reasons
  • genetic heterogeneity
  • Little can be done about this.
  • Single markers dont capture the essence of the
    whole disease pathway.
  • A lot can be done about this?NOVEL STATISTICAL
    METHODS FOR EXTRACTING SIGNAL FROM THE DATA.

4
Outline of standard strategy for screening for
markers
  • 1) Regress a clinical outcome y on the molecular
    markers (features) X.
  • 2) Identify the features that are most
    significant or most predictive of the outcome
    using standard statistical feature selection
    methods
  • Empirical finding often poor validation success.

5
Pathway Based Strategy for Screening for
Markers
  • Find suitably defined clusters in the underlying
    high dimensional feature space X.
  • Relate the clusters to clinical outcomes of
    interest. This results in a few disease
    clusters (a.k.a. pathways or modules)
  • Use features (markers) that describe the states
    of the disease clusters as final predictors.
  • (Limited) Empirical Finding improved validation
    success

6
Motivating why the pathway based screening
strategy may lead to better validation success
  • By first clustering the features, one reduces the
    number of multiple comparisons substantially
  • By looking at aggregates of features (clusters)
    the feature definition is much more robust and
    more likely to be platform independent.
  • Combining the features along pathways is the
    biologically meaningful thing to do.
  • Pathways are closer to the clinical phenotype
    than the individual constituents of these
    pathways.
  • The whole is more than the sum of its parts

7
TEASERValidation success rate of gene
expressions in independent data
300 most significant genes Network based
screening (Cox p-valuelt1.310-3) plt0.05 and
high intramodular connectivity
67
26
8
Weighted Gene Co-Expression Network Analysis.
9
  • Novel statistical approach for analyzing
    microarray data weighted network analysis
  • Empirical evidence that it matters in practice
  • Identification of Brain Cancer Genes that can be
    validated in an independent data set

10
Background
  • Network based methods have been found useful in
    many domains,
  • protein interaction networks
  • the world wide web
  • social interaction networks
  • OUR FOCUS gene co-expression networks

11
Does this map tell you which cities are important?
This one does!
The nodes with the largest number of links
(connections) are most important!
Slide courtesy of Paul Mischel and AL Barabasi
12
Scale free topology is a fundamental property of
such networks (Barabasi et al)
  • It entails the presence of hub nodes that are
    connected to a large number of other nodes
  • Such networks are robust with respect to the
    random deletion of nodes but are sensitive to the
    targeted attack on hub nodes
  • It has been demonstrated that metabolic networks
    exhibit a scale free topology

13
P(k) vs k in scale free networks
P(k)
  • Scale Free Topology refers to the frequency
    distribution of the connectivities
  • Connectivity k
  • p(k)proportion of nodes that have connectivity k

14
How to check Scale Free Topology?
Idea Log transformation p(k) and k and look at
scatter plots
Linear Regression model fitting index R2 can be
used to quantify goodness of fit
15
Gene Co-expression Networks
  • In gene co-expression networks, each gene
    corresponds to a node.
  • Two genes are connected by an edge if their
    expression values are highly correlated.
  • Definition of high correlation is somewhat
    tricky
  • we propose a criterion for picking threshold
    parameter.

16
Steps for constructing asimple, unweighted
co-expression network
Overview gene co-expression network analysis
  • Hi
  • Microarray gene expression data
  • Measure concordance of gene expression with a
    Pearson correlation
  • C) The Pearson correlation matrix is dichotomized
    to arrive at an adjacency matrix. Binary values
    in the adjacency matrix correspond to an
    unweighted network.
  • D) The adjacency matrix can be visualized by a
    graph.

17
Our holistic view.
  • Weighted Network View Unweighted View
  • All genes are connected Some genes are
    connected
  • Connection WidthsConnection strenghts All
    connections are equal

We find theoretical and empirical evidence that
the weighted network view is superior to the
simple network view.
18
A general frame work for defining weighted gene
co-expression networksBin Zhang, Steve
HorvathTechnical report and R code at
www.genetics.ucla.edu/labs/horvath/CoexpressionNet
work
19
Beyond the standard approach
  • Dichotomization allows one to easily define
    network-based concepts but it eliminates some
    information regarding the strength of
    interaction.
  • To overcome the disadvantage of the
    dichotomization, we generalize the approach
  • Measure co-expression by a similarity s(i,j) with
    range 0,1 e.g. absolute value of the Pearson
    correlation
  • Define an adjacency matrix A(i,j)AF(s(i,j))
  • The adjacency function AF is a monotonic,
    non-negative function defined on 0,1 and
    depends on parameters. The choice of the
    parameters determines the properties of the
    network.
  • We consider 2 types of AFs
  • Step function AF(s)I(sgttau) with parameter tau
  • Power function AF(s)sb with parameter

20
Comparing adjacency functions
21
How to estimate the parameter values of an
adjacency function?
  • We propose to use the following criteria
  • A) CONSIDER ONLY THOSE PARAMETER VALUES THAT
    RESULTS IN APPROXIMATE SCALE FREE TOPOLOGY
  • B) SELECT THE PARAMETERS THAT RESULT IN THE
    HIGHEST MEAN NUMBER OF CONNECTIONS
  • Criterion A is motivated by the finding that most
    metabolic networks (including gene co-expression
    networks, protein-protein interaction networks
    and cellular networks) have been found to exhibit
    a scale free topology
  • Criterion B is motivated by our desire to have
    high sensitivity to detect modules (clusters of
    genes) and hub genes.

22
Criterion A is measured by the linear model
fitting index R2
Step AF (tau) Power AF (b)
b
tau
23
Trade-off between criterion A (R2) and criterion
B (mean no. of connections) when varying the
power b
AF(s)sb
criterion A SFT model fit R2 criterion B mean
connectivity
24
Empirical insights for determining the adjacency
function
  • For criterion A measure compliance with scale
    free topology by using the adjusted R2 value for
    the linear regression fit between log(p(k)) and
    log(k)
  • Usually require R2gt0.8
  • For criterion B aim to get a mean(k)50 when
    dealing with 2000 genes.

25
Trade-off between criterion A and B when varying
tau
Step Function I(sgttau)
criterion A criterion B
26
Mathematical Definition of an Undirected Network
27
NetworkAdjacency Matrix
  • A network can be represented by an adjacency
    matrix, Aaij, that encodes whether/how a pair
    of nodes is connected.
  • A is a symmetric matrix with entries in 0,1.
  • For unweighted network, entries are 1 or 0
    depending on whether or not 2 nodes are adjacent
    (connected).
  • For weighted networks, the adjacency matrix
    reports the connection strength between gene
    pairs.

28
Generalized Connectivity
  • Gene connectivity correspond to the row sums of
    the adjacency matrix
  • For unweighted networksnumber of direct
    neighbors
  • For weighted networks sum of connection
    strengths to other nodes

29
Network Analysis Flow Chart
30
Define a Gene Co-expression Similarity
Define a Family of Adjacency Functions
Determine the AF Parameters
Define a Measure of Node Dissimilarity
  Identify Network Modules (Clustering)
Relate Network Concepts to Each Other
Relate the Network Concepts to External Gene or
Sample Information
31
Network Distance Measure Topological Overlap
Matrix
32
How to measure distance in a network?
  • Mathematical Answer Geodesics
  • length of shortest path connecting 2 nodes
  • we have found no empirical evidence that this is
    a biologically meaningful concept in
    co-expression networks
  • Biological Answer look at shared neighbors
  • Intuition if 2 people share the same friends
    they are close in a social network
  • Use the topological overlap measure based
    distance proposed by Ravasz et al 2002 Science)

33
Topological Overlap (Ravasz et al) leads to a
network distance measure
  • Generalized in Zhang and Horvath (2005) to the
    case of weighted networks
  • Generalized in Yip and Horvath (2005) to higher
    order interactions

34
Using the TOM matrix to cluster genes
  • To group nodes with high topological overlap into
    modules (clusters), we typically use average
    linkage hierarchical clustering coupled with the
    TOM distance measure.
  • Once a dendrogram is obtained from a hierarchical
    clustering method, we choose a height cutoff to
    arrive at a clustering.
  • Here modules correspond to branches of the
    dendrogram

TOM plot
Genes correspond to rows and columns
TOM matrix
Hierarchical clustering dendrogram
Module Correspond to branches
35
More traditional view of module
ColumnsBrain tissue samples
RowsGenes Color band indicates module
membership
Message characteristic vertical bands indicate
tight co-expression of module genes
36
Different Ways of Depicting Gene Modules
Topological Overlap Plot Gene
Functions We proposed Multi Dimensional
Scaling Traditional View
1) Rows and columns correspond to genes 2) Red
boxes along diagonal are modules 3) Color
bandsmodules
Idea Use network distance in MDS
37
Hub Genes Predict Survival for Brain Cancer
PatientsMischel PS, Zhang B,et al, Horvath S,
Nelson SF.
38
Comparing the Module Structure in Cancer and
Normal tissues
55 Brain Tumors
VALIDATION DATA 65 Brain Tumors
Messages 1)Cancer modules can be independently
validated 2) Modules in brain cancer tissue can
also be found in normal, non-brain tissue. --gt
Insights into the biology of cancer
Normal brain (adult fetal)
Normal non-CNS tissues
39
Mean Prognostic Significance of Module Genes
Message Focus the attention on the brown module
genes
40
Module hub genes predict cancer survival
  1. Cox model to regress survival on gene expression
    levels
  2. Defined prognostic significance as
    log10(Cox-p-value) the survival association
    between each gene and glioblastoma patient
    survival
  3. A module-based measure of gene connectivity
    significantly and reproducibly identifies the
    genes that most strongly predict patient survival

Validation set 65 gbms r 0.55 p-2.2 x 10-16
Test set 55 gbms r 0.56 p-2.2 x 10-16
41
The fact that genes with high intramodular
connectivity are more likely to be prognostically
significant facilitates a novel screening
strategy for finding prognostic genes
  • Focus on those genes with significant Cox
    regression p-value AND high intramodular
    connectivity.
  • It is essential to to take a module centric view
    focus on intramodular connectivity of disease
    related module
  • Validation success rate proportion of genes with
    independent test set Cox regression p-valuelt0.05.
  • Validation success rate of network based
    screening approach (68)
  • Standard approach involving top 300 most
    significant genes 26

42
Validation success rate of gene expressions in
independent data
300 most significant genes Network based
screening (Cox p-valuelt1.310-3) plt0.05 and
high intramodular connectivity
67
26
43
New ApplicationTissue Microarray Data
44
Tissue MicroarrayDNA Microarray
45
Tissue Array Section
700 Tissue Samples
0.6 mm 0.2mm
46
Ki-67 Expression in Kidney Cancer
High Grade
Low Grade
Message brown staining related to tumor grade
47
Multiple measurements per patientSeveral spots
per tumor sample and several scores per spot
  • Each patients (tumor sample) is usually
    represented by multiple spots
  • 3 tumor spots
  • 1 matched normal spot
  • Maximum intensity Max
  • Percent of cells staining Pos
  • Percent of cells staining with the
  • maximum intensity PosMax
  • Spots have a spot grade NL,1,2,..
  • Indicator of missingness

48
Properties of TMA Data
  • Highly skewed, non-normal,semi-continuous.
  • Often a good idea to model as ordinal variables
    with many levels.
  • Staining scores of the same markers are highly
    correlated

49
Histogram of tumor marker expression scores POS
and MAX
Percent of Cells Staining(POS)
EpCam
P53
CA9
Maximum Intensity (MAX)
50
Frequency plot of the same tumor marker in 2
independent data sets
DATA SET 1 Validation Data Set 2
The cut-off corresponds roughly to the 66
percentile. Thresholding this tumor marker allows
one to stratify the cancer patients into high
risk and low risk patients. Although the
distribution looks very different the percentile
threshold can be validated and is clinically
relevant.
51
Thresholding methods for tumor marker expressions
  • Since clinicians and pathologists prefer
    thresholding tumor marker expressions, it is
    natural to use statistical methods that are based
    on thresholding covariates, e.g. regression
    trees, survival trees, rpart, forest predictors
    etc.
  • Dichotomized marker expressions are often fitted
    in a Cox (or alternative) regression model
  • Danger Over-fitting due to optimal cut-off
    selection.
  • Several thresholding methods and ways for
    adjusting for multiple comparisons are reviewed
    in
  • Liu X, Minin V, Huang Y, Seligson DB, Horvath S
    (2004) Statistical Methods for Analyzing Tissue
    Microarray Data. J of Biopharmaceutical
    Statistics. Vol 14(3) 671-685

52
Finding tumor markers for predicting clinical
outcomes on the basis of Tissue Microarray Data
53
Using the clustering based strategy for finding
tumor markers
  • 1) Find distinct patient clusters without regard
    to outcome
  • 2) Find whether patient clusters have distinct
    PSA recurrence profiles
  • 3) If so, find rules (classifiers) for predicting
    cluster membership
  • 4) Validate those rules in independent data.

54
(No Transcript)
55
Cluster Analysis of Low Gleason Score Prostate
Samples(UCLA data)
56
1) Construct a tumor marker rule for predicting
RF cluster membership.2) Validate the rule
predictions in an independent data set
Threshold Rule Validation
57
Discussion Prostate TMA Data
  • Very weak evidence that individual markers
    predict PSA recurrence
  • None of the markers validated individually
  • However, cluster membership was highly
    predictive, i.e the rule could be validated in an
    independent data set.

58
How to cluster patients on the basis of Tissue
Microarray Data?
59
Questions 1)Can TMA data be used for tumor
class discovery, i.e unsupervised learning?2)
If so, what are suitable unsupervised learning
methods?
60
Tumor Class Discovery using DNA Microarray Data
  • Tumor class discovery entails using a
    unsupervised learning algorithm (i.e.
    hierarchical, k-means, SOM clustering etc.) to
    automatically group tumor samples based on their
    gene expression pattern.

Bullinger et al. N Engl J Med. 2004
61
Clusters involving TMA data may have
unconventional shapesLow risk prostate cancer
patients are colored in black.
  • Scatter plot involving 2 dependent tumor
    markers. The remaining, less dependent markers
    are not shown.
  • Low risk cluster can be described using the
    following rule
  • Marker H3K4 gt 45 and H3K18 gt 70.
  • The intuition is quite different from that of
    Euclidean distance based clusters.

62
Unconventional shape of a clinically meaningful
patient cluster
  • 3 dimensional scatter plot along tumor markers
  • Low risk patients are colored in black

MARKER 2
MARKER 1
63
A dissimilarity measure is an essential input for
tumor class discovery
  • Dissimilarities between tumor samples are used in
    clustering and other unsupervised learning
    techniques
  • Commonly used dissimilarity measures include
    Euclidean distance, 1 - correlation

64
Challenge
  • Conventional dissimilarity measures that work for
    DNA microarray data may not be optimal for TMA
    data.
  • Dissimilarity measure that are based on the
    intuition of multivariate normal distributions
    (clusters have elliptical shapes) may not be
    optimal
  • For tumor marker data, one may want to use a
    different intuition clusters are described using
    thresholding rules involving dependent markers.
  • It may be desirable to have a dissimilarity that
    is invariant under monotonic transformations of
    the tumor marker expressions.

65
We have found that a random forest (Breiman 2001)
dissimilarity can work well in the unsupervised
analysis of TMA data.Shi et al 2004, Seligson et
al 2005.http//www.genetics.ucla.edu/labs/horvath
/RFclustering/RFclustering.htm
66
Kidney cancerComparing PAM clusters that result
from using the RF dissimilarity vs the Euclidean
distance
Kaplan Meier plots for groups defined by cross
tabulating patients according to their RF and
Euclidean distance cluster memberships.
Message In this application, RF clusters are
more meaningful regarding survival time
67
The RF dissimilarity is determined by dependent
tumor markers
Tumor markers
  • The RF dissimilarity focuses on the most
    dependent markers (1,2).
  • In some applications, it is good to focus on
    markers that are dependent since they may
    constitute a disease pathway.
  • The Euclidean distance focuses on the most
    varying marker (4)

Patients sorted by cluster
68
The RF cluster can be described using a
thresholding rule involving the most dependent
markers
  • Low risk patient if marker1gtcut1 marker2gt cut2
  • This kind of thresholding rule can be used to
    make predictions on independent data sets.
  • Validation on independent data set

69
Theoretical reasons for using an RF dissimilarity
for TMA data
  • Main reasons
  • natural way of weighing tumor marker
    contributions to the dissimilarity
  • The more related a tumor marker is to other tumor
    markers the more it contributes to the definition
    of the dissimilarity
  • no need to transform the often highly skewed
    features
  • based feature ranks
  • Chooses cut-off values automatically
  • resulting clusters can often be described using
    simple thresholding rules
  • Other reasons
  • elegant way to deal with missing covariates
  • intrinsic proximity matrix handles mixed variable
    types well
  • CAVEAT The choice of the dissimilarity should be
    determined by the kind of patterns one hopes to
    find. There will be situations when other
    dissimilarities are preferrable.

70
The random forest dissimilarityL. Breiman RF
manualTechnical Report Shi and Horvath
2005http//www.genetics.ucla.edu/labs/horvath/RFc
lustering/RFclustering.htm
71
SummaryRandom forest clustering
  • Intrinsic variable selection focuses on dependent
    variables
  • Depending on the application, this can be
    attractive
  • Resulting clusters can often be described using
    thresholding rules?attractive for TMA data.
  • RF dissimilarity invariant to monotonic
    transformations of variables
  • In some cases, the RF dissimilarity can be
    approximated using a Euclidean distance of ranked
    and scaled features.
  • RF clustering was originally suggested by L.
    Breiman (RF manual). Theoretical properties are
    studied as part of the dissertation work of Tao
    Shi. Technical report/code can be found at
    www.genetics.ucla.edu/labs/horvath/RFclustering/R
    Fclustering.htm www.genetics.ucla.edu/labs/horvat
    h/kidneypaper/RCC.htm

72
Conclusions
  • There is a need to identify/develop appropriate
    data mining methods for TMA data
  • highly skewed, semi-continuous, non-normal data
  • tree or forest based methods work well
  • ALTERNATIVES?

73
Acknowledgements
  • Former students Postdocs for TMA
  • Tao Shi PhD
  • Xueli Liu PhD
  • Yunda Huang PhD
  • Tuyen Hoang PhD
  • UCLA
  • Tissue Microarray Core
  • David Seligson, MD
  • Hyung Kim, MD
  • Arie Belldegrun, MD
  • Robert Figlin, MD
  • Siavash Kurdistani, MD

74
References RF clustering
  • Unsupervised learning tasks in TMA data analysis
  • Review random forest predictors (introduced by L.
    Breiman)
  • Shi, T. and Horvath, S. (2005) Unsupervised
    learning using random forest predictors Journal
    of Computational and Graphical Statistics
  • www.genetics.ucla.edu/labs/horvath/RFclustering/RF
    clustering.htm
  • Application to Tissue Array Data
  • Shi, T., Seligson, D., Belldegrun, A. S.,
    Palotie, A., Horvath, S. (2004) Tumor Profiling
    of Renal Cell Carcinoma Tissue Microarray Data
  • Seligson DB, Horvath S, Shi T, Yu H, Tze S,
    Grunstein M, Kurdistani S (2005) Global histone
    modification patterns predict risk of prostate
    cancer recurrence.
Write a Comment
User Comments (0)
About PowerShow.com