1 / 74

Improved Tumor Marker Validation Success Using

Weighted Gene Co-expression Networks and

Random Forest Clustering

- Steve Horvath
- shorvath_at_mednet.ucla.edu
- Human Genetics Biostatistics
- University of California, Los Angeles

Contents

- Describe pathway based tumor marker screening

strategy - Speculate on the biological reasons why it could

work. - Describe 2 empirical success stories for

identifying tumor markers that validated in

independent data sets - Brain cancer survival time
- (Affy) gene expression microarray data
- weighted gene co-expression networks
- Prostate cancer time to PSA recurrence
- tissue microarray data (immunohistochemical

stainings) - random forest clustering

The Embarassing Validation Problem

- A tumor marker is found to be highly predictive

of a clinical outcome in one data set but fails

to be validated in an independent data set. - Bad (analysis) reasons include
- data snooping
- overfitting
- ascertainment issues
- Good (biological) reasons
- genetic heterogeneity
- Little can be done about this.
- Single markers dont capture the essence of the

whole disease pathway. - A lot can be done about this?NOVEL STATISTICAL

METHODS FOR EXTRACTING SIGNAL FROM THE DATA.

Outline of standard strategy for screening for

markers

- 1) Regress a clinical outcome y on the molecular

markers (features) X. - 2) Identify the features that are most

significant or most predictive of the outcome

using standard statistical feature selection

methods - Empirical finding often poor validation success.

Pathway Based Strategy for Screening for

Markers

- Find suitably defined clusters in the underlying

high dimensional feature space X. - Relate the clusters to clinical outcomes of

interest. This results in a few disease

clusters (a.k.a. pathways or modules) - Use features (markers) that describe the states

of the disease clusters as final predictors. - (Limited) Empirical Finding improved validation

success

Motivating why the pathway based screening

strategy may lead to better validation success

- By first clustering the features, one reduces the

number of multiple comparisons substantially - By looking at aggregates of features (clusters)

the feature definition is much more robust and

more likely to be platform independent. - Combining the features along pathways is the

biologically meaningful thing to do. - Pathways are closer to the clinical phenotype

than the individual constituents of these

pathways. - The whole is more than the sum of its parts

TEASERValidation success rate of gene

expressions in independent data

300 most significant genes Network based

screening (Cox p-valuelt1.310-3) plt0.05 and

high intramodular connectivity

67

26

Weighted Gene Co-Expression Network Analysis.

- Novel statistical approach for analyzing

microarray data weighted network analysis - Empirical evidence that it matters in practice
- Identification of Brain Cancer Genes that can be

validated in an independent data set

Background

- Network based methods have been found useful in

many domains, - protein interaction networks
- the world wide web
- social interaction networks
- OUR FOCUS gene co-expression networks

Does this map tell you which cities are important?

This one does!

The nodes with the largest number of links

(connections) are most important!

Slide courtesy of Paul Mischel and AL Barabasi

Scale free topology is a fundamental property of

such networks (Barabasi et al)

- It entails the presence of hub nodes that are

connected to a large number of other nodes - Such networks are robust with respect to the

random deletion of nodes but are sensitive to the

targeted attack on hub nodes - It has been demonstrated that metabolic networks

exhibit a scale free topology

P(k) vs k in scale free networks

P(k)

- Scale Free Topology refers to the frequency

distribution of the connectivities - Connectivity k
- p(k)proportion of nodes that have connectivity k

How to check Scale Free Topology?

Idea Log transformation p(k) and k and look at

scatter plots

Linear Regression model fitting index R2 can be

used to quantify goodness of fit

Gene Co-expression Networks

- In gene co-expression networks, each gene

corresponds to a node. - Two genes are connected by an edge if their

expression values are highly correlated. - Definition of high correlation is somewhat

tricky - we propose a criterion for picking threshold

parameter.

Steps for constructing asimple, unweighted

co-expression network

Overview gene co-expression network analysis

- Hi

- Microarray gene expression data
- Measure concordance of gene expression with a

Pearson correlation - C) The Pearson correlation matrix is dichotomized

to arrive at an adjacency matrix. Binary values

in the adjacency matrix correspond to an

unweighted network. - D) The adjacency matrix can be visualized by a

graph.

Our holistic view.

- Weighted Network View Unweighted View
- All genes are connected Some genes are

connected - Connection WidthsConnection strenghts All

connections are equal

We find theoretical and empirical evidence that

the weighted network view is superior to the

simple network view.

A general frame work for defining weighted gene

co-expression networksBin Zhang, Steve

HorvathTechnical report and R code at

www.genetics.ucla.edu/labs/horvath/CoexpressionNet

work

Beyond the standard approach

- Dichotomization allows one to easily define

network-based concepts but it eliminates some

information regarding the strength of

interaction. - To overcome the disadvantage of the

dichotomization, we generalize the approach - Measure co-expression by a similarity s(i,j) with

range 0,1 e.g. absolute value of the Pearson

correlation - Define an adjacency matrix A(i,j)AF(s(i,j))
- The adjacency function AF is a monotonic,

non-negative function defined on 0,1 and

depends on parameters. The choice of the

parameters determines the properties of the

network. - We consider 2 types of AFs
- Step function AF(s)I(sgttau) with parameter tau
- Power function AF(s)sb with parameter

Comparing adjacency functions

How to estimate the parameter values of an

adjacency function?

- We propose to use the following criteria
- A) CONSIDER ONLY THOSE PARAMETER VALUES THAT

RESULTS IN APPROXIMATE SCALE FREE TOPOLOGY - B) SELECT THE PARAMETERS THAT RESULT IN THE

HIGHEST MEAN NUMBER OF CONNECTIONS - Criterion A is motivated by the finding that most

metabolic networks (including gene co-expression

networks, protein-protein interaction networks

and cellular networks) have been found to exhibit

a scale free topology - Criterion B is motivated by our desire to have

high sensitivity to detect modules (clusters of

genes) and hub genes.

Criterion A is measured by the linear model

fitting index R2

Step AF (tau) Power AF (b)

b

tau

Trade-off between criterion A (R2) and criterion

B (mean no. of connections) when varying the

power b

AF(s)sb

criterion A SFT model fit R2 criterion B mean

connectivity

Empirical insights for determining the adjacency

function

- For criterion A measure compliance with scale

free topology by using the adjusted R2 value for

the linear regression fit between log(p(k)) and

log(k) - Usually require R2gt0.8
- For criterion B aim to get a mean(k)50 when

dealing with 2000 genes.

Trade-off between criterion A and B when varying

tau

Step Function I(sgttau)

criterion A criterion B

Mathematical Definition of an Undirected Network

NetworkAdjacency Matrix

- A network can be represented by an adjacency

matrix, Aaij, that encodes whether/how a pair

of nodes is connected. - A is a symmetric matrix with entries in 0,1.
- For unweighted network, entries are 1 or 0

depending on whether or not 2 nodes are adjacent

(connected). - For weighted networks, the adjacency matrix

reports the connection strength between gene

pairs.

Generalized Connectivity

- Gene connectivity correspond to the row sums of

the adjacency matrix - For unweighted networksnumber of direct

neighbors - For weighted networks sum of connection

strengths to other nodes

Network Analysis Flow Chart

Define a Gene Co-expression Similarity

Define a Family of Adjacency Functions

Determine the AF Parameters

Define a Measure of Node Dissimilarity

Identify Network Modules (Clustering)

Relate Network Concepts to Each Other

Relate the Network Concepts to External Gene or

Sample Information

Network Distance Measure Topological Overlap

Matrix

How to measure distance in a network?

- Mathematical Answer Geodesics
- length of shortest path connecting 2 nodes
- we have found no empirical evidence that this is

a biologically meaningful concept in

co-expression networks - Biological Answer look at shared neighbors
- Intuition if 2 people share the same friends

they are close in a social network - Use the topological overlap measure based

distance proposed by Ravasz et al 2002 Science)

Topological Overlap (Ravasz et al) leads to a

network distance measure

- Generalized in Zhang and Horvath (2005) to the

case of weighted networks - Generalized in Yip and Horvath (2005) to higher

order interactions

Using the TOM matrix to cluster genes

- To group nodes with high topological overlap into

modules (clusters), we typically use average

linkage hierarchical clustering coupled with the

TOM distance measure. - Once a dendrogram is obtained from a hierarchical

clustering method, we choose a height cutoff to

arrive at a clustering. - Here modules correspond to branches of the

dendrogram

TOM plot

Genes correspond to rows and columns

TOM matrix

Hierarchical clustering dendrogram

Module Correspond to branches

More traditional view of module

ColumnsBrain tissue samples

RowsGenes Color band indicates module

membership

Message characteristic vertical bands indicate

tight co-expression of module genes

Different Ways of Depicting Gene Modules

Topological Overlap Plot Gene

Functions We proposed Multi Dimensional

Scaling Traditional View

1) Rows and columns correspond to genes 2) Red

boxes along diagonal are modules 3) Color

bandsmodules

Idea Use network distance in MDS

Hub Genes Predict Survival for Brain Cancer

PatientsMischel PS, Zhang B,et al, Horvath S,

Nelson SF.

Comparing the Module Structure in Cancer and

Normal tissues

55 Brain Tumors

VALIDATION DATA 65 Brain Tumors

Messages 1)Cancer modules can be independently

validated 2) Modules in brain cancer tissue can

also be found in normal, non-brain tissue. --gt

Insights into the biology of cancer

Normal brain (adult fetal)

Normal non-CNS tissues

Mean Prognostic Significance of Module Genes

Message Focus the attention on the brown module

genes

Module hub genes predict cancer survival

- Cox model to regress survival on gene expression

levels - Defined prognostic significance as

log10(Cox-p-value) the survival association

between each gene and glioblastoma patient

survival - A module-based measure of gene connectivity

significantly and reproducibly identifies the

genes that most strongly predict patient survival

Validation set 65 gbms r 0.55 p-2.2 x 10-16

Test set 55 gbms r 0.56 p-2.2 x 10-16

The fact that genes with high intramodular

connectivity are more likely to be prognostically

significant facilitates a novel screening

strategy for finding prognostic genes

- Focus on those genes with significant Cox

regression p-value AND high intramodular

connectivity. - It is essential to to take a module centric view

focus on intramodular connectivity of disease

related module - Validation success rate proportion of genes with

independent test set Cox regression p-valuelt0.05.

- Validation success rate of network based

screening approach (68) - Standard approach involving top 300 most

significant genes 26

Validation success rate of gene expressions in

independent data

300 most significant genes Network based

screening (Cox p-valuelt1.310-3) plt0.05 and

high intramodular connectivity

67

26

New ApplicationTissue Microarray Data

Tissue MicroarrayDNA Microarray

Tissue Array Section

700 Tissue Samples

0.6 mm 0.2mm

Ki-67 Expression in Kidney Cancer

High Grade

Low Grade

Message brown staining related to tumor grade

Multiple measurements per patientSeveral spots

per tumor sample and several scores per spot

- Each patients (tumor sample) is usually

represented by multiple spots - 3 tumor spots
- 1 matched normal spot

- Maximum intensity Max
- Percent of cells staining Pos
- Percent of cells staining with the
- maximum intensity PosMax
- Spots have a spot grade NL,1,2,..
- Indicator of missingness

Properties of TMA Data

- Highly skewed, non-normal,semi-continuous.
- Often a good idea to model as ordinal variables

with many levels. - Staining scores of the same markers are highly

correlated

Histogram of tumor marker expression scores POS

and MAX

Percent of Cells Staining(POS)

EpCam

P53

CA9

Maximum Intensity (MAX)

Frequency plot of the same tumor marker in 2

independent data sets

DATA SET 1 Validation Data Set 2

The cut-off corresponds roughly to the 66

percentile. Thresholding this tumor marker allows

one to stratify the cancer patients into high

risk and low risk patients. Although the

distribution looks very different the percentile

threshold can be validated and is clinically

relevant.

Thresholding methods for tumor marker expressions

- Since clinicians and pathologists prefer

thresholding tumor marker expressions, it is

natural to use statistical methods that are based

on thresholding covariates, e.g. regression

trees, survival trees, rpart, forest predictors

etc. - Dichotomized marker expressions are often fitted

in a Cox (or alternative) regression model - Danger Over-fitting due to optimal cut-off

selection. - Several thresholding methods and ways for

adjusting for multiple comparisons are reviewed

in - Liu X, Minin V, Huang Y, Seligson DB, Horvath S

(2004) Statistical Methods for Analyzing Tissue

Microarray Data. J of Biopharmaceutical

Statistics. Vol 14(3) 671-685

Finding tumor markers for predicting clinical

outcomes on the basis of Tissue Microarray Data

Using the clustering based strategy for finding

tumor markers

- 1) Find distinct patient clusters without regard

to outcome - 2) Find whether patient clusters have distinct

PSA recurrence profiles - 3) If so, find rules (classifiers) for predicting

cluster membership - 4) Validate those rules in independent data.

(No Transcript)

Cluster Analysis of Low Gleason Score Prostate

Samples(UCLA data)

1) Construct a tumor marker rule for predicting

RF cluster membership.2) Validate the rule

predictions in an independent data set

Threshold Rule Validation

Discussion Prostate TMA Data

- Very weak evidence that individual markers

predict PSA recurrence - None of the markers validated individually
- However, cluster membership was highly

predictive, i.e the rule could be validated in an

independent data set.

How to cluster patients on the basis of Tissue

Microarray Data?

Questions 1)Can TMA data be used for tumor

class discovery, i.e unsupervised learning?2)

If so, what are suitable unsupervised learning

methods?

Tumor Class Discovery using DNA Microarray Data

- Tumor class discovery entails using a

unsupervised learning algorithm (i.e.

hierarchical, k-means, SOM clustering etc.) to

automatically group tumor samples based on their

gene expression pattern.

Bullinger et al. N Engl J Med. 2004

Clusters involving TMA data may have

unconventional shapesLow risk prostate cancer

patients are colored in black.

- Scatter plot involving 2 dependent tumor

markers. The remaining, less dependent markers

are not shown. - Low risk cluster can be described using the

following rule - Marker H3K4 gt 45 and H3K18 gt 70.
- The intuition is quite different from that of

Euclidean distance based clusters.

Unconventional shape of a clinically meaningful

patient cluster

- 3 dimensional scatter plot along tumor markers
- Low risk patients are colored in black

MARKER 2

MARKER 1

A dissimilarity measure is an essential input for

tumor class discovery

- Dissimilarities between tumor samples are used in

clustering and other unsupervised learning

techniques - Commonly used dissimilarity measures include

Euclidean distance, 1 - correlation

Challenge

- Conventional dissimilarity measures that work for

DNA microarray data may not be optimal for TMA

data. - Dissimilarity measure that are based on the

intuition of multivariate normal distributions

(clusters have elliptical shapes) may not be

optimal - For tumor marker data, one may want to use a

different intuition clusters are described using

thresholding rules involving dependent markers. - It may be desirable to have a dissimilarity that

is invariant under monotonic transformations of

the tumor marker expressions.

We have found that a random forest (Breiman 2001)

dissimilarity can work well in the unsupervised

analysis of TMA data.Shi et al 2004, Seligson et

al 2005.http//www.genetics.ucla.edu/labs/horvath

/RFclustering/RFclustering.htm

Kidney cancerComparing PAM clusters that result

from using the RF dissimilarity vs the Euclidean

distance

Kaplan Meier plots for groups defined by cross

tabulating patients according to their RF and

Euclidean distance cluster memberships.

Message In this application, RF clusters are

more meaningful regarding survival time

The RF dissimilarity is determined by dependent

tumor markers

Tumor markers

- The RF dissimilarity focuses on the most

dependent markers (1,2). - In some applications, it is good to focus on

markers that are dependent since they may

constitute a disease pathway. - The Euclidean distance focuses on the most

varying marker (4)

Patients sorted by cluster

The RF cluster can be described using a

thresholding rule involving the most dependent

markers

- Low risk patient if marker1gtcut1 marker2gt cut2
- This kind of thresholding rule can be used to

make predictions on independent data sets. - Validation on independent data set

Theoretical reasons for using an RF dissimilarity

for TMA data

- Main reasons
- natural way of weighing tumor marker

contributions to the dissimilarity - The more related a tumor marker is to other tumor

markers the more it contributes to the definition

of the dissimilarity - no need to transform the often highly skewed

features - based feature ranks
- Chooses cut-off values automatically
- resulting clusters can often be described using

simple thresholding rules - Other reasons
- elegant way to deal with missing covariates
- intrinsic proximity matrix handles mixed variable

types well - CAVEAT The choice of the dissimilarity should be

determined by the kind of patterns one hopes to

find. There will be situations when other

dissimilarities are preferrable.

The random forest dissimilarityL. Breiman RF

manualTechnical Report Shi and Horvath

2005http//www.genetics.ucla.edu/labs/horvath/RFc

lustering/RFclustering.htm

SummaryRandom forest clustering

- Intrinsic variable selection focuses on dependent

variables - Depending on the application, this can be

attractive - Resulting clusters can often be described using

thresholding rules?attractive for TMA data. - RF dissimilarity invariant to monotonic

transformations of variables - In some cases, the RF dissimilarity can be

approximated using a Euclidean distance of ranked

and scaled features. - RF clustering was originally suggested by L.

Breiman (RF manual). Theoretical properties are

studied as part of the dissertation work of Tao

Shi. Technical report/code can be found at

www.genetics.ucla.edu/labs/horvath/RFclustering/R

Fclustering.htm www.genetics.ucla.edu/labs/horvat

h/kidneypaper/RCC.htm

Conclusions

- There is a need to identify/develop appropriate

data mining methods for TMA data - highly skewed, semi-continuous, non-normal data
- tree or forest based methods work well
- ALTERNATIVES?

Acknowledgements

- Former students Postdocs for TMA
- Tao Shi PhD
- Xueli Liu PhD
- Yunda Huang PhD
- Tuyen Hoang PhD

- UCLA
- Tissue Microarray Core
- David Seligson, MD
- Hyung Kim, MD
- Arie Belldegrun, MD
- Robert Figlin, MD
- Siavash Kurdistani, MD

References RF clustering

- Unsupervised learning tasks in TMA data analysis
- Review random forest predictors (introduced by L.

Breiman) - Shi, T. and Horvath, S. (2005) Unsupervised

learning using random forest predictors Journal

of Computational and Graphical Statistics - www.genetics.ucla.edu/labs/horvath/RFclustering/RF

clustering.htm - Application to Tissue Array Data
- Shi, T., Seligson, D., Belldegrun, A. S.,

Palotie, A., Horvath, S. (2004) Tumor Profiling

of Renal Cell Carcinoma Tissue Microarray Data - Seligson DB, Horvath S, Shi T, Yu H, Tze S,

Grunstein M, Kurdistani S (2005) Global histone

modification patterns predict risk of prostate

cancer recurrence.