Title: Liquid association for large scale gene expression and network studies
1Liquid association for large scale gene
expression and network studies
- Ker-Chau Li
- Institute of Statistical Science
- Academia Sinica
(presentation at Isaac Newton Institute for
mathematical Sciences, Workshop under the
Program of Statistical Theory and Methods for
Complex, High-Dimensional Data, June 23-27, 2008)
2Abstract
- The fast-growing public repertoire of
microarray gene expression databases provides
individual investigators with unprecedented
opportunities to study transcriptional activities
for genes of their research interest at no
additional cost. Methods such as hierarchical
clustering, principal component analysis, gene
network and others, have been widely used. They
offer biologists valuable genome-wide portraits
of how genes are co-regulated in groups. Such
approaches have a limitation because it often
turns out that the majority of genes do not fall
into the detected gene clusters. If one has a
gene of primary interest in mind and cannot find
any nearby clusters, what additional analysis can
be conducted? In this talk, I will show how to
address this issue via the statistical notion of
liquid association. An online biodata mining
system is developed in my lab for aiding
biologists to distil information from a web of
aggregated genomic knowledgebase and data sources
at multi-levels, including gene ontology,
protein complexes, genetic markers, drug
sensitivity. The computational issue of liquid
association and the challenges faced in the
context of high p low n problems will be
addressed.
3Change, change, and change
- Calculus is a subject about change
- Life Science is about change
- My entire talk is about change
4 Intuition of SIR
Dogma of regression teaches how output Y CHANGES
in response to CHANGES in input X
- Regression Models
- Parametric
- Multiple linear reg
- Nonlinear reg
- Wavelet
- Nonparametric
- Spline smoothing
- kernel smoothing
- Semiparametric
- Cox regression for survival analysis
- Input data variables
- X1 crime rate
- X2 room size
- X3 family income
-
-
- Xp air quality
- p the total number of input variables
Failed when dimension is high
Output data variable Y house
price f(x1,x2 , xp,error)
Dimension Deduction on X
Principle Component Analysis (PCA)
Critical issue Danger of Information loss
Information relevant to Y may not be contained in
the reduced variables because Y is not used in
the dimension reduction process
5 A reversal of the regression paradigm
Instead of asking how Y changes in response to
changes in X, Ask how X CHANGES as Y CHANGES
- input data variables
- X1 X2 X3 Xp
- p the total number of input variables
output data variable Y
f(b1X,b2X , bkX ,error)
sliced means E(X1Y) E(X2Y) E(XpY)
conduct dimension reduction on E(X1Y), .,
E(XpY) to k projection variables, k ltltp
Regression on k projection variables
A fundamental theory for resolving Information
loss !!! A theorem in Li(1991, JASA) shows
that (i) dimension reduction after inverse
regression recovers the effective projection
variables b1X,b2X , bkX , (ii) No Need to
specify the nonlinear function f .
6Regression is about change
- Sir Francis Galton (1822 - 1911),
- half-cousin of Charles Darwin,
- was an English Victorian
- polymath, anthropologist, eugenicist,
- tropical explorer, geographer, inventor,
- meteorologist, proto-geneticist,
- psychometrician, and statistician.
- He was knighted in 1909.
- Galton invented the use of the regression line
(Bulmer 2003, p. 184), and was the first to
describe and explain the common phenomenon of
regression toward the mean, which he first
observed in his experiments on the size of the
seeds of successive generations of sweet peas.
Bivariate normal
Regression slope equals correlation after
variable standardization
7Correlation and Changes Inside Correlation
Correlation Coefficient has been used by Gauss,
Bravais, Edgeworth Sweeping impact in data
analysis is due to Galton(1822-1911) Typical
laws of heredity in man Karl Pearson modifies
and popularizes its use. A building block in
multivariate analysis, of which clustering,
classification, dimension reduction are recurrent
themes
Liquid association is about the change of
correlation pattern inside the scatter diagram
of a pair of variables
8Liquid association(LA) a new bioinformatics
tool for exploring gene expression data and
much beyond
Pearson correlation(X,Y)
Basis for clustering genes in microarray two
genes X, Y are likely to be functionally
associated if sharing similar expression
profiles measured by correlation coefficient
The converse statement is not true many
functionally associated genes are often
uncorrelated in expression owing to complexity of
gene regulation such as multiple functional roles
for most genes, role-changing as hidden cellular
state variables vary, etc.
LA
How LA works? Instead of two genes,
three genes are considered at a time.
LA measures the CHANGE in
correlation between two genes X, Y as mediated
by a third gene Z.
Li(1992, PNAS) invented a novel statistical
notion termed liquid association Liquid as
opposed to solid is a metaphor for change
9How to alleviate computation burden for computing
a genome-wide total of N3 triplets of
genes N6,000 (yeast) 36 billions
N50,000(human ) 20 trillions
An enabling algorithm is derived from an elegant
theorem that offers a simple formula for
measuring LA under the setting of continuous
cellular state changes. Genome-wide
co-expression dynamics theory and application,
K.C. Li (2002,
PNAS )
- On-line LA system developed for aiding
integrative cancer biology study - Biomarkers and disease candidate genes finding
- Gene/drug correlation
- eQTL
- gene signature for clinical survival prediction
- MicroRNA expression
- Array CGH DNA copy number
LA helps Elucidation of Gene regulation in
metabolic pathways (Li 2002)
Urea Cycle
LA helps Finding disease candidate genes(Li et
al 2007)
multiple sclerosis
Examples of LA application
http//kiefer.stat2.sinica.edu.tw/LAP3
10gene-expression data
cond1 cond2 .. condp
x11 x12 .. x1p x21 x22 ..
x2p
gene1gene2 gene n
11(No Transcript)
12Why clustering makes sense biologically?
The rationale is
Genes with high degree of expression similarity
are likely to be functionally related. may form
structural complex, may participate in common
pathways. may be co-regulated by common
upstream regulatory elements.
Simply put,
Profile similarity implies functional association
13 Protein rarely works as a single unit
Homo-dimer, hetro-dimer, protein complex
??? ATP ???
Mitochondrial ATP Synthase E. coli ATP
(?????) Synthase These images depicting models of
ATP Synthase subunit structure were provided by
John Walker. Some equivalent subunits from
different organisms have different names.
14Example ????? SCATTERPLOT MATRIX of MCM1,MCM2,
MCM3, MCM4, MCM5, MCM6, MCM7,
15The tighter association among the six genes,
MCM2,..., MCM7 is in a sharp contrast to the
association between each of them and MCM1. It
turns out that the gene products of MCM2,..,MCM7
form a hexameric complex that binds
Chromatin(???). It is a part of pre-replicative
complex, an assembly of proteins that form at
origins of DNA replication between late M phase
and the G1/S transition and includes other
proteins believed to act in DNA replication
initiation.
?????
16MCM1 is a Transcription factor helps Activation
of gene expression
17However, the converse is not true
- The expression profiles of majority of
functionally associated genes are indeed
uncorrelated
- Microarray is too noisy
- Biology is complex
18Why no correlation?
- Protein rarely works alone
- Protein has multiple functions
- Different biological processes or pathways have
to be synchronized - Competing use of finite resources metabolites,
hormones, - Protein modification Phosphorylation,
proteolysis, shuttle, - Transcription factors serving both as
activators and repressors
19Transcription factors proteins that bind to DNA
Activator XTF, Y target gene correlation is
positive Repressor XTF, Ytarget gene
correlation is negative
20Some transcription factors can act As both
activators and repressors
Thyroid hormone receptors can be changed from
repressors to activators Dependent on the
absence/presence of thyroid hormone
XTHR Ytarget gene Corr may cancel out if
hormone level fluctuates
21Going subtleProtein modification Histone
inhibits transcription To activate transcription,
the lysine side chain must be acetylated.
Weaver(2001)
22Transcription factor can switch between
activator and repressor, dependent on the
abundance level of thyroid hormone.
23Math. Modeling a nightmare
Current
Next
mRNA
F I T N E S S
mRNA
mRNA
Observed
protein kinase
hidden
ATP, GTP, cAMP, etc
Cytoplasm Nucleus Mitochondria Vacuolar
localization
F U N C T I O N
Statistical methods become useful
DNA methylation, chromatin structure
Nutrients- carbon, nitrogen sources Temperature Wa
ter
24What is LA? PLA?
25Schematic illustration of LA
26A Challenge
- What genes behave like that ?
- Can we identify all of them ?
- N5878 ORFs
- N choose 3 33.8 billion triplets to inspect
27Statistical theory for LA
- X, Y, Z random variables with mean 0 and variance
1 - Corr(X,Y)E(XY)E(E(XYZ))Eg(Z)
- g(z) an ideal summary of association pattern
between X and Y when Z z - g(z)derivative of g(z)
- Definition. The LA of X and Y with respect to Z
is LA(X,YZ) Eg(Z)
28- One way to go about estimating LA is to apply
nonparametric regression for g(z) - But this is probably going to eat up too much
computing time and also face the issue of
regularization such as shall we apply a common
smoothing parameter to all curves or not, - A idea pop out because of my early work on SURE
and cross validation.
29applications of Stein Lemma
Decision Theory
- Nonparametric regression with stein estimates
- Connection of Steins unbiased risk estimate,
with generalized cross validation (Li 1984, Ann.
Stat) -
30Lemma 1 Eh(X)h(1)-h(0)X ? uniform0,1
- h is differentiable
- Fundamental theorem of calculus
- Sir Issac Newton
- (1643-1727)
- Gottfried Leibniz
- (1646-1716)
- from Wikipedia
31Lemma Eh(X) EXh(X)XNormal(0,1)
- Steins Lemma
- Charles Stein
- Integration by part
- Proof
- Start from the right side
- Write down the density of X
- Integration by part
32Statistical theory-LA
- Theorem. If Z is standard normal, then
LA(X,YZ)E(XYZ) - Proof. By Steins Lemma Eg(Z)Eg(Z)Z
- E(E(XYZ)Z)E(XYZ)
- Additional math. properties
- bounded by third moment
- 0, if jointly normal
- transformation
33Normality ?
- Convert each gene expression profile by taking
normal score transformation - LA(X,YZ) average of triplet product of three
gene profiles - (x1y1z1 x2y2z2 . ) / n
-
-
34How does LA work in yeast?
- Urea cycle/arginine biosynthesis
35Yeast Cell Cycle(adapted from Molecular Cell
Biology, Darnell et al)
Most visible event
36ARG1
Glutamate
ARG2
37ARG1
8th place negative
Y
Head
X
Compute LA(X,YZ) for all Z
Backdoor
Rank and find leading genes
Adapted from KEGG
38Why negative LA?high CPA2 signal for
arginine demand. up-regulation of ARG2
concomitant with down-regulation of CAR2
prevents ornithine from leaving the urea
cycle.When the demand is relieved, CPA2 is
lowered, CAR2 is up-regulated, opening up the
channel for orinthine to leave the urea cycle.
39Other examples (see Li 2002)
- XGLN3(transcription factor), YCAR1, ZARG4 (8th
place negative end) - Electron transport XCYT1(cytochome c1), gives
ATP1 (11 times), ATP5 (subunits of ATPase) - Calmodulin CMD1, NUF1 (binding target of CMD1),
CMK1(calmodulin-regulated kinase), YGL149W - Glycolysis genes PFK1, PFK2 (6-phospho-fructokinas
e) - CYR1(adenylate cyclase) , GSY1 (glycogen
synthase), GLC2( glucan branching),
SCH9(serine/threonine protein kinase longevity)
40Liquid association
A method for exploiting lack of correlation
between variables
41LA related References
- Li, K.C. (2002) Genome-wide co-expression
dynamics theory and application. Proceedings
of National Academy of Science . 99, 16875-16880. - Li, K.C., and Yuan, S. (2004) A functional
genomic study on NCI's anticancer drug screen.
The Pharmacogenomics Journal, 4, 127-135. - Li, K.C., Ching-Ti Liu, Wei Sun, Shinsheng Yuan
and Tianwei Yu (2004). A system for enhancing
genome-wide co-expression dynamics study.
Proceedings of National Academy of Sciences .
101 , 15561-15566. - Yu , T., Sun, W., Yuan , S., and Li, K.C.
(2005). Study of coordinative gene expression at
the biological process level. Bioinformatics 21
3651-3657. - Yu, T., and Li, K.C. (2005). Inference of
transcriptional regulatory network by two-stage
constrained space factor analysis. Bioinformatics
21, 4033-4038. - Wei Sun Tianwei Yu Ker-Chau Li?(2007).
Detection of eQTL modules mediated by activity
levels of transcription factors. Bioinformatics
doi 10.1093/bioinformatics/btm327
(correspondence author Li) - Yuan, S., and Li. K.C. (2007) Context-dependent
Clustering for Dynamic Cellular State Modeling of
Microarray Gene Expression. Bioinformatics 2007
doi 10.1093/bioinformatics/btm457
(correspondence author Li) - Li, KC, Palotie A, Yuan, S, Bronnikov, D., Chen
D., Wei X., Choi, O., Saarela J., Peltonen L.
(2007) Finding candidate disease genes by liquid
association. Genome Biology (in Press).
42The human examples
43Gene expression profile for NCIs 60 cell lines
- For each cell line, the relative mRNA
concentrations are measured by cDNA glass array. - Cell lines used in microarray experiment are
without drug administration. - Ross D.T. et al. Systematic variation in gene
expression patterns in human cancer cell lines.
Nat. Genet. 24, 227-235 (2000)
44NCI 60 Cell lines
OVARIAN (6) IGROV1 OVCAR-3 OVCAR-4 OVCAR-5 OVCAR-8
SK-OV-3 PROSTATE (2) DU-145 PC-3 LEUKEMIA
(6) CCRF-CEM HL-60 K-562 MOLT-4 RPMI-8226 SR
MELANOMA (8) LOXIMVI M14 MALME-3M SK-MEL-2 SK-MEL-
28 SK-MEL-5 UACC-257 UACC-62 BREAST
(8) BT-549 HS578T MCF7 MCF7/ADF-RES MDA-MB-231/ATC
C MDA-MB-435 MDA-N T-47D
LUNG (9) A549/ATCC EKVX HOP-62 HOP-92 NCI-H226 NCI
-H23 NCI-H322M NCI-H460 NCI-H522 CNS
(6) SF-268 SF-295 SF-539 SNB-19 SNB-75 U251
COLON (7) COLO205 HCC-2998 HCT-116 HCT-15 HT29 KM1
2 SW-620 RENAL (8) 786-0 A498 ACHN CAKI-1 RXF-3
93 SN12C TK-10 UO-31
45How does LA work in cell-lines?
- Alzheimers Disease hallmark gene
- Amyloid-beta precursor protein (APP)
46Alzheimers disease
The brain tissue shows "neurofibrillary tangles"
(twisted fragments of protein within nerve cells
that clog up the cell), "neuritic plaques"
(abnormal clusters of dead and dying nerve cells,
other brain cells, and protein), and "senile
plaques" (areas where products of dying nerve
cells have accumulated around protein). Although
these changes occur to some extent in all brains
with age, there are many more of them in the
brains of people with AD. The destruction of
nerve cells (neurons) leads to a decrease in
neurotransmitters (substances secreted by a
neuron to send a message to another neuron). The
correct balance of neurotransmitters is critical
to the brain.
47Amyloid beta peptide is the predominant component
of senile plagues in brains of MD patients. It
is derived from Amyloid-beta precusor protein
(APP) by consecutive proteolytic cleavage
of Beta-secretase and gamma-secretase
48(No Transcript)
49What is the physiological role of APP?
- Cao X, Sudhof TC.
- A transcriptionally active complex of APP with
Fe65 and histone acetyltransferase Tip60. - Science. 2001 Jul 6293(5527)115-20.
50Abstract of Cao and Sudhof
- Amyloid-beta precursor protein (APP), a widely
expressed cell-surface protein, is cleaved in the
transmembrane region by gamma-secretase.
gamma-Cleavage of APP produces the extracellular
amyloid beta-peptide of Alzheimer's disease and
releases an intracellular tail fragment of
unknown physiological function. We now
demonstrate that the cytoplasmic tail of APP
forms a multimeric complex with the nuclear
adaptor protein Fe65 and the histone
acetyltransferase Tip60. This complex potently
stimulates transcription via heterologous Gal4-
or LexA-DNA binding domains, suggesting that
release of the cytoplasmic tail of APP by
gamma-cleavage may function in gene expression.
51Take XAPP, YAPBP1
- APBP1 encodes FE65
- Find BACE2 from our short list of LA score
leaders. - BACE2 encodes a beta-site APP-cleaving enzyme
52Take XAPP, YHTATIP HTATIP encodes Tip60
- Finds PSEN1 (second place positive LA score
leader) - Which encodes presenilin 1,
- a major component of
- gamma-secretase
53(No Transcript)
54 Finding disease candidate genes by liquid
association Ker-Chau Li , Aarno
Palotie, Shinsheng Yuan, Denis Bronnikov, Daniel
Chen, XuelianWei, Oi-Wa Choi, Janna Saarela
and Leena Peltonen
55Multiple Sclerosis
56Multiple sclerosis
- 1. MS is a chronic neurological disorder disease,
characterized by multicentric inflammation,
demyelination and axonal damage, resulting in
heterogeneous clinical features, including
pareses, sensory symptoms and ataxia. The
classical clinical features include disturbances
in sensation and mobility. The typical age of
onset is between years 20 and 40, making MS one
of the most common neurological diseases of young
adults. Four genome-wide scans (US, UK, Canada,
and Finland) have revealed several putative
susceptibility loci, of which the loci on
chromosomes 6p, 5p, 17q and 19q have been
replicated in multiple study samples. More
recently, Professors Aarno Peltonen and Leena
Peltonens teams have generated a fine map on
17q22-q24 (Saarela et al 2002). They are now
interested in the functional aspect of the genes
in this region using microarray technology.
57Application finding candidate genes for Multiple
sclerosis
58(No Transcript)
59(No Transcript)
60- glutamate-induced excitotoxicity
- SLC1A3 is highly expressed in various brain
regions including cerebellum, frontal cortex,
basal ganglia and hippocampus. It encodes a
sodium-dependent glutamate/aspartate transporter
1 (GLAST). Glutamate and aspartate are excitatory
neurotransmitters that have been implicated in a
number of pathologic states of the nervous
system. Glutamate concentration in cerebrospinal
fluid rises in acute MS patients whilst glutamate
antagonist amantadine reduces MS relapse rate. In
EAE, the levels of GLAST and GLT-1 (SLC1A2) are
found down-regulated in spinal cord at the peak
of disease symptoms and no recovery was observed
after remission. We consider highly encouraging
that several lines of evidence including both
genetic association and gene expression
association, would be consistent with the
glutamate-induced excitotoxicity hypothesis of
the mechanisms resulting in demyelination and
axonal damage in MS.
61(No Transcript)
62(No Transcript)
63(No Transcript)
64(No Transcript)
65International MS whole genome association
study(2007).
- Affymetrix 500K to screen common genetic variants
of 931 family trios. - Using the on-line supplementary information
provided, we found two SNPs, rs4869676(chr536641
766) and rs4869675(chr5 36636676 ) with TDT
p-value 0.0221 and 0.00399 respectively, are in
the upstream regulatory region of the SLC1A3
gene. - In fact, within the 1Mb region of rs486975,
there are a total 206 SNPs in the Affymetrix 500K
chip. No other SNPs have p-value less than that
of rs486975. - The next most significant SNPs in this region are
rs1343692(chr535860930), and rs6897932(chr53591
0332 the identified MS susceptibility SNP in the
IL7R axon). - The MS marker we identified rs2562582(chr5
36641117) is , less than 5K apart from
rs4869675, but was not in the Affymetrix chip.
66A little bit late
- IL7R was found long time ago before by LA !!!See
the attached the e-mail? sent more than two years
ago in 2005 !!! - Begin forwarded messageFrom Ker Chau Li (local)
ltkcli_at_stat.ucla.edugtDate March 28, 2005 101751
AM PSTTo Robert Yuan ltsyuan_at_stat.ucla.edugt,
Aarno Palotie ltAPalotie_at_mednet.ucla.edugt, Daniel
Chen ltpharmacogenomics_at_yahoo.comgt, Denis
Bronnikov ltdenis_at_ucla.edugt, Palotie Leena
ltleena.peltonen_at_ktl.figtCc Ker Chau Li (local)
ltkcli_at_stat.ucla.edugt - Subject IL7R
- (I thought this e-mail should have been sent
out already but it has not)I take XSLC1A3,
YMBP, Z any gene, using 2002 Atlas data. Two
genes are from the short list of genes with
highest LA scores.IL7R interleukin 7 receptor and
HLA-A - IL7R is at 5p13. Interesting coincidence??
- other interesting findings include GFAP
glial fibrillary acidic protein on 17q21
(Alexander disease)GRM3 (glutamate receptor,
metabotropic )CDR1 (cerebellar degeneration-relate
d protein 1)Ighg3 (immunoglobulin heavy constant
gamma 3)Iglj3 ( immunoglobulin lambda joining 3)
67A2M
- The output of a short list of 25 gene pairs with
the best - LA scores each from the positive and the negative
ends is - given in Additional data file 1 (Table S1). The
statistical significance - of the results of this gene search procedure is
discussed - in Additional data file 2 (Supplementary Text 3).
We find that - the gene A2M (encoding a2-macroglobulin, a
cytokine - transporter and protease inhibitor) appears many
times. We - further find an interesting biologic functional
association - between A2M and MBP from some literature about
the pathogenesis - of MS. Following demyelination in human MS and
- rodent EAE, immunogenic MBP peptides are released
into - cerebrospinal fluid and serum (see Oksenberg and
coworkers - 2 for references) and A2M represents the major
MBP-binding - protein in human plasma 17. A significant
increase in - a2-macroglobulin is found in plasma of MS
patients 18. - Analogously, in rodent EAE, infusion of
a2-macroglobulin - significantly reduces disease symptoms 19.
68Genome-wide LA scores,XMBP
69P-values by randomizationEach dot represents a
case of simulated X highest corr v.s. 20th
highest LA
70LAP website
71Basic workflow is simple
72User interface for browsing computation output
73Acknowledgement
- Mathematics in Biology (MIB), Institute of
Statistical Science, Academia Sinica - Web-based Liquid association development team
- Team leader Dr. Shin-sheng Yuan
- IT specialists
- Guan-I Wu, Hung-Wei Tseng, Shi-Hsien Yang, Yi-Wei
Chen, Chang-Dao Chen, Ying-Fu Ho - Arabidopsis Gene Expression Analysis
- Dr. Ai-Ling Hour
-
74Acknowledgment
- Biodata refining group,UCLA Statistics
- Htpp//kiefer.stat.ucla.edu/lap
- Shinsheng Yuan (chief architect for website
development, gene-drug) - Wei Sun (yeast segregation )
- Ching-Ti Liu (yeast protein complex)
- Tianwei Yu (Stress, gene ontology)
- Xuelian Wei (graphics, cancer, disease page),
Tun-Hsian Yang (disease page) - Yijing Shen (clustering) Tongtong Wu(Stress)
- Jack Li(graph)
75Lung Cancer project
- National Taiwan University Hospital
- Pan-Chyr Yang
- Sung-Liang Yu
- Hsuan-Yu Chen
76Causal analysis
- X, Y, Z
- X-gtY, X-gtZ
- YaXberror
- ZaXberror
- Partial correlation corr (error, error)
- X causes Y and Z if partial correlation0
- (XCoke sale, Yeye disease incidence rate,
Zseason) - Start with a pair of correlated genes Y, Z, find
X to minimize partial correlation - This is very different from LA.
77A limited goal remove the trend
- Universal trend (affects all genes) could be
artificial (due to chip technology imperfection ) - Localized trend (affects a limited number of
highly expressed genes) likely to be
biologically real - Partial correlation can be used to detrend
- Xone gene, Yone gene, Ztrend
- X residual after regressing X on Z
- Yresidual after regressing Y on Z
- Find correlation between X and Y
78- Maximizing the absolute value of partial corr,
given a pair of variables. - Partial corr cosine (angle between two planes,
X,Z plane and Y,Z plane) - Consider the prediction of Z from Z, Y.
- Fixing the error variance, then the optimal Z
should have highest correlation with (XY) (if
X,Y positively correlated) or with (X-Y)
(otherwise) as possible.