Title: The Less, The Better: another trends in genomic and genetic analysis
1 The Less, The Better another trends in
genomic and genetic analysis
- Wentian Li, Ph.D
- The Robert S Boas Center for Genomics and Human
Genetics - Feinstein Institute for Medical Research
- North Shore LIJ Health System
- Manhasset, NY
CSHL Feb 15,2006 (modified from MSSM, May 11,
2005)
2The More, The Better - genes
- 3,000,000,000 bases of human DNA sequences
- 25,000 human gene loci
- if half of 25000 genes have alternative splicing
with 3 variants, it leads to 50,000 gene
products - Affymetrix Human Genome U133 Plus 2 array 54,000
probe sets on 47,000 gene transcripts - Illumina Human-6 Expression BeadChip 48,000
transcripts per sample, simultaneous 6 samples
3The More, The Better - polymorphism
- 1,000,000 marker of genetic variations typed in
HapMap project - extra 4,600,000 SNPs will be tested in Phase-II
of HapMap - Affymetrix 500k arrayset with 250,000 SNPs/each
- Illuminas HumanHap300 BeadChip has 317,000 SNPs
4The More, The Better the future
- Biotechnology developments will obtain even more
expression data, polymorphism data, quicker, and
less expensive. - 1000 genome in 10-20 years (predicted in
Nature2003) - 30 million SNPs will ensure 1 SNP per 100bp
5The Reason for Wanting More Data
- When search of gene or genes responsible for a
human disease has not yet led to any concrete
result, people tends to think that they have not
searched everywhere. The desire for having more
data and more information is to make sure search
space is exhausted.
6The Reason for NOT Wanting More Data
- The expansion of the gene list, polymorphism
data, will inevitably include more and more
information unrelated to the human disease or
phenotype under study. The irrelevant genes and
markers have to be discarded for studying a
particular human disease or phenotype.
7Lessons from other fields
- Philosophy Occam (Ockham)s razor (1300).
principle of parsimony. - Engineer signal-to-noise ratio
- Physics simplicity of fundamental theory
- Animal psychology Morgans canon (In no case is
an animal activity to be interpreted in terms of
higher psychological processes, if it can be
fairly interpreted in terms of processes which
stand lower in the scale of psychological
evolution and development) - Computer science minimum length description
(MDL) principle by Jorma Rissanen (1978). data
compression. feature selection. curse of
dimensionality. - Statistical modeling variable and model selection
8Plurality should not be assumed without
necessity (Pluralitas non est ponenda sine
necessitate)
- -William of Ockham (1287-1347)
9Simplicity is the ultimate sophistication
- -Leonardo da Vinci (1452-1519)
10Two Implication of the The Less, The Better
genomic data analysis
- Genes unrelated to the disease/phenotype should
be excluded from an analysis of this
disease/phenotype - If one gene or one environmental factor can
explain a disease or phenotype, it is not
necessary to use two genes or environmental
factors in the model - to be reevaluated later
11Modeling of data with statistical models
- Each gene appears as a variable in the model
- A variables contribution is weighted by a
coefficient - A coefficient is a parameter in the (statistical)
model - Simple models have fewer parameters (fewer
variables) - Complicated models have more parameters (more
variables) - For any model, its data-fitting performance is
measured by the maximum-likelihood (parameters
are adjusted in order to achieve a maximum of the
likelihood)
12Comparing models with different complexity
(different number of parameters/variables)
- max-likelihood for the model
with one parameter/variable - max-likelihood for the model
with two parameters/variables - If on the same
dataset, does it mean model-2 is better than
model-1?
13Statistical Model Selection comparing the
following quantity instead
- hat-L maximum likelihood
- p number of free/fitting parameters used in the
statistical model - a for Akaike information criterion, a 2. For
Bayesian information criterion, a is logarithm
of the sample size N. - maximize the above expression (if there is a
minus sign, its called AIC or BIC, to be
minimized)
14(No Transcript)
15Comparing BIC and AIC
- If sample size is 8, BIC penalizes model
complexity more than AIC
16(No Transcript)
17(No Transcript)
18(No Transcript)
19Child asthma data
- N7318 children
- Mothers complete questionnaires during pregnancy
(6 months) and post-natally (42 months) - Information on many potentially risk factors.
Here p24 - y1 for infant wheeze. 1316 affected, 6002
unaffected. - W Li, A Sherriff, X Liu (2000) Assessing risk
factors of human complex diseases by Akaike and
Bayesian information criteria (AIC and BIC)
American Journal of Human Genetics, s67. page
222.
20(No Transcript)
21(No Transcript)
22(No Transcript)
23(No Transcript)
24Leukemia expression data
- From Golub et al. Science 286531-537 (1999)
- Expression level from two types of leukemia
patients acute myeloid leukemia (AML) and acute
lymphoblastic leukemia (ALL) - N38 samples are used for training (extra 34
samples for validation/testing). - Number of genes p7129
- From old affymetrix chip (negative values are
possible). Not log transformed (logistic
regression model is not sensitive to value
transformations).
25(No Transcript)
263 more things examined in the leukemia data
- Is there a clear separation between relevant
(good) genes and irrelevant (bad) genes? - How is the performance in validation set compared
to that in training set? - Whats the effect of averaging many single-gene
logistic regressions? (model averaging) - Wentian Li and Yaning Yang (2002), "How
many genes are needed for a discriminant
microarray data analysis", in Methods of
Microarray Data Analysis, eds SM Lin and KF
Johnson (Kluwer Academic), pp.137-150
27(No Transcript)
28(No Transcript)
29(No Transcript)
30Why extreme value distribution?
- Max-likelihood of a single gene (i)
- Max-likelihood of top-ranking gene (maximum of
maximum, extreme-value) - For null data (no differentially expressed gene
exists), the extreme-value is l - The mean of the extreme-value of chi-square
variables is
31(No Transcript)
32(No Transcript)
33(No Transcript)
34More information on the leukemia dataset
- Considered to be an easy dataset
- ALL samples can be further separated into two
sub-sub-types (source from B-cell vs from
T-cell) inhomogeneity. - Out of two misclassified samples (validation
set), one could even be an error.
35(No Transcript)
36(No Transcript)
37Potential problems with BIC/AIC
- In BIC/AIC approach, all parameters are
considered to be equal. However, for some
statistical models, such as recursive
partitioning/CRT, etc. the variable used for the
first split is more important than the variable
in the subsequent splits. - In BIC/AIC approach, there is no difference
between the apparent number of parameters and the
effective number of parameters. However, it is
possible that weights/masks can be used to reduce
the contribution from some variables, models, or
parameters. (Bayesian variable selection? Model
averaging?? Support vector machine???) - In BIC/AIC approach, parameter values are assumed
to be continuous. Sometimes, there are choices in
the model that is discrete (e.g. DNA
segmentation). maximization of a parameter value
is replaced by an extreme value problem - The coefficients of principal components, which
are admittedly parameters, can be removed easily
by redefine the variables first.
38(No Transcript)
39data modeling vs algorithmic modeling
- 98 of all statisticians
- academic statisticians
- traditional statistics
- goodness-of-fit,residual est
- low in prediction rate
- well-defined model
- high in interpretability
- computer is not crucial
- example (logistic) regression
- less variable,.
- 2 of statisticians plus other
- industrial statisticians
- machine learning,data mining
- predictive accuracy
- high in prediction rate
- unknown model, black box
- low in interpretability
- computational intensive
- example forest, SVM
- more variable,
40Newer methods in machine learning thrive on
variables the more the better
- --Leo Breiman in Statistical modeling the two
cultures (2001)
41In the early literature on support vector
machine, there are claims that support vector
machine allows one to finesse the curse of
dimensionality. Neither of these claims is true.
- - Trevor Hastie, Robert Tibshirani, Jerome
Friedman, in The Elements of Statistical Learning
(2001)
42Criticisms (from those who prefer the less, the
better)
- D.R.Cox
- prediction is not everything (e.g. political
science) - weakness of the data quality is to blame, not
the analysis - predictive success is not the primary basis
for model choice natural scientists are more
interested in establishing causal relation - the perfect separation in higher dimension in
SVM may be unstable after all. - Brad Efron
- consequence of low-variance is large bias
- complicated methods are harder to criticize
- the prediction culture contains more than 2
- identifying causal factors is the ultimate
goal, not the prediction - the whole point of science is to open black
boxes and build better ones
43Re-visit the debate between the less the better
vs the more, the better
- data modelling still has its appeal because it
offers possible biological explanation of the
data - quite often, the model with better (if not the
best) prediction rate is the ones with fewer
parameters. - algorithmic modelling points to the possibility
of multiple equivalent/similar genes. should keep
them all. - algorithmic modelling is appealing when genetic
and environmental factors interplay, and a
phenotype is truly caused by large number of
factors
44More biological thinking on this issue
- more has its limit 50k genes, few million
markers - less also has its bound for complex diseases,
one shouldnt look for only 1 gene. must be a
few. - linear combination of gene expressions is hard to
explain in biology terms. trees and
stratifications are much easier to explain. - data quality is a real issue (heterogeneity), so
simulation cannot completely recreate the real
situation. this data heterogeneity may mislead
on the predictive performance.
45(No Transcript)
46 Marchini et al. Nature Genetics (April 05)
- Assume two-locus interaction (epistasis) exists
and marginal effect is low. - In genetic case-control analysis, should we carry
out (1) locus-by-locus analysis? (2) search for
all two-locus combination? (3) a compromise
two-stage analysis - Measure success by whether one locus is found, or
both loci are found. - Conclusion depends on the interaction model,
minor allele frequency, and the linkage
disequilibrium between the maker and the
disease-causing mutation - In many situations, strategy (2) is more
successful than (1)
47(No Transcript)
48A small victory of the more, the better, i.e.,
two-locus model is better than single-locus
approach when the simulated data is by the
two-locus interaction model, even after paying
the multiple testing price!
- Conclusion both has some truth in it.
Ultimately, - the number of variables/parameters used should
match the number of true causal factors, whether
it is small (Mendelian) or large (complex).
49Data Source/Collaborators
- Andrea Sheriff (epidemiology)
- Dale Nyholt (linkage analysis)
- Yaning Yang (microarray, Zipfs law)
- Fengzhu Sun (extreme-value distribution)
- Ivo Grosse (extreme-value distribution)
- Rajeev Azad (DNA segmentation)
- Pedro Bernaola-Galvan (DNA segmentation)