The Less, The Better: another trends in genomic and genetic analysis - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

The Less, The Better: another trends in genomic and genetic analysis

Description:

The Robert S Boas Center for Genomics and Human Genetics ... information criteria (AIC and BIC) American Journal of Human Genetics, s67. page 222. ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 50

Provided by: marshaa

Category:

more less

Transcript and Presenter's Notes

Title: The Less, The Better: another trends in genomic and genetic analysis

1
The Less, The Better another trends in
genomic and genetic analysis

Wentian Li, Ph.D
The Robert S Boas Center for Genomics and Human
Genetics
Feinstein Institute for Medical Research
North Shore LIJ Health System
Manhasset, NY

CSHL Feb 15,2006 (modified from MSSM, May 11,
2005)
2
The More, The Better - genes

3,000,000,000 bases of human DNA sequences
25,000 human gene loci
if half of 25000 genes have alternative splicing
with 3 variants, it leads to 50,000 gene
products
Affymetrix Human Genome U133 Plus 2 array 54,000
probe sets on 47,000 gene transcripts
Illumina Human-6 Expression BeadChip 48,000
transcripts per sample, simultaneous 6 samples

3
The More, The Better - polymorphism

1,000,000 marker of genetic variations typed in
HapMap project
extra 4,600,000 SNPs will be tested in Phase-II
of HapMap
Affymetrix 500k arrayset with 250,000 SNPs/each
Illuminas HumanHap300 BeadChip has 317,000 SNPs

4
The More, The Better the future

Biotechnology developments will obtain even more
expression data, polymorphism data, quicker, and
less expensive.
1000 genome in 10-20 years (predicted in
Nature2003)
30 million SNPs will ensure 1 SNP per 100bp

5
The Reason for Wanting More Data

When search of gene or genes responsible for a
human disease has not yet led to any concrete
result, people tends to think that they have not
searched everywhere. The desire for having more
data and more information is to make sure search
space is exhausted.

6
The Reason for NOT Wanting More Data

The expansion of the gene list, polymorphism
data, will inevitably include more and more
information unrelated to the human disease or
phenotype under study. The irrelevant genes and
markers have to be discarded for studying a
particular human disease or phenotype.

7
Lessons from other fields

Philosophy Occam (Ockham)s razor (1300).
principle of parsimony.
Engineer signal-to-noise ratio
Physics simplicity of fundamental theory
Animal psychology Morgans canon (In no case is
an animal activity to be interpreted in terms of
higher psychological processes, if it can be
fairly interpreted in terms of processes which
stand lower in the scale of psychological
evolution and development)
Computer science minimum length description
(MDL) principle by Jorma Rissanen (1978). data
compression. feature selection. curse of
dimensionality.
Statistical modeling variable and model selection

8
Plurality should not be assumed without
necessity (Pluralitas non est ponenda sine
necessitate)

-William of Ockham (1287-1347)

9
Simplicity is the ultimate sophistication

-Leonardo da Vinci (1452-1519)

10
Two Implication of the The Less, The Better
genomic data analysis

Genes unrelated to the disease/phenotype should
be excluded from an analysis of this
disease/phenotype
If one gene or one environmental factor can
explain a disease or phenotype, it is not
necessary to use two genes or environmental
factors in the model
to be reevaluated later

11
Modeling of data with statistical models

Each gene appears as a variable in the model
A variables contribution is weighted by a
coefficient
A coefficient is a parameter in the (statistical)
model
Simple models have fewer parameters (fewer
variables)
Complicated models have more parameters (more
variables)
For any model, its data-fitting performance is
measured by the maximum-likelihood (parameters
are adjusted in order to achieve a maximum of the
likelihood)

12
Comparing models with different complexity
(different number of parameters/variables)

max-likelihood for the model
with one parameter/variable
max-likelihood for the model
with two parameters/variables
If on the same
dataset, does it mean model-2 is better than
model-1?

13
Statistical Model Selection comparing the
following quantity instead

hat-L maximum likelihood
p number of free/fitting parameters used in the
statistical model
a for Akaike information criterion, a 2. For
Bayesian information criterion, a is logarithm
of the sample size N.
maximize the above expression (if there is a
minus sign, its called AIC or BIC, to be
minimized)

14
(No Transcript)
15
Comparing BIC and AIC

If sample size is 8, BIC penalizes model
complexity more than AIC

16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
Child asthma data

N7318 children
Mothers complete questionnaires during pregnancy
(6 months) and post-natally (42 months)
Information on many potentially risk factors.
Here p24
y1 for infant wheeze. 1316 affected, 6002
unaffected.
W Li, A Sherriff, X Liu (2000) Assessing risk
factors of human complex diseases by Akaike and
Bayesian information criteria (AIC and BIC)
American Journal of Human Genetics, s67. page
222.

20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
Leukemia expression data

From Golub et al. Science 286531-537 (1999)
Expression level from two types of leukemia
patients acute myeloid leukemia (AML) and acute
lymphoblastic leukemia (ALL)
N38 samples are used for training (extra 34
samples for validation/testing).
Number of genes p7129
From old affymetrix chip (negative values are
possible). Not log transformed (logistic
regression model is not sensitive to value
transformations).

25
(No Transcript)
26
3 more things examined in the leukemia data

Is there a clear separation between relevant
(good) genes and irrelevant (bad) genes?
How is the performance in validation set compared
to that in training set?
Whats the effect of averaging many single-gene
logistic regressions? (model averaging)
Wentian Li and Yaning Yang (2002), "How
many genes are needed for a discriminant
microarray data analysis", in Methods of
Microarray Data Analysis, eds SM Lin and KF
Johnson (Kluwer Academic), pp.137-150

27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
Why extreme value distribution?

Max-likelihood of a single gene (i)
Max-likelihood of top-ranking gene (maximum of
maximum, extreme-value)
For null data (no differentially expressed gene
exists), the extreme-value is l
The mean of the extreme-value of chi-square
variables is

31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
More information on the leukemia dataset

Considered to be an easy dataset
ALL samples can be further separated into two
sub-sub-types (source from B-cell vs from
T-cell) inhomogeneity.
Out of two misclassified samples (validation
set), one could even be an error.

35
(No Transcript)
36
(No Transcript)
37
Potential problems with BIC/AIC

In BIC/AIC approach, all parameters are
considered to be equal. However, for some
statistical models, such as recursive
partitioning/CRT, etc. the variable used for the
first split is more important than the variable
in the subsequent splits.
In BIC/AIC approach, there is no difference
between the apparent number of parameters and the
effective number of parameters. However, it is
possible that weights/masks can be used to reduce
the contribution from some variables, models, or
parameters. (Bayesian variable selection? Model
averaging?? Support vector machine???)
In BIC/AIC approach, parameter values are assumed
to be continuous. Sometimes, there are choices in
the model that is discrete (e.g. DNA
segmentation). maximization of a parameter value
is replaced by an extreme value problem
The coefficients of principal components, which
are admittedly parameters, can be removed easily
by redefine the variables first.

38
(No Transcript)
39
data modeling vs algorithmic modeling

98 of all statisticians
academic statisticians
traditional statistics
goodness-of-fit,residual est
low in prediction rate
well-defined model
high in interpretability
computer is not crucial
example (logistic) regression
less variable,.

2 of statisticians plus other
industrial statisticians
machine learning,data mining
predictive accuracy
high in prediction rate
unknown model, black box
low in interpretability
computational intensive
example forest, SVM
more variable,

40
Newer methods in machine learning thrive on
variables the more the better

--Leo Breiman in Statistical modeling the two
cultures (2001)

41
In the early literature on support vector
machine, there are claims that support vector
machine allows one to finesse the curse of
dimensionality. Neither of these claims is true.

- Trevor Hastie, Robert Tibshirani, Jerome
Friedman, in The Elements of Statistical Learning
(2001)

42
Criticisms (from those who prefer the less, the
better)

D.R.Cox
prediction is not everything (e.g. political
science)
weakness of the data quality is to blame, not
the analysis
predictive success is not the primary basis
for model choice natural scientists are more
interested in establishing causal relation
the perfect separation in higher dimension in
SVM may be unstable after all.
Brad Efron
consequence of low-variance is large bias
complicated methods are harder to criticize
the prediction culture contains more than 2
identifying causal factors is the ultimate
goal, not the prediction
the whole point of science is to open black
boxes and build better ones

43
Re-visit the debate between the less the better
vs the more, the better

data modelling still has its appeal because it
offers possible biological explanation of the
data
quite often, the model with better (if not the
best) prediction rate is the ones with fewer
parameters.
algorithmic modelling points to the possibility
of multiple equivalent/similar genes. should keep
them all.
algorithmic modelling is appealing when genetic
and environmental factors interplay, and a
phenotype is truly caused by large number of
factors

44
More biological thinking on this issue

more has its limit 50k genes, few million
markers
less also has its bound for complex diseases,
one shouldnt look for only 1 gene. must be a
few.
linear combination of gene expressions is hard to
explain in biology terms. trees and
stratifications are much easier to explain.
data quality is a real issue (heterogeneity), so
simulation cannot completely recreate the real
situation. this data heterogeneity may mislead
on the predictive performance.

45
(No Transcript)
46
Marchini et al. Nature Genetics (April 05)

Assume two-locus interaction (epistasis) exists
and marginal effect is low.
In genetic case-control analysis, should we carry
out (1) locus-by-locus analysis? (2) search for
all two-locus combination? (3) a compromise
two-stage analysis
Measure success by whether one locus is found, or
both loci are found.
Conclusion depends on the interaction model,
minor allele frequency, and the linkage
disequilibrium between the maker and the
disease-causing mutation
In many situations, strategy (2) is more
successful than (1)

47
(No Transcript)
48
A small victory of the more, the better, i.e.,
two-locus model is better than single-locus
approach when the simulated data is by the
two-locus interaction model, even after paying
the multiple testing price!

Conclusion both has some truth in it.
Ultimately,
the number of variables/parameters used should
match the number of true causal factors, whether
it is small (Mendelian) or large (complex).

49
Data Source/Collaborators