The Less, The Better: another trends in genomic and genetic analysis - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

The Less, The Better: another trends in genomic and genetic analysis

Description:

The Robert S Boas Center for Genomics and Human Genetics ... information criteria (AIC and BIC) American Journal of Human Genetics, s67. page 222. ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 50
Provided by: marshaa
Category:

less

Transcript and Presenter's Notes

Title: The Less, The Better: another trends in genomic and genetic analysis


1
The Less, The Better another trends in
genomic and genetic analysis
  • Wentian Li, Ph.D
  • The Robert S Boas Center for Genomics and Human
    Genetics
  • Feinstein Institute for Medical Research
  • North Shore LIJ Health System
  • Manhasset, NY

CSHL Feb 15,2006 (modified from MSSM, May 11,
2005)
2
The More, The Better - genes
  • 3,000,000,000 bases of human DNA sequences
  • 25,000 human gene loci
  • if half of 25000 genes have alternative splicing
    with 3 variants, it leads to 50,000 gene
    products
  • Affymetrix Human Genome U133 Plus 2 array 54,000
    probe sets on 47,000 gene transcripts
  • Illumina Human-6 Expression BeadChip 48,000
    transcripts per sample, simultaneous 6 samples

3
The More, The Better - polymorphism
  • 1,000,000 marker of genetic variations typed in
    HapMap project
  • extra 4,600,000 SNPs will be tested in Phase-II
    of HapMap
  • Affymetrix 500k arrayset with 250,000 SNPs/each
  • Illuminas HumanHap300 BeadChip has 317,000 SNPs

4
The More, The Better the future
  • Biotechnology developments will obtain even more
    expression data, polymorphism data, quicker, and
    less expensive.
  • 1000 genome in 10-20 years (predicted in
    Nature2003)
  • 30 million SNPs will ensure 1 SNP per 100bp

5
The Reason for Wanting More Data
  • When search of gene or genes responsible for a
    human disease has not yet led to any concrete
    result, people tends to think that they have not
    searched everywhere. The desire for having more
    data and more information is to make sure search
    space is exhausted.

6
The Reason for NOT Wanting More Data
  • The expansion of the gene list, polymorphism
    data, will inevitably include more and more
    information unrelated to the human disease or
    phenotype under study. The irrelevant genes and
    markers have to be discarded for studying a
    particular human disease or phenotype.

7
Lessons from other fields
  • Philosophy Occam (Ockham)s razor (1300).
    principle of parsimony.
  • Engineer signal-to-noise ratio
  • Physics simplicity of fundamental theory
  • Animal psychology Morgans canon (In no case is
    an animal activity to be interpreted in terms of
    higher psychological processes, if it can be
    fairly interpreted in terms of processes which
    stand lower in the scale of psychological
    evolution and development)
  • Computer science minimum length description
    (MDL) principle by Jorma Rissanen (1978). data
    compression. feature selection. curse of
    dimensionality.
  • Statistical modeling variable and model selection

8
Plurality should not be assumed without
necessity (Pluralitas non est ponenda sine
necessitate)
  • -William of Ockham (1287-1347)

9
Simplicity is the ultimate sophistication
  • -Leonardo da Vinci (1452-1519)

10
Two Implication of the The Less, The Better
genomic data analysis
  • Genes unrelated to the disease/phenotype should
    be excluded from an analysis of this
    disease/phenotype
  • If one gene or one environmental factor can
    explain a disease or phenotype, it is not
    necessary to use two genes or environmental
    factors in the model
  • to be reevaluated later

11
Modeling of data with statistical models
  • Each gene appears as a variable in the model
  • A variables contribution is weighted by a
    coefficient
  • A coefficient is a parameter in the (statistical)
    model
  • Simple models have fewer parameters (fewer
    variables)
  • Complicated models have more parameters (more
    variables)
  • For any model, its data-fitting performance is
    measured by the maximum-likelihood (parameters
    are adjusted in order to achieve a maximum of the
    likelihood)

12
Comparing models with different complexity
(different number of parameters/variables)
  • max-likelihood for the model
    with one parameter/variable
  • max-likelihood for the model
    with two parameters/variables
  • If on the same
    dataset, does it mean model-2 is better than
    model-1?

13
Statistical Model Selection comparing the
following quantity instead
  • hat-L maximum likelihood
  • p number of free/fitting parameters used in the
    statistical model
  • a for Akaike information criterion, a 2. For
    Bayesian information criterion, a is logarithm
    of the sample size N.
  • maximize the above expression (if there is a
    minus sign, its called AIC or BIC, to be
    minimized)

14
(No Transcript)
15
Comparing BIC and AIC
  • If sample size is 8, BIC penalizes model
    complexity more than AIC

16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
Child asthma data
  • N7318 children
  • Mothers complete questionnaires during pregnancy
    (6 months) and post-natally (42 months)
  • Information on many potentially risk factors.
    Here p24
  • y1 for infant wheeze. 1316 affected, 6002
    unaffected.
  • W Li, A Sherriff, X Liu (2000) Assessing risk
    factors of human complex diseases by Akaike and
    Bayesian information criteria (AIC and BIC)
    American Journal of Human Genetics, s67. page
    222.

20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
Leukemia expression data
  • From Golub et al. Science 286531-537 (1999)
  • Expression level from two types of leukemia
    patients acute myeloid leukemia (AML) and acute
    lymphoblastic leukemia (ALL)
  • N38 samples are used for training (extra 34
    samples for validation/testing).
  • Number of genes p7129
  • From old affymetrix chip (negative values are
    possible). Not log transformed (logistic
    regression model is not sensitive to value
    transformations).

25
(No Transcript)
26
3 more things examined in the leukemia data
  • Is there a clear separation between relevant
    (good) genes and irrelevant (bad) genes?
  • How is the performance in validation set compared
    to that in training set?
  • Whats the effect of averaging many single-gene
    logistic regressions? (model averaging)
  • Wentian Li and Yaning Yang (2002), "How
    many genes are needed for a discriminant
    microarray data analysis", in Methods of
    Microarray Data Analysis, eds SM Lin and KF
    Johnson (Kluwer Academic), pp.137-150

27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
Why extreme value distribution?
  • Max-likelihood of a single gene (i)
  • Max-likelihood of top-ranking gene (maximum of
    maximum, extreme-value)
  • For null data (no differentially expressed gene
    exists), the extreme-value is l
  • The mean of the extreme-value of chi-square
    variables is

31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
More information on the leukemia dataset
  • Considered to be an easy dataset
  • ALL samples can be further separated into two
    sub-sub-types (source from B-cell vs from
    T-cell) inhomogeneity.
  • Out of two misclassified samples (validation
    set), one could even be an error.

35
(No Transcript)
36
(No Transcript)
37
Potential problems with BIC/AIC
  • In BIC/AIC approach, all parameters are
    considered to be equal. However, for some
    statistical models, such as recursive
    partitioning/CRT, etc. the variable used for the
    first split is more important than the variable
    in the subsequent splits.
  • In BIC/AIC approach, there is no difference
    between the apparent number of parameters and the
    effective number of parameters. However, it is
    possible that weights/masks can be used to reduce
    the contribution from some variables, models, or
    parameters. (Bayesian variable selection? Model
    averaging?? Support vector machine???)
  • In BIC/AIC approach, parameter values are assumed
    to be continuous. Sometimes, there are choices in
    the model that is discrete (e.g. DNA
    segmentation). maximization of a parameter value
    is replaced by an extreme value problem
  • The coefficients of principal components, which
    are admittedly parameters, can be removed easily
    by redefine the variables first.

38
(No Transcript)
39
data modeling vs algorithmic modeling
  • 98 of all statisticians
  • academic statisticians
  • traditional statistics
  • goodness-of-fit,residual est
  • low in prediction rate
  • well-defined model
  • high in interpretability
  • computer is not crucial
  • example (logistic) regression
  • less variable,.
  • 2 of statisticians plus other
  • industrial statisticians
  • machine learning,data mining
  • predictive accuracy
  • high in prediction rate
  • unknown model, black box
  • low in interpretability
  • computational intensive
  • example forest, SVM
  • more variable,

40
Newer methods in machine learning thrive on
variables the more the better
  • --Leo Breiman in Statistical modeling the two
    cultures (2001)

41
In the early literature on support vector
machine, there are claims that support vector
machine allows one to finesse the curse of
dimensionality. Neither of these claims is true.
  • - Trevor Hastie, Robert Tibshirani, Jerome
    Friedman, in The Elements of Statistical Learning
    (2001)

42
Criticisms (from those who prefer the less, the
better)
  • D.R.Cox
  • prediction is not everything (e.g. political
    science)
  • weakness of the data quality is to blame, not
    the analysis
  • predictive success is not the primary basis
    for model choice natural scientists are more
    interested in establishing causal relation
  • the perfect separation in higher dimension in
    SVM may be unstable after all.
  • Brad Efron
  • consequence of low-variance is large bias
  • complicated methods are harder to criticize
  • the prediction culture contains more than 2
  • identifying causal factors is the ultimate
    goal, not the prediction
  • the whole point of science is to open black
    boxes and build better ones

43
Re-visit the debate between the less the better
vs the more, the better
  • data modelling still has its appeal because it
    offers possible biological explanation of the
    data
  • quite often, the model with better (if not the
    best) prediction rate is the ones with fewer
    parameters.
  • algorithmic modelling points to the possibility
    of multiple equivalent/similar genes. should keep
    them all.
  • algorithmic modelling is appealing when genetic
    and environmental factors interplay, and a
    phenotype is truly caused by large number of
    factors

44
More biological thinking on this issue
  • more has its limit 50k genes, few million
    markers
  • less also has its bound for complex diseases,
    one shouldnt look for only 1 gene. must be a
    few.
  • linear combination of gene expressions is hard to
    explain in biology terms. trees and
    stratifications are much easier to explain.
  • data quality is a real issue (heterogeneity), so
    simulation cannot completely recreate the real
    situation. this data heterogeneity may mislead
    on the predictive performance.

45
(No Transcript)
46
Marchini et al. Nature Genetics (April 05)
  • Assume two-locus interaction (epistasis) exists
    and marginal effect is low.
  • In genetic case-control analysis, should we carry
    out (1) locus-by-locus analysis? (2) search for
    all two-locus combination? (3) a compromise
    two-stage analysis
  • Measure success by whether one locus is found, or
    both loci are found.
  • Conclusion depends on the interaction model,
    minor allele frequency, and the linkage
    disequilibrium between the maker and the
    disease-causing mutation
  • In many situations, strategy (2) is more
    successful than (1)

47
(No Transcript)
48
A small victory of the more, the better, i.e.,
two-locus model is better than single-locus
approach when the simulated data is by the
two-locus interaction model, even after paying
the multiple testing price!
  • Conclusion both has some truth in it.
    Ultimately,
  • the number of variables/parameters used should
    match the number of true causal factors, whether
    it is small (Mendelian) or large (complex).

49
Data Source/Collaborators
  • Andrea Sheriff (epidemiology)
  • Dale Nyholt (linkage analysis)
  • Yaning Yang (microarray, Zipfs law)
  • Fengzhu Sun (extreme-value distribution)
  • Ivo Grosse (extreme-value distribution)
  • Rajeev Azad (DNA segmentation)
  • Pedro Bernaola-Galvan (DNA segmentation)
Write a Comment
User Comments (0)
About PowerShow.com