The Causes of Variation - PowerPoint PPT Presentation

About This Presentation
Title:

The Causes of Variation

Description:

Title: PowerPoint Presentation Author: Preferred Customer Last modified by: Lindon Eaves Created Date: 12/28/1999 11:02:15 PM Document presentation format – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 65
Provided by: Prefer985
Category:

less

Transcript and Presenter's Notes

Title: The Causes of Variation


1
The Causes of Variation
  • Lindon Eaves and Tim York
  • Boulder, CO
  • March 2001

2
One Issue (Among Many!)
  • Identifying genes that cause complex diseases and
    genes that contribute to variation in
    quantitative traits

3
Quantitative Trait Locus (QTL)
  • Any gene whose contribution to variation in a
    quantitative trait is large enough to stand out
    against the background noise of other genetic and
    environmental factors

4
Quantitative Trait
  • A continuously variable trait (in which variation
    may be caused by multiple genetic and/or
    environmental factors) any categorical trait in
    which differences between categories may be
    mapped onto variation in a continuous trait

5
Common diseases
  • Estimated life time risk c.60
  • Substantial genetic component
  • Non-Mendelian inheritance
  • Non-genetic risk factors
  • Multiple interacting pathways
  • Most genes still not mapped

6
Examples
  • Ischaemic heart disease (30-50, F-M)
  • Breast cancer (12, F)
  • Colorectal cancer (5)
  • Recurrent major depression (10)
  • ADHD (5)
  • Non-insulin dependent diabetes (5)
  • Essential hypertension (10-25)

7
Even for simple diseasesNumber of alleles is
large(Wright et al, 1999)
  • Ischaemic heart disease (LDR) gt190
  • Breast cancer (BRAC1) gt300
  • Colorectal cancer (MLN1) gt140

8
Definitions
  • Locus One of c. 30-40,000 genes
  • Allele One of several variants of a specific
    gene
  • Gene a sequence of DNA that codes for a specific
    function
  • Base pair chemical letter of the genome (a
    gene has many 1000s of base pairs)
  • Genome all the genes considered together

9
Finding QTLs
  • Linkage
  • Association

10
Linkage
  • Finds QTLs by correlating phenotypic similarity
    with genetic similarity (IBD) in specific parts
    of genome

11
Linkage
  • Doesnt depend on guessing gene
  • Works over broad regions (good for getting in
    right ball-park) and whole genome (genome scan)
  • Only detects large effects (gt10)
  • Requires large samples (10,000s?)
  • Cant guarantee close to gene

12
Association
  • Looks for correlation between specific alleles
    and phenotype (trait value, disease risk)

13
Association
  • More sensitive to small effects
  • Need to guess gene/alleles (candidate gene)
    or be close enough for linkage disequilibrium
    with nearby loci
  • May get spurious association (stratification)
    need to have genetic controls to be convinced

14
RealityFor complex disorders and quantitative
traits
  • Large number of alleles at large number of genes

15
Defining the Haystack
  • 3x109 base pairs
  • Markers every 6-10kb for association in
    populations with no recent bottleneck history
  • 1 SNPs per 721 b.p. (Wang et al., 1998)
  • c.14 SNPs per 10kb 1000s haplotypes/alleles
  • O (104 -105) genes

16
Problems
  • Large number of loci and alleles/haplotypes
  • Possible interactions between genes
  • Possible interactions between genes and
    environment
  • Relatively low frequencies of individual risk
    factors
  • Functional form of genotype-phenotype relations
    not known
  • Sorting out signal from noise minimizing errors
    within budget
  • Scaling of phenotype (continuous, discontinuous)
  • Spurious association (stratification)

17
Prepare for the worst
  • Need statistical approaches that can screen
    enormous numbers of loci and alleles to identify
    reliably those that have impact on risk to disease

18
System Chosen for Study
  • 100 loci
  • 20 loci affect outcome, 80 nuisance genes
  • 257 alleles/locus
  • Allele frequencies c.20-0.1
  • Disease genes each explain 2.5 variance in risk
    (c. 2-fold risk increase)
  • 40 rarest alleles increase risk
  • 50 variance non-genetic

19
(No Transcript)
20
Its a Mess!
  • Dont know which genes might have clues
  • Dont know which alleles unordered categories
  • gt250100 locus/allele combinations
  • More predictor combinations than people (curse
    of dimensionality)
  • Reality worse

21
Problems
  • Informatics large volume of data
  • Computational large number of combinations
  • Statistical large number of chance associations
  • Genetic-epidemiological secondary associations

22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
How are we going to figure it out?
27
Data Mining(Steinberg and Cartel)
  • Attempt to discover possibly very complex
    structure in huge databases (large number of
    records and large number of variables)
  • Problems include classification, regression,
    clustering, association (market analysis)
  • Need tools to partially or fully automate the
    discovery process
  • Large databases support search for rare but
    important patterns and interactions (epistasis,
    GxE)

28
Some Approaches to DM
  • Logistic regression
  • Neural networks
  • CART (Breiman et al. 1984)
  • MARS (Friedman, 1991)

29
MARS
  • Multivariate
  • Adaptive
  • Regression
  • Splines

30
Key references
  • Friedman, J.H. (1991) Multivariate Adaptive
    Regression Splines (with discussion), Annals of
    Statistics, 19 1-141.
  • Steinberg, D., Bernstein, B., Colla, P., Martin,
    K., Friedman, J.H. (1999) MARS User Guide. San
    Diego, CA Salford Systems

31
The MARS Advantage
  • Allows large number of predictors
    (loci/alleles/environments) to be screened
  • Non-parametric
  • Continuous and discontinuous outcomes
  • Systematic search for detailed interactions
  • Testing and cross-validation
  • Continuous and categorical predictors
  • Decides best form of relationship

32
Example Regression SplineImpact of Non-Retail
Business on Median Boston House Prices
Median House Price
Knot
Industrial Business
33
Fitting functions with Splines
  • Piece-wise linear regression.
  • simplest form. allow regression to bend.
  • Knots define where the function changes
    behavior.
  • Local fit vs. Global fit.

actual data
spline with 3 knots
34
One predictor example
  • True knots at 20 and 45 (left)
  • Best single knot at about 35 (right)

Y
Y
10 20 30 40 50 60
10 20 30 40 50 60
X
X
35
10 20 30 40 50 60
10 20 30 40 50 60
10 20 30 40 50 60
10 20 30 40 50 60
36
Re-express variables as basis functions
  • Done to generalize the search for knots.
    Difficult to illustrate splines with gt one
    dimension.
  • Core building block of MARS model
  • max (0, X c)
  • example BF1 max(0, ENV 5)
  • BF2 max(0, ENV 8)
  • 0 for
    ENV lt 5
  • ?1 for 5 lt ENV lt 8
  • ?1 ?2 for ENV gt 8
  • Weighted sum of basis functions used to
    approximate the global function.
  • ie y constant ?1 BF1 ?2 BF2
    error

37
Adaptive Spline
  • Optimal placement of knots
  • Optimal selection of predictors and interactions

38
Adaptive splines
  • Problem
  • What is the optimal location of knots?
  • How many knots do you need?
  • Best to test all variable / knot locations, but
    computationally burdensome.
  • MARS solution
  • Develop an overfit model with too many knots.
  • Remove all knots that contribute little to model
    quality.
  • The final model should have approximately correct
    knot locations.

39
Optimal
  • Explains salient features of data
  • Ignores irrelevant features
  • Stands up to replication
  • - Several ways to operationalize mathematically

40
MARS 2-step model building
  • Step 1. Growing phase
  • begins with only a constant in the model.
  • serially adds basis functions to a user defined
    limit. tests each for improvement when added to
    the model.
  • addition of basis functions until an overly large
    model is found. (theoretically the true model is
    captured).
  • Step 2. Pruning phase
  • delete basis function that contributes least to
    model fit.
  • refit the model and delete next term, repeat.
  • the most parsimonious model is selected.
  • GCV criterion to select optimal model (Craven
    1979).
  • MARS option uses 10 fold cross-validation to
    estimate DF.

41
Cross-validation
  • Protects against over fitting data.
  • Develops a model on subset of data. Tests fit on
    remaining set.
  • Systematically assesses how many DF to charge
    each variable entered into model.
  • Adding a basis function will always lower MSE.
  • This reduction is penalized by DF charged.
  • Only backwards deletion step is penalized.

42
Genetic ExampleRegression spline for
multi-allelic locus
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
So Far
  • Does quite well for largish random samples and
    continuous outcomes.
  • -What about disease (dichotomous) outcomes?
  • -What about selected (extreme) samples?

47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
So?
  • Can detect signal due to relatively large numbers
    of relatively rare unordered alleles of
    relatively small effect at relatively many loci
    amid the noise of still more loci and
    environmental effects
  • MARS may provide elements for analyzing such
    data in this and similar contexts (?micro-
    arrays, SNPs, expression arrays?)
  • Works with continuous data on random samples and
    dichotomous outcomes on selected samples

51
GAW12 Simulated data
  • Provided for two populations
  • large general pop.
  • pop. isolate founded 20 generations ago by 100
    ind.
  • limited migration b/w.
  • Common disease
  • prevalence of 25. increases with age
  • middle age disease, some early onset
  • more common in females than males

52
  • General population
  • 7 genes simulated
  • 13 to 20 kb
  • 12 to 40 diallelic sites at start of simulation
  • passed through 120 to 200K of random mating
  • mutation, intragenic recombination, gene
    conversion allowed at diff. rates for diff.
    genes
  • each gene contains a 500bp recombination hotspot
    15 to 65 of intragenic recombinations
  • 8 to 13 mutational hotspots per gene (6 300 xs
    ?)
  • 25 of genes isolated for 35 to 85K generations.

53
GENE1 GENE5
Length (kb) 20 17
Start of SNP 40 20
Random Mating 150K 165K
Rec. rate .01 .002
Mutation rate 4x10-8 6x10-9
Gene conv. .01 .002
Mean length conv. 1000 1600
Start of rec. hotspot / in 10349 / 50 4197 / 65
mutat. hotspot 13 8
Incr mut rate 200 20
54
  • Isolate population
  • loosely modeled after pop. history of Old Order
    Amish in Lancaster Co., PA
  • Founders 200 chr.s sampled from general pop.
  • 20,000 chr.s sampled from general pop. to create
    an outside pop
  • Isolate children lt12, mean 4 Outside children
    lt12, 1
  • migration allowed b/w pop.s at each generation
  • rate migrants 5 of current isolate size
  • evolution progressed for 20 generations with
    recombination (no mutations, no intragenic rec.)
  • founders were then sampled to create the isolate
    pop.

55
  • 23 extended pedigrees with 1,497 individuals from
    each population. (1,000 living)
  • Pedigrees include the proband, spouse, and all
    first, second, and third degree relatives of
    each.
  • Living individuals are provided
  • affected status, fid, mid, sex
  • age at last exam
  • age of onset if affected
  • 5 quantitative risk factors
  • 2 environmental risk factors (binary and
    quantitative)
  • marker genotype for 1 cM whole genome screen.
    2,855 total markers with an average of 9.1
    alleles
  • sequence data for 7 candidate genes 1,176
    sequence variants
  • 50 replicates provided for each pop.

56
(No Transcript)
57
Sequence data
  • Isolate and General population
  • Intron and Exon sequence from 7 candidate genes.
  • Kept only those individuals with sequence data.
    Each set contain 7,000 individuals. 64 mb MARS
    limit.
  • 5 sets of 7 randomly selected replicates (used 35
    of 50 replicates provided)
  • 5 associated quantitative risk factors.
  • Covariates included E1, E2, Age, Sex, Age of
    onset.

58
  • Affected status binary.
  • Exon sequence coded for each individual as having
    0, 1, or 2 ancestral variants.
  • If intron variant present (whether 1 or 2 copies)
    given a value of 1. Coded in binary form as
    haplotypes of length four.

59
Aff Status
Age of onset
MG6
Liability
E1
CG1
Q1
Q2
Q3
Q4
Q5
MG5
MG1
MG2
MG3
MG4
E2
Age
CG2
CG6
60
True Model Isolate pop. General pop.
AFF E1, Q1-Q5, MG6 557 E1, Q1-Q5, MG6 (435 547 548 557) 5244 5268 6912 7281 E1, Q1-Q5, MG6 (27 57 76 110)(435 547 548 557)
Q1 E1, MG1 5782 MG1 5007 MG1 5782
Q2 E1, MG1 5782 E1, MG1 5007 E1, MG1 5782
Q3 E1, E2 E1, E2 E1, E2
Q4 E1, AGE E1, AGE E1, AGE
Q5 E1, MG5 multi-allelic E1, MG5 1289 3745 8657 8817 E1, MG5 1289 3745 8657 8817
ONSET MG6 557 MG6 15625 none
61
Conclusions
  • MARS works well to capture functional form of
    disease etiology in simulated data with
    dichotomous outcome.
  • In most cases was within 1 Kb of functional
    variant.
  • Generated a predictive model that was replicable
    in at least 4 of 5 data sets.
  • Highly interpretable output in the form of basis
    functions and Importance values.
  • MARS may have problems with highly correlated
    variables.
  • Pattern-recognition tools can be useful to narrow
    down search for genes.

62
Comparison of MARS and ANN
MARS ANN
Both are non-parametric estimation schemes, allow for a high number of input predictors, allow for interactions, non-linear mappings. Both are non-parametric estimation schemes, allow for a high number of input predictors, allow for interactions, non-linear mappings.
Maximum allowable basis functions and degree of interactions. Type of network architecture needs to be specified.
Models are developed fast. Models are trained more slowly (DeVeaux et al. 1993).
Backwards elimination stage to remove unnecessary basis functions. Problem of overfitting the data esp. with small data sets.
Easily interpretable basis functions. Local interpretation of the function. Black box-weights have little meaning. Diff. to interpret predictor contribution
Penalizes model complexity. Tries to dev. a low order, interpretable model. Non-linear transformations and high connectivity allows for ? complexity.
63
But the Haystack is Very Large
  • Reality worse than simulations
  • More alleles at more loci
  • Phenotypes more complex (multivariate)
  • More irrelevant loci (?1000s)
  • Interactions with environment and between loci
  • Spurious associations

64
It Needs Collaboration
  • Clinical
  • Statistical
  • Molecular
  • Epidemiological
  • Physiological
  • Developmental
  • Informational
  • Evolutionary
Write a Comment
User Comments (0)
About PowerShow.com