Epistasis and a Flexible Framework for Detecting Epistasis - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Epistasis and a Flexible Framework for Detecting Epistasis

Description:

Biological and Statistical Epistasis ... Deviations from additivity in a linear statistical model define statistical epistasis. ... Why is There Epistasis? ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 40
Provided by: yixua
Category:

less

Transcript and Presenter's Notes

Title: Epistasis and a Flexible Framework for Detecting Epistasis


1
Epistasis and a Flexible Framework for Detecting
Epistasis
  • Yixuan Chen
  • 11/9/2007

2
  • Jason H Moore A global view of epistasis. Nature
    Genetics 37, 13-14 (2005).
  • Jason H Moore et al. A flexible computational
    framework for detecting, characterizing, and
    interpreting statistical patterns of epistasis in
    genetic studies of human disease susceptibility.
    Journal of Theoretical Biology 241, 252-261
    (2006).

3
Outline
  • Epistasis
  • The flexible framework
  • Experiment Design
  • Results
  • Discussions
  • In the Future

4
Epistasis
  • Epistasis is a phenomenon whereby the effects of
    a given gene on a biological trait are masked or
    enhanced by one or more other genes.
  • For complex traits such as diabetes, asthma,
    hypertension, etc., the presence of epistasis is
    a particular cause for concern.

5
Biological and Statistical Epistasis
  • The physical interactions among proteins and
    other biomolecules and their impact on phenotype
    constitute biological epistasis.
  • Deviations from additivity in a linear
    statistical model define statistical epistasis.
  • The relationship between biological and
    statistical epistasis is important and needs
    further research efforts.

6
Why is There Epistasis?
  • From an evolutionary biology perspective, for a
    phenotype to be buffered against the effects of
    mutations, it must have an underlying genetic
    architecture that is comprised of networks of
    genes that are redundant and robust.
  • This creates dependencies among the genes in the
    network and is realized as epistasis.

7
A Simple Example
8
Disease Model
  • Penetrance
  • Pr (affected genotype)
  • One-locus Dominant Model

9
Two-locus Model
10
Enumeration of Two-locus Models
  • Although there are 29512 possible models,
    because of symmetries in the data, only 50 of
    these are unique.
  • Enumeration allows 0 and 1 only for penetrance
    values (fully penetrant).

11
(No Transcript)
12
A Flexible Framework
  • The framework contains four steps to detect,
    characterize, and interpret epistasis
  • Select interesting combinations of SNPs
  • Construct new attributes from those selected
  • Develop and evaluate a classification model using
    the newly constructed attribute(s)
  • Interpret the final epistasis model using visual
    methods

13
Step 1 Attribute selection
  • Use entropy-based measures of IG and interaction
  • Evaluate the gain in information about a class
    variable (e.g. case-control status) from merging
    two attributes together
  • This measure of IG allows us to gauge the benefit
    of considering two (or more) attributes as one
    unit

14
Information
  • A discrete random variable X, that can take on
    possible values x1...xn
  • The information of observing xi is
  • I(xi) log2Pr(xi)
  • Example
  • An English character has information
    log2(1/26)4.7
  • A common Chinese character has information
    log2(1/2500)11.3

15
Shannon Entropy
  • The information entropy

16
Information Gain
  • Consider two attributes, A and B, and a class
    label C.

17
Information Gain (c1)
  • If IG(ABC) gt 0
  • Evidence for an attribute interaction that cannot
    be linearly decomposed
  • If IG(ABC) lt 0
  • The information between A and B is redundant
  • If IG(ABC) 0
  • Evidence of conditional independence or a mixture
    of synergy and redundancy

18
Attribute Selection based on Entropy
  • Entropy-based IG is estimated for each individual
    attribute (i.e. main effects) and each pairwise
    combination of attributes (i.e. interaction
    effects).
  • Pairs of attributes are sorted and those with the
    highest IG, or percentage of entropy in the class
    removed, are selected for further consideration.

19
Step 2 Constructive induction
  • Use Multifactor Dimensionality Reduction (MDR)
  • A multilocus genotype combination is considered
    high-risk if the ratio of cases to controls
    exceeds given threshold T, else it is considered
    low-risk
  • Genotype combinations considered to be high-risk
    are labeled G1 while those considered low-risk
    are labeled G0.
  • This process constructs a new one-dimensional
    attribute with levels G0 and G1

20
MDR Step 1 2
  • Step 1 partition the data into some number of
    equal parts for cross-validation
  • Step 2 a set of N genetic and/or discrete
    environmental factors is selected from the list
    of all factors

21
MDR Step 3
  • The N factors and their multifactor classes or
    cells are represented in N-dimensional space
  • The ratio of the number of cases to the number of
    controls is evaluated within each multifactor cell

22
MDR Step 4
  • Each multifactor cell in N-dimensional space is
    labeled as high-risk if the ratio meets or
    exceeds some threshold T (e.g. T 1.0) and
    low-risk if otherwise
  • Those cells labeled high-risk are in one group
    and those low-risk are in another group, which
    reduces the N-dimensional model to one dimension

23
MDR Step 5 6
  • Step 5 all possible combinations of N factors
    are evaluated sequentially for their ability to
    classify affected and unaffected individuals in
    the training data, and the best N-factor model is
    selected.
  • Step 6 the independent test data from the
    cross-validation is used to estimate the
    prediction error of the best model selected.

24
(No Transcript)
25
MDR Final
  • Steps 1 through 6 are repeated for each possible
    cross validation interval
  • The final step determine which multifactor
    levels (e.g. genotypes) are high risk and which
    are low risk using the entire dataset.
  • Final ratio threshold is determined by the ratio
    of the number of affected individuals in the
    dataset divided by the number of the unaffected

26
Step 3 Classification and machine learning
  • A naïve Bayes classifier is adopted
  • Where vj is one of a set of V classes and ai is
    one of n attributes describing an event or data
    element.
  • The class associated with a specific attribute
    list is the one, which maximizes the probability
    of the class and the probability of each
    attribute value given the specified class.

27
Step 4 Interpretation Interaction Graphs
  • Comprised of a node for each attribute with
    pairwise connections between them.
  • Each node is labeled the percentage of entropy
    removed (i.e. IG) by each attribute.
  • Each connection is labeled the percentage of
    entropy removed for each pairwise Cartesian
    product of attributes.

28
Step 4 Interpretation Dendrograms
  • Hierarchical clustering is used to build a
    dendrogram that places strongly interacting
    attributes close together at the leaves of the
    tree.

29
Simulation Dataset
  • One-locus
  • M63
  • Two-locus
  • M27
  • M170

30
Simulation Dataset (c1)
  • Three-locus Model
  • Combination of M170 and M63
  • In the form PcaP1bP2
  • Generate 200 cases and 200 controls
  • For each dataset, created two new datasets
  • The functional SNPs plus the class variable
  • Just a single attribute constructed using MDR in
    addition to the class variable.

31
Atrial Fibrillation Dataset
  • 250 patient with AF and 250 matched controls
  • The ACE gene insertion/deletion (I/D)
    polymorphism, the T174M, M235T, G-6A, A-20C,
    G-152A, and G-217A polymorphisms of the
    angiotensinogen gene, and the A1166C polymorphism
    of the angiotensin II type I receptor gene were
    studied.

32
Experiment Design
  • 10-fold cross validation is used to compare
  • Accuracy (TPTN)/(TPTNFPFN)
  • Sensitivity TP/(TPFN)
  • Specificity TN/(TNFP)
  • Precision TP/(TPFP)
  • (T True F False P Positive N Negative)
  • Means were compared using a corrected resampled
    t-test and were considered statistically
    different when the p-value lt 0.05

33
Results
34
(No Transcript)
35
Summary
  • Selects interesting combinations of polymorphisms
    based on their interaction information estimated
    using the entropy-based measures
  • Interesting subsets of attributes can be reduced
    to single attribute using constructive induction
    methods such as MDR
  • Then the single attribute is modeled using
    machine learning and classification
  • The interaction graphs and interaction
    dendrograms constructed using these measures are
    useful for model interpretation.

36
Discussions
  • The flexibility of this framework is the ability
    to plug and play
  • Different attribute selection methodsother than
    the entropy-based
  • Different constructive induction algorithmsother
    than the MDR
  • Different machine learning strategiesother than
    a naïve Bayes classifier

37
Challenges
  • The challenge is making inferences about
    biological epistasis from statistical epistasis

38
In the Future
  • Whole-genome association studies with thousands
    of measured genetic variations
  • Discover high-order epistasis among genes
  • Merge knowledge from genetic studies in human
    populations with detailed descriptions of
    transcriptional networks, biochemical pathways
    and physiological systems in individuals.

39
THANKS!
Write a Comment
User Comments (0)
About PowerShow.com