Disease Association Search and Susceptibility Prediction Algorithms - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Disease Association Search and Susceptibility Prediction Algorithms

Description:

Find the best pattern strength classifier for training sample S: ... pattern strength classifier = potential rule with the maximum accuracy. ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 28
Provided by: Dum96
Category:

less

Transcript and Presenter's Notes

Title: Disease Association Search and Susceptibility Prediction Algorithms


1
Disease Association Search and Susceptibility
Prediction Algorithms
  • Irina Astrovskaya

2
Outline
  • Introduction
  • SNPs, Haplotypes, Genotypes
  • Genetic Epidemiology
  • Case/Control Study
  • Risk/Resistance factors
  • Significance of Risk/Resistance Factors
  • Multiple-Testing Adjustment
  • Disease Association Search
  • Disease Susceptibility Prediction

3
SNPs, Haplotypes, Genotypes
  • Human Genome all genetic material in the
    chromosomes(3109 base pairs).
  • Difference between any two people
    occur in 0.1 of genome.
  • SNP single nucleotide polymorphism, site where
    two or more different nucleotides occur in a
    large percentage of population (? 3 ? 106)
  • mostly biallelic.
  • Diploid two different copies of each chromosome
  • Haplotype description of a single copy
    (expensive)
  • (notation 0 is for major, and 1 is for minor
    allele)
  • Genotype entire genetic identity of an
    individual
  • mixture of two haplotypes
  • (notation 0,1 is for
    homozygote, 2 is for heterozygote)

4
Genetic Epidemiology
  • Genetic epidemiology searches for genetic risk
    factors of diseases.
  • Monogenic disease
  • A mutated gene is entirely responsible for the
    disease .
  • Typically rare in population lt 0.1.
  • Practically all cases are already reported
  • Complex disease
  • interaction of multiple non-linked genes
  • 2SNP analysis vs one-by-one SNP analysis
  • Multiple independent causes
  • Each cause can be result of interaction of
    several genes
  • Each cause explains lt 10-20 of cases
  • Common diseases are mostly complex diseases gt
    0.1.

5
Case/Control study
Given a population of n genotypes each containing
values of m SNPs and disease status.
Disease Status
SNPs
0101201020102210 0220110210120021 0200120012221110
0020011002212101 1101202020100110 012012001010001
1 0210220002021112 0021011000212120
0 0 0 0 1 1 1 1
Case genotypes
Control genotypes
Disease association analysis searches for
risk (resistance) factor of a disease.
6
Risk/Resistance factors
  • one SNP with fixed allele value

0 1 1 0 1 2 1 0 2 case
present in 4 cases 2 control
0 1 1 1 0 2 0 0 1 case
0 0 1 0 0 0 0 2 1 case
0 1 1 1 1 2 0 0 1 case
0 0 1 0 1 1 1 0 2 control
Third SNP with fixed allele value 1 is a risk
factor
0 1 0 0 1 1 0 0 2 control
0 1 1 0 1 2 0 0 2 control
  • multi-SNP combination (MSC) subset of SNPs with
    fixed values

0 1 1 0 1 2 1 0 2 case
0 1 1 1 0 2 0 0 1 case
0 0 1 0 0 0 0 2 1 case
0 1 1 1 1 2 0 0 1 case
present in 3 cases 1 control
0 0 1 0 1 1 1 0 2 control
0 1 0 0 1 1 0 0 2 control
0 1 1 0 1 2 0 0 2 control
x x 1 x x 2 x x x
MSC
Cluster (C) - subset of genotypes which share the
same MSC C d(C) cases in
cluster(C) , h(C) controls in cluster(C)
7
Significance of Risk/Resistance Factors
  • Measured by
  • Relative risk (RR) a ratio of event probability
    occurring in the cases versus controls
  • Odds ratio (OR) compares whether the probability
    of a certain event is the same for two groups
  • P-value probability of obtaining at least the
    same case/control distribution among exposed to
    risk factor, assuming null hypothesis (happened
    by chance)
  • Unadjusted p-value (computed by binomial
    distribution)

where
8
Multiple-testing adjustment
  • Bonferroni
  • easy to compute
  • overly conservative
  • reported SNP is among 100 tests gt p-value100
  • Randomization
  • Randomly permute the disease status gt 10000
    samples
  • Apply searching methods to each sample and get
    MSCs
  • Count of MSCs unadjusted p-value lt the
    observed p-value
  • If this lt 500 gt MSC is significant
  • computationally expensive
  • more accurate
  • statistically significant adjusted p-value lt
    0.05

9
Outline
  • Introduction
  • Disease Association Search
  • Disease Association Problem
  • Exhaustive and Combinatorial Searches
  • Maximum Control-Free Cluster
  • Complimentary Greedy Algorithm
  • Complimentary Greedy Search
  • CGS Results
  • Future
  • Disease Susceptibility Prediction

10
Disease Association Search
Problem Given a case/control study data
consisting of n genotypes
(haplotypes), each containing values of m SNPs
and disease status (case or control) Find
(all) Risk/Resistance factors (MSCs) with
multiple testing adjusted p-value
below 0.05
11
Exhaustive Combinatorial Searches
  • Exhaustive search (ES)
  • complete (infeasible)
  • sample with n genotypes and m SNPs requires
    O(n3m)
  • Combinatorial search (CS)
  • Case-closure of MSC C is a MSC C (with
    maximum number of SNPs with fixed values),which
    consists of the same set of case as C and
    minimum number of controls individuals from C.
  • Efficient way for finding case-closure
    Extend MSC with those SNPs that have common
    values in all cases.
  • Searches only among closed clusters
  • Closure of cluster (C) cluster (C)
  • d(C)d(C) and h(C) is minimized
  • Avoids checking of trivial MSCs
  • faster than ES, but still too slow for large data
  • Tagging(indexing) multiple regression method
  • ES and CS find more statistically significant
    MSCs on indexed data.

12
Maximum Control-Free Cluster
  • Maximum Control-Free Cluster Problem
  • Given case/control study
  • Find cluster (C) that is does not contain
    controls and has the maximum number of cases.
  • It is maximum control-free cluster.
  • Maximal Control-Free Cluster Risk Factor
  • Maximal Disease-Free Cluster Resistance Factor
  • Complexity
  • Includes max independent set problem
  • NP complete
  • However,
  • Sample S is not arbitrary

13
Complimentary Greedy Algorithm
  • Algorithm
  • Start with Clt-S
  • Repeat until h(C)gt0 (control-free)
  • For each SNP s with value i find
  • hh(C)-h(C ? s)
  • dd(C) d(C ? s)
  • Find SNP (s, i) minimizing d/h
  • Add s to MSC
  • Min vertex cover picking and removing vertices
    of maximum degree until no edges left

14
Complimentary Greedy Search (CGS)
  • Algorithm (covering)
  • Start with empty MSC that is present in all
    genotypes
  • Find SNP with allele value, that define a set of
    genotypes with highest ratio of controls over
    cases (Max(controls/cases))
  • Remove it
  • Add the SNP to resulted MSC
  • Repeat 2-3 until all controls are removed
  • Output resulted MSC
  • Adjust to multiple testing the p-value of the
    resulted MSC

Cases
Controls
15
CGS Results
  • CGS finds MSCs with non-trivially high
    association on real data
  • CGS finds more significant MSCs on full dataset
    than CS on indexed in reasonable amount of time

16
Future
  • Clustering algorithm
  • Instead of removing found cluster for
    maximum control-free cluster problem, redefine
    controls in the cluster as cases (redefinition is
    visible only for controls in sample).

Cases
Controls
17
Outline
  • Introduction
  • Disease Association Search
  • Disease Susceptibility Prediction
  • Disease Susceptibility Prediction Problem
  • Cross-Validation Tests
  • Quality Measures of Prediction
  • Prediction Methods
  • Optimum Disease Clustering Problem
  • From Clustering to Prediction
  • Leave-One-Out Results
  • CDC Algorithm
  • Experiments
  • Plans

18
Disease Susceptibility Prediction
Problem
  • Given Case/Control study
  • Genotype of a testing individual
    t
  • Find The disease status of the testing
    individual

Disease Status
SNPs
0101201020102210 0220110210120021 0200120012221110
0020011002212101 1101202020100110 012012001010001
1 0210220002021112 0021011000212120
0 0 0 0 1 1 1 1
Case genotypes
Control genotypes
testing - gt
0110211101211201
?
19
Cross-validation tests
  • Leave-one-out test
  • The disease status of each genotype in the data
    set is predicted while the rest of the data is
    regarded as the training set

Real Disease Status
Predicted Disease Status
Genotype
0
0
0101201020102210
0
0
0220110210120021
Accuracy 80
0
1
0200120012221110
1
1
0020011002212101
1
1
0020011002212101
  • Leave-many-out test
  • Repeat randomly picking 2/3 of the population as
    training set and predict the other 1/3

20
Quality Measures of Prediction
  • Sensitivity The ability to correctly detect
    cases.
  • Sensitivity
    TP/(TPFN)
  • Specificity The ability to avoid calling control
    as case.
  • Specificity
    TN/(FPTN)
  • Accuracy (TP TN)/(TPFPFNTN)
  • Risk Rate Measurements for risk factors

21
Prediction Methods
  • Support Vector Machine
  • Random Forest
  • LP-based prediction
  • Drawback of the prediction problem formulation
  • need of cross-validation ? no optimization

22
Optimum Disease Clustering Problem
  • Given Case/Control study S
  • Find partition P of S into clusters S
    S1?..?Sk , with disease status 0 or 1 assigned to
    each cluster Si , minimizing entropy(P) for a
    given bound on the number of individuals who are
    assigned incorrect status in clusters in
    partition P

23
From Clustering to Prediction
  • Intuition
  • If tested genotype is predicted correctly then
    optimum clustering will have smaller entropy
  • Model-Fitting Prediction Algorithm
  • Set status of testing genotype t to case
  • Find optimum clustering P0 of the dataset S U
    t
  • Set status of testing genotype to control
  • Find optimum clustering P1 of the dataset S U t
  • Find the clustering, which is better fits to
    model (has smaller enthropy), and accordingly
    predict status

24
Leave-One-Out Results
  • Leave-one-out cross validation results of four
    prediction methods for three real data sets.
    Results of combinatorial search-based prediction
    (CSP) and complimentary greedy search-based
    prediction (CGSP) are given when 20, 30, or all
    SNPs are chosen as informative SNPs.

25
CDC Algorithm (B.N. Goertzel Combinations of
SNPs in neuroendocrine effector and receptor
genes predict chronic fatique syndrome,
Pharmacogenomics(2006),7(3))
  • Find the best pattern strength classifier for
    training sample S
  • - all subsets of SNPs with cardinality less
    than k potential rules.
  • - For each potential rule
  • - evaluate each genotype g from S
  • for each SNP in the potential
    rule
  • if this SNP has value 0 or
    1 in g gt add 2 for a sum
  • if the value is
    2gt add 1.
  • - set threshold
  • sum_casesltthresholdsum_controlsgtt
    hreshold -gt min
  • - compute accuracy
  • - pattern strength classifier potential
    rule with the maximum accuracy.
  • Predict status of tested individual
  • - compute the sum for a tested individuals
  • for each SNP in the pattern strength
    classifier
  • if this SNP has value 0 or
    1 gt add 2 for a sum
  • if the value is 2gt add 1.
  • - if the sum is less than threshold gt
    control,
  • otherwise
    gt case.

26
Experiments
  • Different mask for sum in evaluation.
  • Experiments with swap SNPs if SNP is more
    associated with controls.

27
Plans
  • Finish Leave-one-out for all sum masks.
  • CDC method is slow
  • CDC method exhaustively search the best pattern
    strength classifier
  • If there is any way to take it in greedy way?
Write a Comment
User Comments (0)
About PowerShow.com