Exhaustive Search ES: PowerPoint PPT Presentation

presentation player overlay
1 / 1
About This Presentation
Transcript and Presenter's Notes

Title: Exhaustive Search ES:


1
  • If the reported SNP is found among 100 SNPs then
    the probability that the SNP is associated with a
    disease by
  • mere chance becomes 100 times larger
    (Bonferroni).
  • Bonferroni is too crude (e.g., 3-SNP
    combinations among 100 SNPs, p lt 0.0510-6)
  • We adjust resulted p-values via randomization
  • Unadjusted p-value Probability of case/control
    distribution in a set defined by MSC, computed by
    binomial distribution
  • Multiple-testing adjusted p-value
    randomization
  • Randomly permute the disease status of the
    population to generate 1000 instances.
  • Apply searching methods on each instance to get
    MSCs.
  • Compute the probability of MSCs that have a
    higher unadjusted p-value than the observed
  • p-value.
  • In our search we report only MSC with adjusted
    p-value lt 0.05

Human Genome and SNP
Disease association analysis
  • Length of Human Genome ? 3 ? 109 base pairs
  • Difference between any two people ? 0.1 of
    genome
  • Total number of single nucleotide polymorphisms
    (SNP) ? 3 ? 106
  • SNP - single nucleotide site where two or more
    different
  • nucleotides occur in a large percentage of
    population
  • 0 willde type/major (frequency) allele
  • 1 mutation/minor (frequency) allele
  • International HapMap project
  • SNP maps are constructed across the human
  • genome with density of about one SNP per
  • thousand nucleotides.
  • HapMap tries to identify 1 million tag SNPs
  • providing almost as much mapping informa-
  • tion as entire 10 million SNPs
  • Unfortunately, not as much known about SNP
  • combinations
  • HapMap initial budget was 100Million dollars
  • Due today around 1.5Million SNPs are typed
  • Analysis of variation in suspected genes in
    disease and nondisease individuals is aimed at
    identifying SNPs with considerably higher
    frequencies among the disease individuals than
    among the nondisease individuals
  • Most searches are done on a SNP-by-SNP basis
  • Recently two-SNP analysis shows promising results
    (Marchini et al, 2005)
  • Multi-SNP analyses are expected to find even
    stronger disease associations
  • Common diseases can be caused by combinations of
    several unlinked gene (SNPs) variations
  • We address the computational challenge of
    searching for such multi-gene causal combinations
  • The number of multi-SNP combinations is
    infeasible high (3100 for 100 SNPs).
  • How to find associated multi-SNP combinations
    without total checking?
  • Disease association analysis searches for a SNPs
    or multi-SNP combinations with frequency among
    disease individuals considerably higher than
    among nondisease individuals.

Our contributions
  • A novel combinatorial method for finding disease-
    associated multi-SNP combinations was
    developed.
  • Multi-SNP combinations significantly associating
    with diseases were found.
  • For Crohn's disease data (Daly, et al., 2001), a
    few associated multi-SNP combinations with
    multiple-testing-adjusted to p lt 0.05 were found,
    while no single SNP or pair of SNPs showed
    significant association.
  • For a dataset for an autoimmune disorder (Ueda,
    et al., 2003), a few previously unknown
    associated multi-SNP combinations were found.
  • For tick-borne encephalitis virus-induced
    disease, a multi-SNP combination within a group
    of genes showing a high degree of linkage
    disequilibrium significantly associated with the
    severity of the disease was found.

MSC
x x 1 x x 2 x x x
0 1 1 0 1 2 1 0 2 sick
Genetic epidemiology
0 1 1 1 0 2 0 0 1 sick
4 sick 1 healthy
0 0 1 0 0 0 0 2 1 sick
0 1 1 1 1 2 0 0 1 sick
check significance
0 0 1 0 1 2 1 0 2 sick
  • Searching for genetic risk factors for diseases
  • Monogenic diseases
  • A mutated gene is entirely responsible for the
    disease
  • Typically rare in population lt 0.1
  • Practically all cases are already reported
  • Complex diseases
  • Affected by the interaction of multiple genes
  • Significance of risk factor is usually measured
    by Risk Rate or _ _ _Odds Ratio
  • We measure significance by the p-value of the
    set of genotypes _defined by risk factor

0 1 0 0 1 1 0 0 2 healthy
0 1 1 0 1 2 0 0 2 healthy
Statistical significance
Disease-Associated Multi-SNP Combinations Search
  • Multi-SNP combination (MSC) define a set of
    disease and nondisese individuals
  • MSC is considered statistically significant if
    the frequency of disease and nondisese
    distribution has p-value lt 0.05
  • A lot of reported findings are frequently not
    reproducible on different populations. It is
    believed that this happens because the p-values
    are unadjusted to multiple testing
  • Given a population of n genotypes (or
    haplotypes) each containing values of m SNPs from
    0,1,2 and disease status (diseased or
    nondisease)
  • Find all multi-SNP combinations with multiple
    testing adjusted p-value of the frequency
    distribution below 0.05
  • Disease-closure allow finding of the
    statistically significant MSC on the earlier
    stage of searching.
  • Trivial MSCs and MSCs which coincide after
    disease-closure are avoided. That significantly
    speedups the searching.
  • Faster than ES
  • Finds more significant association on the early
    stage of searching
  • Still slow for wide-genome studies
  • Searching level number of SNPs which define MSC
    before disease-closure
  • Indexed Combinatorial Search (ICS)
  • Combinatorial search on the indexed datasets
    obtained by extracting k indexed SNPs with MLR
    based tagging method.
  • Can perform complete searching for the larger
    datasets

Proposed searching methods
Results/comparison of searching methods
  • Exhaustive Search (ES)
  • In order to find a multi-SNP combination with
    the p-value of the frequency distribution below
    0.05, it checks all one-SNP, two-SNP, ..., m-SNP
    combinations.
  • Runtime is O(n3m) making complete searching
    unfeasible even for small numbers of SNPs m
  • We restrict searching to 1,2,3,4,5 SNPs
  • Searching level number of SNPs which
    participate in MSC
  • Indexed Exhaustive Search (IES)
  • Exhaustive search on the indexed datasets
    obtained by extracting k indexed SNPs with MLR
    based tagging method.
  • MLR - multiple linear regression based tagging
    method (He and Zelikovsky, 2006).
  • The tradeoff between the number of chosen
    indexing SNPs and quality of reconstruction
    requires choosing the maximum number of index
    SNPs that can be handled by ES in a reasonable
    computational time.
  • Can perform complete searching for the larger
    datasets
  • For wide-genome study number of tags cant be
    reduced to 5-10 tags. Therefore, IES will not be
    able to perform complete search
  • The relative qualities of the searching methods
    are compared using the number of statistically
    significant multi-SNP combinations found.
  • The statistical significance was adjusted to
    multiple testing and the adjusted 0.05 threshold
    is shown (third column).
  • In the 4th, 5th and 6th columns, we give the
    frequencies of the best multi-SNP combination
    among disease and nondisease populations and the
    unadjusted p-value, respectively.

Data Sets
  • Crohn's disease 387 genotypes with 103 SNPs
    derived from the 616 KB region of human
    Chromosome 5q31, 144 disease genotypes and 243
    nondisease genotypes. (Daly et al., 2001).
  • Autoimmune disorder 1024 genotypes with 108
    SNPs containing gene CD28, CTLA4 and ICONS, 378
    disease genotypes and 646 nondisease genotypes.
    (Ueda et al., 2003).
  • Tick-borne encephalitis 75 genotypes with 41
    SNPs containing gene TLR3, PKR, OAS1, OAS2, and
    OAS3, 21 disease genotypes and 54 nondisease
    genotypes. (Barkash et al., 2006).

Discussion
  • Comparing indexed counterparts with ES and CS
    shows that indexing is quite successful. Indeed,
    the indexed searches found the same multi-SNP
    combinations as the non-indexed searches but were
    much faster and the multiple-testing adjusted
    0.05-threshold was higher and easier to meet.
  • Comparing the CS with the ES counterparts is
    advantageous to the former. Indeed, for the
    Crohn's disease data (Daly.et al., 2001), the ES
    on the first and second search levels is
    unsuccessful while the CS finds several
    statistically significant multi-SNP combinations.
    Similarly, for the tick-borne encephalitis
    virus-induced disease data, the CS and ICS(20)
    found a significant association on the first
    level while no association was found by the ES or
    IES(20). For the autoimmune disorder data
    (Ueda.et al., 2003), the CS found many more
    statistically significant multi-SNP combinations
    then the ES.
  • We conclude that the proposed indexing approach
    and the combinatorial search method are very
    promising techniques for searching for
    statistically significant diseases-associated
    multi-SNP combinations and disease susceptibility
    prediction.
Write a Comment
User Comments (0)
About PowerShow.com