Tagging SNPs - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Tagging SNPs

Description:

A block-free tag SNP selection algorithm that maximizes prediction accuracy' ... a mutation at a single position in human genome, passed along through heredity ... – PowerPoint PPT presentation

Number of Views:346
Avg rating:3.0/5.0
Slides: 27
Provided by: Owne745
Category:
Tags: snps | tagalong | tagon | tagging

less

Transcript and Presenter's Notes

Title: Tagging SNPs


1
Tagging SNPs
  • Presentation by Eric Ruggieri
  • December 20, 2007

2
Outline
  • Brief background to SNP selection
  • A block-free tag SNP selection algorithm that
    maximizes prediction accuracy
  • Halperin et al 2005
  • A block-free tag SNP selection algorithm that
    maximizes informativeness
  • Halldorsson et al 2004

3
What does it mean to tag SNPs?
  • SNP Single Nucleotide Polymorphism
  • Caused by a mutation at a single position in
    human genome, passed along through heredity
  • Characterizes much of the genetic differences
    between humans
  • Most SNPs are bi-allelic
  • Estimated several million common SNPs (minor
    allele frequency gt5
  • To tag select a subset of SNPs to work with

4
Why do we tag SNPs?
  • Disease Association Studies
  • Goal Find genetic factors correlated with
    disease
  • Look for discrepancies in haplotype structure
  • Statistical Power Determined by sample size
  • Cost Determined by overall number of SNPs typed
  • This means, to keep cost down, reduce the number
    of SNPs typed
  • Choose a subset of SNPs, tag SNPs that can
    predict other SNPs in the region with small
    probability of error
  • Remove redundant information

5
What do we know?
  • SNPs physically close to one another tend to be
    inherited together
  • This means that long stretches of the genome
    (sans mutational events) should be perfectly
    correlated if not for
  • Recombination breaks apart haplotypes and slowly
    erodes correlation between neighboring alleles
  • Tends to blur the boundaries of LD blocks
  • Since SNPs are bi-allelic, each SNP defines a
    partition on the population sample.
  • If you are able to reconstruct this partition by
    using other SNPs, there would be no need to type
    this SNP
  • For any single SNP, this reconstruction is not
    difficult

6
Complications
  • But the Global solution to the minimum number of
    tag SNPs necessary is NP-hard
  • The predictions made will not be perfect
  • Correlation between neighboring tag SNPs not as
    strong as correlation between neighboring (not
    necessarily tagged) SNPs
  • Haplotype information is usually not available
    for technical reasons
  • Need for Phasing

7
  • Tagging SNPs can be partitioned into the
    following three steps
  • Determining neighborhoods of LD which SNPs can
    infer each other
  • Tagging quality assessment Defining a quality
    measure that specifies how well a set of tag SNPs
    captures the variance observed
  • Optimization Minimizing the number of tag SNPs

8
Two Classes of tag SNP algorithmsbased on
distinction of how to determine neighborhoods of
LD
  • Block-Based
  • Define blocks that are in strong LD with each
    other, but not with neighboring blocks
  • Requires inference on exact location of haplotype
    blocks
  • Recombination between the blocks but not within
    the blocks
  • Within each block, choose a subset of SNPs
    sufficiently rich to be able to reconstruct
    diversity of the block
  • Many algorithms exist for creating blocks few
    select the same boundaries!
  • Most prominent algorithm due to Zhang et al
    (several papers)

9
How do we create Haplotype Blocks?
  • Recombination-based block building algorithm
  • Infinite sites assumption each site mutates at
    most once
  • Assume no recombination within a block
  • Implies each block should follow the four-gamete
    condition for any pair of sites (See Hudson and
    Kaplan)
  • Diversity-based test A region is a block if at
    least 80 of the sequences occur in more than one
    chromosome.
  • Test does not scale well to large sample sizes.
    (See Patil et al (2001))
  • To generalize this notion, one could look for
    sequences within a region accounting for 80 of
    the sampled population that each occur in at
    least 10 of the sample.
  • LD-based test
  • D value of every pair of SNPs within the block
    shows significant LD given the individual SNP
    frequencies with a P-value of 0.001
  • Two SNPs are considered to have a useful level of
    correlation if they occur in the same haplotype
    block i.e. they are physically close with little
    evidence of recombination. The set of SNPs that
    can be used to predict SNP s can be found by
    taking the union of all putative haplotype blocks
    that contain SNP s.
  • It is possible that many overlapping block
    decompositions will meet the rules defined by a
    rule-based algorithm for finding haplotype blocks
  • Metric LD Maps as described by Maniatis et al.
    (2002)
  • Only those SNPs that are within a distance of lt 1
    LD unit are considered to be significantly
    correlated to each other.

10
  • Entropy-based or block-free
  • Avoids construction of blocks
  • Entropy is a measure of randomness
  • Seek to capture the most information across a
    region without rigid boundaries of a block
  • Both papers presented today use this method

11
Tag SNP Selection in Genotype Data for Maximizing
SNP Prediction Accuracy Halperin et al 2005
12
Problem Formulation
  • Notation Side Board
  • Definition of Prediction Algorithm, f, and
    restriction function, Z
  • Goal is to find a minimum size set of tag SNPs
    and a prediction algorithm such that the
    prediction error is minimized
  • Statistical note about 0-1 loss functions and
    Maximum Likelihood Estimates
  • But, frequencies of genotypes in population
    unknown, so taking expected value difficult
  • Instead, use training dataset to estimate the
    distribution of the genotypes (Bootstrap Method,
    non-parametric)
  • Minimize probability expression for a randomly
    chosen genotype in training set
  • Alternatively, we can seek to minimize the actual
    number of prediction errors un-normalized form
    of the probability expression

13
The Prediction Algorithm
  • Of critical importance in the search for tag SNPs
    is the definition of an adequate measure of the
    prediction quality
  • Different measures will lead to different
    optimal tag SNPs
  • Many of current tag SNP selection tools need to
    first partition the region of interest into LD
    blocks before making predictions
  • Current Prediction Algorithm is based upon
    following assumption
  • Correlation between SNPs tends to decay as
    physical distance between them increases

14
  • This translates to
  • given the genotype values of two SNPs, the
    probabilities of the values at any intermediate
    SNP do not change by knowing the values of
    additional distal ones
  • Prediction function makes its prediction based
    only upon the two nearest SNPs
  • Assumption does not hold for all data sets or for
    all SNPs, but is a good approximation

15
The Prediction Algorithm, cont.
  • Predict predicts the value of SNP i given the
    value of the tag SNPs
  • Aims to maximize the expected accuracy of
    predicting untyped SNPs, given the unphased
    (genotype) information of the tag SNPs
  • Uses a majority vote in order to make a
    prediction (Maximum Likelihood prediction)
  • In order to used the phased information available
    from the training set, two majority votes are
    actually calculated, although they coincide if
    the genotype takes the value 0 or 1
  • Two votes are necessary only if we have a
    heterozygote allele at a tag SNP
  • All of the tag SNPs except for the closest two
    are ignored
  • If there is not a tag SNP on one side of SNP i,
    the two closest tag SNPs on the other side are
    selected, whether they be the first two tag SNPs
    or the last two tag SNPs.

16
An Exact Algorithm for Tag SNP Selection
  • STAMPA (Selection of tag SNPs for Maximizing
    Prediction Accuracy)
  • Dynamic Programming
  • Recall, we are trying to minimize XT
  • Define indicator function
  • Three auxiliary score functions score(m1,m2),
    score1(m1,m2), score2(m1,m2)
  • Score Gives the total number of prediction
    errors in SNPs m1.m2-1, given that m1 and m2 are
    tag SNPs and that there are no tag SNPs in
    between
  • Score1 and score2 work similarly
  • Since Predict uses only nearest two tag SNPs to
    make prediction, all variables are local and sums
    can be readily computed

17
Building the Recursion
  • For lltt, define f(m,l) to be the minimum number
    of prediction errors in SNPs 1,2,m given that
    the lth SNP is in position m
  • For lt, f(m,t) represents the minimum number of
    prediction errors in all SNPs given that the
    final tag SNP is in position m
  • Recurrence relation
  • The minimum value of XT over all possible values
    of tag SNPs of size t is simply the min f(m,t)
    over all possible values of m
  • Use back pointers to get entire set of tag SNPs
  • Complexity Time O(m3n)
  • However, by placing a cap on distance between
    adjacent tag SNPs O(mc(cnt))

18
An Alternate Method Random Sampling
  • Gives up predictive power for speed and
    efficiency
  • Randomly generate 100 sets of tag SNPs by using
    the uniform distribution on the set of all
    available SNPs
  • Select any t of the m SNPs available
  • Compute XTi for all SNP sets, then choose SNP set
    that minimizes XTi

19
Advantages to the Method
  • Uses genotype information and so does not require
    phasing
  • In practice, only genotype data available
  • Does not rely on a specific block partition
  • Side Note Algorithm has the feel of the
    k-nearest neighbor classifier

20
Optimal Haplotype Block-Free Selection of Tagging
SNPs for Genome-Wide Association Studies
  • Halldorsson et al (2004)
  • including Prof. Istrail

21
  • Tagging SNPs can be partitioned into the
    following three steps
  • Determining neighborhoods of LD which SNPs can
    infer each other
  • Tagging quality assessment Defining a quality
    measure that specifies how well a set of tag SNPs
    captures the variance observed
  • Optimization Minimizing the number of tag SNPs

22
Finding Neighborhoods
  • Goal is to select SNPs in the sample that
    characterize regions of common recent ancestry
    that will contain conserved haplotypes
  • Recent common ancestry means that there has been
    little time for recombination to break apart
    haplotypes
  • Constructing fixed size neighborhoods in which to
    look for SNPs is not desirable because of the
    variability of recombination rates and historical
    LD across the genome
  • In fact, the size of informative neighborhoods is
    highly variable precisely because of variable
    recombination rates and SNP density
  • Authors avoid block-building by recursively
    creating neighborhood with help of
    informativeness measure

23
Defning Informativeness
  • A measure of tagging quality assessment
  • Assume all SNPs are bi-allelic
  • Notation
  • I(s,t) Informativeness of a SNP s with respect
    to a SNP t
  • i, j are two haplotypes drawn at random from the
    uniform distribution on the set of distinct
    haplotype pairs.
  • Note I(s,t) 1 implies complete predictability,
    I(s,t)0 when t is monomorphic in the population.
  • I(s,t) easily estimated through the use of
    bipartite clique that defines each SNP
  • We can write I(s,t) in terms of an edge set
  • Definition of I easily extended to a set of SNPs
    S by taking the union of edge sets
  • Assumes the availability of haplotype phases
  • New measure avoids some of the difficulties
    traditional LD measures have experienced when
    applied to tagging SNP selection
  • The concept of pairwise LD fails to reliably
    capture the higher-order dependencies implied by
    haplotype structure

24
Bounded-Width Algorithm k Most Informative SNPs
(k-MIS)
  • Input A set of n SNPs S
  • Output subset of SNPs S such that I(S,S) is
    maximal
  • In its most general form, k-MIS is NP-hard by
    reduction of the set cover problem to MIS
  • Algorithm optimizes informativeness, although
    easily adapted for other measures
  • Define distance between two SNPs as the number of
    SNPs in between them
  • k-MIS can be solved as long as distance between
    adjacent tag SNPs not too large

25
  • Define
  • Assignment Asi
  • S(As)
  • Recursion function Iw(s,l, S(A)) score of the
    most informative subset of l SNPs chosen from
    SNPs 1 through s such that As described the
    assignment for SNP s.
  • Pseudocode
  • Complexity O(nk2w) in time and O(k2w) in space,
    assuming maximal window w

26
Evaluation
  • Algorithm evaluated by Leave-One-Out
    Cross-Validation
  • accumulated accuracy over all haplotypes gives a
    global measure of the accuracy for the given data
    set.
  • SNPs not typed were predicted by a majority vote
    among all haplotypes in the training set that
    were identical to the one being inferred
  • If no such haplotypes existed, the majority vote
    is taken among all training haplotypes that have
    the same allele call on all but one of the typed
    SNPs
  • etc.
  • When compared to block-based method of Zhang
  • Presumably, the advantage is due to the cost
    imposed by artificially restricting the range of
    influence of the few SNPs chosen by block
    boundaries
  • Informativeness was shown to be a good
    measure
  • aligned well with the leave-one-out cross
    validation results
  • extremely close to the results of optimizing for
    haplotype r2
Write a Comment
User Comments (0)
About PowerShow.com