CrossParallel Likelihood between any 2 SNPs: - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

CrossParallel Likelihood between any 2 SNPs:

Description:

Fcross = total number of 01 and 10 haplotypes in ... Validation by erasing randomly chosen SNP ... The second column corresponds to the ratio of erased data. ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 2
Provided by: tae55
Category:

less

Transcript and Presenter's Notes

Title: CrossParallel Likelihood between any 2 SNPs:


1
Family Trios Phasing and Missing data
recovery Dumitru Brinza, Jingwu He, Weidong Mao
and Alexander Zelikovsky Computer Science
Department
Diploid organisms, haplotypes, genotypes and SNPs
New Phasing Method 2-SNP Statistics (2SNP)
Greedy method for Trio Phasing
  • Proposed by Halperin et al. in Perfect phylogeny
    and haplotype assignment (2004)
  • For each trio we introduce four partial
    haplotypes with SNPs 0, 1 and ?
  • Algorithm iteratively finds the complete
    haplotype which covers the maximum possible
    number of partial haplotypes, removes this set of
    resolved partial haplotypes and continues in that
    manner
  • The drawback of this method is introducing errors
    to trio constraint
  • Diploid - two haplotypes (different copies of
    each chromosome)
  • SNP - single nucleotide site where two or more
    different
  • nucleotides occur in a large percentage of
    population
  • 0 willde type/major (frequency) allele
  • 1 mutation/minor (frequency) allele
  • Haplotype - description of a single copy
  • Example 00110101 (0 is for major, 1 is for minor
    allele)
  • Genotype - description of the mixed two copies
  • Example 01122110 (000, 111, 201)

Phasing Genotype Graph Coloring
  • Genotype graph for genotype g is a complete
    graph G(g ) where
  • Vertices heterozygous SNPs in g
  • (I,j)-edge weight w(I,j)cross/parallel
    likelihood phasing
  • Phasing of 2 heterozygous SNPs
  • Parallel edge 22 00 11
  • Cross edge 22 01 10
  • Graph coloring
  • Color all vertices in two colors such that any 2
    vertices connected with the parallel edge
  • have the same color, and any 2 vertices
    connected with cross edge have opposite colors

Integer Linear Program for PPTP
  • For each trio we introduce four template
    haplotypes 0,1,2,?
  • 0,1 correspond to fully resolved haplotypes, 2
    comes in SNPs corresponding to the genotypes
    2s, ? unconstrained SNPs
  • Variables
  • for each possible haplotype i, xi? 0,1,
  • for each heterozigous SNP j in each template,
    yj? 0,1

Cross/Parallel Phasing Likelihood
  • Cross/Parallel Likelihood between any 2 SNPs
  • Fcross total number of 01 and 10 haplotypes in
    SNPs i and j
  • Fparallel total number of 00 and 11 haplotypes
    in SNPs i and j
  • Fexp_cross expected number of 01 and 10
    haplotypes in SNPs i and j based on single SNP
    frequencies.
  • Fexp_parallel expected number of 00 and 11
    haplotypes in SNPs i and j based on single SNP
    frequencies.
  • Adjusting of Fcross and Fparallel

    assuming that any two pairs of
    haplotypes form a genotype with the same
    probability
  • Positive value ? cross, negative value ? parallel
  • Larger absolute value gives more confidence

Inferring Haplotypes (Phasing) from Population
Genotypes
  • Haplotypes increase the power of an association
    between marker loci and phenotypic traits
  • Genotype phasing is resolution of a genotype into
    the two haplotypes
  • Physical phasing is too expensive
  • Soon single (Affy) chip will allow finding all
    (?107) SNPs of single human
  • Computational phasing (inferring) is much cheaper
  • Statistical methods (Phase, Phamily, PL,
    HAPLOTYPER, SNPHAP, GERBIL)
  • Combinatorial methods (Parsimony, HAPINF,
    Perfect Phylogeny, HAP)

Family Trio Phasing without Recombinations
projections closest phasings w/o recombinations
GERBIL
PHASE
phasings of trio data as unrelated individuals
Family Trio Phasing
2-SNP Statistics (2SNP)
trio-phasings w/o recombinations
  • Frequently genotype data consist of family trios
    two parents and one offspring
  • Trio data allow to correctly recover offspring
    haplotypes almost in all cases
  • Example of unique phasing
  • Unique example of ambiguous phasing

Missing Data Recovery Problem
  • Collect statistics on haplotype/genotype
    frequencies for any 2 SNPs
  • For each 2 SNPs compute weights reflecting
    likelihood of cross/parallel
  • For each genotype g
  • Find Maximum Spaning Tree (MST) for genotype
    graph G(g )
  • Color G(g ) vertices and phase based on colors
  • Missing data recovery
  • Recover each missing site (?s) based on the
    closest haplotype with the phased site (Zhang et
    al.)

pure parsimonious trio-phasing w/o recombinations
  • Real data often miss some SNPs
  • Daly et al data (Chron Disease) 10-16
  • Gabriel et al data (Hapmap) 7-10
  • How to reconstruct missing values, how to verify
    reconstruction method?
  • Scramble extra 10 and reconstruct them
  • Karp-Halperin (2004) have error rate 2.8
  • Parental haplotypes may recombine ? impossible to
    recover parental haplotypes
  • ASSUMPTION No recombinations in parents

Family Trio Phasing Validation
Input Unrelated genotypes
Missing data recovery
Statistics collection
Graph coloring
Phasing
Genotype Graph MST
  • Phasing method can be validated on simulated data
    (haplotypes are known)
  • The validation on real data is usually performed
    on the trio data
  • Trio validation can not be applied since a trio
    phasing method may rely on both offsptring and
    parents genotypes
  • Validation by erasing randomly chosen SNP

0?210022002
0?010010000
0?110001001
s1
..
Our contributions
s2
00110101000
..
0?010010000
  • Formulating the Pure-parsimony Trio Phasing
    Problem(PTPP)
    and the Trio Missing Data
    Recovery Problem
    (TMDRP)
  • Two new greedy and integer linear programming

    (ILP) based methods solving PTPP and TMDRP
  • New 2-SNP Statistics (2SNP) phasing method for
    unrelated individuals
  • Extensive experimental validation of proposed
    methods and comparison with the previously known
    methods

00010010000
0?110001001
s4
s3
00110001001
s3
s1
s2
s4
Data Sets
  • Daly et al data (Chron Disease) derived from the
    616 kilobase region of human Chromosome 5q31
  • 129 family trios of genotypes with 103 SNPs
  • Gabriel et al data (Hapmap) consists SNPs from
    the 62 regions of human genome
  • 29 family trios genotypes with ? 3000 SNPs
  • Using MS simulator we simulate
  • Daly et al. data by generating 258 populations,
    each population with 100 individuals and each
    haplotype with 103 SNPs, then randomly choosing
    one haplotype from each population. We only
    simulate parentss haplotypes, then we obtain
    family trio haplotypes and genotypes by random
    matching the parental haplotypes.

Unrelated Individuals Phasing Validation
  • Phasing method can be validated on simulated data
    (haplotypes are known)
  • The validation on real data is usually performed
    on the trio data
  • Offspring haplotypes are mostly known (inferred
    from parents haplotypes)
  • Error types
  • Single-Site error
  • Number of SNPs in offspring phased haplotypes
    which differ from SNPs inferred from trio data,
    divide by (total number of
    SNPs) x (total number of haplotypes)
  • Individual error
  • Number of correctly phased offspring genotypes
    (no Single-Site errors) divide by total number of
    genotypes
  • Switching error
  • Minimum number of switches which should be done
    in pair of haplotypes of offspring phased
    genotype such that both haplotypes will coincide
    with haplotypes inferred from trio data, divide
    by total number of heterozigous positions in
    offspring genotypes.

Previous work
Experimental Results
  • PHASE Bayesian statistical method (Stephens et
    al., 2001, 2003)
  • HAPLOTYPER proposed a Monte Carlo aproach (Niu
    et al., 2002)
  • Phamily phase the trio families based on PHASE
    (Acherman et al., 2003)
  • Greedy method for phasing and missing data
    recoveryby (Halperin and Karp, 2004)
  • GERBIL statistical method using maximum
    likelihood (ML), MST and expectation-maximization
    (EM) (Kimmel and Shamir, 2005)
  • SNPHAP use ML/EM assuming Hardy-Weinberg
    equilibrium (Clayton et al., 2004)
  • Phasing in Family Trios
  • The results for five phasing methods on the real
    data sets of Daly et al. and Gabriel et al. and
    on simulated data. The second column corresponds
    to the ratio of erased data. The C corresponds to
    the error of offspring. The P corresponds to the
    error of parents. The T corresponds to the total
    error.

2SNP Results, Comparison with other Phasing
Methods
Problem Formulation Family Trio Phasing w/o
Recombinations (TPP)
  • Phasing methods
  • PHASE, GERBIL, HAPLOTYPER, 2SNP
  • Reported errors
  • Single-site error, individual error, swithing
    error
  • Errors are reported in percents
  • Running time
  • Time is reported in hours(h), minutes(m),
    seconds(s)
  • running time is not stable (average is
    reported)
  • Data Sets
  • Daly et al. offspring data, 129 genotypes with
    103 SNPs
  • Gabriel et al. offspring data, 29 genotypes in 61
    blocks with 50 SNPs on average
  • Forton et al. 128 genotypes with 89 SNPs obtained
    as random matching of 256 randomly chosen real
    haplotypes
  • MS simulated data with recombination ratio
    (0,4,16), 100 genotypes and 103 SNPs
  • Given a set of family trios of genotypes each
    with m sites corresponding to m SNPs
  • 0 homozygote with major allele, 1 homozygote
    with minor allele, 2 heterozygote, ?
    missing SNP value
  • Find for each trio four haplotypes h1, h2, h3, h4
    each with m 0-1-sites such that
  • h1 and h2 explain fathers genotype, h3 and h4
    explain mothers genotype, h1 and h3 explain
    offsprings genotype
  • Missing data recovery in Family Trios
  • The results for missing data recovery on the real
    and simulated data sets with five methods. The
    second column corresponds to the ratio of erased
    data. The C corresponds to the error of
    offspring. The P corresponds to the error of
    parents. The T corresponds to the total error.

Pure-Parsimony Trio Phasing (PPTP)
  • Easy to find a feasible solution to TPP
    (exponential number of feasible solutions)
  • We pursue parsimonious objective,i.e.,
    minimization of the total number of haplotypes
  • Drawback of PP is that when the number of SNPs
    becomes large (as well as the number of
    recombinations), then the quality of pure
    parsimony phasing is diminishing
  • Partition the genotypes into blocks
  • In case of trio data we do not have joining
    blocks problem
  • Pure-Parsimony Trio Phasing (PPTP). Given 3n
    genotypes corresponding to n family trios find
    minimum number of distinct haplotypes explaining
    all trios
Write a Comment
User Comments (0)
About PowerShow.com