Combinatorial methods in Bioinformatics: the haplotyping problem - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Combinatorial methods in Bioinformatics: the haplotyping problem

Description:

Human genetic variations are related to diseases (cancers, diabetes, ... The human genome project produces genotype sequences of humans ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 35
Provided by: bimibDis
Category:

less

Transcript and Presenter's Notes

Title: Combinatorial methods in Bioinformatics: the haplotyping problem


1
Combinatorial methods in Bioinformatics the
haplotyping problem
  • Paola Bonizzoni
  • DISCo
  • Università di Milano-Bicocca

2
Content
  • Motivation biological terms
  • Combinatorial methods in haplotyping
  • Haplotyping via perfect phylogeny the PPH
    problem
  • Inference of incomplete perfect phylogeny
    algorithms
  • Incomplete pph and missing data
  • Other models open problems

3
Biological terms
Diploid organism
Biallelic site i Value(?i) ? A,C,G,T ? 2
heterozygous
?i ?i1 ?i2
haplotype
homozygous
4
Motivations
  • Human genetic variations are related to diseases
    (cancers, diabetes, osteoporoses) most common
    variation is the Single Nucleotide Polymorphism
    (SNP) on haplotypes in chromosomes
  • The human genome project produces genotype
    sequences of humans
  • Computational methods to derive haplotypes from
    genotype data are demanded
  • Ongoing international HapMap project find
    haplotype differences on large scale
  • population data

graphs
Set-cover problems
Combinatorial methods
Optimization problems
5
Haplotyping the formal model
  • Haplotype m-vector hlt0, 1,, 0gt over
    0,1m
  • Genotype m-sequence glt0,1, ,0,0,
    1,1gt over 0,1,

g lt, , 0,, 1 gt
Def.
Haplotypes lth, kgt solve genotype g iff
g(i) implies h(i) ? k(i)
h(i) k(i) g(i) otherwise
6
Examples
  • g lt0,,1,,0,1,1gt

h
g solved by ltk,hgt g k
klt0,0,1,1,0,1,1gt
hlt0,1,1,0,0,1,1gt
Clark inference rule
7
Haplotype inference the general problem
  • Problem HI
  • Instance a set Gg1, ,g m of genotypes and a
    set Hh1, ,h n of haplotypes,
  • Solution a set H of haplotypes that solves
    each genotype g in G s.t. H ? H.

H derives from an inference RULE
8
Type of inference rules
  • Clarks rule haplotypes solve g by an iterative
    rule
  • Gusfield coalescent model haplotypes are related
    to genotypes by a tree model
  • Pedigree data haplotypes are related to
    genotypes by a directed graph

9
Mendelian law and Recombination
Mother
Father
Parent
B D
A C
B
A
C
D
Child
A C
B D
A D
B C
A
C
A
D
B
C
D
B
10
Pedigree
  • Pedigree, nuclear family, founder

11
Pedigree
  • Pedigree, nuclear family, founder

Mother
Father
ID Num
Founders
Genotypes
Family trio
loop
Children
Nuclear family
12
Haplotyping from genotypes The problem methods
  • Problem
  • Input genotype data (missing).
  • Output haplotypes.
  • Input data
  • Data with pedigree (dependent).
  • Data without pedigree info (independent).
  • Statistical methods
  • Find the most likely haplotypes based on genotype
    data.
  • Adv solid theoretical bases
  • Disadv computation intensive
  • Rule-based methods
  • Define rules based on some plausible assumptions
    and find those haplotypes consistent with these
    rules.
  • Adv usually simple thus very fast
  • Disadv no numerical assessment of the
    reliability of the results

13
HI by the perfect phylogeny model
  • IDEA

00000
g1 0, 1,,,1
G
H
g2 , 0,0,0,1
1, 0,0,0,1
0, 0,0,0,1
0, 1,1,0,1
Genotypes are the mating of haplotypes in a tree
Given G find H and T that explain G!
14
Perfect Phylogeny models
  • Input data 0-1 matrix A characters, species
  • Output data phylogeny for A

R
1 1 0 0 0
0 0 1 0 0
1 1 0 0 1
0 0 1 1 0
Path c3c4
15
Perfect phylogeny
Def.
A pp T for a 0-1 matrix A
  • each row si labels exactly one leaf of T
  • each column cj labels exactly one edge of T
  • each internal edge labelled by at least one
    column cj
  • row si gives the 0,1 path from the root to si

Path c3c4
0 0 1 1
16
pp model another view
x
L(x) cluster of x
set of leaves of T x
A pp is associated to a tree-family (S,C) with
Ss1 ,, sn CS ? S S is a cluster s.t.
?X, Y in C , if X?Y?? then X?Y or Y ? X.
17
pp another view
A tree-family (S,C) is represented by a 0-1
matrix
c i
  • c i S s j ? S iff b ji1

0 1 0 0 0
0 0 1 0 0
1 1 0 0 1
0 0 1 1 0
  • for each set in C at least a column

s j
Lemma A 0-1 matrix is a pp iff it represents a
tree-family
18
Haplotyping by the pp
  • A 0-1 matrix B represents the phylogenetic tree
    for a set H of haplotypes
  • si haplotype
  • ci SNPs

SNP site
00000
0-1 switch in position i only once in the tree !!
01000
19
Haplotyping and the pp observations
  • The root of T may not be the haplotype 000000
  • 0-1 switch or 1-0 switch (directed case)

00011
00000
00011
00011
00011
0-1 switch
1-0 switch
01010
01001
01000
01000
01000
11000
11001
11010
01100
01010
01001
20
HI problem in the pp model
  • Input data a 0-1-matrix B n ? m of genotypes
    G
  • Output data a 0-1 matrix B 2n ? m of
    haplotypes s.t.
  • (1) each g ? G is solved by a pair of
    rows lth,kgt in B
  • (2) B has a pp (tree family)

???
DECISION Problem
21
An example
a 1 0 a 0 1
a b 0 c 1 0
b 0 1 b 0 0
c 1 0 c 1 0
22
The pph problem solutions
  • An undirected algorithm Gusfield Recomb 2002
  • An O(nm2)- algorithm Karp et al. Recomb 2003
  • A linear time O(nm) algorithm ??
  • Optimal algorithm
  • A related problem the incomplete directed pp
    (IDP)
  • Inferring a pp from a 0-1- matrix
  • O(nm klog2(n m)) algorithm Peer, T. Pupko, R.
    Shamir, R. Sharan SIAM 2004

23
IDP problem
Instance A 0-1-? Matrix A Solution solve ?
Into 0 or 1 to obtain a matrix A and a pp for
A, or say no pp exists
1 2 3 4 5
  • OPEN PROBLEM find an optimal algorithm ??

24
Decision algorithms for incomplete pp
  • Based on
  • Characterization of 0-1 matrix A that has a pp
  • -Tree family - - forbidden
    submatrix
  • give a no certificate

Bipartite graph G(A)(S,C,E) with E(si,cj)
bij 1
Forbidden subgraph
X
Y
10
11
01
25
Test a 0-1 matrix A has a pp?
  • O(nm) algorithm (Gusfield 1991)
  • Steps
  • Given A order c1, ,cm as (decreasing) binary
    numbers A
  • Let L(i,j)k , k maxl ltj Ai,l1
  • Let index(j) maxL(i,j) i
  • Then apply th.

TH. A has a pp iff L(i,j) index(j) for
each (i,j) s.t. Ai,j1
26
Idea
27
The IDP algorithm
C
c
s1
s3
s2
28
Other HI problems via the pp model
  • Incomplete 0-1--? matrix because of missing
    data
  • haplotypes pp (Ihpp) haplotype rows
  • genotype pp (Igpp) genotype rows
  • Algorithms
  • Ihpp IDP given a row as a root (polynomial
    time)
  • NP-complete otherwise
  • Igpp has polynomial solution under rich data
    hypothesis (Karp et al. Recomb 2004 Icalp 2004
    )
  • NP-complete otherwise

29
HI problem and other models
  • Haplotype inference in pedigree data under the
    recombination model

child
30
Pedigree graph
father
mather
child
31
Haplotype inference in pedigree
00 01 10
10 11 00
01 11 01
32
Problems
  • MPT-MRHI (Pedigree tree multi-mating minimum
    recombination HI)
  • SPT-MRHI (Pedigree tree single-mating minimum
    recombination HI)

Np-complete even if the graph is acyclic, but
unbounded number of children
OPEN
33
Conclusions
34
References
Write a Comment
User Comments (0)
About PowerShow.com