Greedy Approximation Algorithms for Covering Problems in Computational Biology - PowerPoint PPT Presentation

About This Presentation
Title:

Greedy Approximation Algorithms for Covering Problems in Computational Biology

Description:

Four nucleotide types: A,C,T,G. Normally double stranded. A's paired with T's. C's paired with G's ... Fingerprints n distinguishers. 34. Example. Given ... – PowerPoint PPT presentation

Number of Views:414
Avg rating:3.0/5.0
Slides: 53
Provided by: IonMa8
Category:

less

Transcript and Presenter's Notes

Title: Greedy Approximation Algorithms for Covering Problems in Computational Biology


1
Greedy Approximation Algorithms for Covering
Problems in Computational Biology
  • Ion Mandoiu
  • Computer Science Engineering Department
  • University of Connecticut

2
Why Approximation Algorithms?
  • Most practical optimization problems are NP-hard
  • Approximation algorithms offer the next best
    thing to an efficient exact algorithm
  • Polynomial time
  • Solutions guaranteed to be close to optimum
  • ?-approximation algorithm solution cost within a
    multiplicative factor of ? of optimum cost
  • Practical relevance insights needed to establish
    approximation guarantee often lead to fast,
    highly effective practical implementations

3
Why Computational Biology?
  • Exploding multidisciplinary field at the
    intersection of computer science, biology,
    discrete mathematics, statistics, optimization,
    chemistry, physics,
  • Source of a fast growing number of combinatorial
    optimization applications
  • TSP and Euler paths in DNA sequencing
  • Dynamic Programming in sequence alignment
  • Integer Programming in Haplotype inference
  • This talk two covering problems in
    computational biology (primer set selection and
    string barcoding)

4
Overview
  • Potential function greedy algorithm
  • - The set cover problem and the greedy algorithm
  • - Potential function generalization
  • Primer Set Selection for Multiplex PCR
  • The String Barcoding Problem
  • Conclusions

5
The Set Cover Problem
  • Given
  • Universal set U with n elements
  • Family of sets (Sx, x?X) covering all elements
    of U
  • Find
  • Minimum size subset X of X s.t. (Sx, x?X)
    covers all elements of U
  • Greedy Algorithm
  • - Start with empty X, and repeatedly add x such
    that Sx contains the most uncovered elements
    until U is covered

6
Approximation Guarantee
  • Classical result (Johnson74, Lovasz75,
    Chvatal79) the greedy setcover algorithm has an
    approximation factor of H(n)11/21/31/n lt
    1ln(n)
  • The approximation factor is tight
  • Cannot be approximated within a factor of
    (1-?)ln(n) unless NPDTIME(nloglog(n))

7
General setting
  • Potential function ?(X) ? 0
  • ?() ?max
  • ?(X) 0 for all feasible solutions
  • X ? X ? ?(X) ? ?(X)
  • If ?(X)gt0, then there exists x s.t. ?(Xx) lt
    ?(X)
  • X ? X ? ?(x,X) ? ?(x,X) for every x, where
  • ?(x,X) ?(X) - ?(Xx)
  • Problem find minimum size set X with ?(X)0

8
Generic Greedy Algorithm
  • X ?
  • While ?(X) gt 0
  • Find x with maximum ?(x,X)
  • X ? X x
  • Theorem (Konwar et al.05) The generic greedy
    algorithm has an approximation factor of 1ln ?max

9
Proof Idea
Let x1, x2,,xg be the elements selected by
greedy, in the order in which they are chosen,
and x1, x2,,xk be the elements of an optimum
solution. Charging scheme xi charges to xj a
cost of
where ?ij ?(xi,x1,, xi-1?x1,,xj) Fact
1 Each xj gets charged a total cost of at most
1ln ?max
10
Proof of claim 2
Fact 2 Each xi charges at least 1 unit of cost
11
Overview
  • Potential function greedy algorithm
  • Primer set selection for multiplex PCR
  • Motivation and problem formulation
  • Greedy applied to primer set selection
  • Experimental results
  • The String Barcoding Problem
  • Conclusions

12
DNA Structure
  • Four nucleotide types A,C,T,G
  • Normally double stranded
  • As paired with Ts
  • Cs paired with Gs

13
The Polymerase Chain Reaction
14
Primer Pair Selection Problem
  • Given
  • Genomic sequence around amplification locus
  • Primer length k
  • Amplification upperbound L
  • Find Forward and reverse primers of length k
    that hybridize within a distance of L of each
    other and optimize amplification efficiency
    (melting temperature, secondary structure,
    mis-priming, etc.)

15
Multiplex PCR
  • Multiplex PCR (MP-PCR)
  • Multiple DNA fragments amplified simultaneously
  • Each amplified fragment still defined by two
    primers
  • A primer may participate in amplification of
    multiple targets
  • Primer set selection
  • Typically done by time-consuming trial and error
  • An important objective is to minimize number of
    primers
  • Reduced assay cost
  • Higher effective concentration of primers ?
    higher amplification efficiency
  • Reduced unintended amplification

16
Primer Set Selection Problem
  • Given
  • Genomic sequences around n amplification loci
  • Primer length k
  • Amplification upper bound L
  • Find
  • Minimum size set S of primers of length k such
    that, for each amplification locus, there are two
    primers in S hybridizing with the forward and
    reverse genomic sequences within a distance of L
    of each other

17
Applications
  • Single Nucleotide Polymorphism (SNP) genotyping
  • Up to thousands of SNPs genotyped simultaneously
  • Selective PCR amplification required for improved
    accuracy
  • Spotted microarray synthesis FernandesSkiena02
  • Primers can be used multiple times
  • For each target, need a pair of primers
    amplifying that target and only that target
    (amplification uniqueness constraint)
  • Can still reduce primers from 2n to O(n1/2)

18
Previous Work on Primer Selection
  • Well-studied problem Pearson et al. 96,
    Linhart Shamir02, Souvenir et al.03, etc.
  • Almost all problem formulations decouple
    selection of forward and reverse primers
  • To enforce bound of L on amplification length,
    select only primers that hybridize within L/2
    bases of desired target
  • In worst case, this method can increase the
    number of primers by a factor of O(n) compared to
    the optimum
  • Pearson et al. 96 Greedy set cover algorithm
    gives O(ln n) approximation factor for the
    decoupled formulation

19
Previous Work (contd.)
  • FernandesSkiena02 study primer set selection
    with uniqueness constraints
  • Minimum Multi-Colored Subgraph Problem
  • Vertices correspond to candidate primers
  • Edge colored by color i between u and v iff
    corresponding primers hybridize within a distance
    of L of each other around i-th amplification
    locus
  • Goal is to find minimum size set of vertices
    inducing edges of all colors
  • Can capture length amplification constraints too

20
Integer Program Formulation
  • 0/1 variable xu for every vertex
  • 0/1 variable ye for every edge e

21
LP-Rounding Algorithm
(1) Solve linear programming relaxation (2)
Select node u with probability xu (3) Repeat
step 2 O(ln(n)) times and return selected nodes
  • Theorem Konwar et al.04 The LP-rounding
    algorithm finds a feasible solution at most
    O(m1/2lnn) times larger than the optimum, where m
    is the maximum color class size, and n is the
    number of nodes
  • For primer selection, m ? L2 ? approximation
    factor is O(Llnn)
  • Better approximation?
  • Unlikely for minimum multi-colored subgraph
    problem

22
Selection w/o Uniqueness Constraints
  • Can be seen as a simultaneous set covering
    problem
  • - The ground set is partitioned into n disjoint
    sets Si (one for each target), each with 2L
    elements
  • The goal is to select a minimum number of sets
    (i.e., primers) that cover at least half of the
    elements in each partition

SNPi
L
L
23
Greedy Algorithm
  • Potential function ? minimum number of
    elements that must be covered ?i max0, L -
    uncovered elements in Si
  • Initially, ? nL
  • For feasible solutions, ? 0
  • ?(?) ? nL (much smaller in practice)
  • Theorem Konwar et al.05 The number of primers
    selected by the greedy algorithm is at most
    1ln(nL) larger than the optimum

24
Experimental Setting
  • Datasets extracted from NCBI databases, L1000
  • Dell PowerEdge 2.8GHz Xeon
  • Compared algorithms
  • G-FIX greedy primer cover algorithm Pearson et
    al.
  • MIPS-PT iterative beam-search heuristic
    Souvenir et al.
  • Restrict primers to L/2 bases around
    amplification locus
  • G-VAR naïve modification of G-FIX
  • First selected primer can be up to L bases away
  • Opposite sequence truncated after selecting first
    primer
  • G-POT potential function driven greedy algorithm

25
Experimental Results, NCBI tests
Targets k G-FIX (Pearson et al.) G-FIX (Pearson et al.) G-VAR (G-FIX with dynamic truncation) G-VAR (G-FIX with dynamic truncation) MIPS-PT (Souvenir et al.) MIPS-PT (Souvenir et al.) G-POT (Potential- function greedy) G-POT (Potential- function greedy)
Targets k Primers CPU sec Primers CPU sec Primers CPU sec Primers CPU sec
20 8 7 0.04 7 0.08 8 10 6 0.10
20 10 9 0.03 10 0.08 13 15 9 0.08
20 12 14 0.04 13 0.08 18 26 13 0.11
50 8 13 0.13 15 0.30 21 48 10 0.32
50 10 23 0.22 24 0.36 30 150 18 0.33
50 12 31 0.14 32 0.30 41 246 29 0.28
100 8 17 0.49 20 0.89 32 226 14 0.58
100 10 37 0.37 37 0.72 50 844 31 0.75
100 12 53 0.59 48 0.84 75 2601 42 0.61
26
primers, as percentage of 2n (l8)
n
27
primers, as percentage of 2n (l10)
n
28
primers, as percentage of 2n (l12)
n
29
CPU Seconds (l10)
n
30
Overview
  • Potential function greedy algorithm
  • Primer Set Selection for Multiplex PCR
  • The String Barcoding Problem
  • - Problem Formulation
  • - Integer programming and greedy algorithms
  • - Experimental results
  • Conclusions

31
Motivation
  • Rapid pathogen detection
  • Given
  • Pathogen with unknown identity
  • Database of known pathogens
  • Problem
  • Identify unknown pathogen quickly
  • Ideal solution determine DNA sequence of unknown
    pathogen

32
Real World
  • Not possible to quickly sequence an unknown
    pathogen
  • Only have sequence for pathogens in database
  • Can quickly test for presence of short substrings
    in unknown virus (substring tests) using
    hybridization
  • String barcoding Borneman et al.01,
    RashGusfield02
  • Use substring tests that uniquely identify each
    pathogen in the database

33
String Barcoding Problem
Given Genomic sequences g1,, gn Find
Minimum number of distinguisher strings t1,,tk
Such that For every gi ? gj, there exists a
string tl which is substring of gi or gj, but
not of both
  • At least log2n distinguishers needed
  • Fingerprints ? n distinguishers

34
Example
  • Given sequences
  • 1. cagtgc
  • 2. cagttc
  • 3. catgga
  • Feasible set of distinguishers tg, atgga

tg atgga
cagtgc 1 0
cagttc 0 0
catgga 1 1
Row vectors unique barcodes for each pathogen
35
Computational Complexity
  • Berman et al.04 Cannot be approximated within
    a factor of (1-?)ln(n) unless NPDTIME(nloglog(n))

36
Setcover Greedy Algorithm
  • Distinguisher selection as setcover problem
  • Elements to be covered are the pairs of sequences
  • Each candidate distinguisher defines a set of
    pairs that it separates
  • Another view covering all edges of a complete
    graph with n vertices by the minimum number of
    given cuts
  • For n sequences, largest set can have O(n2)
    elements
  • ? The setcover greedy guarantees ln(n2) 2 ln n
    approximation

37
Integer Program Formulation
  • 0/1 variable for each candidate distinguisher
  • 1 ? candidate is selected
  • 0 ? candidate is not selected
  • For each pair of sequences, at least one
    candidate separating them is selected
  • Objective Function
  • Minimize selected candidates

38
Practical Issues
  • Quadratic of constraints, huge of variables
  • Genome sizes range from thousands of bases for
    phage and viruses to millions for bacteria to
    billions for higher organisms
  • Many variables can be removed
  • Candidates that appear in all sequences
  • Sufficient to keep a single candidate among those
    that appear in the same set of sequences
  • How to efficiently remove useless variables?
  • RashGusfield use suffix trees

39
Suffix Tree Example
  • Strings
  • 1. cagtgc
  • 2. cagttc
  • 3. catgga

40
Integer Program
Minimize V18 V22 V11 V17 V8 objective
function Such that V18 V17 V8 gt
1 constraint to cover pair 1,2 V22 V11 V8
gt 1 constraint to cover pair 1,3 V18 V22
V11 V17 gt 1 constraint to cover pair
2,3 Binaries all variables are
0/1 V18 V22 V11 V17 V8 End
tg (V18) atgga (V22)
cagtgc 1 0
cagttc 0 0
catgga 1 1
41
Limitations of Integer Program Method
  • Works only for small instances
  • 50-150 sequences
  • Average length 1000 characters
  • Over 4 hours needed to come within 20 of
    optimum!
  • Scalable Heuristics?

42
Distinguisher Induced Partition
  • Key idea Berman et al. 04 Keep track of the
    partition defined by distinguishers selected so
    far

43
Information Content Heuristic
  • ? partition entropy log2(permutations
    compatible with current partition)
  • Initial partition entropy log2(n!) ? n log2n
  • For feasible distinguisher sets, partition
    entropy 0
  • ?(?) ? n
  • log2(n!) - log2(k!(n-k)!) lt log2(2n) n
  • Information content heuristic (ICH) greedy
    driven by partition entropy
  • Theorem Berman et al.04 ICH has an
    approximation factor of 1ln(n)

44
ICH Limitations
  • Real genomic data has degenerate nucleotides
  • Ambiguous sequencing
  • Single nucleotide polymorphisms
  • For sequences with degenerate nucleotides there
    are three possibilities for distinguisher
    hybridization
  • Sure hybridization
  • Sure mismatch
  • Uncertain hybridization
  • ? No partition to work with!

45
Practical Implementation
  • ICH and setcover greedy give nearly identical
    results on data w/o non-degenerate bases
  • Setcover greedy can also be extended to handle
  • degenerate bases in the sequences
  • redundancy requirements (each pair of sequences
    must be separated r times)
  • Two main steps for both algorithms
  • Candidate generation
  • Greedy selection

46
Candidate Generation
  • Can be done using suffix trees
  • We use a simpler yet efficient incremental
    approach
  • Candidates that match all or only one sequence
    are removed from consideration
  • Solution quality is similar even when candidates
    are generated from a single sequence
  • Equivalent to considering only distinguisher sets
    that assign a barcode of (1,1,,1) to the source
    sequence

47
Candidate Selection
  • Evaluate ?(?) for all candidates and choose best
  • Speed-up techniques
  • Efficient gain computation using partition
    data-structure
  • Lazy gain update if old ?(?) is lower than best
    so far, do not recompute

48
Experimental Results
  • mat mat part part
  • n lazy lazy dist
  • 100 35.4 22.1 2.2 1.4 8.0
  • 200 221.6 125.2 8.8 4.6 10.0
  • 500 2168.8 1144.4 53.0 18.7 12.3
  • 1000 5600.4 2756.4 113.6 31.7 14.1
  • Averages over 10 testcases, sequence length
    10,000
  • Barcodes for 100 sequences of length 1,000,000
    computed in less than 10 minutes

49
Overview
  • Potential function greedy algorithm
  • Primer Set Selection for Multiplex PCR
  • The String Barcoding Problem
  • Conclusions

50
Conclusions
  • General potential function framework for
    designing and analyzing greedy covering
    algorithms
  • Improved approximation guarantees and practical
    performance for two important optimization
    problems in computational biology primer set
    selection for multiplex PCR, and distinguisher
    selection for string barcoding

51
Ongoing Work
  • Primer Set Selection
  • Improved hybridization models
  • Degenerate primers
  • Partitioning into multiple multiplexed PCR
    reactions
  • Close approximation gap for minimum multicolored
    sub-graph
  • String Barcoding
  • Probe mixtures as distinguishers
  • Beyond redundancy error correcting
  • Simultaneous detection of multiple pathogens

52
Acknowledgments
  • B. DasGupta, K. Konwar, A. Russell, A. Shvartsman
  • UCONN Research Foundation
Write a Comment
User Comments (0)
About PowerShow.com