Greedy Approximation Algorithms for Covering Problems in Computational Biology - PowerPoint PPT Presentation

About This Presentation

Title:

Greedy Approximation Algorithms for Covering Problems in Computational Biology

Description:

Four nucleotide types: A,C,T,G. Normally double stranded. A's paired with T's. C's paired with G's ... Fingerprints n distinguishers. 34. Example. Given ... – PowerPoint PPT presentation

Number of Views:414

Avg rating:3.0/5.0

Slides: 53

Provided by: IonMa8

Learn more at: https://dna.engr.uconn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Greedy Approximation Algorithms for Covering Problems in Computational Biology

1
Greedy Approximation Algorithms for Covering
Problems in Computational Biology

Ion Mandoiu
Computer Science Engineering Department
University of Connecticut

2
Why Approximation Algorithms?

Most practical optimization problems are NP-hard
Approximation algorithms offer the next best
thing to an efficient exact algorithm
Polynomial time
Solutions guaranteed to be close to optimum
?-approximation algorithm solution cost within a
multiplicative factor of ? of optimum cost
Practical relevance insights needed to establish
approximation guarantee often lead to fast,
highly effective practical implementations

3
Why Computational Biology?

Exploding multidisciplinary field at the
intersection of computer science, biology,
discrete mathematics, statistics, optimization,
chemistry, physics,
Source of a fast growing number of combinatorial
optimization applications
TSP and Euler paths in DNA sequencing
Dynamic Programming in sequence alignment
Integer Programming in Haplotype inference
This talk two covering problems in
computational biology (primer set selection and
string barcoding)

4
Overview

Potential function greedy algorithm
- The set cover problem and the greedy algorithm
- Potential function generalization
Primer Set Selection for Multiplex PCR
The String Barcoding Problem
Conclusions

5
The Set Cover Problem

Given
Universal set U with n elements
Family of sets (Sx, x?X) covering all elements
of U
Find
Minimum size subset X of X s.t. (Sx, x?X)
covers all elements of U
Greedy Algorithm
- Start with empty X, and repeatedly add x such
that Sx contains the most uncovered elements
until U is covered

6
Approximation Guarantee

Classical result (Johnson74, Lovasz75,
Chvatal79) the greedy setcover algorithm has an
approximation factor of H(n)11/21/31/n lt
1ln(n)
The approximation factor is tight
Cannot be approximated within a factor of
(1-?)ln(n) unless NPDTIME(nloglog(n))

7
General setting

Potential function ?(X) ? 0
?() ?max
?(X) 0 for all feasible solutions
X ? X ? ?(X) ? ?(X)
If ?(X)gt0, then there exists x s.t. ?(Xx) lt
?(X)
X ? X ? ?(x,X) ? ?(x,X) for every x, where
?(x,X) ?(X) - ?(Xx)
Problem find minimum size set X with ?(X)0

8
Generic Greedy Algorithm

X ?
While ?(X) gt 0
Find x with maximum ?(x,X)
X ? X x

Theorem (Konwar et al.05) The generic greedy
algorithm has an approximation factor of 1ln ?max

9
Proof Idea
Let x1, x2,,xg be the elements selected by
greedy, in the order in which they are chosen,
and x1, x2,,xk be the elements of an optimum
solution. Charging scheme xi charges to xj a
cost of
where ?ij ?(xi,x1,, xi-1?x1,,xj) Fact
1 Each xj gets charged a total cost of at most
1ln ?max
10
Proof of claim 2
Fact 2 Each xi charges at least 1 unit of cost
11
Overview

Potential function greedy algorithm
Primer set selection for multiplex PCR
Motivation and problem formulation
Greedy applied to primer set selection
Experimental results
The String Barcoding Problem
Conclusions

12
DNA Structure

Four nucleotide types A,C,T,G
Normally double stranded
As paired with Ts
Cs paired with Gs

13
The Polymerase Chain Reaction
14
Primer Pair Selection Problem

Given
Genomic sequence around amplification locus
Primer length k
Amplification upperbound L
Find Forward and reverse primers of length k
that hybridize within a distance of L of each
other and optimize amplification efficiency
(melting temperature, secondary structure,
mis-priming, etc.)

15
Multiplex PCR

Multiplex PCR (MP-PCR)
Multiple DNA fragments amplified simultaneously
Each amplified fragment still defined by two
primers
A primer may participate in amplification of
multiple targets
Primer set selection
Typically done by time-consuming trial and error
An important objective is to minimize number of
primers
Reduced assay cost
Higher effective concentration of primers ?
higher amplification efficiency
Reduced unintended amplification

16
Primer Set Selection Problem

Given
Genomic sequences around n amplification loci
Primer length k
Amplification upper bound L
Find
Minimum size set S of primers of length k such
that, for each amplification locus, there are two
primers in S hybridizing with the forward and
reverse genomic sequences within a distance of L
of each other

17
Applications

Single Nucleotide Polymorphism (SNP) genotyping
Up to thousands of SNPs genotyped simultaneously
Selective PCR amplification required for improved
accuracy
Spotted microarray synthesis FernandesSkiena02
Primers can be used multiple times
For each target, need a pair of primers
amplifying that target and only that target
(amplification uniqueness constraint)
Can still reduce primers from 2n to O(n1/2)

18
Previous Work on Primer Selection

Well-studied problem Pearson et al. 96,
Linhart Shamir02, Souvenir et al.03, etc.
Almost all problem formulations decouple
selection of forward and reverse primers
To enforce bound of L on amplification length,
select only primers that hybridize within L/2
bases of desired target
In worst case, this method can increase the
number of primers by a factor of O(n) compared to
the optimum
Pearson et al. 96 Greedy set cover algorithm
gives O(ln n) approximation factor for the
decoupled formulation

19
Previous Work (contd.)

FernandesSkiena02 study primer set selection
with uniqueness constraints
Minimum Multi-Colored Subgraph Problem
Vertices correspond to candidate primers
Edge colored by color i between u and v iff
corresponding primers hybridize within a distance
of L of each other around i-th amplification
locus
Goal is to find minimum size set of vertices
inducing edges of all colors
Can capture length amplification constraints too

20
Integer Program Formulation

0/1 variable xu for every vertex
0/1 variable ye for every edge e

21
LP-Rounding Algorithm
(1) Solve linear programming relaxation (2)
Select node u with probability xu (3) Repeat
step 2 O(ln(n)) times and return selected nodes

Theorem Konwar et al.04 The LP-rounding
algorithm finds a feasible solution at most
O(m1/2lnn) times larger than the optimum, where m
is the maximum color class size, and n is the
number of nodes
For primer selection, m ? L2 ? approximation
factor is O(Llnn)
Better approximation?
Unlikely for minimum multi-colored subgraph
problem

22
Selection w/o Uniqueness Constraints

Can be seen as a simultaneous set covering
problem
- The ground set is partitioned into n disjoint
sets Si (one for each target), each with 2L
elements
The goal is to select a minimum number of sets
(i.e., primers) that cover at least half of the
elements in each partition

SNPi
L
L
23
Greedy Algorithm

Potential function ? minimum number of
elements that must be covered ?i max0, L -
uncovered elements in Si
Initially, ? nL
For feasible solutions, ? 0
?(?) ? nL (much smaller in practice)
Theorem Konwar et al.05 The number of primers
selected by the greedy algorithm is at most
1ln(nL) larger than the optimum

24
Experimental Setting

Datasets extracted from NCBI databases, L1000
Dell PowerEdge 2.8GHz Xeon
Compared algorithms
G-FIX greedy primer cover algorithm Pearson et
al.
MIPS-PT iterative beam-search heuristic
Souvenir et al.
Restrict primers to L/2 bases around
amplification locus
G-VAR naïve modification of G-FIX
First selected primer can be up to L bases away
Opposite sequence truncated after selecting first
primer
G-POT potential function driven greedy algorithm

25
Experimental Results, NCBI tests
Targets k G-FIX (Pearson et al.) G-FIX (Pearson et al.) G-VAR (G-FIX with dynamic truncation) G-VAR (G-FIX with dynamic truncation) MIPS-PT (Souvenir et al.) MIPS-PT (Souvenir et al.) G-POT (Potential- function greedy) G-POT (Potential- function greedy)
Targets k Primers CPU sec Primers CPU sec Primers CPU sec Primers CPU sec
20 8 7 0.04 7 0.08 8 10 6 0.10
20 10 9 0.03 10 0.08 13 15 9 0.08
20 12 14 0.04 13 0.08 18 26 13 0.11
50 8 13 0.13 15 0.30 21 48 10 0.32
50 10 23 0.22 24 0.36 30 150 18 0.33
50 12 31 0.14 32 0.30 41 246 29 0.28
100 8 17 0.49 20 0.89 32 226 14 0.58
100 10 37 0.37 37 0.72 50 844 31 0.75
100 12 53 0.59 48 0.84 75 2601 42 0.61
26
primers, as percentage of 2n (l8)
n
27
primers, as percentage of 2n (l10)
n
28
primers, as percentage of 2n (l12)
n
29
CPU Seconds (l10)
n
30
Overview

Potential function greedy algorithm
Primer Set Selection for Multiplex PCR
The String Barcoding Problem
- Problem Formulation
- Integer programming and greedy algorithms
- Experimental results
Conclusions

31
Motivation

Rapid pathogen detection
Given
Pathogen with unknown identity
Database of known pathogens
Problem
Identify unknown pathogen quickly
Ideal solution determine DNA sequence of unknown
pathogen

32
Real World

Not possible to quickly sequence an unknown
pathogen
Only have sequence for pathogens in database
Can quickly test for presence of short substrings
in unknown virus (substring tests) using
hybridization
String barcoding Borneman et al.01,
RashGusfield02
Use substring tests that uniquely identify each
pathogen in the database

33
String Barcoding Problem
Given Genomic sequences g1,, gn Find
Minimum number of distinguisher strings t1,,tk
Such that For every gi ? gj, there exists a
string tl which is substring of gi or gj, but
not of both

At least log2n distinguishers needed
Fingerprints ? n distinguishers

34
Example

Given sequences
1. cagtgc
2. cagttc
3. catgga
Feasible set of distinguishers tg, atgga

tg atgga
cagtgc 1 0
cagttc 0 0
catgga 1 1
Row vectors unique barcodes for each pathogen
35
Computational Complexity

Berman et al.04 Cannot be approximated within
a factor of (1-?)ln(n) unless NPDTIME(nloglog(n))

36
Setcover Greedy Algorithm

Distinguisher selection as setcover problem
Elements to be covered are the pairs of sequences
Each candidate distinguisher defines a set of
pairs that it separates
Another view covering all edges of a complete
graph with n vertices by the minimum number of
given cuts
For n sequences, largest set can have O(n2)
elements
? The setcover greedy guarantees ln(n2) 2 ln n
approximation

37
Integer Program Formulation

0/1 variable for each candidate distinguisher
1 ? candidate is selected
0 ? candidate is not selected
For each pair of sequences, at least one
candidate separating them is selected
Objective Function
Minimize selected candidates

38
Practical Issues

Quadratic of constraints, huge of variables
Genome sizes range from thousands of bases for
phage and viruses to millions for bacteria to
billions for higher organisms
Many variables can be removed
Candidates that appear in all sequences
Sufficient to keep a single candidate among those
that appear in the same set of sequences
How to efficiently remove useless variables?
RashGusfield use suffix trees

39
Suffix Tree Example

Strings
1. cagtgc
2. cagttc
3. catgga

40
Integer Program
Minimize V18 V22 V11 V17 V8 objective
function Such that V18 V17 V8 gt
1 constraint to cover pair 1,2 V22 V11 V8
gt 1 constraint to cover pair 1,3 V18 V22
V11 V17 gt 1 constraint to cover pair
2,3 Binaries all variables are
0/1 V18 V22 V11 V17 V8 End
tg (V18) atgga (V22)
cagtgc 1 0
cagttc 0 0
catgga 1 1
41
Limitations of Integer Program Method

Works only for small instances
50-150 sequences
Average length 1000 characters
Over 4 hours needed to come within 20 of
optimum!
Scalable Heuristics?

42
Distinguisher Induced Partition

Key idea Berman et al. 04 Keep track of the
partition defined by distinguishers selected so
far

43
Information Content Heuristic

? partition entropy log2(permutations
compatible with current partition)
Initial partition entropy log2(n!) ? n log2n
For feasible distinguisher sets, partition
entropy 0
?(?) ? n
log2(n!) - log2(k!(n-k)!) lt log2(2n) n
Information content heuristic (ICH) greedy
driven by partition entropy
Theorem Berman et al.04 ICH has an
approximation factor of 1ln(n)

44
ICH Limitations

Real genomic data has degenerate nucleotides
Ambiguous sequencing
Single nucleotide polymorphisms
For sequences with degenerate nucleotides there
are three possibilities for distinguisher
hybridization
Sure hybridization
Sure mismatch
Uncertain hybridization
? No partition to work with!

45
Practical Implementation

ICH and setcover greedy give nearly identical
results on data w/o non-degenerate bases
Setcover greedy can also be extended to handle
degenerate bases in the sequences
redundancy requirements (each pair of sequences
must be separated r times)
Two main steps for both algorithms
Candidate generation
Greedy selection

46
Candidate Generation

Can be done using suffix trees
We use a simpler yet efficient incremental
approach
Candidates that match all or only one sequence
are removed from consideration
Solution quality is similar even when candidates
are generated from a single sequence
Equivalent to considering only distinguisher sets
that assign a barcode of (1,1,,1) to the source
sequence

47
Candidate Selection

Evaluate ?(?) for all candidates and choose best
Speed-up techniques
Efficient gain computation using partition
data-structure
Lazy gain update if old ?(?) is lower than best
so far, do not recompute

48
Experimental Results

mat mat part part
n lazy lazy dist
100 35.4 22.1 2.2 1.4 8.0
200 221.6 125.2 8.8 4.6 10.0
500 2168.8 1144.4 53.0 18.7 12.3
1000 5600.4 2756.4 113.6 31.7 14.1
Averages over 10 testcases, sequence length
10,000
Barcodes for 100 sequences of length 1,000,000
computed in less than 10 minutes

49
Overview

Potential function greedy algorithm
Primer Set Selection for Multiplex PCR
The String Barcoding Problem
Conclusions

50
Conclusions

General potential function framework for
designing and analyzing greedy covering
algorithms
Improved approximation guarantees and practical
performance for two important optimization
problems in computational biology primer set
selection for multiplex PCR, and distinguisher
selection for string barcoding

51
Ongoing Work