Title: Optimal kmer superstrings for protein identification and DNA assay design'
1Optimal k-mer superstrings for protein
identification and DNA assay design.
- Nathan Edwards
- Center for Bioinformatics and Computational
Biology - University of Maryland, College Park
2k-mer (Sub-)Problems
- Enumerate
- For all (distinct) k-mers, do
- Existence
- ...with respect to exact ( inexact) count x
- Uniqueness
- ...with respect to exact inexact match
- Near-neighbors
- ...with respect to inexact match
- Representation
- Represent (distinct) k-mers for other tools
- Fast annotation of k-mer counts on original
sequences
3Applications of k-mer sets
- Peptide Identification
- Represent all amino-acid 30-mers
- ...that occur at least twice in human dbEST
- PCR Primer Design
- Test DNA 20-mer primers for uniqueness
- What does it mean to be unique?
- DNA sequencing error / repeat detection
- Eliminate mers that are too rare or too frequent
- Pathogen signatures
- Near-neighbors imply potential false-positives
4k-mer Superstring Problem
- Given
- A set of sequences S S1, ..., Sn
- Sequence database
- Word size k
- Find
- A new set of sequences T T1, ..., Tm
- Such that
- Total length of T is minimized, and
- T is complete and correct w.r.t. k-mers of S
5k-mer Superstring Problem
- Completeness
- All of the k-mers of S are represented
- Correctness
- No additional k-mers are present
- Minimize the total representation length
- Correlates with running time
6Shortest (common) superstring problem
- General strings (arbitrary length)
- Single output string
- Completeness for input sequences only
- Classical NP-hard problem
- Garey and Johnson
- Approximate within 2.5OPT
- Max-SNP hard
- One of the first algorithmic approaches to genome
assembly
7de Bruijn Sequences
- de Bruijn sequences represent all words of length
k from some alphabet A. - A 0,1, k 3 s 0001110100
- A 0,1, k 4 s 0000111101011001000
8de Bruijn Graph A 0,1, k 4
1
1
0
1
0
1
1
0
1
0
1
0
1
0
0
0
9de Bruijn Sequences Graphs
- de Bruijn graphs (k,A)
- Edges represent length k words from A
- Each node has
- in degree A
- out degree A
- Eulerian tour constructs de Bruijn sequence.
10Sequencing-by-Hybridization-graph
ACDEFGI, ACDEFACG, DEFGEFGI
11Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
12Sequence Databases CSBH-graphs
- Original sequences correspond to paths
ACDEFGI, ACDEFACG, DEFGEFGI
13C3 Enumeration
- Complete
- All k-mers are present
- Correct
- No other k-mers are present
- Compact
- No k-mer is present more than once
14Correct, Complete, Compact (C3) Enumeration
- Set of paths that use each edge exactly once
ACDEFGEFGI, DEFACG
15Correct, Complete (C2) Enumeration
- Set of paths that use each edge at least once
ACDEFGEFGI, DEFACG
16Patching the CSBH-graph
- Use artificial edges to fix unbalanced nodes
17Patching the CSBH-graph
- Use matching-style formulations to choose
artificial edges - Optimal C2/C3 enumeration in polynomial time.
- Chinese Postman Problem
- Edmonds and Johnson, 73
- l-tuple DNA sequencing
- Pevzner, 89
- Shortest (Common) Superstring
- MAX-SNP-hard, 2.5 approx algorithm
18Related work
- Chinese Postman Problem
- Undirected graph, weighted edges
- Shortest path that uses all the edges
- Solvable in polynomial time
- Construct minimum weighted matching between nodes
of odd-degree - Add matching to graph and find Eulerian path
- Minimize weight of extra edges used
19C2 Enumeration
- Chinese postman problem, except
- Directed graph
- Add edges from nodes with surplus in-degree to
nodes with surplus out-degree - Fixed cost teleportation option
- Can always start a new sequence
- Find optimal set of additional edges
- Transportation problem / min cost flow instance
20C3 Enumeration
in-out
in-out
Cost k
21Reusing Edges
- ACDEHAC, ACDFHAC, ACDGHACD
22Reusing Edges
- C3 ACDEHACDFHAC, ACDGHACD
23Reusing Edges
24C2 Enumeration
in-out
in-out
4
10
Shortcut paths
7
25C3 Enumeration
in-out
in-out
Cost 0
Cost 0
Cost k
26Sample Preparation for Peptide Identification
27Single Stage MS
MS
m/z
28Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
29Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
30Peptide Identification
- For each (likely) peptide sequence
- 1. Compute fragment masses
- 2. Compare with spectrum
- 3. Retain those that match well
- Peptide sequences from protein sequence databases
- Swiss-Prot, IPI, NCBIs nr, ...
- Automated, high-throughput peptide identification
in complex mixtures
31Novel Splice Isoform
- Human Jurkat leukemia cell-line
- Lipid-raft extraction protocol, targeting T cells
- von Haller, et al. MCP 2003.
- LIME1 gene
- LCK interacting transmembrane adaptor 1
- LCK gene
- Leukocyte-specific protein tyrosine kinase
- Proto-oncogene
- Chromosomal aberration involving LCK in
leukemias. - Multiple significant peptide identifications
32Novel Splice Isoform
33Novel Splice Isoform
34Novel Mutation
- HUPO Plasma Proteome Project
- Pooled samples from 10 male 10 female healthy
Chinese subjects - Plasma/EDTA sample protocol
- Li, et al. Proteomics 2005. (Lab 29)
- TTR gene
- Transthyretin (pre-albumin)
- Defects in TTR are a cause of amyloidosis.
- Familial amyloidotic polyneuropathy
- late-onset, dominant inheritance
35Novel Mutation
Ala2?Pro associated with familial amyloid
polyneuropathy
36Novel Mutation
37Compressed EST Peptide Sequence Database
- For all ESTs mapped to a UniGene gene
- Six-frame translation
- Eliminate ORFs lt 30 amino-acids
- Eliminate amino-acid 30-mers observed once
- Compress to C2 FASTA database
- Complete, Correct for amino-acid 30-mers
- Inclusive gene-centric peptide sequence database
- Size lt 3 of naïve enumeration, 20774 FASTA
entries - Running time 1 of naïve enumeration search
- E-values 2 of naïve enumeration search results
38Compressed EST Peptide Sequence Database
- For all ESTs mapped to a UniGene gene
- Six-frame translation
- Eliminate ORFs lt 30 amino-acids
- Eliminate amino-acid 30-mers observed once
- Compress to C2 FASTA database
- Complete, Correct for amino-acid 30-mers
- Gene-centric peptide sequence database
- Size lt 3 of naïve enumeration, 20774 FASTA
entries - Running time 1 of naïve enumeration search
- E-values 2 of naïve enumeration search results
39Sequence Databases CSBH-graphs
- All k-mers represented by an edge have the same
count
1
2
2
1
2
40CSBH-graph subgraphs
- Quickly determine those that occur twice
2
2
1
2
41k-mer (Sub-)Problems
- Enumerate
- For all (distinct) k-mers, do
- Existence
- ...with exact ( inexact) count x
- Uniqueness
- ...exact inexact match
- Near-neighbors
- ...inexact match
- Representation
- Represent (distinct) k-mers for other tools
- Fast annotation of k-mer counts on original
sequences
42Large scale instances!
- CSBH-graph instances
- Partition set of all k-mers, determine
non-trivial nodes - Days on condor grid (250 CPUs) to construct
- 100,000,000 nodes and edges (sparse dense)
- Min-cost flow instances
- 500,000 nodes and edges
- Algorithms must be linear in problem size
- Out-of-core Eulerian path algorithm?
- Currently testing out-of-core connected-components
43Grid computing
- Heterogeneous machines
- Varying disk/memory/MHz/cores capabilities
- Centralized scheduler
- Jobs started asynchronously
- Other jobs may preempt current job
- Input files may need to be staged
- 250 simultaneous requests for a 3Gb file?
- How to guarantee integrity of input files?
- Problem decomposition may be non-trivial
- Jobs sizes need to fit the least capable machine
- Sometimes need to game the scheduler
- Need to ensure the integrity of job output
44Uniqueness Oracles
- Oracle for uniqueness of 20-mers in the Human
genome (size 3Gb) - Count occurrences in the genome 0,1,2
- Construct 20-mer superstring for 20-mers with
count 1 - Construct 20-mer superstring for 20-mers with
count gt 1 - Easy(-ish) for exact sequence match O(n)
- Fast automata, hash tables, suffix trees.
45Polymerase Chain Reaction
46Polymerase Chain Reaction
47Inexact sequence match
- Inexact sequence matching O(nmk)
- Errors/Mismatches (k) 1,2,3
- distinct 20-mers (m) O(n)
- Achieve expected linear time using a hybrid
approach (blastn) - Exact search for short chunks of primers
- Expensive alignment only where chunks match
- Large chunks ) Fast, but miss occurrences
- Small chunks ) Slow, find all matches
48Inexact sequence match
- Baeza-Yates Perleberg
- Correct and O(n) for small k
- At least 1 chunk is observed with no error.
- Small k ? Large chunks ? Fast and correct
- Form of locality sensitive hashing
g
? ?
q
49Locality Sensitive Hashing
- For each primer
- store a (set of) hash(es) in hash-table
- At each position in the genome
- look-up a (set of) hash(es) in hash-table
- if any hash is found, do more expensive check
- Need to weigh
- sensitivity (false negatives) vs
- specificity (false positives)
- Our application requires speed and no false
negatives!
50Random Projection
- Choose T templates of l random care positions
g
q
51Random Projection
- Choose T templates of l random care positions
g
t1
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0
52Random Projection
- Choose T templates of l random positions
g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
53Random Projection
- Choose T templates of l random positions
g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
54Random Projection
- Choose T templates of l random positions
g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
55Gapped seed-set design problem
- Given
- mer-size m ( 20 )
- errors k ( 1,2,3)
- cares l ( 10,12,14 )
- Find the smallest set of templates with no false
negatives. - Minimize running time.
56Gapped seed set design formulation (for k 2)
- Cover the edges of Km with copies of Km-l
- How many triangles to cover K6?(m 6, k 2, l
3) - Some instances of (m,2,m-3) cover each edge
exactly once - Steiner triple systems
57How many triangles cover K6?
- 15 edges total
- Is 5 triangles possible?
58How many triangles cover K6?
- 15 edges total
- Is 5 triangles possible? NO!
59How many triangles cover K6?
- 15 edges total
- Is 5 triangles possible? NO!
- Each node requires 3 triangles
- Triangles must account for at least 18 edges
60How many triangles cover K6?
- 15 edges total
- Is 5 triangles possible? NO!
- Each node requires 3 triangles
- Triangles must account for at least 18 edges
61Gapped seed set design formulation 2
- Set cover instance
- Ground set all possible placements of the k
errors (alignments) - Covering sets all possible placements of the l
care positions - For (m20,k2,l10),
- 190 elements, 184,756 sets!
- Greedy approximation algorithm works
62Gapped seed set design formulation 3
Templates
Positions (m)
Remove any kposition nodes, at least 1
templatemust have degree l.
l
63Gapped seed set design formulation 3
- Polynomial size in terms of number of templates
- Select T in advance and test whether sufficient.
- Greedily add 1,2,3,... templates.
- Apply iteratively to achieve feasible solution
64Solution for (20,2,10)
- .................... Positions
- t1
- t2
- t3
- t4
- t5
- t6
- Need gt 4 templates, 6 is optimal
65Remember the application!
- We are checking some templates twice!
- We compute hash(es) at each position in the
genome - Any template that is a shift of another will be
computed at some nearby genomic position!
66Solution for (20,2,10)
- .................... Positions
- t1
- t2
- t3
- t4
- t5
- t6
- Need at most 3 templates...can we do better?
67Solution for (20,2,10) w/ shift
- .................... Positions
- t1
- t2
- Optimal is 2 templates...
68Gapped seed set designSolution strategies
- Randomized algorithms
- Greedy algorithm
- Directly to set cover instance
- Indirectly to bipartite instance
- Integer programming
- On set cover and bipartite instances
- Solution of greedy algorithm subproblem
- ...in parallel, using COIN-OR SYMPHONY
- Branch-and-bound enumeration
- Solution of greedy algorithm subproblem
- ...in parallel, using COIN-OR ALPS library
69What about edit-distance?
- Formulations can be generalized
- Similar solution strategies can be applied
- (All) symmetry lost!
- This may actually be helpful
- Much harder to solve
- Is greedy still good?
- Solutions typically require more templates
70Uniqueness Oracles
- Integrated with CSBH-graph construction algorithm
- Ensure edge-count property is preserved
- Sequence database of unique / non-unique 20-mers
for small genomes - D. melanogaster, up to edit-distance 2
- Currently working to scale to human...
71Other Projects / Interests
- HMMs for Peptide Spectrum Matching
- with UMd, CS
- Rapid Microorganism Identification Database
- www.RMIDb.org
- Pathogen detection using Spectral Matching
- with USDA
- Locality sensitive hashing
- spectra, peptide sequence
- Statistical techniques
- statistical significance
- importance sampling
- CSBH-graph applications
- genome assembly
- Grid computing
- Web-applications
- Relational databases
72Future Research Directions
- Extend k-mer superstring algorithms
- Range of word sizes, variable length words
- Other sequence properties (Tryptic peptides, Tm)
- Identification of protein isoforms
- Optimize proteomics workflow for isoform
detection - Identify splice variants in cancer cell-lines
(MCF-7) and clinical brain tumor samples - Aggressive peptide sequence enumeration
- dbPep for genomic annotation
- Open, flexible informatics infrastructure for
peptide identification
73Future Research Directions
- Proteomics for Microorganism Identification
- Specificity of tandem mass spectra
- Revamp RMIDb prototype
- Incorporate spectral matching
- Primer design
- Uniqueness oracle for inexact match in human
- Integration with Primer3
- Tiling, multiplexing, pooling, tag arrays
74Acknowledgements
- Chau-Wen Tseng, Xue Wu
- UMCP Computer Science
- Catherine Fenselau, Steve Swatkoski
- UMCP Biochemistry
- Calibrant Biosystems
- PeptideAtlas, HUPO PPP, X!Tandem
- Funding National Cancer Institute