Optimal kmer superstrings for protein identification and DNA assay design' - PowerPoint PPT Presentation

1 / 74

About This Presentation

Title:

Optimal kmer superstrings for protein identification and DNA assay design'

Description:

Can always 'start' a new sequence. Find optimal set of additional edges ... Gapped seed-set design problem. Given: mer-size: m ( = 20 ) # errors: k ( = 1,2,3) ... – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 75

Provided by: cbcb6

Category:

more less

Transcript and Presenter's Notes

Title: Optimal kmer superstrings for protein identification and DNA assay design'

1
Optimal k-mer superstrings for protein
identification and DNA assay design.

Nathan Edwards
Center for Bioinformatics and Computational
Biology
University of Maryland, College Park

2
k-mer (Sub-)Problems

Enumerate
For all (distinct) k-mers, do
Existence
...with respect to exact ( inexact) count x
Uniqueness
...with respect to exact inexact match
Near-neighbors
...with respect to inexact match
Representation
Represent (distinct) k-mers for other tools
Fast annotation of k-mer counts on original
sequences

3
Applications of k-mer sets

Peptide Identification
Represent all amino-acid 30-mers
...that occur at least twice in human dbEST
PCR Primer Design
Test DNA 20-mer primers for uniqueness
What does it mean to be unique?
DNA sequencing error / repeat detection
Eliminate mers that are too rare or too frequent
Pathogen signatures
Near-neighbors imply potential false-positives

4
k-mer Superstring Problem

Given
A set of sequences S S1, ..., Sn
Sequence database
Word size k
Find
A new set of sequences T T1, ..., Tm
Such that
Total length of T is minimized, and
T is complete and correct w.r.t. k-mers of S

5
k-mer Superstring Problem

Completeness
All of the k-mers of S are represented
Correctness
No additional k-mers are present
Minimize the total representation length
Correlates with running time

6
Shortest (common) superstring problem

General strings (arbitrary length)
Single output string
Completeness for input sequences only
Classical NP-hard problem
Garey and Johnson
Approximate within 2.5OPT
Max-SNP hard
One of the first algorithmic approaches to genome
assembly

7
de Bruijn Sequences

de Bruijn sequences represent all words of length
k from some alphabet A.
A 0,1, k 3 s 0001110100
A 0,1, k 4 s 0000111101011001000

8
de Bruijn Graph A 0,1, k 4
1
1
0
1
0
1
1
0
1
0
1
0
1
0
0
0
9
de Bruijn Sequences Graphs

de Bruijn graphs (k,A)
Edges represent length k words from A
Each node has
in degree A
out degree A
Eulerian tour constructs de Bruijn sequence.

10
Sequencing-by-Hybridization-graph
ACDEFGI, ACDEFACG, DEFGEFGI
11
Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
12
Sequence Databases CSBH-graphs

Original sequences correspond to paths

ACDEFGI, ACDEFACG, DEFGEFGI
13
C3 Enumeration

Complete
All k-mers are present
Correct
No other k-mers are present
Compact
No k-mer is present more than once

14
Correct, Complete, Compact (C3) Enumeration

Set of paths that use each edge exactly once

ACDEFGEFGI, DEFACG
15
Correct, Complete (C2) Enumeration

Set of paths that use each edge at least once

ACDEFGEFGI, DEFACG
16
Patching the CSBH-graph

Use artificial edges to fix unbalanced nodes

17
Patching the CSBH-graph

Use matching-style formulations to choose
artificial edges
Optimal C2/C3 enumeration in polynomial time.
Chinese Postman Problem
Edmonds and Johnson, 73
l-tuple DNA sequencing
Pevzner, 89
Shortest (Common) Superstring
MAX-SNP-hard, 2.5 approx algorithm

18
Related work

Chinese Postman Problem
Undirected graph, weighted edges
Shortest path that uses all the edges
Solvable in polynomial time
Construct minimum weighted matching between nodes
of odd-degree
Add matching to graph and find Eulerian path
Minimize weight of extra edges used

19
C2 Enumeration

Chinese postman problem, except
Directed graph
Add edges from nodes with surplus in-degree to
nodes with surplus out-degree
Fixed cost teleportation option
Can always start a new sequence
Find optimal set of additional edges
Transportation problem / min cost flow instance

20
C3 Enumeration
in-out
in-out
Cost k
21
Reusing Edges

ACDEHAC, ACDFHAC, ACDGHACD

22
Reusing Edges

C3 ACDEHACDFHAC, ACDGHACD

23
Reusing Edges

C2 ACDEHACDFHACDGHAC

24
C2 Enumeration
in-out
in-out
4
10
Shortcut paths
7
25
C3 Enumeration
in-out
in-out
Cost 0
Cost 0
Cost k
26
Sample Preparation for Peptide Identification
27
Single Stage MS
MS
m/z
28
Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
29
Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
30
Peptide Identification

For each (likely) peptide sequence
1. Compute fragment masses
2. Compare with spectrum
3. Retain those that match well
Peptide sequences from protein sequence databases
Swiss-Prot, IPI, NCBIs nr, ...
Automated, high-throughput peptide identification
in complex mixtures

31
Novel Splice Isoform

Human Jurkat leukemia cell-line
Lipid-raft extraction protocol, targeting T cells
von Haller, et al. MCP 2003.
LIME1 gene
LCK interacting transmembrane adaptor 1
LCK gene
Leukocyte-specific protein tyrosine kinase
Proto-oncogene
Chromosomal aberration involving LCK in
leukemias.
Multiple significant peptide identifications

32
Novel Splice Isoform
33
Novel Splice Isoform
34
Novel Mutation

HUPO Plasma Proteome Project
Pooled samples from 10 male 10 female healthy
Chinese subjects
Plasma/EDTA sample protocol
Li, et al. Proteomics 2005. (Lab 29)
TTR gene
Transthyretin (pre-albumin)
Defects in TTR are a cause of amyloidosis.
Familial amyloidotic polyneuropathy
late-onset, dominant inheritance

35
Novel Mutation
Ala2?Pro associated with familial amyloid
polyneuropathy
36
Novel Mutation
37
Compressed EST Peptide Sequence Database

For all ESTs mapped to a UniGene gene
Six-frame translation
Eliminate ORFs lt 30 amino-acids
Eliminate amino-acid 30-mers observed once
Compress to C2 FASTA database
Complete, Correct for amino-acid 30-mers
Inclusive gene-centric peptide sequence database
Size lt 3 of naïve enumeration, 20774 FASTA
entries
Running time 1 of naïve enumeration search
E-values 2 of naïve enumeration search results

38
Compressed EST Peptide Sequence Database

For all ESTs mapped to a UniGene gene
Six-frame translation
Eliminate ORFs lt 30 amino-acids
Eliminate amino-acid 30-mers observed once
Compress to C2 FASTA database
Complete, Correct for amino-acid 30-mers
Gene-centric peptide sequence database
Size lt 3 of naïve enumeration, 20774 FASTA
entries
Running time 1 of naïve enumeration search
E-values 2 of naïve enumeration search results

39
Sequence Databases CSBH-graphs

All k-mers represented by an edge have the same
count

1
2
2
1
2
40
CSBH-graph subgraphs

Quickly determine those that occur twice

2
2
1
2
41
k-mer (Sub-)Problems

Enumerate
For all (distinct) k-mers, do
Existence
...with exact ( inexact) count x
Uniqueness
...exact inexact match
Near-neighbors
...inexact match
Representation
Represent (distinct) k-mers for other tools
Fast annotation of k-mer counts on original
sequences

42
Large scale instances!

CSBH-graph instances
Partition set of all k-mers, determine
non-trivial nodes
Days on condor grid (250 CPUs) to construct
100,000,000 nodes and edges (sparse dense)
Min-cost flow instances
500,000 nodes and edges
Algorithms must be linear in problem size
Out-of-core Eulerian path algorithm?
Currently testing out-of-core connected-components

43
Grid computing

Heterogeneous machines
Varying disk/memory/MHz/cores capabilities
Centralized scheduler
Jobs started asynchronously
Other jobs may preempt current job
Input files may need to be staged
250 simultaneous requests for a 3Gb file?
How to guarantee integrity of input files?
Problem decomposition may be non-trivial
Jobs sizes need to fit the least capable machine
Sometimes need to game the scheduler
Need to ensure the integrity of job output

44
Uniqueness Oracles

Oracle for uniqueness of 20-mers in the Human
genome (size 3Gb)
Count occurrences in the genome 0,1,2
Construct 20-mer superstring for 20-mers with
count 1
Construct 20-mer superstring for 20-mers with
count gt 1
Easy(-ish) for exact sequence match O(n)
Fast automata, hash tables, suffix trees.

45
Polymerase Chain Reaction
46
Polymerase Chain Reaction
47
Inexact sequence match

Inexact sequence matching O(nmk)
Errors/Mismatches (k) 1,2,3
distinct 20-mers (m) O(n)
Achieve expected linear time using a hybrid
approach (blastn)
Exact search for short chunks of primers
Expensive alignment only where chunks match
Large chunks ) Fast, but miss occurrences
Small chunks ) Slow, find all matches

48
Inexact sequence match

Baeza-Yates Perleberg
Correct and O(n) for small k
At least 1 chunk is observed with no error.
Small k ? Large chunks ? Fast and correct
Form of locality sensitive hashing

g
? ?
q
49
Locality Sensitive Hashing

For each primer
store a (set of) hash(es) in hash-table
At each position in the genome
look-up a (set of) hash(es) in hash-table
if any hash is found, do more expensive check
Need to weigh
sensitivity (false negatives) vs
specificity (false positives)
Our application requires speed and no false
negatives!

50
Random Projection

Choose T templates of l random care positions

g
q
51
Random Projection

Choose T templates of l random care positions

g
t1
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0
52
Random Projection

Choose T templates of l random positions

g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
53
Random Projection

Choose T templates of l random positions

g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
54
Random Projection

Choose T templates of l random positions

g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
55
Gapped seed-set design problem

Given
mer-size m ( 20 )
errors k ( 1,2,3)
cares l ( 10,12,14 )
Find the smallest set of templates with no false
negatives.
Minimize running time.

56
Gapped seed set design formulation (for k 2)

Cover the edges of Km with copies of Km-l
How many triangles to cover K6?(m 6, k 2, l
3)
Some instances of (m,2,m-3) cover each edge
exactly once
Steiner triple systems

57
How many triangles cover K6?

15 edges total
Is 5 triangles possible?

58
How many triangles cover K6?

15 edges total
Is 5 triangles possible? NO!

59
How many triangles cover K6?

15 edges total
Is 5 triangles possible? NO!
Each node requires 3 triangles
Triangles must account for at least 18 edges

60
How many triangles cover K6?

15 edges total
Is 5 triangles possible? NO!
Each node requires 3 triangles
Triangles must account for at least 18 edges

61
Gapped seed set design formulation 2

Set cover instance
Ground set all possible placements of the k
errors (alignments)
Covering sets all possible placements of the l
care positions
For (m20,k2,l10),
190 elements, 184,756 sets!
Greedy approximation algorithm works

62
Gapped seed set design formulation 3
Templates
Positions (m)
Remove any kposition nodes, at least 1
templatemust have degree l.
l
63
Gapped seed set design formulation 3

Polynomial size in terms of number of templates
Select T in advance and test whether sufficient.
Greedily add 1,2,3,... templates.
Apply iteratively to achieve feasible solution

64
Solution for (20,2,10)

.................... Positions
t1
t2
t3
t4
t5
t6

Need gt 4 templates, 6 is optimal

65
Remember the application!

We are checking some templates twice!
We compute hash(es) at each position in the
genome
Any template that is a shift of another will be
computed at some nearby genomic position!

66
Solution for (20,2,10)

.................... Positions
t1
t2
t3
t4
t5
t6

Need at most 3 templates...can we do better?

67
Solution for (20,2,10) w/ shift

.................... Positions
t1
t2

Optimal is 2 templates...

68
Gapped seed set designSolution strategies

Randomized algorithms
Greedy algorithm
Directly to set cover instance
Indirectly to bipartite instance
Integer programming
On set cover and bipartite instances
Solution of greedy algorithm subproblem
...in parallel, using COIN-OR SYMPHONY
Branch-and-bound enumeration
Solution of greedy algorithm subproblem
...in parallel, using COIN-OR ALPS library

69
What about edit-distance?

Formulations can be generalized
Similar solution strategies can be applied
(All) symmetry lost!
This may actually be helpful
Much harder to solve
Is greedy still good?
Solutions typically require more templates

70
Uniqueness Oracles

Integrated with CSBH-graph construction algorithm
Ensure edge-count property is preserved
Sequence database of unique / non-unique 20-mers
for small genomes
D. melanogaster, up to edit-distance 2
Currently working to scale to human...

71
Other Projects / Interests

HMMs for Peptide Spectrum Matching
with UMd, CS
Rapid Microorganism Identification Database
www.RMIDb.org
Pathogen detection using Spectral Matching
with USDA

Locality sensitive hashing
spectra, peptide sequence
Statistical techniques
statistical significance
importance sampling
CSBH-graph applications
genome assembly
Grid computing
Web-applications
Relational databases

72
Future Research Directions

Extend k-mer superstring algorithms
Range of word sizes, variable length words
Other sequence properties (Tryptic peptides, Tm)
Identification of protein isoforms
Optimize proteomics workflow for isoform
detection
Identify splice variants in cancer cell-lines
(MCF-7) and clinical brain tumor samples
Aggressive peptide sequence enumeration
dbPep for genomic annotation
Open, flexible informatics infrastructure for
peptide identification

73
Future Research Directions