Optimal kmer superstrings for protein identification and DNA assay design' - PowerPoint PPT Presentation

1 / 74
About This Presentation
Title:

Optimal kmer superstrings for protein identification and DNA assay design'

Description:

Can always 'start' a new sequence. Find optimal set of additional edges ... Gapped seed-set design problem. Given: mer-size: m ( = 20 ) # errors: k ( = 1,2,3) ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 75
Provided by: cbcb6
Category:

less

Transcript and Presenter's Notes

Title: Optimal kmer superstrings for protein identification and DNA assay design'


1
Optimal k-mer superstrings for protein
identification and DNA assay design.
  • Nathan Edwards
  • Center for Bioinformatics and Computational
    Biology
  • University of Maryland, College Park

2
k-mer (Sub-)Problems
  • Enumerate
  • For all (distinct) k-mers, do
  • Existence
  • ...with respect to exact ( inexact) count x
  • Uniqueness
  • ...with respect to exact inexact match
  • Near-neighbors
  • ...with respect to inexact match
  • Representation
  • Represent (distinct) k-mers for other tools
  • Fast annotation of k-mer counts on original
    sequences

3
Applications of k-mer sets
  • Peptide Identification
  • Represent all amino-acid 30-mers
  • ...that occur at least twice in human dbEST
  • PCR Primer Design
  • Test DNA 20-mer primers for uniqueness
  • What does it mean to be unique?
  • DNA sequencing error / repeat detection
  • Eliminate mers that are too rare or too frequent
  • Pathogen signatures
  • Near-neighbors imply potential false-positives

4
k-mer Superstring Problem
  • Given
  • A set of sequences S S1, ..., Sn
  • Sequence database
  • Word size k
  • Find
  • A new set of sequences T T1, ..., Tm
  • Such that
  • Total length of T is minimized, and
  • T is complete and correct w.r.t. k-mers of S

5
k-mer Superstring Problem
  • Completeness
  • All of the k-mers of S are represented
  • Correctness
  • No additional k-mers are present
  • Minimize the total representation length
  • Correlates with running time

6
Shortest (common) superstring problem
  • General strings (arbitrary length)
  • Single output string
  • Completeness for input sequences only
  • Classical NP-hard problem
  • Garey and Johnson
  • Approximate within 2.5OPT
  • Max-SNP hard
  • One of the first algorithmic approaches to genome
    assembly

7
de Bruijn Sequences
  • de Bruijn sequences represent all words of length
    k from some alphabet A.
  • A 0,1, k 3 s 0001110100
  • A 0,1, k 4 s 0000111101011001000

8
de Bruijn Graph A 0,1, k 4
1
1
0
1
0
1
1
0
1
0
1
0
1
0
0
0
9
de Bruijn Sequences Graphs
  • de Bruijn graphs (k,A)
  • Edges represent length k words from A
  • Each node has
  • in degree A
  • out degree A
  • Eulerian tour constructs de Bruijn sequence.

10
Sequencing-by-Hybridization-graph
ACDEFGI, ACDEFACG, DEFGEFGI
11
Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
12
Sequence Databases CSBH-graphs
  • Original sequences correspond to paths

ACDEFGI, ACDEFACG, DEFGEFGI
13
C3 Enumeration
  • Complete
  • All k-mers are present
  • Correct
  • No other k-mers are present
  • Compact
  • No k-mer is present more than once

14
Correct, Complete, Compact (C3) Enumeration
  • Set of paths that use each edge exactly once

ACDEFGEFGI, DEFACG
15
Correct, Complete (C2) Enumeration
  • Set of paths that use each edge at least once

ACDEFGEFGI, DEFACG
16
Patching the CSBH-graph
  • Use artificial edges to fix unbalanced nodes

17
Patching the CSBH-graph
  • Use matching-style formulations to choose
    artificial edges
  • Optimal C2/C3 enumeration in polynomial time.
  • Chinese Postman Problem
  • Edmonds and Johnson, 73
  • l-tuple DNA sequencing
  • Pevzner, 89
  • Shortest (Common) Superstring
  • MAX-SNP-hard, 2.5 approx algorithm

18
Related work
  • Chinese Postman Problem
  • Undirected graph, weighted edges
  • Shortest path that uses all the edges
  • Solvable in polynomial time
  • Construct minimum weighted matching between nodes
    of odd-degree
  • Add matching to graph and find Eulerian path
  • Minimize weight of extra edges used

19
C2 Enumeration
  • Chinese postman problem, except
  • Directed graph
  • Add edges from nodes with surplus in-degree to
    nodes with surplus out-degree
  • Fixed cost teleportation option
  • Can always start a new sequence
  • Find optimal set of additional edges
  • Transportation problem / min cost flow instance

20
C3 Enumeration
in-out
in-out
Cost k
21
Reusing Edges
  • ACDEHAC, ACDFHAC, ACDGHACD

22
Reusing Edges
  • C3 ACDEHACDFHAC, ACDGHACD

23
Reusing Edges
  • C2 ACDEHACDFHACDGHAC

24
C2 Enumeration
in-out
in-out
4
10
Shortcut paths
7
25
C3 Enumeration
in-out
in-out
Cost 0
Cost 0
Cost k
26
Sample Preparation for Peptide Identification
27
Single Stage MS
MS
m/z
28
Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
29
Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
30
Peptide Identification
  • For each (likely) peptide sequence
  • 1. Compute fragment masses
  • 2. Compare with spectrum
  • 3. Retain those that match well
  • Peptide sequences from protein sequence databases
  • Swiss-Prot, IPI, NCBIs nr, ...
  • Automated, high-throughput peptide identification
    in complex mixtures

31
Novel Splice Isoform
  • Human Jurkat leukemia cell-line
  • Lipid-raft extraction protocol, targeting T cells
  • von Haller, et al. MCP 2003.
  • LIME1 gene
  • LCK interacting transmembrane adaptor 1
  • LCK gene
  • Leukocyte-specific protein tyrosine kinase
  • Proto-oncogene
  • Chromosomal aberration involving LCK in
    leukemias.
  • Multiple significant peptide identifications

32
Novel Splice Isoform
33
Novel Splice Isoform
34
Novel Mutation
  • HUPO Plasma Proteome Project
  • Pooled samples from 10 male 10 female healthy
    Chinese subjects
  • Plasma/EDTA sample protocol
  • Li, et al. Proteomics 2005. (Lab 29)
  • TTR gene
  • Transthyretin (pre-albumin)
  • Defects in TTR are a cause of amyloidosis.
  • Familial amyloidotic polyneuropathy
  • late-onset, dominant inheritance

35
Novel Mutation
Ala2?Pro associated with familial amyloid
polyneuropathy
36
Novel Mutation
37
Compressed EST Peptide Sequence Database
  • For all ESTs mapped to a UniGene gene
  • Six-frame translation
  • Eliminate ORFs lt 30 amino-acids
  • Eliminate amino-acid 30-mers observed once
  • Compress to C2 FASTA database
  • Complete, Correct for amino-acid 30-mers
  • Inclusive gene-centric peptide sequence database
  • Size lt 3 of naïve enumeration, 20774 FASTA
    entries
  • Running time 1 of naïve enumeration search
  • E-values 2 of naïve enumeration search results

38
Compressed EST Peptide Sequence Database
  • For all ESTs mapped to a UniGene gene
  • Six-frame translation
  • Eliminate ORFs lt 30 amino-acids
  • Eliminate amino-acid 30-mers observed once
  • Compress to C2 FASTA database
  • Complete, Correct for amino-acid 30-mers
  • Gene-centric peptide sequence database
  • Size lt 3 of naïve enumeration, 20774 FASTA
    entries
  • Running time 1 of naïve enumeration search
  • E-values 2 of naïve enumeration search results

39
Sequence Databases CSBH-graphs
  • All k-mers represented by an edge have the same
    count

1
2
2
1
2
40
CSBH-graph subgraphs
  • Quickly determine those that occur twice

2
2
1
2
41
k-mer (Sub-)Problems
  • Enumerate
  • For all (distinct) k-mers, do
  • Existence
  • ...with exact ( inexact) count x
  • Uniqueness
  • ...exact inexact match
  • Near-neighbors
  • ...inexact match
  • Representation
  • Represent (distinct) k-mers for other tools
  • Fast annotation of k-mer counts on original
    sequences

42
Large scale instances!
  • CSBH-graph instances
  • Partition set of all k-mers, determine
    non-trivial nodes
  • Days on condor grid (250 CPUs) to construct
  • 100,000,000 nodes and edges (sparse dense)
  • Min-cost flow instances
  • 500,000 nodes and edges
  • Algorithms must be linear in problem size
  • Out-of-core Eulerian path algorithm?
  • Currently testing out-of-core connected-components

43
Grid computing
  • Heterogeneous machines
  • Varying disk/memory/MHz/cores capabilities
  • Centralized scheduler
  • Jobs started asynchronously
  • Other jobs may preempt current job
  • Input files may need to be staged
  • 250 simultaneous requests for a 3Gb file?
  • How to guarantee integrity of input files?
  • Problem decomposition may be non-trivial
  • Jobs sizes need to fit the least capable machine
  • Sometimes need to game the scheduler
  • Need to ensure the integrity of job output

44
Uniqueness Oracles
  • Oracle for uniqueness of 20-mers in the Human
    genome (size 3Gb)
  • Count occurrences in the genome 0,1,2
  • Construct 20-mer superstring for 20-mers with
    count 1
  • Construct 20-mer superstring for 20-mers with
    count gt 1
  • Easy(-ish) for exact sequence match O(n)
  • Fast automata, hash tables, suffix trees.

45
Polymerase Chain Reaction
46
Polymerase Chain Reaction
47
Inexact sequence match
  • Inexact sequence matching O(nmk)
  • Errors/Mismatches (k) 1,2,3
  • distinct 20-mers (m) O(n)
  • Achieve expected linear time using a hybrid
    approach (blastn)
  • Exact search for short chunks of primers
  • Expensive alignment only where chunks match
  • Large chunks ) Fast, but miss occurrences
  • Small chunks ) Slow, find all matches

48
Inexact sequence match
  • Baeza-Yates Perleberg
  • Correct and O(n) for small k
  • At least 1 chunk is observed with no error.
  • Small k ? Large chunks ? Fast and correct
  • Form of locality sensitive hashing

g
? ?
q
49
Locality Sensitive Hashing
  • For each primer
  • store a (set of) hash(es) in hash-table
  • At each position in the genome
  • look-up a (set of) hash(es) in hash-table
  • if any hash is found, do more expensive check
  • Need to weigh
  • sensitivity (false negatives) vs
  • specificity (false positives)
  • Our application requires speed and no false
    negatives!

50
Random Projection
  • Choose T templates of l random care positions

g
q
51
Random Projection
  • Choose T templates of l random care positions

g
t1
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0
52
Random Projection
  • Choose T templates of l random positions

g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
53
Random Projection
  • Choose T templates of l random positions

g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
54
Random Projection
  • Choose T templates of l random positions

g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
55
Gapped seed-set design problem
  • Given
  • mer-size m ( 20 )
  • errors k ( 1,2,3)
  • cares l ( 10,12,14 )
  • Find the smallest set of templates with no false
    negatives.
  • Minimize running time.

56
Gapped seed set design formulation (for k 2)
  • Cover the edges of Km with copies of Km-l
  • How many triangles to cover K6?(m 6, k 2, l
    3)
  • Some instances of (m,2,m-3) cover each edge
    exactly once
  • Steiner triple systems

57
How many triangles cover K6?
  • 15 edges total
  • Is 5 triangles possible?

58
How many triangles cover K6?
  • 15 edges total
  • Is 5 triangles possible? NO!

59
How many triangles cover K6?
  • 15 edges total
  • Is 5 triangles possible? NO!
  • Each node requires 3 triangles
  • Triangles must account for at least 18 edges

60
How many triangles cover K6?
  • 15 edges total
  • Is 5 triangles possible? NO!
  • Each node requires 3 triangles
  • Triangles must account for at least 18 edges

61
Gapped seed set design formulation 2
  • Set cover instance
  • Ground set all possible placements of the k
    errors (alignments)
  • Covering sets all possible placements of the l
    care positions
  • For (m20,k2,l10),
  • 190 elements, 184,756 sets!
  • Greedy approximation algorithm works

62
Gapped seed set design formulation 3
Templates
Positions (m)
Remove any kposition nodes, at least 1
templatemust have degree l.
l
63
Gapped seed set design formulation 3
  • Polynomial size in terms of number of templates
  • Select T in advance and test whether sufficient.
  • Greedily add 1,2,3,... templates.
  • Apply iteratively to achieve feasible solution

64
Solution for (20,2,10)
  • .................... Positions
  • t1
  • t2
  • t3
  • t4
  • t5
  • t6
  • Need gt 4 templates, 6 is optimal

65
Remember the application!
  • We are checking some templates twice!
  • We compute hash(es) at each position in the
    genome
  • Any template that is a shift of another will be
    computed at some nearby genomic position!

66
Solution for (20,2,10)
  • .................... Positions
  • t1
  • t2
  • t3
  • t4
  • t5
  • t6
  • Need at most 3 templates...can we do better?

67
Solution for (20,2,10) w/ shift
  • .................... Positions
  • t1
  • t2
  • Optimal is 2 templates...

68
Gapped seed set designSolution strategies
  • Randomized algorithms
  • Greedy algorithm
  • Directly to set cover instance
  • Indirectly to bipartite instance
  • Integer programming
  • On set cover and bipartite instances
  • Solution of greedy algorithm subproblem
  • ...in parallel, using COIN-OR SYMPHONY
  • Branch-and-bound enumeration
  • Solution of greedy algorithm subproblem
  • ...in parallel, using COIN-OR ALPS library

69
What about edit-distance?
  • Formulations can be generalized
  • Similar solution strategies can be applied
  • (All) symmetry lost!
  • This may actually be helpful
  • Much harder to solve
  • Is greedy still good?
  • Solutions typically require more templates

70
Uniqueness Oracles
  • Integrated with CSBH-graph construction algorithm
  • Ensure edge-count property is preserved
  • Sequence database of unique / non-unique 20-mers
    for small genomes
  • D. melanogaster, up to edit-distance 2
  • Currently working to scale to human...

71
Other Projects / Interests
  • HMMs for Peptide Spectrum Matching
  • with UMd, CS
  • Rapid Microorganism Identification Database
  • www.RMIDb.org
  • Pathogen detection using Spectral Matching
  • with USDA
  • Locality sensitive hashing
  • spectra, peptide sequence
  • Statistical techniques
  • statistical significance
  • importance sampling
  • CSBH-graph applications
  • genome assembly
  • Grid computing
  • Web-applications
  • Relational databases

72
Future Research Directions
  • Extend k-mer superstring algorithms
  • Range of word sizes, variable length words
  • Other sequence properties (Tryptic peptides, Tm)
  • Identification of protein isoforms
  • Optimize proteomics workflow for isoform
    detection
  • Identify splice variants in cancer cell-lines
    (MCF-7) and clinical brain tumor samples
  • Aggressive peptide sequence enumeration
  • dbPep for genomic annotation
  • Open, flexible informatics infrastructure for
    peptide identification

73
Future Research Directions
  • Proteomics for Microorganism Identification
  • Specificity of tandem mass spectra
  • Revamp RMIDb prototype
  • Incorporate spectral matching
  • Primer design
  • Uniqueness oracle for inexact match in human
  • Integration with Primer3
  • Tiling, multiplexing, pooling, tag arrays

74
Acknowledgements
  • Chau-Wen Tseng, Xue Wu
  • UMCP Computer Science
  • Catherine Fenselau, Steve Swatkoski
  • UMCP Biochemistry
  • Calibrant Biosystems
  • PeptideAtlas, HUPO PPP, X!Tandem
  • Funding National Cancer Institute
Write a Comment
User Comments (0)
About PowerShow.com