Title: A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Mo
1A Combinatorial Toolbox for Protein Sequence
Design and Landscape Analysis in the Grand
Canonical Model
Ming-Yang Kao Department of Computer
Science Northwestern University Evanston,
Illinois U. S. A
2Acknowledgments
- This talk is based on joint work with colleagues
students at Yale University - Computer Science
- Jim Aspnes
- Gauri Shah
- Biology
- Julia Hartling
- Junhyong Kim
3Dual Purposes of This Talk
- Discuss protein folding problems.
- Emphasize the point that as bioinformatics grows,
advanced algorithmic techniques will become
useful and crucial.
4Importance of Protein Folding
The 3D structure significantly determines the
function.
5Two Complementary Problems for Protein Folding
- Protein Folding Prediction ---
Given a protein sequence,
determine the 3D folding of the sequence. - Protein Sequence Design ---
- Given a 3D structure, determine the fittest
protein sequence for the structure, i.e., one
that has the smallest energy among all possible
sequences when folded into the structure.
6Complexity for Protein Folding Problems
- Protein Folding Prediction ---
Given a protein sequence,
determine the 3D folding of the sequence.
- NP-hard under various models.
- Protein Sequence Design ---
- Given a 3D structure, determine the fittest
protein sequence for the structure, i.e., one
that has the smallest energy among all possible
sequences when folded into the structure. - Solvable in polynomial time under the Grand
Canonical model.
7History of Protein Sequence Design
- Protein Sequence Design ---
- Given a 3D structure, determine the fittest
protein sequence for the structure, i.e., one
that has the smallest energy among all possible
sequences when folded into the structure. - Sun et al, 1995 Heuristic search without
optimality guarantee. - Hart, 1997 Open question on the computational
tractability. - Kleinberg, 1999 Polynomial-time algorithms.
- Aspnes, Hartling, Kao, Kim, Shah, 2001 Improved
algorithms and generalized problems.
this talk
8Outline of Technical Discussions
- The Grand Canonical Model
- Two Basic Computational Problems
- Experimental Results
- Combinatorial Tools
- (1a) Linear Programming
- (1b) Network Flow
- (1c) Compact Representation of All Min Cut
- (1d) others
- Further Algorithmic Computational Hardness
Results - Conclusions
9Outline of Technical Discussions (1)
- The Grand Canonical Model
- Two Basic Computational Problems
- Experimental Results
- Combinatorial Tools
- (1a) Linear Programming
- (1b) Network Flow
- (1c) Compact Representation of All Min Cut
- (1d) others
- Further Algorithmic Computational Hardness
Results - Conclusions
10Grand Canonical Model (Sun et al, 1995)
- Each amino acid is classified as Hydrophobic (H)
and Polar (P). - Each amino acid sequence is then considered as a
binary sequence of H and P. (For mathematical
convenience, set H 1 and P 0). - Hydrophobic (H) A, C, F, I, L, M, V, W, Y.
- Polar (P) the other amino acids.
- Sun, Brem, Chan, Dill. Designing amino acid
sequences to fold with good hydrophobic cores.
Protein Engineering, 1995.
11Representation of a 3D structure (Sun et al,
1995)
- A 3D folding structure S of n amino acid
sequence - the coordinate of each atom in S.
- the pairwise distances between the centers of
amino acid residues in S. - the solvent-accessible areas of the amino acid
residues in S.
12Goal of Protein Sequence Design (Sun et al, 1995)
- Input A 3D structure S and a sequence length n.
- Output a sequence X of n amino acids that, when
folded into S, has the following properties - The H-residues in X are as close to each other as
possible. - The solvent-accessible areas of the H-residues of
X are as small as possible.
13Fitness of a Sequence (Sun et al, 1995)
14Fitness of a Sequence (Sun et al, 1995)
closeness among H-residues
small surface area
15Outline of Technical Discussions (2)
- The Grand Canonical Model
- Two Basic Computational Problems
- Experimental Results
- Combinatorial Tools
- (1a) Linear Programming
- (1b) Network Flow
- (1c) Compact Representation of All Min Cut
- (1d) others
- Further Algorithmic Computational Hardness
Results - Conclusions
16Problem 1
- Input
- the parameters alpha and beta,
- a protein sequence Y,
- Ys 3D structure,
- the sequence length n of Y.
- Output
- a fittest sequence X for the 3D structure with
respect to the given alpha and beta. - Applications of this problem Design the best
sequences for novel structures because we dont
really need Y.
17Problem 2
- Input
- the parameters alpha and beta,
- a protein sequence Y,
- Ys 3D structure,
- the sequence length n of Y.
- Output
- a fittest sequence X for the 3D structure that is
the most similar to Y over all possible alpha and
beta. - Applications of this problem tune the alpha and
beta of the Grand Canonical model.
18Basic Computational Scheme (1)
a min cut
3D structure
a fittest sequence
network
HPPPHHPHP
19Problem 1
- Input
- the parameters alpha and beta,
- a protein sequence Y,
- Ys 3D structure,
- the sequence length n of Y.
- Output
- a fittest sequence X for the 3D structure with
respect to the given alpha and beta. - Applications of this problem Design the best
sequences for novel structures because we dont
really need Y. - Computational Complexity 1 network flow.
20Problem 2
- Input
- the parameters alpha and beta,
- a protein sequence Y,
- Ys 3D structure,
- the sequence length n of Y.
- Output
- a fittest sequence X for the 3D structure that is
the most similar to Y over all possible alpha and
beta. - Applications of this problem tune the alpha and
beta of the Grand Canonical model. - Computational Complexity O(n) network flows.
21Outline of Technical Discussions (3)
- The Grand Canonical Model
- Two Basic Computational Problems
- Experimental Results
- Combinatorial Tools
- (1a) Linear Programming
- (1b) Network Flow
- (1c) Compact Representation of All Min Cut
- (1d) others
- Further Algorithmic Computational Hardness
Results - Conclusions
22Empirical Study Predictive Ability
- Computed Fittest Sequence versus Native Sequences
( similarity) - Our Similarity versus Kleinbergs
- Similarity versus Protein Family Size.
23 similarity --- computed versus native
- similarity the percentage of the H/Ps in
the computed fittest sequence that are identical
to those in the native sequence. - The average percentage of the hydrophobic
residues is 42 in the native sequences that were
studied. - The best sequence picked without domain
knowledge would have a 58 similarity on average.
24 similarity --- computed versus native (1)
25 similarity --- computed versus native (2)
Our results versus Kleinbergs
26 similarity --- computed versus native (3)
27 similarity versus PFAM family size (1)
- similarity the percentage of the H/Ps in
the computed fittest sequence that are identical
to those in the native sequence. - PFAM family size of a protein of proteins in
the PFAM database that are related to the given
protein. - The relatedness is computed via HMM models.
- pfam.wustl.edu
- measure of success of a protein in Nature.
28 similarity versus PFAM family size (2)
- similarity the percentage of the H/Ps in
the computed fittest sequence that are identical
to those in the native sequence. - PFAM family size of a protein of proteins in
the PFAM database that are related to the given
protein. - Intuition/Conjecture
- (3A) the more diverse a protein family is,
(3B) the more its
3D structures vary,
(3C) the smaller the similarity will
be.
29 similarity versus PFAM family size (3)
30 similarity versus PFAM family size (4)
31Outline of Technical Discussions (4)
- The Grand Canonical Model
- Two Basic Computational Problems
- Experimental Results
- Combinatorial Tools
- (1a) Linear Programming
- (1b) Network Flow
- (1c) Compact Representation of All Min Cut
- (1d) others
- Further Algorithmic Computational Hardness
Results - Conclusions
32Tool 1 Linear Programming
find x and y that
Goal find a fittest sequence X of n amino
acids.
find a binary sequence x that minimizes
- Linear
- Totally unimodular
- Integer solution
- Useful for proving theorems
- Still too inefficient
clueless!
quadratic
33Tool 2 Network Flow (1)
- analogy a network of oil pipes
- source s (origin of oil)
- sink t (destination of oil)
- other nodes (midway stations)
- arcs (pipes)
- arc capacity (pipe capacity)
- flow (amount of oil through a pipe)
- goal deliver max amount of oil from source
to sink - computational goal a max flow
- computational complexity
- VE log (V2/E)
14
1
8
4
9
14
s
5
4
t
20
5
5
10
34Tool 2 Network Flow (2)
14 (1)
- example of max flow
- source (origin of oil)
- sink (destination of oil)
- other nodes (midway stations)
- arcs (pipes)
- arc capacity (pipe capacity)
- flow (amount of oil through a pipe)
- goal deliver max amount of oil from source
to sink - computational goal a max flow
- computational complexity
- VE log (V2/E)
1 (1)
8 (5)
14 (14)
4 (4)
9 (9)
s
5 (4)
4 (4)
t
20
5 (5)
5
10 (5)
35Tool 2 Network Flow (3)
14 (1)
- max flow versus min cut
- min cut ? bottleneck
- a partition (S,T) of nodes with s in S and t in
T. - total capacity of arcs from S to T
max flow.
1 (1)
8 (5)
14 (14)
4 (4)
9 (9)
s
5 (4)
t
4 (4)
20
5 (5)
5
10 (5)
36Tool 2 Network Flow (4)
14 (1)
- max flow versus min cut
- min cut ? bottleneck
- a partition (S,T) of nodes with s in S and t in
T. - total capacity of arcs from S to T
max flow. - computational complexity
- VE log (V2/E)
1 (1)
8 (5)
14 (14)
4 (4)
9 (9)
s
5 (4)
t
4 (4)
20
5 (5)
5
10 (5)
37Basic Computational Scheme (1)
a min cut
3D structure
a fittest sequence
network
HPPPHHPHP
38Tool 2 3D ? Network (1)
7
9
8
4
6
5
1
2
3
S1 3 S2 18 S3 6 S4 9 S5 3 S6 9 S7 6 S8
24 S9 9
g(d16) 0.5 g(d25) 0.75 g(d58) 0.9 g(d49)
0.75 alpha -8 beta 1/3
39Tool 2 3D ? Network (2)
betasi
7
9
8
1
-alphag(dij)
2
1
4
6
1,6
5
3
6
4
2
4
2,5
1
2
3
3
6
5
S1 3 S2 18 S3 6 S4 9 S5 3 S6 9 S7 6 S8
24 S9 9
g(d16) 0.5 g(d25) 0.75 g(d58) 0.9 g(d49)
0.75 alpha -8 beta 1/3
1
5,8
7.2
3
6
2
6
4,9
8
7
3
8
9
40Tool 2 3D ? Network (3)
betasi
7
9
8
1
-alphag(dij)
2
1
4
6
1,6
5
3
6
4
2
4
2,5
1
2
3
3
6
5
S1 3 S2 18 S3 6 S4 9 S5 3 S6 9 S7 6 S8
24 S9 9
g(d16) 0.5 g(d25) 0.75 g(d58) 0.9 g(d49)
0.75 alpha -8 beta 1/3
1
5,8
7.2
3
6
2
6
4,9
8
7
3
8
9
41Tool 2 3D ? Network (4)
betasi
7
9
8
1
-alphag(dij)
2
1
4
6
1,6
5
3
6
4
2
4
2,5
1
2
3
3
6
5
1
5,8
7.2
3
6
2
6
4,9
8
7
Theorem (Kleinberg, 1999) The amino acids that
are with the source in a min cut are Hs.
3
8
9
42Basic Computational Scheme (1)
a min cut
3D structure
a fittest sequence
network
HPPPHHPHP
43Problem 1
- Input
- the parameters alpha and beta,
- a protein sequence Y,
- Ys 3D structure,
- the sequence length n of Y.
- Output
- a fittest sequence X for the 3D structure with
respect to the given alpha and beta. - Applications of this problem Design the best
sequences for novel structures because we dont
really need Y.
44Tool 3 Linear Size Representation of All Min
Cuts (1)
v2
14 (1)
Step 1 Compute a max flow of G. Step 2 Compute
the residual network G. Step 3 Contract every
strongly connected component into a super node.
Call the new graph G.
1 (1)
v1
8 (5)
14 (14)
4 (4)
v3
9 (9)
s
5 (4)
v6
t
v5
4 (4)
20
5 (5)
Def A node subset U of G is a closed set if for
every node x in U, every descendant of x is also
in U.
5
10 (5)
v4
v7
Theorem (Picard and Queyranne, 1980) Every
closed set not including the sink forms a min
cut, and vice versa.
45Tool 3 Linear Size Representation of All Min
Cuts (2)
v2
13
Residual Network
1
1
v1
3
14
4
v3
5
9
s
1
v6
t
v5
4
4
20
5
5
5
5
v4
v7
46Tool 3 Linear Size Representation of All Min
Cuts (3)
Picard-Queyranne Representation
v2
v1
v3
s
v6
t
v5
5
v4
v7
47Tool 3 Linear Size Representation of All Min
Cuts (4)
v2
Picard-Queyranne Representation
v1
v3
s
v6
- Applications
- Obtain all fittest sequences.
- Study the landscape of the fittest sequences.
- Compute fittest sequences with additional
optimization objectives.
t
v5
5
v4
v7
48Basic Computational Scheme (2)
a max flow/min cut
3D structure
network
Picard-Queyranne Representation
the space of all fittest sequences HPPPHHPHP
49Outline of Technical Discussions (5)
- The Grand Canonical Model
- Two Basic Computational Problems
- Experimental Results
- Combinatorial Tools
- (1a) Linear Programming
- (1b) Network Flow
- (1c) Compact Representation of All Min Cut
- (1d) others
- Further Algorithmic Computational Hardness
Results - Conclusions
50Problem 3
- Input a 3D structure.
- Output all its fittest protein sequences.
- Computational Complexity
(A) A linear size
representation can be computed with 1 network
flow. - (B) Each individual fittest protein
sequences can be generated from this
representation in O(n) time.
51Problem 4
Input f 3D structures. Output the set of all
protein sequences that are the fittest
simultaneously for all these 3D structures.
Computational Complexity f network flows.
52Problem 5
- Input a protein sequence Y and its native 3D
structure. - Output the set of all fittest protein sequences
that are also the most (or least) similar to Y in
terms of unweighted (or weighted) Hamming
distances. - Computational Complexity 1 network flow.
53Problem 6
- Input a 3D structure.
- Output Count the number of protein sequences in
the solution to each of Problems 3, 4, and 5. - Computational Complexity P-complete.
54Problem 7
- Input a 3D structure and a bound e.
- Output Enumerate the protein sequences whose
fitness function values are within an additive
factor e of that of the fittest protein
sequences. - Computational Complexity polynomial time to
generate each desired protein sequence.
55Problem 8
- Input a 3D structure.
- Output the largest possible unweighted (or
weighted) Hamming distance between any two
fittest protein sequences. - Computational Complexity 1 network flow.
56Problem 9
- Input a protein sequence Y and its native 3D
structure. - Output the average unweighted (or weighted)
Hamming distance between Y and the fittest
protein sequences for the 3D structure. - Computational Complexity P-complete.
57Problem 10
- Input a protein sequence Y, its native 3D
structure, and two unweighted Hamming distances
d1and d2. - Output a fittest protein sequence whose distance
from Y is also between d1and d2. - Computational Complexity NP-hard.
58Problem 11
- Input a protein sequence Y, its native 3D
structure, and an unweighted Hamming distance d. -
- Output the fittest among the protein sequences
which are at distance d from Y. - Computational Complexity NP-hard. We have a
polynomial-time approximation algorithm.
59Problem 12
- Input a protein sequence Y and its native 3D
structure - Output all the ratios between the scaling
factors alpha and beta in the GC model such that
the smallest possible unweighted (or weighted)
Hamming distance between Y and any fittest
protein sequence is minimized over all possible
alpha and beta. - Computational Complexity O(n) network flows.
60Problem 13
- Input a 3D structure.
- Output Determine whether the fittest protein
sequences are connected, i.e., whether they can
mutate into each other through allowable
mutations, such as point mutations, while the
intermediate protein sequences all remain the
fittest. - Computational Complexity 1 network flow.
61Problem 14
- Input a 3D structure and two fittest protein
sequences. - Output Determine whether the two sequences are
connected. - Computational Complexity 1 network flow.
62Problem 15
- Input a 3D structure.
- Output the smallest set of allowable mutations
with respect to which the fittest protein
sequences (or two given fittest protein
sequences) for the structure are connected. - Computational Complexity 1 network flow.
63Outline of Technical Discussions (6)
- The Grand Canonical Model
- Two Basic Computational Problems
- Experimental Results
- Combinatorial Tools
- (1a) Linear Programming
- (1b) Network Flow
- (1c) Compact Representation of All Min Cut
- (1d) others
- Further Algorithmic Computational Hardness
Results - Conclusions
64Further Research for Protein Sequence Design
- More sophisticated models (biology).
- Algorithms and complexity for such models
(computer science). - Web lab validation (biology).
65Further Algorithmic Research for Bioinformatics
- Current State of Bioinformatics
- Biology mostly very simple heuristics
- Algorithms mostly very simple techniques
- Conjectures
- Biology Nature is not so simple. Most of the
biological information is very complicated. - Algorithms Very sophisticated, novel, and
fundamental techniques will be needed to unlock
Natures secrets.