A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Mo - PowerPoint PPT Presentation

1 / 65

About This Presentation

Title:

A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Mo

Description:

Protein Sequence Design and Landscape Analysis. in the Grand Canonical Model. Ming-Yang Kao ... Designing amino acid sequences to fold with good hydrophobic ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 66

Provided by: csNorth

Category:

more less

Transcript and Presenter's Notes

Title: A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Mo

1
A Combinatorial Toolbox for Protein Sequence
Design and Landscape Analysis in the Grand
Canonical Model
Ming-Yang Kao Department of Computer
Science Northwestern University Evanston,
Illinois U. S. A
2
Acknowledgments

This talk is based on joint work with colleagues
students at Yale University
Computer Science
Jim Aspnes
Gauri Shah
Biology
Julia Hartling
Junhyong Kim

3
Dual Purposes of This Talk

Discuss protein folding problems.
Emphasize the point that as bioinformatics grows,
advanced algorithmic techniques will become
useful and crucial.

4
Importance of Protein Folding
The 3D structure significantly determines the
function.
5
Two Complementary Problems for Protein Folding

Protein Folding Prediction ---
Given a protein sequence,
determine the 3D folding of the sequence.
Protein Sequence Design ---
Given a 3D structure, determine the fittest
protein sequence for the structure, i.e., one
that has the smallest energy among all possible
sequences when folded into the structure.

6
Complexity for Protein Folding Problems

Protein Folding Prediction ---
Given a protein sequence,
determine the 3D folding of the sequence.
NP-hard under various models.
Protein Sequence Design ---
Given a 3D structure, determine the fittest
protein sequence for the structure, i.e., one
that has the smallest energy among all possible
sequences when folded into the structure.
Solvable in polynomial time under the Grand
Canonical model.

7
History of Protein Sequence Design

Protein Sequence Design ---
Given a 3D structure, determine the fittest
protein sequence for the structure, i.e., one
that has the smallest energy among all possible
sequences when folded into the structure.
Sun et al, 1995 Heuristic search without
optimality guarantee.
Hart, 1997 Open question on the computational
tractability.
Kleinberg, 1999 Polynomial-time algorithms.
Aspnes, Hartling, Kao, Kim, Shah, 2001 Improved
algorithms and generalized problems.

this talk
8
Outline of Technical Discussions

The Grand Canonical Model
Two Basic Computational Problems
Experimental Results
Combinatorial Tools
(1a) Linear Programming
(1b) Network Flow
(1c) Compact Representation of All Min Cut
(1d) others
Further Algorithmic Computational Hardness
Results
Conclusions

9
Outline of Technical Discussions (1)

The Grand Canonical Model
Two Basic Computational Problems
Experimental Results
Combinatorial Tools
(1a) Linear Programming
(1b) Network Flow
(1c) Compact Representation of All Min Cut
(1d) others
Further Algorithmic Computational Hardness
Results
Conclusions

10
Grand Canonical Model (Sun et al, 1995)

Each amino acid is classified as Hydrophobic (H)
and Polar (P).
Each amino acid sequence is then considered as a
binary sequence of H and P. (For mathematical
convenience, set H 1 and P 0).
Hydrophobic (H) A, C, F, I, L, M, V, W, Y.
Polar (P) the other amino acids.
Sun, Brem, Chan, Dill. Designing amino acid
sequences to fold with good hydrophobic cores.
Protein Engineering, 1995.

11
Representation of a 3D structure (Sun et al,
1995)

A 3D folding structure S of n amino acid
sequence
the coordinate of each atom in S.

the pairwise distances between the centers of
amino acid residues in S.
the solvent-accessible areas of the amino acid
residues in S.

12
Goal of Protein Sequence Design (Sun et al, 1995)

Input A 3D structure S and a sequence length n.
Output a sequence X of n amino acids that, when
folded into S, has the following properties
The H-residues in X are as close to each other as
possible.
The solvent-accessible areas of the H-residues of
X are as small as possible.

13
Fitness of a Sequence (Sun et al, 1995)
14
Fitness of a Sequence (Sun et al, 1995)
closeness among H-residues
small surface area
15
Outline of Technical Discussions (2)

The Grand Canonical Model
Two Basic Computational Problems
Experimental Results
Combinatorial Tools
(1a) Linear Programming
(1b) Network Flow
(1c) Compact Representation of All Min Cut
(1d) others
Further Algorithmic Computational Hardness
Results
Conclusions

16
Problem 1

Input
the parameters alpha and beta,
a protein sequence Y,
Ys 3D structure,
the sequence length n of Y.
Output
a fittest sequence X for the 3D structure with
respect to the given alpha and beta.
Applications of this problem Design the best
sequences for novel structures because we dont
really need Y.

17
Problem 2

Input
the parameters alpha and beta,
a protein sequence Y,
Ys 3D structure,
the sequence length n of Y.
Output
a fittest sequence X for the 3D structure that is
the most similar to Y over all possible alpha and
beta.
Applications of this problem tune the alpha and
beta of the Grand Canonical model.

18
Basic Computational Scheme (1)
a min cut
3D structure
a fittest sequence
network
HPPPHHPHP
19
Problem 1

Input
the parameters alpha and beta,
a protein sequence Y,
Ys 3D structure,
the sequence length n of Y.
Output
a fittest sequence X for the 3D structure with
respect to the given alpha and beta.
Applications of this problem Design the best
sequences for novel structures because we dont
really need Y.
Computational Complexity 1 network flow.

20
Problem 2

Input
the parameters alpha and beta,
a protein sequence Y,
Ys 3D structure,
the sequence length n of Y.
Output
a fittest sequence X for the 3D structure that is
the most similar to Y over all possible alpha and
beta.
Applications of this problem tune the alpha and
beta of the Grand Canonical model.
Computational Complexity O(n) network flows.

21
Outline of Technical Discussions (3)

The Grand Canonical Model
Two Basic Computational Problems
Experimental Results
Combinatorial Tools
(1a) Linear Programming
(1b) Network Flow
(1c) Compact Representation of All Min Cut
(1d) others
Further Algorithmic Computational Hardness
Results
Conclusions

22
Empirical Study Predictive Ability

Computed Fittest Sequence versus Native Sequences
( similarity)
Our Similarity versus Kleinbergs
Similarity versus Protein Family Size.

23
similarity --- computed versus native

similarity the percentage of the H/Ps in
the computed fittest sequence that are identical
to those in the native sequence.
The average percentage of the hydrophobic
residues is 42 in the native sequences that were
studied.
The best sequence picked without domain
knowledge would have a 58 similarity on average.

24
similarity --- computed versus native (1)
25
similarity --- computed versus native (2)
Our results versus Kleinbergs
26
similarity --- computed versus native (3)
27
similarity versus PFAM family size (1)

similarity the percentage of the H/Ps in
the computed fittest sequence that are identical
to those in the native sequence.
PFAM family size of a protein of proteins in
the PFAM database that are related to the given
protein.
The relatedness is computed via HMM models.
pfam.wustl.edu
measure of success of a protein in Nature.

28
similarity versus PFAM family size (2)

similarity the percentage of the H/Ps in
the computed fittest sequence that are identical
to those in the native sequence.
PFAM family size of a protein of proteins in
the PFAM database that are related to the given
protein.
Intuition/Conjecture
(3A) the more diverse a protein family is,
(3B) the more its
3D structures vary,
(3C) the smaller the similarity will
be.

29
similarity versus PFAM family size (3)
30
similarity versus PFAM family size (4)
31
Outline of Technical Discussions (4)

The Grand Canonical Model
Two Basic Computational Problems
Experimental Results
Combinatorial Tools
(1a) Linear Programming
(1b) Network Flow
(1c) Compact Representation of All Min Cut
(1d) others
Further Algorithmic Computational Hardness
Results
Conclusions

32
Tool 1 Linear Programming
find x and y that
Goal find a fittest sequence X of n amino
acids.
find a binary sequence x that minimizes

Linear
Totally unimodular
Integer solution
Useful for proving theorems
Still too inefficient

clueless!
quadratic
33
Tool 2 Network Flow (1)

analogy a network of oil pipes
source s (origin of oil)
sink t (destination of oil)
other nodes (midway stations)
arcs (pipes)
arc capacity (pipe capacity)
flow (amount of oil through a pipe)
goal deliver max amount of oil from source
to sink
computational goal a max flow
computational complexity
VE log (V2/E)

14
1
8
4
9
14
s
5
4
t
20
5
5
10
34
Tool 2 Network Flow (2)
14 (1)

example of max flow
source (origin of oil)
sink (destination of oil)
other nodes (midway stations)
arcs (pipes)
arc capacity (pipe capacity)
flow (amount of oil through a pipe)
goal deliver max amount of oil from source
to sink
computational goal a max flow
computational complexity
VE log (V2/E)

1 (1)
8 (5)
14 (14)
4 (4)
9 (9)
s
5 (4)
4 (4)
t
20
5 (5)
5
10 (5)
35
Tool 2 Network Flow (3)
14 (1)

max flow versus min cut
min cut ? bottleneck
a partition (S,T) of nodes with s in S and t in
T.
total capacity of arcs from S to T
max flow.

1 (1)
8 (5)
14 (14)
4 (4)
9 (9)
s
5 (4)
t
4 (4)
20
5 (5)
5
10 (5)
36
Tool 2 Network Flow (4)
14 (1)

max flow versus min cut
min cut ? bottleneck
a partition (S,T) of nodes with s in S and t in
T.
total capacity of arcs from S to T
max flow.
computational complexity
VE log (V2/E)

1 (1)
8 (5)
14 (14)
4 (4)
9 (9)
s
5 (4)
t
4 (4)
20
5 (5)
5
10 (5)
37
Basic Computational Scheme (1)
a min cut
3D structure
a fittest sequence
network
HPPPHHPHP
38
Tool 2 3D ? Network (1)
7
9
8
4
6
5
1
2
3
S1 3 S2 18 S3 6 S4 9 S5 3 S6 9 S7 6 S8
24 S9 9
g(d16) 0.5 g(d25) 0.75 g(d58) 0.9 g(d49)
0.75 alpha -8 beta 1/3
39
Tool 2 3D ? Network (2)
betasi
7
9
8
1
-alphag(dij)
2
1
4
6
1,6
5
3
6
4
2
4
2,5
1
2
3
3
6
5
S1 3 S2 18 S3 6 S4 9 S5 3 S6 9 S7 6 S8
24 S9 9
g(d16) 0.5 g(d25) 0.75 g(d58) 0.9 g(d49)
0.75 alpha -8 beta 1/3
1
5,8
7.2
3
6
2
6
4,9
8
7
3
8
9
40
Tool 2 3D ? Network (3)
betasi
7
9
8
1
-alphag(dij)
2
1
4
6
1,6
5
3
6
4
2
4
2,5
1
2
3
3
6
5
S1 3 S2 18 S3 6 S4 9 S5 3 S6 9 S7 6 S8
24 S9 9
g(d16) 0.5 g(d25) 0.75 g(d58) 0.9 g(d49)
0.75 alpha -8 beta 1/3
1
5,8
7.2
3
6
2
6
4,9
8
7
3
8
9
41
Tool 2 3D ? Network (4)
betasi
7
9
8
1
-alphag(dij)
2
1
4
6
1,6
5
3
6
4
2
4
2,5
1
2
3
3
6
5
1
5,8
7.2
3
6
2
6
4,9
8
7
Theorem (Kleinberg, 1999) The amino acids that
are with the source in a min cut are Hs.
3
8
9
42
Basic Computational Scheme (1)
a min cut
3D structure
a fittest sequence
network
HPPPHHPHP
43
Problem 1

Input
the parameters alpha and beta,
a protein sequence Y,
Ys 3D structure,
the sequence length n of Y.
Output
a fittest sequence X for the 3D structure with
respect to the given alpha and beta.
Applications of this problem Design the best
sequences for novel structures because we dont
really need Y.

44
Tool 3 Linear Size Representation of All Min
Cuts (1)
v2
14 (1)
Step 1 Compute a max flow of G. Step 2 Compute
the residual network G. Step 3 Contract every
strongly connected component into a super node.
Call the new graph G.
1 (1)
v1
8 (5)
14 (14)
4 (4)
v3
9 (9)
s
5 (4)
v6
t
v5
4 (4)
20
5 (5)
Def A node subset U of G is a closed set if for
every node x in U, every descendant of x is also
in U.
5
10 (5)
v4
v7
Theorem (Picard and Queyranne, 1980) Every
closed set not including the sink forms a min
cut, and vice versa.
45
Tool 3 Linear Size Representation of All Min
Cuts (2)
v2
13
Residual Network
1
1
v1
3
14
4
v3
5
9
s
1
v6
t
v5
4
4
20
5
5
5
5
v4
v7
46
Tool 3 Linear Size Representation of All Min
Cuts (3)
Picard-Queyranne Representation
v2
v1
v3
s
v6
t
v5
5
v4
v7
47
Tool 3 Linear Size Representation of All Min
Cuts (4)
v2
Picard-Queyranne Representation
v1
v3
s
v6

Applications
Obtain all fittest sequences.
Study the landscape of the fittest sequences.
Compute fittest sequences with additional
optimization objectives.

t
v5
5
v4
v7
48
Basic Computational Scheme (2)
a max flow/min cut
3D structure
network
Picard-Queyranne Representation
the space of all fittest sequences HPPPHHPHP
49
Outline of Technical Discussions (5)

The Grand Canonical Model
Two Basic Computational Problems
Experimental Results
Combinatorial Tools
(1a) Linear Programming
(1b) Network Flow
(1c) Compact Representation of All Min Cut
(1d) others
Further Algorithmic Computational Hardness
Results
Conclusions

50
Problem 3

Input a 3D structure.
Output all its fittest protein sequences.
Computational Complexity
(A) A linear size
representation can be computed with 1 network
flow.
(B) Each individual fittest protein
sequences can be generated from this
representation in O(n) time.

51
Problem 4
Input f 3D structures. Output the set of all
protein sequences that are the fittest
simultaneously for all these 3D structures.
Computational Complexity f network flows.
52
Problem 5

Input a protein sequence Y and its native 3D
structure.
Output the set of all fittest protein sequences
that are also the most (or least) similar to Y in
terms of unweighted (or weighted) Hamming
distances.
Computational Complexity 1 network flow.

53
Problem 6

Input a 3D structure.
Output Count the number of protein sequences in
the solution to each of Problems 3, 4, and 5.
Computational Complexity P-complete.

54
Problem 7

Input a 3D structure and a bound e.
Output Enumerate the protein sequences whose
fitness function values are within an additive
factor e of that of the fittest protein
sequences.
Computational Complexity polynomial time to
generate each desired protein sequence.

55
Problem 8

Input a 3D structure.
Output the largest possible unweighted (or
weighted) Hamming distance between any two
fittest protein sequences.
Computational Complexity 1 network flow.

56
Problem 9

Input a protein sequence Y and its native 3D
structure.
Output the average unweighted (or weighted)
Hamming distance between Y and the fittest
protein sequences for the 3D structure.
Computational Complexity P-complete.

57
Problem 10

Input a protein sequence Y, its native 3D
structure, and two unweighted Hamming distances
d1and d2.
Output a fittest protein sequence whose distance
from Y is also between d1and d2.
Computational Complexity NP-hard.

58
Problem 11

Input a protein sequence Y, its native 3D
structure, and an unweighted Hamming distance d.
Output the fittest among the protein sequences
which are at distance d from Y.
Computational Complexity NP-hard. We have a
polynomial-time approximation algorithm.

59
Problem 12

Input a protein sequence Y and its native 3D
structure
Output all the ratios between the scaling
factors alpha and beta in the GC model such that
the smallest possible unweighted (or weighted)
Hamming distance between Y and any fittest
protein sequence is minimized over all possible
alpha and beta.
Computational Complexity O(n) network flows.

60
Problem 13

Input a 3D structure.
Output Determine whether the fittest protein
sequences are connected, i.e., whether they can
mutate into each other through allowable
mutations, such as point mutations, while the
intermediate protein sequences all remain the
fittest.
Computational Complexity 1 network flow.

61
Problem 14

Input a 3D structure and two fittest protein
sequences.
Output Determine whether the two sequences are
connected.
Computational Complexity 1 network flow.

62
Problem 15

Input a 3D structure.
Output the smallest set of allowable mutations
with respect to which the fittest protein
sequences (or two given fittest protein
sequences) for the structure are connected.
Computational Complexity 1 network flow.

63
Outline of Technical Discussions (6)

The Grand Canonical Model
Two Basic Computational Problems
Experimental Results
Combinatorial Tools
(1a) Linear Programming
(1b) Network Flow
(1c) Compact Representation of All Min Cut
(1d) others
Further Algorithmic Computational Hardness
Results
Conclusions

64
Further Research for Protein Sequence Design

More sophisticated models (biology).
Algorithms and complexity for such models
(computer science).
Web lab validation (biology).

65
Further Algorithmic Research for Bioinformatics

Current State of Bioinformatics
Biology mostly very simple heuristics
Algorithms mostly very simple techniques
Conjectures
Biology Nature is not so simple. Most of the
biological information is very complicated.
Algorithms Very sophisticated, novel, and
fundamental techniques will be needed to unlock
Natures secrets.

Write a Comment

User Comments (0)