A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Mo - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Mo

Description:

Protein Sequence Design and Landscape Analysis. in the Grand Canonical Model. Ming-Yang Kao ... Designing amino acid sequences to fold with good hydrophobic ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 66
Provided by: csNorth
Category:

less

Transcript and Presenter's Notes

Title: A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Mo


1
A Combinatorial Toolbox for Protein Sequence
Design and Landscape Analysis in the Grand
Canonical Model
Ming-Yang Kao Department of Computer
Science Northwestern University Evanston,
Illinois U. S. A
2
Acknowledgments
  • This talk is based on joint work with colleagues
    students at Yale University
  • Computer Science
  • Jim Aspnes
  • Gauri Shah
  • Biology
  • Julia Hartling
  • Junhyong Kim

3
Dual Purposes of This Talk
  • Discuss protein folding problems.
  • Emphasize the point that as bioinformatics grows,
    advanced algorithmic techniques will become
    useful and crucial.

4
Importance of Protein Folding
The 3D structure significantly determines the
function.
5
Two Complementary Problems for Protein Folding
  • Protein Folding Prediction ---
    Given a protein sequence,
    determine the 3D folding of the sequence.
  • Protein Sequence Design ---
  • Given a 3D structure, determine the fittest
    protein sequence for the structure, i.e., one
    that has the smallest energy among all possible
    sequences when folded into the structure.

6
Complexity for Protein Folding Problems
  • Protein Folding Prediction ---
    Given a protein sequence,
    determine the 3D folding of the sequence.

  • NP-hard under various models.
  • Protein Sequence Design ---
  • Given a 3D structure, determine the fittest
    protein sequence for the structure, i.e., one
    that has the smallest energy among all possible
    sequences when folded into the structure.
  • Solvable in polynomial time under the Grand
    Canonical model.

7
History of Protein Sequence Design
  • Protein Sequence Design ---
  • Given a 3D structure, determine the fittest
    protein sequence for the structure, i.e., one
    that has the smallest energy among all possible
    sequences when folded into the structure.
  • Sun et al, 1995 Heuristic search without
    optimality guarantee.
  • Hart, 1997 Open question on the computational
    tractability.
  • Kleinberg, 1999 Polynomial-time algorithms.
  • Aspnes, Hartling, Kao, Kim, Shah, 2001 Improved
    algorithms and generalized problems.

this talk
8
Outline of Technical Discussions
  • The Grand Canonical Model
  • Two Basic Computational Problems
  • Experimental Results
  • Combinatorial Tools
  • (1a) Linear Programming
  • (1b) Network Flow
  • (1c) Compact Representation of All Min Cut
  • (1d) others
  • Further Algorithmic Computational Hardness
    Results
  • Conclusions

9
Outline of Technical Discussions (1)
  • The Grand Canonical Model
  • Two Basic Computational Problems
  • Experimental Results
  • Combinatorial Tools
  • (1a) Linear Programming
  • (1b) Network Flow
  • (1c) Compact Representation of All Min Cut
  • (1d) others
  • Further Algorithmic Computational Hardness
    Results
  • Conclusions

10
Grand Canonical Model (Sun et al, 1995)
  • Each amino acid is classified as Hydrophobic (H)
    and Polar (P).
  • Each amino acid sequence is then considered as a
    binary sequence of H and P. (For mathematical
    convenience, set H 1 and P 0).
  • Hydrophobic (H) A, C, F, I, L, M, V, W, Y.
  • Polar (P) the other amino acids.
  • Sun, Brem, Chan, Dill. Designing amino acid
    sequences to fold with good hydrophobic cores.
    Protein Engineering, 1995.

11
Representation of a 3D structure (Sun et al,
1995)
  • A 3D folding structure S of n amino acid
    sequence
  • the coordinate of each atom in S.
  • the pairwise distances between the centers of
    amino acid residues in S.
  • the solvent-accessible areas of the amino acid
    residues in S.

12
Goal of Protein Sequence Design (Sun et al, 1995)
  • Input A 3D structure S and a sequence length n.
  • Output a sequence X of n amino acids that, when
    folded into S, has the following properties
  • The H-residues in X are as close to each other as
    possible.
  • The solvent-accessible areas of the H-residues of
    X are as small as possible.

13
Fitness of a Sequence (Sun et al, 1995)
14
Fitness of a Sequence (Sun et al, 1995)
closeness among H-residues
small surface area
15
Outline of Technical Discussions (2)
  • The Grand Canonical Model
  • Two Basic Computational Problems
  • Experimental Results
  • Combinatorial Tools
  • (1a) Linear Programming
  • (1b) Network Flow
  • (1c) Compact Representation of All Min Cut
  • (1d) others
  • Further Algorithmic Computational Hardness
    Results
  • Conclusions

16
Problem 1
  • Input
  • the parameters alpha and beta,
  • a protein sequence Y,
  • Ys 3D structure,
  • the sequence length n of Y.
  • Output
  • a fittest sequence X for the 3D structure with
    respect to the given alpha and beta.
  • Applications of this problem Design the best
    sequences for novel structures because we dont
    really need Y.

17
Problem 2
  • Input
  • the parameters alpha and beta,
  • a protein sequence Y,
  • Ys 3D structure,
  • the sequence length n of Y.
  • Output
  • a fittest sequence X for the 3D structure that is
    the most similar to Y over all possible alpha and
    beta.
  • Applications of this problem tune the alpha and
    beta of the Grand Canonical model.

18
Basic Computational Scheme (1)
a min cut
3D structure
a fittest sequence
network
HPPPHHPHP
19
Problem 1
  • Input
  • the parameters alpha and beta,
  • a protein sequence Y,
  • Ys 3D structure,
  • the sequence length n of Y.
  • Output
  • a fittest sequence X for the 3D structure with
    respect to the given alpha and beta.
  • Applications of this problem Design the best
    sequences for novel structures because we dont
    really need Y.
  • Computational Complexity 1 network flow.

20
Problem 2
  • Input
  • the parameters alpha and beta,
  • a protein sequence Y,
  • Ys 3D structure,
  • the sequence length n of Y.
  • Output
  • a fittest sequence X for the 3D structure that is
    the most similar to Y over all possible alpha and
    beta.
  • Applications of this problem tune the alpha and
    beta of the Grand Canonical model.
  • Computational Complexity O(n) network flows.

21
Outline of Technical Discussions (3)
  • The Grand Canonical Model
  • Two Basic Computational Problems
  • Experimental Results
  • Combinatorial Tools
  • (1a) Linear Programming
  • (1b) Network Flow
  • (1c) Compact Representation of All Min Cut
  • (1d) others
  • Further Algorithmic Computational Hardness
    Results
  • Conclusions

22
Empirical Study Predictive Ability
  • Computed Fittest Sequence versus Native Sequences
    ( similarity)
  • Our Similarity versus Kleinbergs
  • Similarity versus Protein Family Size.

23
similarity --- computed versus native
  • similarity the percentage of the H/Ps in
    the computed fittest sequence that are identical
    to those in the native sequence.
  • The average percentage of the hydrophobic
    residues is 42 in the native sequences that were
    studied.
  • The best sequence picked without domain
    knowledge would have a 58 similarity on average.

24
similarity --- computed versus native (1)
25
similarity --- computed versus native (2)
Our results versus Kleinbergs
26
similarity --- computed versus native (3)
27
similarity versus PFAM family size (1)
  • similarity the percentage of the H/Ps in
    the computed fittest sequence that are identical
    to those in the native sequence.
  • PFAM family size of a protein of proteins in
    the PFAM database that are related to the given
    protein.
  • The relatedness is computed via HMM models.
  • pfam.wustl.edu
  • measure of success of a protein in Nature.

28
similarity versus PFAM family size (2)
  • similarity the percentage of the H/Ps in
    the computed fittest sequence that are identical
    to those in the native sequence.
  • PFAM family size of a protein of proteins in
    the PFAM database that are related to the given
    protein.
  • Intuition/Conjecture
  • (3A) the more diverse a protein family is,
    (3B) the more its
    3D structures vary,
    (3C) the smaller the similarity will
    be.

29
similarity versus PFAM family size (3)
30
similarity versus PFAM family size (4)
31
Outline of Technical Discussions (4)
  • The Grand Canonical Model
  • Two Basic Computational Problems
  • Experimental Results
  • Combinatorial Tools
  • (1a) Linear Programming
  • (1b) Network Flow
  • (1c) Compact Representation of All Min Cut
  • (1d) others
  • Further Algorithmic Computational Hardness
    Results
  • Conclusions

32
Tool 1 Linear Programming
find x and y that
Goal find a fittest sequence X of n amino
acids.
find a binary sequence x that minimizes
  • Linear
  • Totally unimodular
  • Integer solution
  • Useful for proving theorems
  • Still too inefficient

clueless!
quadratic
33
Tool 2 Network Flow (1)
  • analogy a network of oil pipes
  • source s (origin of oil)
  • sink t (destination of oil)
  • other nodes (midway stations)
  • arcs (pipes)
  • arc capacity (pipe capacity)
  • flow (amount of oil through a pipe)
  • goal deliver max amount of oil from source
    to sink
  • computational goal a max flow
  • computational complexity
  • VE log (V2/E)

14
1
8
4
9
14
s
5
4
t
20
5
5
10
34
Tool 2 Network Flow (2)
14 (1)
  • example of max flow
  • source (origin of oil)
  • sink (destination of oil)
  • other nodes (midway stations)
  • arcs (pipes)
  • arc capacity (pipe capacity)
  • flow (amount of oil through a pipe)
  • goal deliver max amount of oil from source
    to sink
  • computational goal a max flow
  • computational complexity
  • VE log (V2/E)

1 (1)
8 (5)
14 (14)
4 (4)
9 (9)
s
5 (4)
4 (4)
t
20
5 (5)
5
10 (5)
35
Tool 2 Network Flow (3)
14 (1)
  • max flow versus min cut
  • min cut ? bottleneck
  • a partition (S,T) of nodes with s in S and t in
    T.
  • total capacity of arcs from S to T
    max flow.

1 (1)
8 (5)
14 (14)
4 (4)
9 (9)
s
5 (4)
t
4 (4)
20
5 (5)
5
10 (5)
36
Tool 2 Network Flow (4)
14 (1)
  • max flow versus min cut
  • min cut ? bottleneck
  • a partition (S,T) of nodes with s in S and t in
    T.
  • total capacity of arcs from S to T
    max flow.
  • computational complexity
  • VE log (V2/E)

1 (1)
8 (5)
14 (14)
4 (4)
9 (9)
s
5 (4)
t
4 (4)
20
5 (5)
5
10 (5)
37
Basic Computational Scheme (1)
a min cut
3D structure
a fittest sequence
network
HPPPHHPHP
38
Tool 2 3D ? Network (1)
7
9
8
4
6
5
1
2
3
S1 3 S2 18 S3 6 S4 9 S5 3 S6 9 S7 6 S8
24 S9 9
g(d16) 0.5 g(d25) 0.75 g(d58) 0.9 g(d49)
0.75 alpha -8 beta 1/3
39
Tool 2 3D ? Network (2)
betasi
7
9
8
1
-alphag(dij)
2
1
4
6
1,6
5
3
6
4
2
4
2,5
1
2
3
3
6
5
S1 3 S2 18 S3 6 S4 9 S5 3 S6 9 S7 6 S8
24 S9 9
g(d16) 0.5 g(d25) 0.75 g(d58) 0.9 g(d49)
0.75 alpha -8 beta 1/3
1
5,8
7.2
3
6
2
6
4,9
8
7
3
8
9
40
Tool 2 3D ? Network (3)
betasi
7
9
8
1
-alphag(dij)
2
1
4
6
1,6
5
3
6
4
2
4
2,5
1
2
3
3
6
5
S1 3 S2 18 S3 6 S4 9 S5 3 S6 9 S7 6 S8
24 S9 9
g(d16) 0.5 g(d25) 0.75 g(d58) 0.9 g(d49)
0.75 alpha -8 beta 1/3
1
5,8
7.2
3
6
2
6
4,9
8
7
3
8
9
41
Tool 2 3D ? Network (4)
betasi
7
9
8
1
-alphag(dij)
2
1
4
6
1,6
5
3
6
4
2
4
2,5
1
2
3
3
6
5
1
5,8
7.2
3
6
2
6
4,9
8
7
Theorem (Kleinberg, 1999) The amino acids that
are with the source in a min cut are Hs.
3
8
9
42
Basic Computational Scheme (1)
a min cut
3D structure
a fittest sequence
network
HPPPHHPHP
43
Problem 1
  • Input
  • the parameters alpha and beta,
  • a protein sequence Y,
  • Ys 3D structure,
  • the sequence length n of Y.
  • Output
  • a fittest sequence X for the 3D structure with
    respect to the given alpha and beta.
  • Applications of this problem Design the best
    sequences for novel structures because we dont
    really need Y.

44
Tool 3 Linear Size Representation of All Min
Cuts (1)
v2
14 (1)
Step 1 Compute a max flow of G. Step 2 Compute
the residual network G. Step 3 Contract every
strongly connected component into a super node.
Call the new graph G.
1 (1)
v1
8 (5)
14 (14)
4 (4)
v3
9 (9)
s
5 (4)
v6
t
v5
4 (4)
20
5 (5)
Def A node subset U of G is a closed set if for
every node x in U, every descendant of x is also
in U.
5
10 (5)
v4
v7
Theorem (Picard and Queyranne, 1980) Every
closed set not including the sink forms a min
cut, and vice versa.
45
Tool 3 Linear Size Representation of All Min
Cuts (2)
v2
13
Residual Network
1
1
v1
3
14
4
v3
5
9
s
1
v6
t
v5
4
4
20
5
5
5
5
v4
v7
46
Tool 3 Linear Size Representation of All Min
Cuts (3)
Picard-Queyranne Representation
v2
v1
v3
s
v6
t
v5
5
v4
v7
47
Tool 3 Linear Size Representation of All Min
Cuts (4)
v2
Picard-Queyranne Representation
v1
v3
s
v6
  • Applications
  • Obtain all fittest sequences.
  • Study the landscape of the fittest sequences.
  • Compute fittest sequences with additional
    optimization objectives.

t
v5
5
v4
v7
48
Basic Computational Scheme (2)
a max flow/min cut
3D structure
network
Picard-Queyranne Representation
the space of all fittest sequences HPPPHHPHP
49
Outline of Technical Discussions (5)
  • The Grand Canonical Model
  • Two Basic Computational Problems
  • Experimental Results
  • Combinatorial Tools
  • (1a) Linear Programming
  • (1b) Network Flow
  • (1c) Compact Representation of All Min Cut
  • (1d) others
  • Further Algorithmic Computational Hardness
    Results
  • Conclusions

50
Problem 3
  • Input a 3D structure.
  • Output all its fittest protein sequences.
  • Computational Complexity
    (A) A linear size
    representation can be computed with 1 network
    flow.
  • (B) Each individual fittest protein
    sequences can be generated from this
    representation in O(n) time.

51
Problem 4
Input f 3D structures. Output the set of all
protein sequences that are the fittest
simultaneously for all these 3D structures.
Computational Complexity f network flows.
52
Problem 5
  • Input a protein sequence Y and its native 3D
    structure.
  • Output the set of all fittest protein sequences
    that are also the most (or least) similar to Y in
    terms of unweighted (or weighted) Hamming
    distances.
  • Computational Complexity 1 network flow.

53
Problem 6
  • Input a 3D structure.
  • Output Count the number of protein sequences in
    the solution to each of Problems 3, 4, and 5.
  • Computational Complexity P-complete.

54
Problem 7
  • Input a 3D structure and a bound e.
  • Output Enumerate the protein sequences whose
    fitness function values are within an additive
    factor e of that of the fittest protein
    sequences.
  • Computational Complexity polynomial time to
    generate each desired protein sequence.

55
Problem 8
  • Input a 3D structure.
  • Output the largest possible unweighted (or
    weighted) Hamming distance between any two
    fittest protein sequences.
  • Computational Complexity 1 network flow.

56
Problem 9
  • Input a protein sequence Y and its native 3D
    structure.
  • Output the average unweighted (or weighted)
    Hamming distance between Y and the fittest
    protein sequences for the 3D structure.
  • Computational Complexity P-complete.

57
Problem 10
  • Input a protein sequence Y, its native 3D
    structure, and two unweighted Hamming distances
    d1and d2.
  • Output a fittest protein sequence whose distance
    from Y is also between d1and d2.
  • Computational Complexity NP-hard.

58
Problem 11
  • Input a protein sequence Y, its native 3D
    structure, and an unweighted Hamming distance d.
  • Output the fittest among the protein sequences
    which are at distance d from Y.
  • Computational Complexity NP-hard. We have a
    polynomial-time approximation algorithm.

59
Problem 12
  • Input a protein sequence Y and its native 3D
    structure
  • Output all the ratios between the scaling
    factors alpha and beta in the GC model such that
    the smallest possible unweighted (or weighted)
    Hamming distance between Y and any fittest
    protein sequence is minimized over all possible
    alpha and beta.
  • Computational Complexity O(n) network flows.

60
Problem 13
  • Input a 3D structure.
  • Output Determine whether the fittest protein
    sequences are connected, i.e., whether they can
    mutate into each other through allowable
    mutations, such as point mutations, while the
    intermediate protein sequences all remain the
    fittest.
  • Computational Complexity 1 network flow.

61
Problem 14
  •  Input a 3D structure and two fittest protein
    sequences.
  • Output Determine whether the two sequences are
    connected.
  • Computational Complexity 1 network flow.

62
Problem 15
  • Input a 3D structure.
  • Output the smallest set of allowable mutations
    with respect to which the fittest protein
    sequences (or two given fittest protein
    sequences) for the structure are connected.
  • Computational Complexity 1 network flow.

63
Outline of Technical Discussions (6)
  • The Grand Canonical Model
  • Two Basic Computational Problems
  • Experimental Results
  • Combinatorial Tools
  • (1a) Linear Programming
  • (1b) Network Flow
  • (1c) Compact Representation of All Min Cut
  • (1d) others
  • Further Algorithmic Computational Hardness
    Results
  • Conclusions

64
Further Research for Protein Sequence Design
  • More sophisticated models (biology).
  • Algorithms and complexity for such models
    (computer science).
  • Web lab validation (biology).

65
Further Algorithmic Research for Bioinformatics
  • Current State of Bioinformatics
  • Biology mostly very simple heuristics
  • Algorithms mostly very simple techniques
  • Conjectures
  • Biology Nature is not so simple. Most of the
    biological information is very complicated.
  • Algorithms Very sophisticated, novel, and
    fundamental techniques will be needed to unlock
    Natures secrets.
Write a Comment
User Comments (0)
About PowerShow.com