Title: Combinatorial optimisation in protein structure prediction and recognition: Background, review, and
1Combinatorial optimisation in protein structure
prediction and recognition Background, review,
and research direction
2Whats in this talk?
- What is protein structure prediction and
recognition? - Who has done what before?
- Whats interesting and hasnt been done?
- Being critical about others work is easy.
- Doing something brilliant is difficult.
- This talk addresses the easy problem.
3Combining two Amino acids
Before
After
4Protein polypeptide chain
N-terminal
C-terminal
A polypeptide chain chain of amino acids linked
together by peptide bonds. Each amino acid is
the same except for the residues. There are 20
such amino acids. Different combinations of these
20 amino acids make different proteins. A
protein sequence can contain from tens to
thousands of amino acids.
5An example
Primary structure individual amino acids.
Quaternary structure greenblue chains.
Secondary structure ?-helix and ?-sheet.
The green chain defines a tertiary structure. So
is the blue chain.
6Motivation
- Notice It is the 3-D structures of the proteins
that are important (2 different sequences can
have exactly the same structure!) - Need to know the shape of a protein, so as to
develop antibodies that bind that shape - Fold
prediction. - Antibodies produced against one protein may also
work for another protein that looks similar -
Structure recognition.
7Structure prediction
8HP models (Ab initio prediction )
- Given a sequence of amino acids, determine the
structure from scratch. - Hydrophobic-hydrophilic (HP) model
- proposed by Dill (1985)
- Two groups of amino acids
- Hydrophobic acids (H)
- Hydrophilic acids (P)
- Self avoiding walks on lattices
- Objective minimise global free energy
- Meaning, its good to put as many Hydrophobic
acids as close together as possible.
9HP model on latticesa 2-dimensional example
Hydrophobic acids
Hydrophilic acids
10HP model on latticesa 2-dimensional example
Hydrophobic acids
Hydrophilic acids
Fold with 5 hydrophobic contacts
11Previous work on HP models
- Most previous work involves complete enumeration
of self-avoiding random walks on various lattices
(e.g. Lau and Dill (1989), Irback and Troein
(2002)) - Irback and Troein (2002) managed sequences with
up to 25 amino acids - Unger and Moult (1993) - hybrid Genetic Algorithm
and Simulated Annealing (2-D) - size 20-64. Opt for size 36,48,60 (Opt ?! How do
they know?) - Shakhovich et al. (1991) tried SA on 30 27-acid
problems. (Only 1 found global minimum.
Inappropriate local search is to blame.) - Backofen (2001) constraint programming approach
- tested problems of size 27-36, time 20min -
1hr38min (opt) - IP models proposed recently in Greenberg, Hart
and Lancia (2002). No numerical results reported
as yet. - (See pages 1-4 of pdf file)
12Problems with IP models
- Dealing with symmetry
- Methods are suggested in Greenberg, Hart and
Lancia (2002) and in Beckofens PhD thesis. - What about other lattices?
- Number of lattice points unnecessarily large.
- Lau and Dill (1989) proposed maximal compact
chain conformations Lattice walks in which every
point is occupied by exactly one amino acid. - E.g. 3x3x3 cubic lattice for a 27-amino
acid-chain - May be not that tight, but definitely not n2.
- May be a union of some of those maximal compact
chain conformations.
13Lets be critical
- Cubic lattices probably not good enough. But its
a good start anyway. - Faulon, Rintoul and Young (2002) tried 2-D
honeycomb, 2-D square, 3-D diamond and 3-D cubic
lattices. Agarwala et al. triangular lattice
(Constrained SAW, no optimisation involved). - Use energy matrix rather than simple unit credit
for each HH interaction? (Different
hydrophobicity) - Energy released by putting different pairs of
H-acids together are different, and are
depending on how far they are apart in sequence! - Dills HP model is too simplified.
- Besides, interactions between H-acids should be
defined differently to the Domain and
Neighbourhood.
14Under old definitions, suppose are hydrophobic
acids,
are all the same.
15look better than
But surely
16Research opportunities
- Exact algorithms
- Alternative ILP formulations (with tight LP
relaxation bounds) - Difference in lattice neighbourhood and
hydrophobic interaction neighbourhood (use
Euclidean distance for the latter). - Development of solution methodologies
- Modify Dills model to deal with reality
- Alternative lattices (apply optimisation
techniques as supposed to complete or simple
constrained numeration). - More complicated hydrophobicity (Atkins and Hart
(1999) discussed fixed energy matrix and proved
NP-hardness). - Previous methods either constraints programming
or integer linear programming. Why not a hybrid
CP and ILP approach?
17Research opportunities
- No methods so far can manage a sequence with gt100
amino acids - Heuristics
- Meta-heuristics still room for research, try
different neighbourhood scheme - Tailor-made search techniques that considers
folding patterns - Development of problem-specific heuristic or
greedy heuristic - At least that will provide quick initial bounds
for exact methods.
18Structure recognition
19- Sequence alignment
- Comparing a sequence of amino acids with known
sequences in Protein Data Bank on the primary
structure level. - Does this sequence look alike that sequence?
- Methods well developed e.g. BLAST.
- Fold recognition
- Comparing the structure of an unknown protein
with known protein structures in PDB. - Contact Map Optimisation (primary-structure
comparisons) - Arthur Lesks model (secondary-structure
comparisons) - Ip et al.s model (secondary-structure
comparisons)
20Contact Map Optimisation
- Comparing 3-D structures of two sequences of
amino acids, e.g. s(s1..sm) and t(t1..tn).
(Assuming you already know how each of them look
like, and you now want to know how much they look
alike each other.) - Construct an undirected graph for each of s and
t, amino-acids as vertices. - For each sequence, two amino acids that are
within a certain Euclidean distance from each
other are connected by an edge.
21Contact Map Optimisation
s
s1
s2
sm
tn
t1
t2
t
22Contact Map Optimisation
One way of mapping. 4 pairs of edges mapped.
23Contact Map Optimisation
Another way of mapping. 5 edges mapped.
24Wait a minute...
- Remember from the HP models, amino acids are
divided into two groups. What is the point of
mapping a hydrophobic amino acid in one graph to
a hydrophilic amino acid in another or vice
versa??? - Adding constraints that only amino acids of the
same group are supposed to be matched might be
helpful!!!
25Who has done what?
- No one noticed the HP issue so models arent 100
cool. - Lancia et al. (2001) ILP model (see pages 5-6 of
pdf file) - LP-relaxation of no-crossing constraints
typically weak, hence clique constraints
(exponentially many) are introduced. - Problem can be converted to a max independent
problem, for which cliques inequalities are
facet-defining. - O(n2) time separation for cliques.
- Root-node LP relaxation (from 1min to 2hours for
62-74 acids and 80-140 contacts. The more alike
of the two proteins the faster LP relaxation can
be solved!)
26Who has done what?
- Heuristic approaches
- Lancia et al. (2001)
- Genetic algorithm (GA)
- Steepest ascent local search
- Results of Lancia et al.
- Exact algorithm
- Gaps 0-gt5 (Mostly gt5 exactly how much??)
- Heuristics
- Same story as above. GA much better than LS.
- Work on similar topics can also be found in Havel
et al. (1979), Martin et al. (1992) and so on.
27Lets be critical...
- Even just the LP relaxation of the IP formulation
without no-crossing constraints takes a long time
to solve for comparing pairs of real protein
sequences with 100-200 amino acids. - Tried comparing two sequences with 120 amino
acids, took more than 10 hours!!! - Really should consider the HP issue, and may be
even aggregating certain amino acids!
28Lets be critical...
- A big problem with model - a 3-D example
Consider the following sequence
1 2 3 4 5 6 7
2
3
3
1
1
2
4
7
7
4
5
5
6
6
Two different structures giving the same
objective value by the ILP formulation of Lancia
et al. assuming acids within e-distance of 31/3
are connected by an edge.
29Research opportunities
- Exact methods
- New ILP formulation.
- Alternative solution methodologies for solving
the ILPs - now that we know the ILP models are
huge and solving them is hard. - Heuristics
- Problem specific heuristic.
- Different neighbourhood search for
meta-heuristics.
30Arthur Lesks model
- Compare structures of two protein sequences by
inspecting relations between secondary structures
Does the blue protein look like the green protein?
31(No Transcript)
32(No Transcript)
33Protein sequence 1
Protein sequence 2
34Similar to CMO...
D
C
B
?1
?1
?2
?2
?3
?4
?1
?1
35Useful papers and websites
- Greenberg, H.J., Hart, W.E., Lancia, G.
Opportunities for Combinatorial Optimization in
Computational Biology - http//www.dkfz-heidelberg.de/tbi/bioinfo/ProteinS
tructure/ - Christian Lemmen and Thomas Lengauer.
Computational methods for the structural
alignment of molecules, Journal of
Computer-Aided Molecular Design, 14 215- 232,
2000.