Computational Molecular Biology

About This Presentation

Title:

Computational Molecular Biology

Description:

Computational Molecular Biology Protein Structure: Introduction and Prediction – PowerPoint PPT presentation

Number of Views:172

Avg rating:3.0/5.0

Slides: 169

Provided by: MYTRATHAI

Category:

more less

Transcript and Presenter's Notes

Title: Computational Molecular Biology

1
Computational Molecular Biology

Protein Structure Introduction and Prediction

2
Protein Folding

One of the most important problem in molecular
biology
Given the one-dimensional amino-acid sequence
that specifies the protein, what is the proteins
fold in three dimensions?

3
Overview

Understand protein structures
Primary, secondary, tertiary
Why study protein folding
Structure can reveal functional information
which we cannot find from the sequence
Misfolding proteins can cause diseases mad cow
disease
Use in drug designs

4
Overview of Protein Structure

Proteins make up about 50 of the mass of the
average human
Play a vital role in keeping our bodies
functioning properly
Biopolymers made up of amino acids
The order of the amino acids in a protein and the
properties of their side chains determine the
three dimensional structure and function of the
protein

5
Amino Acid

Building blocks of proteins
Consist of
An amino group (-NH2)
Carboxyl group (-COOH)
Hydrogen (-H)
A side chain group (-R) attached to the central
a-carbon
There are 20 amino acids
Primary protein structure is a sequence of a
chain of amino acids

Side chain
Aminogroup
Carboxylgroup
6
Side chains (Amino Acids)

20 amino acids have side chains that vary in
structure, size, hydrogen bonding ability, and
charge.
R gives the amino acid its identity
R can be simple as hydrogen (glycine) or more
complex such as an aromatic ring (tryptophan)

7
Chemical Structure of Amino Acids
8
How Amino Acids Become Proteins
Peptide bonds
9
Polypeptide

More than fifty amino acids in a chain are called
a polypeptide.
A protein is usually composed of 50 to 400 amino
acids.
We call the units of a protein amino acid
residues.

amidenitrogen
carbonylcarbon
10
Side chain properties

Carbon does not make hydrogen bonds with water
easily hydrophobic.
These water fearing side chains tend to
sequester themselves in the interior of the
protein
O and N are generally more likely than C to
h-bond to water hydrophilic
Ten to turn outward to the exterior of the
protein

11
(No Transcript)
12
Primary Structure
Primary structure Linear String of Amino Acids
Side-chain
Backbone
... ALA PHE LEU ILE LEU ARG ...
Each amino acid within a protein is referred to
as residues Each different protein has a unique
sequence of amino acid residues, this is its
primary structure
13
Secondary Structure

Refers to the spatial arrangement of contiguous
amino acid residues
Regularly repeating local structures stabilized
by hydrogen bonds
A hydrogen atom attached to a relatively
electronegative atom
Examples of secondary structure are the ahelix
and ßpleated-sheet

14
Alpha-Helix

Amino acids adopt the form of a right handed
spiral
The polypeptide backbone forms the inner part of
the spiral
The side chains project outward
every backbone N-H group donates a hydrogen bond
to the backbone C O group

15
Beta-Pleated-Sheet

Consists of long polypeptide chains called
beta-strands, aligned adjacent to each other in
parallel or anti-parallel orientation
Hydrogen bonding between the strands keeps them
together, forming the sheet
Hydrogen bonding occurs between amino and
carboxyl groups of different strands

16
Parallel Beta Sheets
17
Anti-Parallel Beta Sheets
18
Mixed Beta Sheets
19
Tertiary Structure

The full dimensional structure, describing the
overall shape of a protein
Also known as its fold

20
Quaternary Structure

Proteins are made up of multiple polypeptide
chains, each called a subunit
The spatial arrangement of these subunits is
referred to as the quaternary structure
Sometimes distinct proteins must combine together
in order to form the correct 3-dimensional
structure for a particular protein to function
properly.
Example the protein hemoglobin, which carries
oxygen in blood. Hemoglobin is made of four
similar proteins that combine to form its
quaternary structure.

21
Other Units of Structure

Motifs (super-secondary structure)
Frequently occurring combinations of secondary
structure units
A pattern of alpha-helices and beta-strands
Domains A protein chain often consists of
different regions, or domains
Domains within a protein often perform different
functions
Can have completely different structures and
folds
Typically a 100 to 400 residues long

22
What Determines Structure

What causes a protein to fold in a particular
way?
At a fundamental level, chemical interactions
between all the amino acids in the sequence
contribute to a proteins final conformation
There are four fundamental chemical forces
Hydrogen bonds
Hydrophobic effect
Van der Waal Forces
Electrostatic forces

23
Hydrogen Bonds

Occurs when a pair of nucliophilic atoms such as
oxygen and nitrogen share a hydrogen between them
Pattern of hydrogen bounding is essential in
stabilizing basic secondary structures

24
Van der Waal Forces

Interactions between immediately adjacent atoms
Result from the attraction between an atoms
nucleus and it neighbors electrons

25
Electrostatic Forces

Oppositely charged side chains con form
salt-bridges, which pulls chains together

26
Experimental Determination

Centralized database (to deposit protein
structures) called the protein Databank (PDB),
accessible at http//www.rcsb.org/pdb/index.html
Two main techniques are used to determine/verify
the structure of a given protein
X-ray crystallography
Nuclear Magnetic Resonance (NMR)
Both are slow, labor intensive, expensive
(sometimes longer than a year!)

27
X-ray Crystallography

A technique that can reveal the precise three
dimensional positions of most of the atoms in a
protein molecule
The protein is first isolated to yield a high
concentration solution of the protein
This solution is then used to grow crystals
The resulting crystal is then exposed to an X-ray
beam

28
Disadvantages

Not all proteins can be crystallized
Crystalline structure of a protein may be
different from its structure
Multiple maps may be needed to get a consensus

29
NMR

The spinning of certain atomic nuclei generates a
magnetic moment
NMR measures the energy levels of such magnetic
nuclei (radio frequency)
These levels are sensitive to the environment of
the atom
What they are bonded to, which atoms they are
close to spatially, what distances are between
different atoms
Thus by carefully measurement, the structure of
the protein can be constructed

30
Disadvantages

Constraint of the size of the protein an upper
bound is 200 residues
Protein structure is very sensitive to pH.

31
Computational Methods

Given a long and painful experimental methods,
need computational approaches to predict the
structure from its sequence.

32
Functional Region Prediction
33
Protein Secondary Structure
34
Tertiary Structure Prediction
35
More Details on X-ray Crystallography
36
Overview
37
Overview
38
Crystal

A crystal can be defined as an arrangement of
building blocks which is periodic in three
dimensions

39
Crystallize a Protein

Have to find the right combination of all the
different influences to get the protein to
crystallize
This can take a couple hundred or even thousand
experiments
Most popular way to conduct these experiments
Hanging-drop method

40
Hanging drop method

The reservoir contains a precipitant
concentration twice as high as the protein
solution
The protein solutions is made up of 50 of stock
protein solution and 50 of reservoir solution
Overtime, water will diffuse from the protein
drop into the reservoir
Both the protein concentration and precipitant
concentration will increase
Crystals will appear after days, weeks, months

41
Properties of protein crystal

Very soft
Mechanically fragile
Large solvent areas (30-70)

42
A Schematic Diffraction Experiment
43
Why do we need Crystals

A single molecule could never be oriented and
handled properly for a diffraction experiment
In a crystal, we have about 1015 molecules in the
same orientation so that we get a tremendous
amplification of the diffraction
Crystals produce much simpler diffraction
patterns than single molecules

44
Why do we need X-rays

X-rays are electromagnetic waves with a
wavelength close to the distance of atoms in the
protein molecules
To get information about where the atoms are, we
need to resolve them -gt thus we need radiation

45
A Diffraction Pattern
46
(No Transcript)
47
Resolution

The primary measure of crystal order/quality of
the model
Ranges of resolution
Low resolution (gt3-5 Ao) is difficult to see the
side chains only the overall structural fold
Medium resolution (2.5-3 Ao)
High resolution (2.0 Ao)

48
Some Crystallographic Terms

h,k,l Miller indices (like a name of the
reflection)
I(h,k,l) intensity
2? angle between the x-ray incident beam and
reflect beam

49
Diffraction by a Molecule in a Crystal

The electric vector of the X-ray wave forces the
electrons in our sample to oscillate with the
same wavelength as the incoming wave

50
Description of Waves
51
Structure Factor Equation

fj proportional to the number of electrons this
atom j has
One of the fundamental equations in X-ray
Crystallography

52
The Phase

From the measurement, we can only obtain the
intensity I(hkl) of any given reflection (hkl)
The phase a(hkl) cannot be measured

53
How to Determine the Phase

Small changes are introduced into the crystal of
the protein of interest
Eg soaking the crystal in a solution containing
a heavy atom compound

Second diffraction data set needs to be
collected
Comparing two data sets to determine the phases
(also able to localize the heavy atoms)

54
Other Phase Determination Methods
55
Electron Density Map

Once we know the complete diffraction pattern
(amplitudes and phases), need to calculate an
image of the structure
The above equation returns the electron density
(so we get a map of where the electrons are their
concentration)

56
Interpretation of Electron Density

Now, the electron density has to be interpreted
in terms of atom identities and positions.
(1) packing of the whole molecules is shown in
the crystal
(2) a chain of seven amino acids in shown with
the resulting structure superimposed
(3) the electron density of a trypophan side
chain is shown

57
Refinement and the R-Factor
58
Nuclear Magnetic Resonance

Concentrated protein solution (very purified)
Magnetic field
Effect of radio frequencies on the resonance of
different atoms is measured.

59
(No Transcript)
60
NMR

Behavior of any atom is influenced by neighboring
atoms
more closely spaced residues are more perturbed
than distant residues
can calculate distances based on perturbation

61
NMR spectrum of a protein
62
Computational Molecular Biology

Protein Structure Secondary Prediction

63
Primary Structure Symbolic Definition

A A,C,D,E,F,G,H,I,J,K,L,M,N,P,Q,R,S.T,V,W,Y
set of symbols denoting all amino acids
A - set of all finite sequences formed out of
elements of A, called protein sequences
Elements of A are denoted by x, y, z ..i.e. we
write x? A, y? A, z?A, etc
PROTEIN PRIMARY STRUCTURE any x ? A is also
called a protein sequence or protein sub-unit

64
Protein Secondary Structure (PSS)

Secondary structure the arrangement of the
peptide backbone in space. It is produced by
hydrogen bondings between amino acids
PROTEIN SECONDARY STRUCTURE consists of protein
sequence and its hydrogen bonding patterns
called SS categories

65
Protein Secondary Structure

Databases for protein sequences are expanding
rapidly
The number of determined protein structures (PSS
protein secondary structures) and the number of
known protein sequences is still limited
PSSP (Protein Secondary Structure Prediction)
research is trying to breach this gap.

66
Protein Secondary Structure

The most commonly observed conformations in
secondary structure are
Alpha Helix
Beta Sheets/Strands
Loops/Turns

67
Turns and Loops

Secondary structure elements are connected by
regions of turns and loops
Turns short regions of non-?, non-?
conformation
Loops larger stretches with no secondary
structure.

68
Three secondary structure states

Prediction methods are normally assessed for 3
states
H (helix)
E (strands)
L (others (loop or turn))

69
Secondary Structure

8 different categories
H ? - helix
G 310 helix
I ? - helix (extremely rare)
E ? - strand
B ? - bridge
T ?- turn
S bend
L the rest

70
Three SS states Reduction methods

Method 1, used by DSSP program
H(helix) G (310 helix), H (?- helix)
E (strands) E (?-strand), B (?-bridge) ,
L all the rest
Shortly E,B gt E G,H gt H Rest gt C
Method 2, used by STRIDE program
H as in Method 1
E E (?-strand), b (isolated ? -bridge),
L all the rest

71
Three SS states Reduction methods

Method 3, used by DEFINE program
H(helix) as in Method 1
E (strands) E (?-strand),
L all the rest

72
Example of typical PSS Data

Example
Sequence
KELVLALYDYQEKSPREVTHKKGDILTLLNSTNKDWWKYEYNDRQGFVP
Observed SS
HHHHHLLLLEEEHHHLLLEEEEEELLLHHHHHHHHLLLEEEEEELLLHHH

73
PSS Symbolic Definition

Given A A,C,D,E,F,G,H,I,J,K,L,M,N,P,Q,R,S.T,V,
W,Y set of symbols denoting amino acids and a
protein sequence x ? A
Let S H, E, L be the set of symbols of 3
states H (helix), E (strands) and L (loop) and
S be the set of all finite sequences of elements
of S.
We denote elements of S by e, e? S

74
PSS Symbolic Definition

Any one-to-one function
f A? S i.e. f ? A x S
is called a protein secondary structure (PSS)
identification function
An element (x, e) ? f is a called protein
secondary structure (of the protein sequence x)
The element e ? S (of (x, e) ? f ) is called
secondary structure.

75
PSSP

If a protein sequence shows clear similarity to a
protein of known three dimensional structure
then the most accurate method of predicting the
secondary structure is to align the sequences by
standard dynamic programming algorithms
Why?
homology modelling is much more accurate than
secondary structure prediction for high levels of
sequence identity.

76
PSSP

Secondary structure prediction methods are of
most use when sequence similarity to a protein of
known structure is undetectable.
It is important that there is no detectable
sequence similarity between sequences used to
train and test secondary structure prediction
methods.

77
Classification and Classifiers

Given a database table DB with a special
atribute C, called a class attribute (or decision
attribute). The values C1, C2, ...Cn of the
class atrribute are called class labels.
Example

A1 A2 A3 A4 C
1 1 m g c1
0 1 v g c2
1 0 m b c1
78
Classification and Classifiers

The attribute C partitions the records in the DB
divides the records into disjoint subsets
defined by the attributes C values, CLASSIFIES
the records.
It means we use the attributre C and its values
to divide the set R of records of DB into n
disjoint classes
C1 r?DB Cc1 ...... Cnr?DB Ccn
Example (from our table)
C1 (1,1,m,g), (1,0,m,b) r1,r3
C2 (0,1,v,g) r2

79
Classification and Classifiers

An algorithm is called a classification algorithm
if it uses the data and its classification to
build a set of patterns.
Those patterns are structured in such a way that
we can use them to classify unknown sets of
objects- unknown records.
For that reason (because of the goal) the
classification algorithm is often called shortly
a classifier.
The name classifier implies more then just
classification algorithm. A classifier is final
product of a data set and a classification
algorithm.

80
Classification and Classifiers

Building a classifier consists of two phases
training and testing.
In both phases we use data (training data set
and disjoint with it test data set) for which the
class labels are known for ALL of the records.
We use the training data set to create patterns
We evaluate created patterns with the use of of
test data, which classification is known.
The measure for a trained classifier accuracy is
called predictive accuracy.
The classifier is build i.e. we terminate the
process if it has been trained and tested and
predictive accuracy was on an acceptable level.

81
Classifiers Predictive Accuracy

PREDICTIVE ACCURACY of a classifier is a
percentage of well classified data in the testing
data set.
Predictive accuracy depends heavily on a choice
of the test and training data.
There are many methods of choosing test and and
training sets and hence evaluating the predictive
accuracy. This is a separate field of research.

82
Accuracy Evaluation

Use training data to adjust parameters of method
until it gives the best agreement between its
predictions and the known classes
Use the testing data to evaluate how well the
method works (without adjusting parameters!)
How do we report the performance?
Average accuracy fraction of all test examples
that were classified correctly

83
Accuracy Evaluation

Multiple cross-validation test has to be
performed to exclude a potential dependency of
the evaluated accuracy on the particular test set
chosen
Jack-Knife
Use 129 chains for setting up the tool (training
set)
1 for estimating the performance (testing)
This has to be repeated 130 times until each
protein has been used once for testing
The average over all 130 tests gives an estimate
of the prediction accuracy

84
PSSP Datasets

Historic RS126 dataset. Contains126 sub-units
with known secondary structure selected by Rost
and Sander. Today is not used anymore
CB513 dataset. Contains 513 sub-units with known
secondary structure selected by Cuff and Barton
in 1999. Used quite frencently in PSSP research
HS17771 dataset. Created by Hobohm and Scharf.
In March-2002 it contained 1771 sub-units
Lots of authors has their own and secret
datasets

85
Measures for PSSP accuracy

http//cubic.bioc.columbia.edu/eva/doc/measure_sec
.html (for more information)
Q3 Three-state prediction accuracy (percent of
succesful classified)
Qi obs How many of the observed residues were
correctly predicted?
Qi prd How many of the predicted residues were
correctly predicted?

86
Measures for PSSP Accuracy

Aij number of residues predicted to be in
structure type j and observed to be in type i
Number of residues predicted to be in structure
i
Number of residues observed to be in structure i

87
Measures for SSP Accuracy

The percentage of residues correctly predicted to
be in class i relative to those observed to be in
class i
The percentages of residues correctly predicted
to be in class i from all residues predicted to
be in i
Overall 3-state accuracy

88
PSSP Algorithms

There are three generations in PSSP algorithms
First Generation based on statistical
information of single amino acids (1960s and
1970s)
Second Generation based on windows (segments) of
amino acids. Typically a window containes 11-21
amino acids (dominating the filed until early
1990s)
Third Generation based on the use of windows on
evolutionary information

89
PSSP First Generation

First generation PSSP systems are based on
statistical information on a single amino acid
The most relevant algorithms
Chow-Fasman, 1974
GOR, 1978
Both algorithms claimed 74-78 of predictive
accuracy, but tested with better constructed
datasets were proved to have the predictive
accuracy 50 (Nishikawa, 1983)

90
Chou-Fasman method

Uses table of conformational parameters
determined primarily from measurements of the
known structure (from experimental methods)
Table consists of one likelihood for each
structure for each amino acid
Based on frequencies of residues in a-helices,
b-sheets and turns
Notation P(H) propensity to form alpha helices
f(i) probability of being in position 1 (of a
turn)

91
Chou-Fasman Pij-values
92
Chou-Fasman

A prediction is made for each type of structure
for each amino acid
Can result in ambiguity if a region has high
propensities for both helix and sheet (higher
value usually chosen)

93
Chou-Fasman

How it works
1. Assign all of the residues the appropriate set
of parameters
2. Identify a-helix and b-sheet regions. Extend
the regions in both directions.
3. If structures overlap compare average values
for P(H) and P(E) and assign secondary structure
based on best scores.
4. Turns are calculated using 2 different
probability values.

94
Assign Pij values
1. Assign all of the residues the appropriate
set of parameters
95
Scan peptide for a-helix regions
2. Identify regions where 4 out of 6 have a
P(H) gt100 alpha-helix nucleus
96
Extend a-helix nucleus
3. Extend helix in both directions until a set of
four consecutive residues with P(H) lt100.
Find sum of P(H) and sum of P(E) in the extended
region If region is long enough ( gt 5 letters)
and sum P(H) gt sum P(E) then declare the extended
region as alpha helix
97
Scan peptide for b-sheet regions
4. Identify regions where 3 out of 5 have a
P(E) gt100 b-sheet nucleus 5. Extend b-sheet
until 4 continuous residues with an average P(E)
lt 100 6. If region average gt 100 and the
average P(E) gt average P(H) then b-sheet
98
Overlapping

Resolving overlapping alpha helix beta sheet
Compute sum of P(H) and sum of P(E) in the
overlap.
If sum P(H) gt sum P(E) gt alpha helix
If sum P(E) gt sum P(H) gt beta sheet

99
Turn Prediction

An amino acid is predicted as turn if all of the
following holds
f(i)f(i1)f(i2)f(i3) gt 0.000075
Avg(P(ik)) gt 100, for k0, 1, 2, 3
Sum(P(t)) gt Sum(P(H)) and Sum(P(E)) for ik,
(k0, 1, 2, 3)

100
PSSP Second Generation

Based on the information contained in a window of
amino acids (11-21 aa.)
The most systems use algorithms based on
Statistical information
Physico-chemical properties
Sequence patterns
Graph-theory
Multivariante statistics
Expert rules
Nearest-neighbour algorithms

101
PSSP First Second Generation

Main problems
Prediction accuracy lt70
SS assigments differ even between crystals of the
same protein
SS formation is partially determined by
long-range interactions, i.e., by contacts
between residues that are not visible by any
method based on windows of 11-21 adjacent residues

102
PSSP First Second Generation

Main problems
Prediction accuracy for b-strand 28-48, only
slightly better than random
beta-sheet formation is determined by more
nonlocal contacts than in alpha-helix formation
Predicted helices and strands are usually too
short
Overlooked by most developers

103
Example of Second Generation

Example for typical secondary structure
prediction of the 2nd generation.
The protein sequence (SEQ ) given was the SH3
structure.
The observed secondary structure (OBS ) was
assigned by DSSP (H helix E strand blank
non-regular structure the dashes indicate the
continuation).
The typical prediction of too short segments (TYP
) poses the following problems in practice.
(i) Are the residues predicted to be strand in
segments 1, 5, and 6 errors, or should the
helices be elongated?
(ii) Should the 2nd and 3rd strand be joined, or
should one of them be ignored, or does the
prediction indicate two strands, here? Note the
three-state per-residue accuracy is 60 for the
prediction given.

104
PSSP Third Generation

PHD First algorithm in this generation (1994)
Evolutionary information improves the prediction
accuracy to 72
Use of evolutionary information
1. Scan a database with known sequences with
alignment methods for finding similar sequences
2. Filter the previous list with a threshold to
identify the most significant sequences
3. Build amino acid exchange profiles based on
the probable homologs (most significant
sequences)
4. The profiles are used in the prediction,
i.e. in building the classifier

105
PSSP Third Generation

Many of the second generation algorithms have
been updated to the third generation

106
PSSP Third Generation

Due to the improvement of protein information in
databases i.e. better evolutionary information,
todays predictive accuracy is 80
It is believed that maximum reachable accuracy is
88. Why such conjecture?

107
Why 88

SS assignments may vary for two versions of the
same structure
Dynamic objects with some regions being more
mobile than others
Assignment differ by 5-15 between different
X-ray (NMR) versions of the same protein
Assignment diff. by about12 between structural
homologues
B. Rost, C. Sander, and R. Schneider, Redefining
the goals of protein secondary structure
predictions, J. Mol. Bio.

108
PSSP Data Preparation

Public Protein Data Sets used in PSSP research
contain protein secondary structure sequences. In
order to use classification algorithms we must
transform secondary structure sequences into
classification data tables.
Records in the classification data tables are
called, in PSSP literature (learning) instances.
The mechanism used in this transformation process
is called window.
A window algorithm has a secondary structure as
input and returns a classification table set of
instances for the classification algorithm.

109
Window

Consider a secondary structure (x, e).
where (x,e) (x1x2 xn, e1e2en)
Window of the length w chooses a subsequence of
length w of x1x2 xn, and an element ei from
e1e2en, corresponding to a special position in
the window, usually the middle
Window moves along the sequences
x x1x2 xn and e e1e2en
simultaneously, starting at the beginning moving
to the right one letter at the time at each step
of the process.

110
Window Sequence to Structure

Such window is called sequence to structure
window. We will call it for short a window.
The process terminates when the window or its
middle position reaches the end of the sequence
x.
The pair (subsequence, element of e ) is often
written in a form
subsequence ? H, E or L
is called an instance, or a rule.

111
Example Window

Consider a secondary structure (x, e) and the
window of length 5 with the special position in
the middle (bold letters)
Fist position of the window is
x A R N S T V V S T A A .
e H H H H L L L E E E
Window returns instance
A R N S T ? H

112
Example Window

Second position of the window is
x A R N S T V V S T A A .
e H H H H L L L E E E
Windows returns instance
R N S T V ? H
Next instances are
N S T V V ? L
S T V V S ? L
T V V S T ? L

113
Symbolic Notation

Let f be a protein secondary structure (PSS)
identification function
f A? S i.e. f ? A x S
Let x x1x2xn, e e1e2en, f(x) e, we define
f(x1x2xn)xi ei, i.e. f(x)xi ei

114
ExampleSemantics of Instances

Let
x A R N S T V V S T A A .
e H H H H L L L E E E
And assume that the windows returns an instance
A R N S T ? H
Semantics of the instance is
f(x)NH,
where f is the identification function and N is
preceded by A R and followed by S T and the
window has the length 5

115
Classification Data Base (Table)

We build the classification table with attributes
being the positions p1, p2, p3, p4, p5 .. pw
in the window, where w is length of the
window.
The corresponding values of attributes are
elements of of the subsequent on the given
position.
Classification attribute is S with values in the
set H, E, L assigned by the window operation
(instance, rule).
The classification table for our example (first
few records) is the following.

116
Classification Table (Example)

x A R N S T V V S T A A .
e H H H H L L L E E E

p1 p2 p3 p4 p5 S
A R N S T H
R N S T V H
N S T V V L
S T V V S L
Semantics of record r r(p1, p2, p3,p4,p5, S) is
f(x)Vp3 Vs where Va denotes a value of
the attribute a.
117
Size of classification datasets (tables)

The window mechanism produces very large datasets
For example window of size 13 applied to the
CB513 dataset of 513 protein subunits produces
about
70,000 records (instances)

118
Window

Window has the following parameters
PARAMETER 1 i ? N, the starting point of the
window as it moves along the sequence x x1 x2
. xn. The value i1 means that window starts
at x1, i5 means that window starts at x5
PARAMETER 2 w ? N denotes the size (length)
of the window.
For example the PHD system of Rost and Sander
(1994) uses two window sizes 13 and 17.

119
Window

PARAMETER 3 p ? 1,2, , w
where p is a special position of the window
that returns the classification attribute values
from S H, E, L and w is the size (length) of
the window
PSSP PROBLEM
find optimal size w, optimal special position
p for the best prediction accuracy

120
Window Symbolic Definition

Window Arguments window parameters and secondary
structure (x,e)
Window Value (subsequence of x, element of e)
OPERATION (sequence to structure window)
W is a partial function
W N ? N ? 1,, k ?(A ? S ) ? A ? S
W(i, k, p, (x,e)) (xi x(i1). x(ik-1),
f(x)x(ip)) where (x,e) (x1x2 ..xn, e1e2en)

121
Neural network models

machine learning approach
provide training sets of structures (e.g.
a-helices, non a -helices)
are trained to recognize patterns in known
secondary structures
provide test set (proteins with known structures)
accuracy 70 75

122
Reasons for improved accuracy

Align sequence with other related proteins of the
same protein family
Find members that has a known structure
If significant matches between structure and
sequence assign secondary structures to
corresponding residues

123
3 State Neural Network
124
Neural Network
125
Input Layer

Most of approach set w 17. Why?
Based on evidence of statistical correlation with
secondary structure as far as 8 residues on
either side of the prediction point
The input layer consists of
17 blocks, each represent a position of window
Each block has 21 units
The first 20 units represent the 20 aa
One to provide a null input used when the moving
window overlaps the amino- or carboxyl-terminal
end of the protein

126
Binary Encoding Scheme

Example
Let w 5, and let say we have the sequence
A E G K Q.
Then the input layer is
A,C,D,E,F,G,,N,P,Q,R,S.T,V,W,Y
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . 0 0
0 0 1 0 ..
0 0 1 0 ..

127
Hidden Layer

Represent the structure of the central aa
Encoding scheme
Can use two units to present
(1,0) H, (0,1) E, (0,0) L
Some uses three units
(1,0,0) H, (0,1,0) E, (0,0,1) L
For each connection, we can assign some weight
value.
This weight value can be adjusted to best fit the
data (training)

128
Output Level

Based on the hidden level and some function f,
calculate the output.
Helix is assigned to any group of 4 or more
contiguous residues
Having helix output values greater than sheet
outputs and greater than some threshold t
Strand (E) is assigned to any group of two or
more contiguous resides, having sheet output
values greater than helix outputs and greater
than t
Otherwise, assigned to L
Note that t can be adjusted as well (training)

129
How PHD works

Step 1. BLAST search with input sequence
Step 2. Perform multiple seq. alignment and
calculate aa frequencies for each position

130
How PHD works

Step 3. First Level Sequence to structure net
Input alignment profile, Output units for H,
E, L
Calculate occurrences of any of the residues
to be present in either an a-helix, b-strand, or
loop.

1 2 3 4 5 6 7
H 0.05 E 0.18 L 0.67
N0.2, S0.4, A0.4
131
How PHD works

Step 3. Second Level Structure to structure
net
Input First Level values, Output units for H,
E, L
Window size 17

H 0.59 E 0.09 L 0.31
E0.18
Step 4. Decision level
132
Prepare Data for PHD Neural Nets

Starting from a sequence of unknown structure
(SEQUENCE ) the following steps are required to
finally feed evolutionary information into the
PHD neural networks
a data base search for homologues (method Blast),
a refined profile-based dynamic-programming
alignment of the most likely homologues (method
MaxHom)
a decision for which proteins will be considered
as homologues (length-depend cut-off for pairwise
sequence identity)
a final refinement, and extraction of the
resulting multiple alignment. Numbers 1-3
indicate the points where users of the
PredictProtein service can interfere to improve
prediction accuracy without changes made to the
final prediction method PHD .
http//cubic.bioc.columbia.edu/papers/2000_rev_hum
ana/paper.html

133
PHD Neural Network
134
Prediction Accuracy
135
Where can I learn more?

Protein Structure Prediction Center
Biology and Biotechnology Research
ProgramLawrence Livermore National Laboratory,
Livermore, CA
http//predictioncenter.llnl.gov/Center.html

DSSP Database of Secondary Structure
Prediction http//www.sander.ebi.ac.uk/dssp/
136
Computational Molecular Biology

Protein Structure Tertiary Prediction via
Threading

137
Objective

Study the problem of predicting the tertiary
structure of a given protein sequence

138
A Few Examples
actual
predicted
predicted
actual
actual
actual
predicted
predicted
139
Two Comparative Modeling

Homology modeling identification of homologous
proteins through sequence alignment structure
prediction through placing residues into
corresponding positions of homologous structure
models
Protein threading make structure prediction
through identification of good
sequence-structure fit
We will focus on the Protein Threading.

140
Why it Works?

Observations
Many protein structures in the PDB are very
similar
Eg many 4-helical bundles, globins in the set
of solved structure
Conjecture
There are only a limited number of unique
protein folds in nature

141
Threading Method

General Idea
Try to determine the structure of a new sequence
by finding its best fit to some fold in library
of structures
Sequence-Structure Alignment Problem
Given a solved structure T for a sequence t1t2tn
and a new sequence S s1s2 sm, we need to find
the best match between S and T

142
What to Consider

How to evaluate (score) a given alignment of s
with a structure T?
How to efficiently search over all possible
alignments?

143
Three Main Approaches

Protein Sequence Alignment
3D Profile Method
Contact Potentials

144
Protein Sequence Alignment Method

Align two sequences S and T
If in the alignment, si aligns with tj, assign si
to the position pj in the structure
Advantages
Simple
Disadvantages
Similar structures have lots of sequence
variability, thus sequence alignment may not be
very helpful

145
3D Profile Method

Actually uses structural information
Main idea
Reduce the 3D structure to a 1D string describing
the environment of each position in the protein.
(called the 3D profile (of the fold))
To determine if a new sequence S belongs to a
given fold T, we align the sequence with the
folds 3D profile
First question How to create the 3D profile?

146
Create the 3D Profile

For a given fold, do
For each residue, determine
How buried is it?
Fraction of surrounding environment that is polar
What secondary structure is it in (alpha-helix,
beta-sheet, or neither)

147
Create the 3D profile

2. Assign an environment class to each position
Six classes describe the burial and polarity
criteria (exposed, partially buried, very buried,
different fractions of polar environment)

148
Create the 3D Profile

These environment classes depend on the number of
surrounding polar residues and how buried the
position is.
There are 3 SS for each of these, thus have 18
environment classes

149
Create the 3D Profile

3. Convert the known structure T to a string of
environment descriptors
4. Align the new sequence S with E using dynamic
programming

150
Scores for Alignment

Need scores for aligning individual residues with
environments.
Key Different aa prefer diff. environment. Thus
determine scores by looking at the statistical
data

151
Scores for Alignment

Choose a database of known structures
Tabulate the number of times we see a particular
residue in a particular environment class -gt
compute the score for each env class and each aa
pair
Choose gap penalties, eg. may charge more for
gaps in alpha and beta environments

152
Alignment

This gives us a table of scores for aligning an
aa sequence with an environment string
Using this scoring and Dynamic Programming, we
can find an optimal alignment and score for each
fold in our library
The fold with the highest score is the best fold
for the new sequence

153
Contact Potentials Method

Take 3D structure into account more carefully
Include information about how residues interact
with each other
Consider pairwise interactions between the
position pi, pj in the fold
For a given alignment, produce a score which is
the sum over these interactions

154
Problem

Have a sequence from the database T t1tn with
known positions p1pn, and a new sequence S
s1sm.
Find 1 lt r1 lt r2 lt lt rn lt m which maximize
where ri is the index of the aa in S which
occupies position pi
This problem is NP-complete for pairwise
interactions

155
How to Define that Score?

Use so-called knowledge-based potentials, which
comes from databases of observed interactions.
The general form

156
How to Define the Score

General Idea
Define cutoff parameter for contact (e.g. up to
6 Angstroms)
Use the PDB to count up the number of times aa i
and j are in contact
Several method for normalization. Eg.
Normalization is by hypothetical random
frequencies

157
Other Variations

Many other variations in defining the potentials
In addition to pairwise potentials, consider
single residue potentials
Distance-dependent intervals
Counting up pairwise contacts separately for
intervals within 1 Angstrom, between 1 and 2
Angstroms

158
Threading via Tree-Decomposition
159
Contact Graph

Each residue as a vertex
One edge between two residues if their spatial
distance is within given cutoff.
Cores are the most conserved segments in the
template

template
160
Simplified Contact Graph
161
Alignment Example
162
Alignment Example
163
Calculation of Alignment Score

164
Graph Labeling Problem

Each core as a vertex
Two cores interact if there is an interaction
between any two residues, each in one core
Add one edge between two cores that interact.

h
f
b
d
s
m
c
a
e
i
j
k
l
Each possible sequence alignment position for a
single core can be treated as a possible label
assignment to a vertex in G Di be a set of
all possible label assignments to vertex i. Then
for each label assignment A(i) in Di, we have
165
Tree Decomposition
166
Tree DecompositionRobertson Seymour, 1986
Greedy minimum degree heuristic
h

Choose the vertex with minimum degree
The chosen vertex and its neighbors form a
component
Add one edge to any two neighbors of the chosen
vertex
Remove the chosen vertex
Repeat the above steps until the graph is empty

167
Tree Decomposition (Contd)
Tree Decomposition
168
Tree Decomposition-Based Algorithms

Bottom-to-Top Calculate the minimal F function
2. Top-to-Bottom Extract the optimal assignment

A tree decomposition rooted at Xr
The score of component Xi
The scores of subtree rooted at Xl
The score of subtree rooted at Xi
The scores of subtree rooted at Xj

Write a Comment

User Comments (0)

About PowerShow.com

Computational Molecular Biology - PowerPoint PPT Presentation

Computational Molecular Biology

Computational Molecular Biology Protein Structure: Introduction and Prediction – PowerPoint PPT presentation