Title: Prof Shoba Ranganathan Dept. of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, Australia
1Prof Shoba Ranganathan Dept. of Chemistry and
Biomolecular Sciences, Macquarie University,
Sydney, Australia Dept of Biochemistry, Yong
Loo Lin School of MedicineNational University of
Singapore(shoba_at_bic.nus.edu.sg)
- Biomolecular Modeling
- building a 3D protein
- structure from its sequence
2Why protein structure?
- In the factory of the living cell, proteins are
the workers, performing a variety of tasks
- Each protein adopts a particular folding pattern
that determines its function - The 3D structure of a protein brings into close
proximity residues that are far apart in the
amino acid sequence
3How does a protein fold?
- Most newly synthesized proteins fold without
assistance!
- Ribonuclease A denatured protein could refold
and recover its activity (C. Anfinsen -1966) - Structure implies function
- The amino acid sequence encodes the proteins
structural information
4- Understanding Protein Structure
- A Quick Overview of Sequence Analysis
- Finding a Structural Homologue
- Template Selection
- Aligning the Query Sequence to Template
Structure(s) - Building the Model
5The basics
- Proteins are linear heteropolymers one or more
polypeptide chains - Repeat units 20 amino acid residues
- Range from a few 10s-1000s
- Three-dimensional shapes (folds) adopted vary
enormously - Experimental methods X-ray crystallography,
electron microscopy and NMR (nuclear magnetic
resonance)
6The (L-)amino acid
R
Side chain H,CH3,
Backbone
Amino
C a
-
Carboxylate
7The peptide bond
8Coplanar atoms
9Levels of protein structure
- Zeroth amino acid composition
- Primary
- This is simply the order of covalent linkages
along the polypeptide chain, i.e. the sequence
itself
10Levels of protein structure
- Secondary
- Local organization of the protein backbone
a-helix, b-strand (which assemble into b-sheets),
turn and interconnecting loop
11Ramachandran / phi-psi plot
b-sheet
a-helix (left handed)
y
a-helix (right handed)
f
12Levels of protein structure
- Tertiary
- packing of secondary structure elements into a
compact spatial unit - Fold or domain this is the level to which
structure prediction is currently possible
13Levels of protein structure
- Quaternary
- Assembly of homo- or heteromeric protein chains
- Usually the functional unit of a protein,
especially for enzymes
14Structural classes
All-a (helical)
All-b (sheet)
15Structural classes
a/b (parallel b-sheet)
ab (antiparallel b-sheet)
16Structural information
- Protein Data Bank maintained by the Research
Collaboratory for Structural Bioinformatics - http//www.rcsb.org/pdb
- gt 45,744 structures of proteins
- Also contains structures of DNA, carbohydrates,
protein-DNA complexes and numerous small ligand
molecules.
17The PDB data
- Text files
- Each entry is identified by a unique 4-letter
code say 1emg - 1emg entry
- Header information
- Atomic coordinates in Å (1 Ångstrom 1.0e-10 m)
18PDB Header details
- identifies the molecule, any modifications, date
of release of PDB entry - organism, keywords, method
- Authors, reference, resolution if X-ray structure
- Sequence, x-reference to sequence databases
HEADER GREENFLUORESCENT PROTEIN
12-NOV-98 1EMG TITLE GREEN
FLUORESCENT PROTEIN (65-67 REPLACED BY CRO, S65T
TITLE 2 SUBSTITUTION, Q80R)
COMPND MOL_ID 1
COMPND 2
MOLECULE GREEN FLUORESCENT PROTEIN
COMPND 3 CHAIN A
COMPND 4 ENGINEERED YES
COMPND 5
MUTATION 65 - 67 REPLACED BY CRO, S65T
SUBSTITUTION, Q80R COMPND 6
SUBSTITUTION
COMPND 7 BIOLOGICAL_UNIT
MONOMER
19The data itself
- Coordinates for each heavy (non-hydrogen) atom
from the first residue to the last - Any ligands (starting with HETATM) follow the
biomacromolecule - O of water molecules (also HETATM) at the end
ATOM 1 N SER A 2 29.089 9.397
51.904 1.00 81.75 ATOM 2 CA SER A 2
27.883 10.162 52.185 1.00 79.71 ATOM 3
C SER A 2 26.659 9.634 51.463 1.00
82.64 ATOM 4 O SER A 2 26.718
8.686 50.686 1.00 81.02 ATOM 5 CB
SER A 2 28.039 11.660 51.932 1.00
75.59 ATOM 6 OG SER A 2 27.582
12.038 50.639 1.00 43.28 ------- ATOM 1737
CD1 ILE A 229 39.535 21.584 52.346 1.00
41.62 TER 1738 ILE A 229
20Structural Families
- SCOP - Structural Classification Of Proteins
- http//scop.mrc-lmb.cam.ac.uk/scop
- FSSP Family of Structurally Similar Proteins
- http//www.ebi.ac.uk/dali/fssp/
- CATH Class, Architecture, Topology, Homology
- http//www.biochem.ucl.ac.uk/bsm/cath
21Structure comparison facts
- Proteins adopt a limited number of topologies.
- Homologous sequences show very similar
structures, with strong conservation in secondary
structural elements variations in non-conserved
regions. - In the absence of sequence homology, some folds
are preferred by vastly different sequences.
22Structure comparison facts
- The active site (a collection of functionally
critical residues) is remarkably conserved, even
when the protein fold is different. - Structural models (especially those based on
homology) provide insights into possible function
for new proteins. - Implications for
- protein engineering
- ligand/drug design,
- function assignment of genomic data.
23Visualizing PDB information
- RASMOL most popular, available for all platforms
- (Sayle et al, 2005)
- http//www.bernstein-plus-sons.com/software/
rasmol - DeepView Swiss-PDBViewer from Swiss-Prot
- (Guex Peitsch, 1997)
- http//tw.expasy.org/spdbv/
- Chemscape Chime Plug-in for PC and Mac
- http//www.mdli.com/products/framework/chems
cape - PyMOL Very good, available for all platforms
- (DeLano, W.L. The PyMOL Molecular Graphics
System, 2002) - http//pymol.sourceforge.net
24RASMOL views - SH2 domain
All-atom model
Space-filling model
Atom colors N O C S
25RASMOL views 1sha
Ca Trace
Ribbon
Rainbow coloring N to C
Coloring by structural units
26Homologous folds
- Hemoglobin and erythrocruorin 31 sequence
identity
27Analogous folds
- Hemoglobin and phycocyanin 9 sequence identity
28Surface Properties
- Cro repressor DNA complex
- Basic residues in blue
- Acidic residues in red
29Mapping Functional Regions
- Immunoglobulin l light chain - dimer
- Hydrophobhic residues in magenta
- Hydrophilic and charged residues in cyan
30- Understanding Protein Structure
- A Quick Overview of Sequence Analysis
- Finding a Structural Homologue
- Template Selection
- Aligning the Query Sequence to Template
Structure(s) - Building the Model
31Siblings and Cousins
- Siblings or homologues sequences with at least
30 sequence identity over an alignment length of
at least 125 residues and conservation of
function. - Cousins or paralogues lt 30 identity but with
conservation of function - Both show structural conservation
- Homologues located using a database search tool
such as BLAST (free webserver)
http//www.ncbi.nlm.nih.gov/BLAST - Paralogues require a more sensitive method such
as PSI-BLAST
32Multiple Sequence Alignment
- Finding the best way to match the residues of
- related sequences
- Identical residues must be lined up
- The rest should be arranged, based on
- observed substitution in protein families
- chemical similarity
- charge similarity
- Where it is impossible to get the residues to
line up, the biological concept of
insertion/deletion in invoked the gap in
alignments
33MSA Methods
- CLUSTALW / CLUSTALX (Thompson et al, 1997)
freely available for all platforms and one of the
best alignment programs - http//www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top
.html - MAXHOM (Sander Schneider, 1991) alignment
based on maximum homology available via the
PredictProtein webserver, free for academics - http//cubic.bioc.columbia.edu/predictprotein/
- MALIGN (Johnson et al, 1994) freely available
UNIX program, based on the structural alignment
of protein families - http//www.abo.fi/fak/mnf/bkf/research/johnson/sof
tware.html -
34Alignment Checks
- Conservation of functionally important residues
e.g. the catalytic triad (Asp-Ser-His) that are
essential for serine proteinase activity - Line up of structurally important residues e.g.
cysteines forming disulfide bonds - Overall, maximizing the alignment of like
residues - Completely conserved residues usually indicate
some conserved structural or functional role,
especially buried charges
35Sequence Motifs Patterns
- From the analysis of the alignment of protein
families - Conserved sequence features, usually associated
with a specific function - PROSITE (Hulo et al, 2006) database for protein
signature patterns http//www.expasy.ch/prosit
e
36Aligned Sequence Families
- From alignments of homologous sequences
- PRINTS
- PRODOM http//www.toulouse.inra.fr/prodom.html
- From Hidden Markov Model based methods
- PFAM http//www.sanger.ac.uk/Pfam
37Protein Domains
- Most proteins are composed of structural subunits
called domains - A domain is a compact unit of protein structure,
usually associated with a function. - It is usually a fold - in the case of monomeric
soluble proteins. - A domain comprises normally only one protein
chain rare examples involving 2 chains are
known. - Domains can be shared between different proteins
like a LEGO block
38Protein Architectures
- Beads-on-a-string sequential location
tyrosine-protein kinase receptor TIE-1
(immunoglobulin, EGF, fibronectin type-3 and
protein kinase). - Domain insertions plugged-in - pyruvate kinase
(1pyk) - SMART smart.embl-heidelberg.de
- Simple Modular Architecture Retrieval Tool
39 Dissection into Domains
- A sequence, usually gt 125 residues should be
routinely checked to see how many domains are
present. - Conserved Domain Architecture Retrieval Tool
(CDART) uses information in Pfam and SMART to
assign domains along a sequence - E.g. NP_002917 shows similarity to G-protein
regulators
40- Understanding Protein Structure
- A Quick Overview of Sequence Analysis
- Finding a Structural Homologue
- Template Selection
- Aligning the Query Sequence to Template
Structure(s) - Building the Model
41Structural Homologues
- BLASTP vs. PDB database or PSI-BLAST look for
4-character PDB ID - E lt 0.005
- Domain coverage at least 60 coverage is
recommended - Gaps we dont want them. Choose between
- few gaps and reasonable similarity scores or
- lots of gaps and high similarity scores?
42Small Proteins Disulfide bonds
- BLAST-type methods may not locate homologues, if
Conserved Domain search is not turned on. - Are the Cys residues conserved?
- Gaps where are they on the structure?
- gnlPfampfam00095, wap, WAP-type (Whey Acidic
Protein) - four-disulfide core'.
- CD-Length 46 residues, 100.0 aligned
- Score 43.9 bits (102), Expect 1e-06
-
- Q49 KAGFCPWNLLQMISSTGPCPMKIECSSDRECSGNMKCCNVDCVMT
CTPP 97 - D 1 KPGVCPWVSISE---AGQCLELNPCQSDEECPGNKKCCPGSCGMS
CLTP 46
43Metal-binding domains
- C2H2 Zinc Finger
- 2 Cys 2 His binding to Zinc
- Not detected even by CD-search in BLAST
- Detected by Pfam SMART
- Sequence Pattern
- -X-C-X(1-5)-C-X3--X5--X2-H-X(3-6)-H/C
44Structure Prediction Methods
- Secondary Structure Prediction identify local
structural elements such as helices, strands and
loops. - gt 75 accuracy achievable
- PredictProtein or PHD
- http//cubic.bioc.columbia.edu/pp/
- PSIPRED
- http//bioinf.cs.ucl.ac.uk/psipred/
- SSPro
- http//promoter.ics.uci.edu/BRNN-PRED/
45Folds from Secondary Structure Predictions
- Assembling SSEs into folds is a combinatorial
problem - Current methods depend on available structural
data for mapping predictions - FORREST http//abs.cit.nih.gov/foresst/foresst.ht
ml - TOPITS from the PHD server http//cubic.bioc.colu
mbia.edu/pp
46Tertiary Structure Prediction
- Fold recognition/Threading lt 20 identity
typically - Best results obtained by combining several
database search and knowledge-based tools - 3D-PSSM http//www.sbg.bio.ic.ac.uk/3dpssm/
- FUGUE
http//www-cryst.bioc.cam.ac.uk/fugue/
47- Understanding Protein Structure
- A Quick Overview of Sequence Analysis
- Finding a Structural Homologue
- Template Selection
- Aligning the Query Sequence to Template
Structure(s) - Building the Model
48One or many templates?
- Sequence similarity extract template sequences
and align with query select the most similar
structure - Completeness Missing data?
REMARK 465 MISSING RESIDUES REMARK 465 THE
FOLLOWING RESIDUES WERE NOT LOCATED IN THE
REMARK 465 EXPERIMENT. (MMODEL NUMBER
RESRESIDUE NAME CCHAIN REMARK 465 IDENTIFIER
SSSEQSEQUENCE NUMBER IINSERTION CODE.) REMARK
465 REMARK 465 M RES C SSSEQI REMARK 465 MET
A 1 REMARK 465 THR A 230 REMARK 470 M RES
CSSEQI ATOMS REMARK 470 GLU A 5 OE2 REMARK
470 GLU A 6 CG CD OE1 OE2 REMARK 470 GLU A
17 OE1
49One or many templates?
- X-ray or NMR?
- Lowest resolution X-ray structure
- X-ray and then NMR
- NMR average over assembly
- One or many?
- Structure alignment of Ca atoms
- If 2 templates are very close, keep only one
- Keep templates that provide new information
50Many templates
- Sequence alignment from structure comparison of
templates (SSA) can be different from a simple
sequence alignment (SA). - For model building,
- align templates structurally
- extract the corresponding SSA
51- Understanding Protein Structure
- A Quick Overview of Sequence Analysis
- Finding a Structural Homologue
- Template Selection
- Aligning the Query Sequence to the Template
Structure(s) - Building the Model
52Query - Template Alignment
- gt40 identity any alignment method is OK
- Below this, checks are essential.
- Collect close sequence homologues (about 10) and
align to query to get MSA (multiple sequence
alignment) - Collect several structural templates (at least 5)
and align them using structure comparison
methods extract the SSA (structural sequence
alignment) - Align MSA to SSA using profile alignment
- Extract query and selected template(s) from the
final alignment QTA.
53QTA Checks
- Residue conservation checks
- Functional regions
- Patterns/motifs conserved?
- Indels
- Combine gaps separated by few residues
- Editing the alignment
- Move gaps from secondary structures to loops
- Within loops, move gaps to loop ends, i.e.
turnaround point of backbone
54QTA Checks
- Residue conservation checks
- Functional regions
- Patterns/motifs conserved?
- Indels
- Combine gaps separated by few residues
- Editing the alignment
- Move gaps from secondary structures to loops
- Within loops, move gaps to loop ends, i.e.
turnaround point of backbone
55Visual Inspection of Indels
- 2-residue deletion from sequence alignment
- End-of-loop 2-residue deletion
56- Understanding Protein Structure
- A Quick Overview of Sequence Analysis
- Finding a Structural Homologue
- Template Selection
- Aligning the Query Sequence to Template
Structure(s) - Building the Model
57Input for Model Building
- Query sequence
- Template structure
- Template sequence
- Query-template sequence alignment
58Methods Available
- WHATIF (Vriend G, 1990) "
- High quality models where template is available
- Indels not modelled
- Side chain rotamers
- In silico mutations
- In silico disulfide bond creation
59Methods Available
- SWISS-MODEL (Schwede et al, 2003) "
- Automatic modeling mode with multiple templates
- Query template input
- High Homology situations
- DeepView for input file creation
60Methods Available
- MODELLER (Sali Blundell, 1993)
- High quality models
- Sequence alignment
- Structure analysis/alignment
- Multiple templates
- Multiple chains
- Ligand/cofactor present
- ESyPred3D (uses MODELLER) "
- QTAs from several methods neural networks
- http//www.fundp.ac.be/urbm/bioinfo/esypred/
61Methods Available
- ICM (Ruben et al, 1994)
- High quality models
- Loop modelling
- Multiple templates not possible
- Sequence/Structure alignment/analysis
- Ab initio peptide modeling
- Secondary structure prediction
- Geno3D (Combet et al, 2002) "
- Automated modelling
- Distance geometry used for loops
- http//geno
62Methods Available
- 3D-JIGSAW (Bates et al, 2001) "
- Automatic modeling mode
- Interactive user mode to select templates
- Multiple templates
- Multidomain protein modeling
63Methods Available
- CPH-MODELS (Lund et al, 1997) "
- Fully automated
- FASTA search for templates
- Not validated
64Automatic or Manual Mode?
- Automatic High homology
- Manual
- Medium/Low homology
- Template from structure prediction
- Multiple templates
- Multiple chains
- Ligand present
65How good is the model?
- Structural Quality Analysis
- PROCHECK (Laskowski et al, 1993)
- WHATIF (Vriend G, 1990) "
- ERRAT (Colovos Yeates, 1993) "
66Improving ill-defined regions
- Iterative model building
- Rebuild or anneal bad regions
- Check/edit alignment and rebuild
- Molecular dynamics and/or Monte Carlo simulations
- Compute intensive
- Input files need to be set up
- Optional
67Molecular Modeling Protocol
- Resources required
- The query sequence
- Personal computer with internet connectivity
- RASMOL/DeepView for PDB structure visualization
- CLUSTALX sequence alignment software
- Access to a UNIX workstation
- MODELLER/ICM UNIX software
- WHATIF UNIX/PC software
- PROCHECK UNIX or Windows software
68MM Protocol Input Files
- Minimum requirement
- Query sequence
- Template structure
- Template sequence
- Query-Template alignment
69Ex 1. High Homology Case
- Human SOX9 WT - homologous to
- SRY (PDB 1HRY) - 49 identity
- S9WT ..AGAACAATGG.. highest
- SOXCORE ..GCAACAATCT.. least
- Mutants (campomelic dysplasia)
- F12L No DNA binding
- H65Y Minimal binding
- P70R altered specificity no SOXCORE
- A19V near WT but normal binding
-
701. SOX9 Models
- WT P70R models built
- Ca overlay
- WT-SRY 0.72 Å
- J. Biol. Chem. 274 (1999) 24023
71Ex 1. SOX9 DNA Models
SOX9-WT
SRY
SOX9-P70R
- Observed disease-linked mutations mapped
- Other residues in DNA-binding groove determined
72Ex 2. Low Homology Situation
- Pigments from reef-building corals similar to
Pocilloporin - fluoresce under UV and visible radiation
- similar to the Green Fluorescent Protein - GFP
(19.6 identity) - contain QYG instead of SYG in GFP, as
proposed fluorophore
732. Alignment of POC4 GFP
742. POC4 Model
- Barrel ends open
- C-ter not included
- b-sheet OK
- QYG fits the site!
- 26 residues within 5Å of QYG (only 19 in GFP)
- Increased thermal stability
- UV protection
75Ex 3. Small Disulfide-bonded Protein Complement
Factor H
- 20 tandem homologous units SCRs (short
consensus repeat) or sushi regions - Each SCR is 60 aa
- conserved Y, P, G
- 2 disulfide bridges1-3 2-4
- Linkers of 3-8 aa
- Heparin binding SCRs 7 (high affinity) 20
- Previous SCRs required for activity minimum
constructs are fH67 and fH18-20
763. Sequence Alignment of Close Functional
Homologues
Site A Site d Site B Site c
773. Templates for fH SCRs 6-7
- hfH SCRs 1516 (fH1516 PDB ID 1HFH)
- Vaccinia virus complement control protein domains
34 (vcp34 PDB ID 1VVC) - Orientations differ considerably
- Vcp34 28 identical to hfH67 compared to hfH1516
(25) !
783. Query-Templates alignment
79 3. hfH67 model
Sialic acid
Heparin disaccharide repeat
hfH67 hfH1516
803. Locating residues for mutation from model
SCRs 67
SCRs 1516
Pacific Symposium of Biocomputing 2000, 5155
81Ex 4 Protein Engineering
- Thermolysin-like protease unstable at high
temperatures (gt 40 ºC unlike trypsin) - Homology Model built
- G8 N60 suited for disulfide bond
- Double Mutant functional at 92.5 ºC
J Mansfeld et al. Extreme Stabilization of a
Thermolysin-like Protease by an Engineered
Disulfide Bond J. Biol. Chem. 1997 272
11152-11156.
82 Ex 5. Multiple chains Human Hand, Foot Mouth
Disease Virus capsid
- 2000 outbreak of HFMD in Singapore thousands of
children affected 4 deaths (The Lancet, 2000,
356, 1338) - Major etiological agent EV71 (enterovirus group)
- Neurological complications
835. EV71 genome structure
Capsid
Replication
VP4 VP2 VP3 VP1 2A 2B
2C 3A,B 3C 3D
PolyA
VPg
5 AUG
3 UAG
Pan-enterovirus primers
EV71 specific primers
- 95/94 (RNA) homology coxsackievirus A16/B3
- Only 1 difference between neurovirulent and
non-neuro virulent isolates - Most variations in non-capsid regions
- Within capsid regions, VP1 shows maximum
variability relative to other Evs - Differences in capsid region 1 VP1 VP2, 2 VP3
845. Picornaviridae
Icosahedral Capsid
855. Template hunt
- BLASTP against PDB sequences
- VP1 3 templates
- 1BEV 38.7 (bovine enterovirus)
- 1EAH 36.5 (poliovirus type 2 strain Lansing)
- 1FPN 38.0 (human rhinovirus serotype 2)
- VP2 1BEV 56.7
- VP3 1BEV 54.9
- VP4 1BEV 50.0
865. Fixing the VP1 Alignment
- Structural alignment of templates using VAST
(Gibrat, Madej, Bryant, 1996) - Extract corresponding sequence alignment
- Match HFMDV VP1 to aligned templates using
profile alignment in CLUSTALW
875. VP1 alignment to templates
b-strands
a-helices
3,10-helices
Pocket-factor binding residues
885. Model building steps
- Build all 4 capsid proteins (VP1-VP4) together to
ensure 3D fit - Use 1BEV alone for VP2-VP4
- For VP1 use aligned 1BEV, 1EAH, 1FPN
- Check model
895. Round 1 VP1,VP2,VP3,VP4
- Clip hanging ends
- Re-position problem loops
- adjust gaps in alignment
- Build again
905. Round 2 Pentamer Check
- Loops look OK
- Build pentamer
- Publish.
- Oops clash in pentamer assembly. Go back
915. Close encounters of the 3rd Kind
- Build only VP3 pentamer
- N-terminus of each VP3 hydrogen-bonded
- Also, in BEV, Asp-Lys ion pair
- First 25 aa overlay v. well
925. Fourth foray build with 5 VP3s
- Only first 50aa of the other 4 VP3s included
- Model resulted in knots due to insufficient
refinement cycles - However, VP3 pentameric region OK
935. Fifth and final attempt
945. Canyon Pit and Antigenic Sites
Poliovirus sites
Neurovirulent Polio (mouse)
Cardiovirus
955. Putative antigenic sites
VP1
VP2
965. HFMDV Conclusions
- Unique surface loops identified for
- Immunodiagnostic assays
- Vaccine design
- Antibodies being generated
- Canyon pit depth is similar to BEV
- Mapping the antigenic regions of other related
enteroviruses on the HFMDV surface specific VP1
and VP2 sites buried - Sunita Singh, Vincent T. K. Chow, C. L. Poh, M.
C. Phoon Dept. of Microbiology, NUS - Applied Bioinformatics, Vol 1, issue 1, 43-52
invited research article