Title: Bioinformatics and Machine Learning: the Prediction of Protein Structures on a Genomic Scale Pierre Baldi Dept. Information and Computer Science Institute for Genomics and Bioinformatics University of California, Irvine
1Bioinformatics and Machine Learning the
Prediction of Protein Structures on a Genomic
ScalePierre BaldiDept. Information and
Computer ScienceInstitute for Genomics and
BioinformaticsUniversity of California, Irvine
2- tggaagggctaattcactcccaacgaagacaagatatccttgatctgtgg
atctaccacacacaaggctacttccctgattagcagaactacacaccagg
gccagggatcagatatccactgacctttggatggtgctacaagctagtac
cagttgagccagagaagttagaagaagccaacaaaggagagaacaccagc
ttgttacaccctgtgagcctgcatggaatggatgacccggagagagaagt
gttagagtggaggtttgacagccgcctagcatttcatcacatggcccgag
agctgcatccggagtacttcaagaactgctgacatcgagcttgctacaag
ggactttccgctggggactttccagggaggcgtggcctgggcgggactgg
ggagtggcgagccctcagatcctgcatataagcagctgctttttgcctgt
actgggtctctctggttagaccagatctgagcctgggagctctctggcta
actagggaacccactgcttaagcctcaataaagcttgccttgagtgcttc
aagtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctc
agacccttttagtcagtgtggaaaatctctagcagtggcgcccgaacagg
gacctgaaagcgaaagggaaaccagaggagctctctcgacgcaggactcg
gcttgctgaagcgcgcacggcaagaggcgaggggcggcgactggtgagta
cgccaaaaattttgactagcggaggctagaaggagagagatgggtgcgag
agcgtcagtattaagcgggggagaattagatcgatgggaaaaaattcggt
taaggccagggggaaagaaaaaatataaattaaaacatatagtatgggca
agcagggagctagaacgattcgcagttaatcctggcctgttagaaacatc
agaaggctgtagacaaatactgggacagctacaaccatcccttcagacag
gatcagaagaacttagatcattatataatacagtagcaaccctctattgt
gtgcatcaaaggatagagataaaagacaccaaggaagctttagacaagat
agaggaagagcaaaacaaaagtaagaaaaaagcacagcaagcagcagctg
acacaggacacagcaatcaggtcagccaaaattaccctatagtgcagaac
atccaggggcaaatggtacatcaggccatatcacctagaactttaaatgc
atgggtaaaagtagtagaagagaaggctttcagcccagaagtgataccca
tgttttcagcattatcagaaggagccaccccacaagatttaaacaccatg
ctaaacacagtggggggacatcaagcagccatgcaaatgttaaaagagac
catcaatgaggaagctgcagaatgggatagagtgcatccagtgcatgcag
ggcctattgcaccaggccagatgagagaaccaaggggaagtgacatagca
ggaactactagtacccttcaggaacaaataggatggatgacaaataatcc
acctatcccagtaggagaaatttataaaagatggataatcctgggattaa
ataaaatagtaagaatgtatagccctaccagcattctggacataagacaa
ggaccaaaggaaccctttagagactatgtagaccggttctataaaactct
aagagccgagcaagcttcacaggaggtaaaaaattggatgacagaaacct
tgttggtccaaaatgcgaacccagattgtaagactattttaaaagcattg
ggaccagcggctacactagaagaaatgatgacagcatgtcagggagtagg
aggacccggccataaggcaagagttttggctgaagcaatgagccaagtaa
caaattcagctaccataatgatgcagagaggcaattttaggaaccaaaga
aagattgttaagtgtttcaattgtggcaaagaagggcacacagccagaaa
ttgcagggcccctaggaaaaagggctgttggaaatgtggaaaggaaggac
accaaatgaaagattgtactgagagacaggctaattttttagggaagatc
tggccttcctacaagggaaggccagggaattttcttcagagcagaccaga
gccaacagccccaccagaagagagcttcaggtctggggtagagacaacaa
ctccccctcagaagcaggagccgatagacaaggaactgtatcctttaact
tccctcaggtcactctttggcaacgacccctcgtcacaataaagataggg
gggcaactaaaggaagctctattagatacaggagcagatgatacagtatt
agaagaaatgagtttgccaggaagatggaaaccaaaaatgatagggggaa
ttggaggttttatcaaagtaagacagtatgatcagatactcatagaaatc
tgtggacataaagctataggtacagtattagtaggacctacacctgtcaa
cataattggaagaaatctgttgactcagattggttgcactttaaattttc
ccattagccctattgagactgtaccagtaaaattaaagccaggaatggat
ggcccaaaagttaaacaatggccattgacagaagaaaaaataaaagcatt
agtagaaatttgtacagagatggaaaaggaagggaaaatttcaaaaattg
ggcctgaaaatccatacaatactccagtatttgccataaagaaaaaagac
agtactaaatggagaaaattagtagatttcagagaacttaataagagaac
tcaagacttctgggaagttcaattaggaataccacatcccgcagggttaa
aaaagaaaaaatcagtaacagtactggatgtgggtgatgcatatttttca
gttcccttagatgaagacttcaggaagtatactgcatttaccatacctag
tataaacaatgagacaccagggattagatatcagtacaatgtgcttccac
agggatggaaaggatcaccagcaatattccaaagtagcatgacaaaaatc
ttagagccttttagaaaacaaaatccagacatagttatctatcaatacat
ggatgatttgtatgtaggatctgacttagaaatagggcagcatagaacaa
aaatagaggagctgagacaacatctgttgaggtggggacttaccacacca
gacaaaaaacatcagaaagaacctccattcctttggatgggttatgaact
ccatcctgataaatggacagtacagcctatagtgctgccagaaaaagaca
gctggactgtcaatgacatacagaagttagtggggaaattgaattgggca
agtcagatttacccagggattaaagtaaggcaattatgtaaactccttag
aggaaccaaagcactaacagaagtaataccactaacagaagaagcagagc
tagaactggcagaaaacagagagattctaaaagaaccagtacatggagtg
tattatgacccatcaaaagacttaatagcagaaatacagaagcaggggca
aggccaatggacatatcaaatttatcaagagccatttaaaaatctgaaaa
caggaaaatatgcaagaatgaggggtgcccacactaatgatgtaaaacaa
ttaacagaggcagtgcaaaaaataaccacagaaagcatagtaatatgggg
aaagactcctaaatttaaactgcccatacaaaaggaaacatgggaaacat
ggtggacagagtattggcaagccacctggattcctgagtgggagtttgtt
aatacccctcccttagtgaaattatggtaccagttagagaaagaacccat
agtaggagcagaaaccttctatgtagatggggcagctaacagggagacta
aattaggaaaagcaggatatgttactaatagaggaagacaaaaagttgtc
accctaactgacacaacaaatcagaagactgagttacaagcaatttatct
agctttgcaggattcgggattagaagtaaacatagtaacagactcacaat
atgcattaggaatcattcaagcacaaccagatcaaagtgaatcagagtta
gtcaatcaaataatagagcagttaataaaaaaggaaaaggtctatctggc
atgggtaccagcacacaaaggaattggaggaaatgaacaagtagataaat
tagtcagtgctggaatcaggaaagtactatttttagatggaatagataag
gcccaagatgaacatgagaaatatcacagtaattggagagcaatggctag
tgattttaacctgccacctgtagtagcaaaagaaatagtagccagctgtg
ataaatgtcagctaaaaggagaagccatgcatggacaagtagactgtagt
ccaggaatatggcaactagattgtacacatttagaaggaaaagttatcct
ggtagcagttcatgtagccagtggatatatagaagcagaagttattccag
cagaaacagggcaggaaacagcatattttcttttaaaattagcaggaaga
tggccagtaaaaacaatacatactgacaatggcagcaatttcaccggtgc
tacggttagggccgcctgttggtgggcgggaatcaagcaggaatttggaa
ttccctacaatccccaaagtcaaggagtagtagaatctatgaataaagaa
ttaaagaaaattataggacaggtaagagatcaggctgaacatcttaagac
agcagtacaaatggcagtattcatccacaattttaaaagaaaagggggga
ttggggggtacagtgcaggggaaagaatagtagacataatagcaacagac
atacaaactaaagaattacaaaaacaaattacaaaaattcaaaattttcg
ggtttattacagggacagcagaaatccactttggaaaggaccagcaaagc
tcctctggaaaggtgaaggggcagtagtaatacaagataatagtgacata
aaagtagtgccaagaagaaaagcaaagatcattagggattatggaaaaca
gatggcaggtgatgattgtgtggcaagtagacaggatgaggattagaaca
tggaaaagtttagtaaaacaccatatgtatgtttcagggaaagctagggg
atggttttatagacatcactatgaaagccctcatccaagaataagttcag
aagtacacatcccactaggggatgctagattggtaataacaacatattgg
ggtctgcatacaggagaaagagactggcatttgggtcagggagtctccat
agaatggaggaaaaagagatatagcacacaagtagaccctgaactagcag
accaactaattcatctgtattactttgactgtttttcagactctgctata
agaaaggccttattaggacacatagttagccctaggtgtgaatatcaagc
aggacataacaaggtaggatctctacaatacttggcactagcagcattaa
taacaccaaaaaagataaagccacctttgcctagtgttacgaaactgaca
gaggatagatggaacaagccccagaagaccaagggccacagagggagcca
cacaatgaatggacactagagcttttagaggagcttaagaatgaagctgt
tagacattttcctaggatttggctccatggcttagggcaacatatctatg
aaacttatggggatacttgggcaggagtggaagccataataagaattctg
caacaactgctgtttatccattttcagaattgggtgtcgacatagcagaa
taggcgttactcgacagaggagagcaagaaatggagccagtagatcctag
actagagccctggaagcatccaggaagtcagcctaaaactgcttgtacca
attgctattgtaaaaagtgttgctttcattgccaagtttgtttcataaca
aaagccttaggcatctcctatggcaggaagaagcggagacagcgacgaag
agctcatcagaacagtcagactcatcaagcttctctatcaaagcagtaag
tagtacatgtaacgcaacctataccaatagtagcaatagtagcattagta
gtagcaataataatagcaatagttgtgtggtccatagtaatcatagaata
taggaaaatattaagacaaagaaaaatagacaggttaattgatagactaa
tagaaagagcagaagacagtggcaatgagagtgaaggagaaatatcagca
cttgtggagatgggggtggagatggggcaccatgctccttgggatgttga
tgatctgtagtgctacagaaaaattgtgggtcacagtctattatggggta
cctgtgtggaaggaagcaaccaccactctattttgtgcatcagatgctaa
agcatatgatacagaggtacataatgtttgggccacacatgcctgtgtac
ccacagaccccaacccacaagaagtagtattggtaaatgtgacagaaaat
tttaacatgtggaaaaatgacatggtagaacagatgcatgaggatataat
cagtttatgggatcaaagcctaaagccatgtgtaaaattaaccccactct
gtgttagtttaaagtgcactgatttgaagaatgatactaataccaatagt
agtagcgggagaatgataatggagaaaggagagataaaaaactgctcttt
caatatcagcacaagcataagaggtaaggtgcagaaagaatatgcatttt
tttataaacttgatataataccaatagataatgatactaccagctataag
ttgacaagttgtaacacctcagtcattacacaggcctgtccaaaggtatc
ctttgagccaattcccatacattattgtgccccggctggttttgcgattc
taaaatgtaataataagacgttcaatggaacaggaccatgtacaaatgtc
agcacagtacaatgtacacatggaattaggccagtagtatcaactcaact
gctgttaaatggcagtctagcagaagaagaggtagtaattagatctgtca
atttcacggacaatgctaaaaccataatagtacagctgaacacatctgta
gaaattaattgtacaagacccaacaacaatacaagaaaaagaatccgtat
ccagagaggaccagggagagcatttgttacaataggaaaaataggaaata
tgagacaagcacattgtaacattagtagagcaaaatggaataacacttta
aaacagatagctagcaaattaagagaacaatttggaaataataaaacaat
aatctttaagcaatcctcaggaggggacccagaaattgtaacgcacagtt
ttaattgtggaggggaatttttctactgtaattcaacacaactgtttaat
agtacttggtttaatagtacttggagtactgaagggtcaaataacactga
aggaagtgacacaatcaccctcccatgcagaataaaacaaattataaaca
tgtggcagaaagtaggaaaagcaatgtatgcccctcccatcagtggacaa
attagatgttcatcaaatattacagggctgctattaacaagagatggtgg
taatagcaacaatgagtccgagatcttcagacctggaggaggagatatga
gggacaattggagaagtgaattatataaatataaagtagtaaaaattgaa
ccattaggagtagcacccaccaaggcaaagagaagagtggtgcagagaga
aaaaagagcagtgggaataggagctttgttccttgggttcttgggagcag
caggaagcactatgggcgcagcctcaatgacgctgacggtacaggccaga
caattattgtctggtatagtgcagcagcagaacaatttgctgagggctat
tgaggcgcaacagcatctgttgcaactcacagtctggggcatcaagcagc
tccaggcaagaatcctggctgtggaaagatacctaaaggatcaacagctc
ctggggatttggggttgctctggaaaactcatttgcaccactgctgtgcc
ttggaatgctagttggagtaataaatctctggaacagatttggaatcaca
cgacctggatggagtgggacagagaaattaacaattacacaagcttaata
cactccttaattgaagaatcgcaaaaccagcaagaaaagaatgaacaaga
attattggaattagataaatgggcaagtttgtggaattggtttaacataa
caaattggctgtggtatataaaattattcataatgatagtaggaggcttg
gtaggtttaagaatagtttttgctgtactttctatagtgaatagagttag
gcagggatattcaccattatcgtttcagacccacctcccaaccccgaggg
gacccgacaggcccgaaggaatagaagaagaaggtggagagagagacaga
gacagatccattcgattagtgaacggatccttggcacttatctgggacga
tctgcggagcctgtgcctcttcagctaccaccgcttgagagacttactct
tgattgtaacgaggattgtggaacttctgggacgcagggggtgggaagcc
ctcaaatattggtggaatctcctacagtattggagtcaggaactaaagaa
tagtgctgttagcttgctcaatgccacagccatagcagtagctgagggga
cagatagggttatagaagtagtacaaggagcttgtagagctattcgccac
atacctagaagaataagacagggcttggaaaggattttgctataagatgg
gtggcaagtggtcaaaaagtagtgtgattggatggcctactgtaagggaa
agaatgagacgagctgagccagcagcagatagggtgggagcagcatctcg
agacctggaaaaacatggagcaatcacaagtagcaatacagcagctacca
atgctgcttgtgcctggctagaagcacaagaggaggaggaggtgggtttt
ccagtcacacctcaggtacctttaagaccaatgacttacaaggcagctgt
agatcttagccactttttaaaagaaaaggggggactggaagggctaattc
actcccaaagaagacaagatatccttgatctgtggatctaccacacacaa
ggctacttccctgattagcagaactacacaccagggccaggggtcagata
tccactgacctttggatggtgctacaagctagtaccagttgagccagata
agatagaagaggccaataaaggagagaacaccagcttgttacaccctgtg
agcctgcatgggatggatgacccggagagagaagtgttagagtggaggtt
tgacagccgcctagcatttcatcacgtggcccgagagctgcatccggagt
acttcaagaactgctgacatcgagcttgctacaagggactttccgctggg
gactttccagggaggcgtggcctgggcgggactggggagtggcgagccct
cagatcctgcatataagcagctgctttttgcctgtactgggtctctctgg
ttagaccagatctgagcctgggagctctctggctaactagggaacccact
gcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgccc
gtctgttgtgtgactctggtaactagagatccctcagacccttttagtca
gtgtggaaaatctctagca
3(No Transcript)
4(No Transcript)
5(No Transcript)
6(No Transcript)
7SCALES
organisms genome genes
virus 10-100,000 10
bacteria 5 Mb 3,000
single cell 15Mb 6,000
simple animal 100Mb 15,000
man 3,000Mb 30-40,000
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12Examples of Computational Problems
- Physical and Genetic Maps
- Genome assembly
- Pairwise and Multiple Alignments
- Motif Detection/Discrimination/Classification
- Data Base Searches and Mining
- Phylogenetic Tree Reconstruction
- Gene Finding and Gene Parsing
- Protein Secondary Structure Prediction
- Protein Tertiary Structure Prediction
- Protein Function Prediction
- Comparative genomics
- DNA microarray analysis
- Gene regulation/regulatory networks
13Machine Learning
- Extract information from the data automatically
(inference) via a process of model fitting
(learning from examples). - Model Selection Neural Networks, Hidden Markov
Models, Stochastic Grammars, Bayesian Networks,
Graphical Models, Kernel Methods - Model Fitting Gradient Methods, Monte Carlo
Methods, - Machine learning approaches are most useful in
areas where there is a lot of data but little
theory.
14Three Key Factors for Expansion
- Data Mining/Machine Learning Expansion is fueled
by - Progress in sensors, data storage, and data
management. - Computing power.
- Theoretical framework.
15(No Transcript)
16(No Transcript)
17Utility of Structural Information
(Baker and Sali, 2001)
18CAVEAT
19REMARKS
- Structure/Folding
- Backbone/Full Atom
- Homology Modeling
- Protein Threading
- Ab Initio (Physical Potentials, Statistical
Mechanics/Lattice Models) - Lego Approach
- Statistical/Machine Learning (Training Sets, SS
prediction) - Mixtures ab-initio with statistical potentials,
machine learning with profiles, etc.
20ß-sheet
ß-ladders
ß-strands
21PROTEIN STRUCTURE PREDICTION
22Intestinal Fatty Acid Binding Protein (beta
barrel)
23(No Transcript)
24(No Transcript)
25Helices
- 1GRJ (Grea Transcript Cleavage Factor From
Escherichia Coli)
26Antiparallel ß-sheets
- 1MSC (Bacteriophage Ms2 Unassembled Coat Protein
Dimer)
27Parallel ß-sheets
28Contact map
29Secondary structure prediction
30SPARSE ENCODING
31(No Transcript)
32GRAPHICAL MODELS
- Bayesian statistics and modeling leads to very
high-dimensional distributions P(D,H,M) which are
typically intractable. - Need for factorization into independent clusters
of variables that reflect the local (Markovian)
dependencies of the world and the data. - Hence the general theory of graphical models.
- Directed models reflect temporal and causality
relationships NNs, HMMs, Bayesian networks, etc. - Directed models are used for instance in expert
systems. - Undirected models reflect correlations Random
Markov Fields, Boltzmann machines, etc.) - Undirected models are used for instance in image
modeling problems. - Directed/Undirected and other models are
possible.
33GLOBAL FACTORIZATIONFOR BAYESIAN NETWORKS
- X1, ,Xn random variables associated with the
vertices of a DAG Directed Acyclic Graph - The local conditional distributions P(XiXj j
parent of i) are the parameters of the model.
They can be represented by look-up tables
(costly) or other more compact parameterizations
(Sigmoidal Belief Networks, XOR, etc). - The global distribution is the product of the
local characteristicsP(X1,,Xn) ?i P(XiXj
j parent of i)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)
39DATA PREPARATION
-
- Starting point PDB data base.
- Remove sequences not determined by X ray
diffraction. - Remove sequences where DSSP crashes.
- Remove proteins with physical chain
breaks (neighboring AA having
distances exceeding 4 Angstroms) - Remove sequences with resolution worst
than 2.5 Angstroms. - Remove chains with less than 30 AA.
- Remove redundancy (Hobohms algorithm,
Smith-Waterman, PAM 120, etc.) - Build multiple alignments (BLAST,
PSI-BLAST, etc.)
40SECONDARY STRUCTURE PROGRAMS
- DSSP (Kabsch and Sander, 1983) works by
assigning potential backbone hydrogen bonds
(based on the 3D coordinates of the backbone
atoms) and subsequently by identifying repetitive
bonding patterns. - STRIDE (Frishman and Argos, 1995) in addition
to hydrogen bonds, it uses also dihedral angles. - DEFINE (Richards and Kundrot, 1988) uses
difference distance matrices for evaluating the
match of interatomic distances in the protein to
those from idealized SS.
41SECONDARY STRUCTURE ASSIGNMENTS
- DSSP classes
- H alpha helix
- E sheet
- G 3-10 helix
- S kind of turn
- T beta turn
- B beta bridge
- I pi-helix (very rare)
- C the rest
- CASP (harder) assignment
- a H and G
- ß E and B
- ? the rest
- Alternative assignment
- a H
- ß B
- ? the rest
42ENSEMBLES
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47FUNDAMENTAL LIMITATIONS
- 100 CORRECT RECOGNITION IS PROBABLY IMPOSSIBLE
FOR SEVERAL REASONS - SOME PROTEINS DO NOT FOLD SPONTANEOUSLY OR MAY
NEED CHAPERONES - QUATERNARY STRUCTURE BETA-STRAND PARTNERS MAY BE
ON A DIFFERENT CHAIN - STRUCTURE MAY DEPEND ON OTHER VARIABLES
ENVIRONMENT, PH - DYNAMICAL ASPECTS
- FUZZINESS OF DEFINITIONS AND ERRORS IN DATABASES
48(No Transcript)
49(No Transcript)
50BB-RNNs
512D RNNs
522D INPUTS
- AA at positions i and j
- Profiles at positions i and j
- Correlated profiles at positions i and j
- Secondary Structure, Accessibility, etc.
53PERFORMANCE ()
6Å 8Å 10Å 12Å
non-contacts 99.9 99.8 99.2 98.9
contacts 71.2 65.3 52.2 46.6
all 98.5 97.1 93.2 88.5
54(No Transcript)
55(No Transcript)
56(No Transcript)
57COARSE MAPS
58COARSE ANGLE PREDICTION
591b0ya (6Å coarse)
60STRUCTURAL PROTEOMICSSUITE
- SSpro secondary structure
- SSpro8 secondary structure
- ACCpro accessibility
- CONpro contact number
- DI-pro disulphide bridges
- BETA-pro beta partners
- CMAP-pro contact map
- CCMAP-pro coarse contact map
- CON23D-pro contact map to 3D
- 3D-pro 3D structure
61(No Transcript)
62- SISQQTVWNQMATVRTPLNFDSSKQSFCQFSVDLLGGGISVDKTGDWITL
VQNSPISNLL - CCCECCCCCCEEEECCCCCCCCCCCCEEEEEEECCCCEEEECCCCCCEEE
EECCHHHHHH - CCCEEEEECEEEEECCCCCCCTCCCCEEEEEEEETCSEEEECTTTTEEEE
EECCHHHHHH - -----------------------------------
---- - --------------------------
- -------------------------------
- --------------------------
- eeeeee---e--e-e-eee-ee-eee---------e-e--eeeeee----
---------- - RVAAWKKGCLMVKVVMSGNAAVKRSDWASLVQVFLTNSNSTEHFDACRWT
KSEPHSWELI - HHHHHHCCCEEEEEEEEEECCEEECCCCCEEEEEEEECCCCCCCCCEEEE
EECCCCCCCC - HHHHHHTTCEEEEEEEEEEEEEEECCCCCEEEEEEEECCCTTCCCEEEEE
EECCTCCEEE - -----------------------
---------- - --------------------
---- - -----------------
---- - ------------------
---- - -----ee---e-------e-e-ee-e-e-e-----e--eeee--e-----
--e-e-ee-e
63Advantage of Machine Learning
- Machine learning systems take time to train
(weeks). - Once trained however they can predict structures
almost faster than proteins can fold. - Predict or search protein structures on a genomic
or bioengineering scale .
64INSIGHTS INTO NEURAL SYSTEMS
- Architectures with 6 layers
- Propagation of output results from the center to
the periphery - Role of lateral computations
65(No Transcript)
66DAG-RNNs APPROACH
- Two steps
- 1. Build relevant DAG to connect inputs, outputs,
and hidden variables - 2. Use a deterministic (neural network)
parameterization together with appropriate
stationarity assumptions/weight sharing - Process structured data of variable size,
topology, and dimensions efficiently - Sequences, trees, d-lattices, graphs, etc
- Other applications
67(No Transcript)
68(No Transcript)
69Structural Databases
- PPDB Poxvirus Proteomic Database
- ICBS Inter Chain Beta Sheet Database
70Poxvirus Family
- Great medical relevance
- Reasonable size of the genomes
- Potential threat as bioterrorism weapons
- Need for poxvirus vaccines and drugs
71PSPDB
- Poxvirus Structural Proteomic DataBase
- Vaccinia/Variola
- 260 genes or so per family member
72Smallpox Background
- Smallpox virus variola major and variola minor
viruses - Smallpox has caused more deaths in human history
than any other known pathogen - Case-fatality rate of 30
- Higher fatality rate for virgin soil
populations - Americas and Hawaii
- Biological Weapon
- Used by British soldiers in French and Indian
Wars and American Revolution - Smallpox vaccine using vaccinia virus
- Widely used beginning in 19th century
- Smallpox eradicated in 1980
- Other poxviruses can evolve into deadly human
pathogens - Monkeypox
- Potential manipulation by humans
- World population is becoming virgin soil
73(No Transcript)
74Screenshot Details B13R
75(No Transcript)
76ß-sheet
ß-ladders
ß-strands
77Interchain ß-sheet
Interchain ß-ladders
781DFN Defensin
79Why a database of ICBS interactions?
- Role at multiple structural levels
- Dimerization, oligomerization
- Protein-protein interaction
- Protein and peptide aggregation
- Protein-nucleic acid interaction
- Multiple functions / pathways
- Involved in diseases
- Artificial ß-sheet mimicks exist (Nowick)
- There existed no systematic method to identify
and characterize them within (or between) known
structures.
80(No Transcript)
81(No Transcript)
82(No Transcript)
83Source PDB structures
1a72
Protein Data Bank
PDB Asymmetric Crystallographic Unit
84Sources PDB and PQS structures
1a72
Protein Data Bank
Protein Quaternary Structure server
- Translate and rotate copies
- Assess (crystal packing vs. biological molecule)
PDB Asymmetric unit
PQS 1 or more likely biological macromolecule
85Strength of ICBS interactions and ICBS index
Image provided by James Nowick
86Ranking ICBS interactions the ICBS index
87Update and Query of the ICBS Database
PUBLIC DATABASES
PDB
PQS
UPDATE OF THE ICBS DATABASE
Get PDB PQS structures
Determine secondary structure (DSSP)
Find inter-chain beta-ladders
-Count H-bonds in ICBS -Count heavy atom contacts
-Compute ICBS index -Extract miscellaneous data
Update the database
ICBS USER INTERFACE
ICBS database / Web server
Query form
Results pages
88(No Transcript)
89(No Transcript)
90Prevalence of ICBS interactions(August 11, 2002)
- 4,869 ICBS entries for 22,513 structures scanned
- 2,536 PDB entries (14,6) with ß-sheet
interactions (ladders of length gt 1), over 17,313
entries with peptide or protein chains - identical chains 59.9 different 40.1
- antiparallel interchain ladders 68.3 parallel
8.5 mixed 23.2
91Conclusion
- The database is used to
- Gain insight on ICBS interactions
- Find potential drug targets
- Guide the development of artificial molecules
- Extensions
- Contents
- Protein-protein interactions (between structures)
- Annotation
- Sequence analysis
- Curated, non-redundant set
- Statistical characterization (interface
context) - Interchain vs. intrachain comparison
- Prediction
- Issues specificity, binding strength
92PROTEIN DOCKING
- Problem of protein flexibility
93- Inclusion of B-factor (a measure of positional
uncertainty and flexibility) into energy
calculations - B-factor incorporation into structure
Prediction/Assessment - Reverse drug design search an active site
database using a small molecule as a query - Automated active site detection
- New comparison algorithm
- ADME/Toxicity Prediction?
94ACKNOWLEDGMENTS
- UCI
- Gianluca Pollastri, Steve Hampson
- Arlo Randall, Pierre-Francois Baisnee, S. Josh
Swamidass, Jianlin Chen, Yimeng Dou, Yann
Pecout, Alessandro Vullo, Lin Wu - DTU Soren Brunak
- Columbia Burkhard Rost
- U of Florence Paolo Frasconi
- U of Bologna Rita Casadio, Piero Fariselli
- www.igb.uci.edu/tools.htm
- www.ics.uci.edu/pfbaldi