Title: Rapid Discrimination of Decoy Protein Structures Using Multiple Energy Functions
1Rapid Discrimination of Decoy Protein Structures
Using Multiple Energy Functions
- Beating back bogus biomolecules since 2004.
2Machine Learning vs. Bioinformatics vs.
Structural Biology in a battle to the finish!
3Designing a Potential
- Preserve hydrophobic character
- Use standard threading f (Bryant Lawrence)
- Preserve hydrogen bonding
- Calculate Hbond geometry and strength
- Preserve van der Waals character
- Count bad contacts/distances
- Preserve good stereochemistry
- Ramachandran allowed/disallowed regions
- Preserve globular shape/size
- Limit radius of gyration
- Preserve secondary structure
- Assess predicted 2o to observed 2o
4Energy Functions
- Threading Energy
- Bryant Lawrence
- Ramachandran Score
- Secondary Structure Score
Si score for residue i according to the
Ramachandran plot core 0, allowed 1,
generally allowed 2, disallowed 3
5Energy Functions
Dij C? distance between residue i and
centroid Pradgyr n0.33 2.4
6Energy Functions
Ei Hydrogen bond energy for residue i with
hydrogen bond Nh Number of residues involved
in hydrogen and disulfide bonds
7Energy Functions
8Other Scores
- Average Hbond Energy
- Hbonds (and various functions thereof)
- Hbonds/SS
- (Actual - Theoretical) Radius of Gyration (and
absolute value) - Secondary Structure Errors
- Beta, Helix, Coil
- Statistical chi functions
9Other Inputs
- Secondary structure as calculated by PsiPred
based only on the protein sequence - Actual secondary structure as computed by VADAR
10Development Data Sources
- Real structures Richardson lab Top500 set, as
assessed by their All-Atom Contact method - http//kinemage.biochem.duke.edu/databases/top500.
php - Decoy Structures 76,200 decoys of 41 structures
derived from the Rosetta process - (Baker Lab, from Tsai et al, 2001)
11Upcoming Data Sources
- PDB Select 25 set
- Trents NMR structures
- Structures generated by Homodeller for BacMap
- Homodeller forced to do bad threading
12Initial Idea
- Try to predict RMSD from real structure based on
some combination of the scoring functions (using
MINER and other methods) - Modest success R0.7, Spearman 0.7
13Simpler Task
- Just distinguish real from decoy structures.
- Unclear which scores are important, so use data
driven method.
14C4.5 (Quinlan)
- Induces a decision tree from data
- Splits the data set at branch based on
information-theoretic criteria - Very general
- Handles discrete and numeric attributes equally
well.
15Sample Decision Tree
- chi4n lt 0.502986
- beta_perc lt 0.1875
- hbond_erg lt -1.301656
- helix_perc lt 0.675
- rofg lt 0.455691 no
(10.0) - rofg gt 0.455691 yes
(16.0/1.0) - helix_perc gt 0.675
- bump lt 0.000773
- hbond_erg lt
-1.448668 - ss lt 0.740741
yes (13.0/2.0) - ss gt 0.740741 no
(2.0)
16Should I Go To This Talk?
Should I Go To This Talk?
Information S(prob)-log(prob) 0.811 bits
17Split on Interesting
Info 0.721 bits
Info 0.918 bits
Total Info 6/8 0.721 2/8 0.918 0.771
18Split on Food
Info 0 bits
Info 0.918 bits
Total Info 3/8 0.918 0.344
19Real/Decoy Decision Tree
- 127 branches, 64 leaves (equivalently 64 rules)
- Real structures (446) 384 right, 62 wrong
- Decoys (63546) 63537 right, 9 wrong
- Given a real structure, the tree will identify it
correctly 86 of the time. - Given a decoy, theres an extremely good chance
(99.986) it will be identified as such.
20The Most Effective Rule
- If abs(actual radius of gyration theoretical
radius of gyration) lt 2.698442, AND - (Fraction of Residues Predicted As Beta and
Evaluated as Helix) 0, AND - Chi Score4 gt 0.502986, AND
- (Fraction of Residues Predicted As Beta and
Evaluated as Beta) lt 0.468750, THEN - Its a decoy
- Correct in 49862 cases, wrong in 36.
- A real finding, or a bias in how Rosetta
generates decoys?
21Chi Scoring Functions
- Another decision tree
- Inputs
- amino acid type
- Phi
- Psi
- Next AA
- Next chi
- Preceding AA
- Preceding Chi
- Coil or Not
- (Chi2 angle)
22Chi Results
- 1296 branches, 917 leaves
- w/ chi2, correctly predicts 80 of chi angles
- w/o chi2, correctly predicts 73 of chi angles
23Going Forward
- Use a lot more datasets and different
combinations thereof. - Try predicting RMSD within classes (0 2 A, 2
4 A, etc.) - Examine incorrectly classified structures more
carefully - Test with contrived structures (i.e. real
structure with polar residue in core) - Different machine learning methods.
24Acknowledgements
- Joey Cruz
- The Proteome Analysts (Roman Eisner, Alona Fyshe,
Brandon Pearcey, Brett Poulin) - Warren Ward
- Dr. Wishart
- Haiyan Zhang