6. Machine Learning and Other Predictive Methods - PowerPoint PPT Presentation

About This Presentation
Title:

6. Machine Learning and Other Predictive Methods

Description:

Bioinformatics analogy and differences: Data (GenBank, Swissprot, PDB) Similarity (BLAST) ... Bioinformatics, 21, Supplement 1, i359-368, (2005). 29 ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 37
Provided by: Sho57
Learn more at: https://ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: 6. Machine Learning and Other Predictive Methods


1
6. Machine Learning and Other Predictive Methods
2
Chemical Space
Stars Small Mol.
Existing 1022 107
Virtual 0 1060 (?)
Mode Real Virtual
Access Difficult Easy
3
Predictive Methods
  • Predict physical, chemical, and biological
    properties
  • For example 3D structure, NMR and mass spectra,
    boiling point, melting point, solubility (log P),
    toxicity, reaction rates, binding affinities,
    QSAR,
  • Dock PDB to PubChem

4
Methods
  • Spetrum of methods
  • Schrodinger Equation
  • Molecular Dynamics
  • Machine Learning (e.g. SS prediction)

5
Chemical Informatics
  • Informatics must be able to deal with
    variable-size structured data
  • Graphical Models
  • (Recursive) Neural Networks
  • ILP
  • GA
  • SGs
  • Kernels

6
Neural Networks
  • Feedforward applied to fingerprints (1D)
  • Recursive applied to bond graph (2D)
  • Directed Acyclic Graph
  • State vectors
  • Weight sharing

7
Chemo/Bio Informatics
  • Two Key Ingredients
  • 1. Data
  • 2. Similarity Measures
  • Bioinformatics analogy and differences
  • Data (GenBank, Swissprot, PDB)
  • Similarity (BLAST)

8
Fundamental Importance of Similarity Measures
  • Rapid Search of Large Databases
  • Protein Receptor (Docking)
  • Small Molecule/Ligand (Similarity)
  • Predictive Methods (Kernel Methods)

9
Classification
  • Learning to Classify
  • Limited number of training examples (molecules,
    patients, sequences, etc.)
  • Learning algorithm (how to build the classifier?)
  • Generalization should correctly classify test
    data.
  • Formalization
  • X is the input space
  • Y (e.g. toxic/non toxic, or 1,-1) is the target
    class
  • f X?Y is the classifier.

10
Linear Classifiers
11
Classification
  • Fundamental Point
  • f is entirely determined
  • by the dot products ltxixjgt?
  • measuring similarity between pairs of data points

12
Non Linear Classification(Kernel Methods)
  • We can transform a nonlinear problem into a
    linear one using a kernel.

13
Non Linear Classification(Kernel Methods)
  • We can transform a nonlinear problem into a
    linear one using a kernel K.
  • Fundamental property the linear decision surface
    depends on
  • K(xi ,xj)ltf(xi ) , f(xj)gt.
  • All we need is the Gram similarity matrix K. K
    defines the local metric of the embedding space.

14
Finding a Good Kernel
  • Given Two molecules.
  • Task Systematically compute relevant similarity
    while being storage/time efficient.
  • Motivation Enable efficient application of
    search and kernel algorithms.

15
Similarity Data Representations
NC(O)C(O)O
16
1D SMILES Kernel
17
2D Molecule Graph Kernel
  • For chemical compounds
  • atom/node labels
  • A C,N,O,H,
  • bond/edge labels
  • B s, d, t, ar,
  • Count labeled paths
  • Fingerprints

(CsNsCdO)
18
Similarity for Binary Fingerprints
  • Tally features
  • Unique (a,b)
  • In common (c)
  • Similarity Formula
  • Tanimotoc/(abc)
  • Tversky(a,ß)c/(aabßc)

19
Similarity Measures
20
3D Coordinate Kernel
21
Datasets
22
Examples of ResultsMutag and PTC
23
Results
24
Example of Results (NCI)
25
Example of ResultsNCI
Accuracy/ROC
26
Comparison of Kernels (NCI)
27
Regression Aqueous Solubility 30 folds
cross-validation Delaney Dataset 1440
Examples
Kernel R² RMSE MAE
1D AS 0.88 0.75 0.55
1D VAS weight factor 1 0.89 0.71 0.53
2D MinMax depth 2 no cycle 0.92 0.61 0.44
2D Tanimoto depth 10 no cycle 0.86 0.79 0.56
2.5D Tanimoto depth 4 0.77 1.02 0.72
3D CH bin. Width 0.1 0.83 0.87 0.67
Published results (train-test) 0.69 0.75
28
XLogP 40 folds cross-validation
Dataset size 1991
Kernel R² RMSE MAE
1D AS 0.91 0.47 0.32
1D VAS weight factor 1 0.91 0.46 0.33
2D MinMax depth 5 no cycle 0.94 0.39 0.25
2D Tanimoto depth 10 cycles 0.88 0.54 0.35
3D CH bin. Width 0.05 0.67 0.88 0.68
S. J. Swamidass, J. Chen, P. Phung, J. Bruand, L.
Ralaivola, and P. Baldi. Kernels for Small
Molecules and the Prediction of Mutagenicity,
Toxicity, and Anti-Cancer Activity. Proceedings
of the 2005 Conference on Intelligent Systems for
Molecular Biology, ISMB 05. Bioinformatics, 21,
Supplement 1, i359-368, (2005).
29
Additional Representations
1D SMILES string
2D Atomic connection table
3D XYZ coordinates of labeled points
2.5D 2D surface in 3D space
NC(CO)C(O)O
4D Bag of conformers as XYZ coordinates of
labeled points
3.5D Bag of conformers in 2D surface in 3D space
Multiple Conformers
30
2.5D Surface Kernel
  • Build a graph G (V atoms) which approximates
    the surface (convex hull).
  • Use spectral graph kernels on G.

31
2.5D Surface Kernel
  • Compute regular/Delauney tessellation
    (tetrahedrization) of the convex hull of the
    atoms in the molecule
  • Use alpha-shape algorithm to detect surface
    triangles at relevant scale (keep interior and
    regular edges, remove singular edges, r on the
    order of water carbon radius)
  • This yields a triangulated graph that
    approximates the surface (average degree 6).
  • Use spectral kernel with paths (l3,4) on the
    triangulated surface graph.

32
Alpha Shape
  • The shape formed by a set of points.
  • Closely related solvent accessible surface.
  • Calculated in O(nlog(n)) using CGAL

http//www.cgal.org/Manual/doc_html/cgal_manual/Al
pha_shapes_3/Chapter_main.html
33
The Conformer Problem
  • Atoms connected by proximity
  • Different conformers have different graphs and
    features.

34
2.5D Conformers 3.5D
Molecule A
Molecule B
35
Molecular Representations and Kernels
  • 1D SMILES strings
  • 2D Graph of bonds
  • 2D Surfaces
  • 2.5D Conformers
  • 3D Atomic coordinates
  • (Pharmacophores, Epitopes)
  • 3.5D Conformers
  • 4D Temporal evolution
  • 4D Isomers

36
Summary
  • ChemDB and other resources
  • Variety of kernels for small molecules
  • State-of-the-art performance on several benchmark
    datasets
  • For now, 2D kernels slightly better than 1D and
    3D kernels
  • Many possible extensions 2.5D, 3D, 3.5D, 4D
    kernels
  • Need for larger data sets and new models of
    cooperation in the chemistry community
  • Many open (ML) questions (e.g. clustering and
    visualizing 107 compounds, intelligent
    recognition of useful molecules/reactions,
    retrosynthesis, prediction of reaction rates,
    information retrieval from literature, docking,
    matching table of all proteins against all known
    compounds, origin of life, etc.)
Write a Comment
User Comments (0)
About PowerShow.com