Bioinformatics and Machine Learning: the Prediction of Protein Structures on a Genomic Scale Pierre Baldi Dept. Information and Computer Science Institute for Genomics and Bioinformatics University of California, Irvine - PowerPoint PPT Presentation

About This Presentation
Title:

Bioinformatics and Machine Learning: the Prediction of Protein Structures on a Genomic Scale Pierre Baldi Dept. Information and Computer Science Institute for Genomics and Bioinformatics University of California, Irvine

Description:

Bioinformatics and Machine Learning: the Prediction of Protein Structures on a Genomic Scale Pierre – PowerPoint PPT presentation

Number of Views:362
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Bioinformatics and Machine Learning: the Prediction of Protein Structures on a Genomic Scale Pierre Baldi Dept. Information and Computer Science Institute for Genomics and Bioinformatics University of California, Irvine


1
Bioinformatics and Machine Learning the
Prediction of Protein Structures on a Genomic
ScalePierre BaldiDept. Information and
Computer ScienceInstitute for Genomics and
BioinformaticsUniversity of California, Irvine
2
  • tggaagggctaattcactcccaacgaagacaagatatccttgatctgtgg
    atctaccacacacaaggctacttccctgattagcagaactacacaccagg
    gccagggatcagatatccactgacctttggatggtgctacaagctagtac
    cagttgagccagagaagttagaagaagccaacaaaggagagaacaccagc
    ttgttacaccctgtgagcctgcatggaatggatgacccggagagagaagt
    gttagagtggaggtttgacagccgcctagcatttcatcacatggcccgag
    agctgcatccggagtacttcaagaactgctgacatcgagcttgctacaag
    ggactttccgctggggactttccagggaggcgtggcctgggcgggactgg
    ggagtggcgagccctcagatcctgcatataagcagctgctttttgcctgt
    actgggtctctctggttagaccagatctgagcctgggagctctctggcta
    actagggaacccactgcttaagcctcaataaagcttgccttgagtgcttc
    aagtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctc
    agacccttttagtcagtgtggaaaatctctagcagtggcgcccgaacagg
    gacctgaaagcgaaagggaaaccagaggagctctctcgacgcaggactcg
    gcttgctgaagcgcgcacggcaagaggcgaggggcggcgactggtgagta
    cgccaaaaattttgactagcggaggctagaaggagagagatgggtgcgag
    agcgtcagtattaagcgggggagaattagatcgatgggaaaaaattcggt
    taaggccagggggaaagaaaaaatataaattaaaacatatagtatgggca
    agcagggagctagaacgattcgcagttaatcctggcctgttagaaacatc
    agaaggctgtagacaaatactgggacagctacaaccatcccttcagacag
    gatcagaagaacttagatcattatataatacagtagcaaccctctattgt
    gtgcatcaaaggatagagataaaagacaccaaggaagctttagacaagat
    agaggaagagcaaaacaaaagtaagaaaaaagcacagcaagcagcagctg
    acacaggacacagcaatcaggtcagccaaaattaccctatagtgcagaac
    atccaggggcaaatggtacatcaggccatatcacctagaactttaaatgc
    atgggtaaaagtagtagaagagaaggctttcagcccagaagtgataccca
    tgttttcagcattatcagaaggagccaccccacaagatttaaacaccatg
    ctaaacacagtggggggacatcaagcagccatgcaaatgttaaaagagac
    catcaatgaggaagctgcagaatgggatagagtgcatccagtgcatgcag
    ggcctattgcaccaggccagatgagagaaccaaggggaagtgacatagca
    ggaactactagtacccttcaggaacaaataggatggatgacaaataatcc
    acctatcccagtaggagaaatttataaaagatggataatcctgggattaa
    ataaaatagtaagaatgtatagccctaccagcattctggacataagacaa
    ggaccaaaggaaccctttagagactatgtagaccggttctataaaactct
    aagagccgagcaagcttcacaggaggtaaaaaattggatgacagaaacct
    tgttggtccaaaatgcgaacccagattgtaagactattttaaaagcattg
    ggaccagcggctacactagaagaaatgatgacagcatgtcagggagtagg
    aggacccggccataaggcaagagttttggctgaagcaatgagccaagtaa
    caaattcagctaccataatgatgcagagaggcaattttaggaaccaaaga
    aagattgttaagtgtttcaattgtggcaaagaagggcacacagccagaaa
    ttgcagggcccctaggaaaaagggctgttggaaatgtggaaaggaaggac
    accaaatgaaagattgtactgagagacaggctaattttttagggaagatc
    tggccttcctacaagggaaggccagggaattttcttcagagcagaccaga
    gccaacagccccaccagaagagagcttcaggtctggggtagagacaacaa
    ctccccctcagaagcaggagccgatagacaaggaactgtatcctttaact
    tccctcaggtcactctttggcaacgacccctcgtcacaataaagataggg
    gggcaactaaaggaagctctattagatacaggagcagatgatacagtatt
    agaagaaatgagtttgccaggaagatggaaaccaaaaatgatagggggaa
    ttggaggttttatcaaagtaagacagtatgatcagatactcatagaaatc
    tgtggacataaagctataggtacagtattagtaggacctacacctgtcaa
    cataattggaagaaatctgttgactcagattggttgcactttaaattttc
    ccattagccctattgagactgtaccagtaaaattaaagccaggaatggat
    ggcccaaaagttaaacaatggccattgacagaagaaaaaataaaagcatt
    agtagaaatttgtacagagatggaaaaggaagggaaaatttcaaaaattg
    ggcctgaaaatccatacaatactccagtatttgccataaagaaaaaagac
    agtactaaatggagaaaattagtagatttcagagaacttaataagagaac
    tcaagacttctgggaagttcaattaggaataccacatcccgcagggttaa
    aaaagaaaaaatcagtaacagtactggatgtgggtgatgcatatttttca
    gttcccttagatgaagacttcaggaagtatactgcatttaccatacctag
    tataaacaatgagacaccagggattagatatcagtacaatgtgcttccac
    agggatggaaaggatcaccagcaatattccaaagtagcatgacaaaaatc
    ttagagccttttagaaaacaaaatccagacatagttatctatcaatacat
    ggatgatttgtatgtaggatctgacttagaaatagggcagcatagaacaa
    aaatagaggagctgagacaacatctgttgaggtggggacttaccacacca
    gacaaaaaacatcagaaagaacctccattcctttggatgggttatgaact
    ccatcctgataaatggacagtacagcctatagtgctgccagaaaaagaca
    gctggactgtcaatgacatacagaagttagtggggaaattgaattgggca
    agtcagatttacccagggattaaagtaaggcaattatgtaaactccttag
    aggaaccaaagcactaacagaagtaataccactaacagaagaagcagagc
    tagaactggcagaaaacagagagattctaaaagaaccagtacatggagtg
    tattatgacccatcaaaagacttaatagcagaaatacagaagcaggggca
    aggccaatggacatatcaaatttatcaagagccatttaaaaatctgaaaa
    caggaaaatatgcaagaatgaggggtgcccacactaatgatgtaaaacaa
    ttaacagaggcagtgcaaaaaataaccacagaaagcatagtaatatgggg
    aaagactcctaaatttaaactgcccatacaaaaggaaacatgggaaacat
    ggtggacagagtattggcaagccacctggattcctgagtgggagtttgtt
    aatacccctcccttagtgaaattatggtaccagttagagaaagaacccat
    agtaggagcagaaaccttctatgtagatggggcagctaacagggagacta
    aattaggaaaagcaggatatgttactaatagaggaagacaaaaagttgtc
    accctaactgacacaacaaatcagaagactgagttacaagcaatttatct
    agctttgcaggattcgggattagaagtaaacatagtaacagactcacaat
    atgcattaggaatcattcaagcacaaccagatcaaagtgaatcagagtta
    gtcaatcaaataatagagcagttaataaaaaaggaaaaggtctatctggc
    atgggtaccagcacacaaaggaattggaggaaatgaacaagtagataaat
    tagtcagtgctggaatcaggaaagtactatttttagatggaatagataag
    gcccaagatgaacatgagaaatatcacagtaattggagagcaatggctag
    tgattttaacctgccacctgtagtagcaaaagaaatagtagccagctgtg
    ataaatgtcagctaaaaggagaagccatgcatggacaagtagactgtagt
    ccaggaatatggcaactagattgtacacatttagaaggaaaagttatcct
    ggtagcagttcatgtagccagtggatatatagaagcagaagttattccag
    cagaaacagggcaggaaacagcatattttcttttaaaattagcaggaaga
    tggccagtaaaaacaatacatactgacaatggcagcaatttcaccggtgc
    tacggttagggccgcctgttggtgggcgggaatcaagcaggaatttggaa
    ttccctacaatccccaaagtcaaggagtagtagaatctatgaataaagaa
    ttaaagaaaattataggacaggtaagagatcaggctgaacatcttaagac
    agcagtacaaatggcagtattcatccacaattttaaaagaaaagggggga
    ttggggggtacagtgcaggggaaagaatagtagacataatagcaacagac
    atacaaactaaagaattacaaaaacaaattacaaaaattcaaaattttcg
    ggtttattacagggacagcagaaatccactttggaaaggaccagcaaagc
    tcctctggaaaggtgaaggggcagtagtaatacaagataatagtgacata
    aaagtagtgccaagaagaaaagcaaagatcattagggattatggaaaaca
    gatggcaggtgatgattgtgtggcaagtagacaggatgaggattagaaca
    tggaaaagtttagtaaaacaccatatgtatgtttcagggaaagctagggg
    atggttttatagacatcactatgaaagccctcatccaagaataagttcag
    aagtacacatcccactaggggatgctagattggtaataacaacatattgg
    ggtctgcatacaggagaaagagactggcatttgggtcagggagtctccat
    agaatggaggaaaaagagatatagcacacaagtagaccctgaactagcag
    accaactaattcatctgtattactttgactgtttttcagactctgctata
    agaaaggccttattaggacacatagttagccctaggtgtgaatatcaagc
    aggacataacaaggtaggatctctacaatacttggcactagcagcattaa
    taacaccaaaaaagataaagccacctttgcctagtgttacgaaactgaca
    gaggatagatggaacaagccccagaagaccaagggccacagagggagcca
    cacaatgaatggacactagagcttttagaggagcttaagaatgaagctgt
    tagacattttcctaggatttggctccatggcttagggcaacatatctatg
    aaacttatggggatacttgggcaggagtggaagccataataagaattctg
    caacaactgctgtttatccattttcagaattgggtgtcgacatagcagaa
    taggcgttactcgacagaggagagcaagaaatggagccagtagatcctag
    actagagccctggaagcatccaggaagtcagcctaaaactgcttgtacca
    attgctattgtaaaaagtgttgctttcattgccaagtttgtttcataaca
    aaagccttaggcatctcctatggcaggaagaagcggagacagcgacgaag
    agctcatcagaacagtcagactcatcaagcttctctatcaaagcagtaag
    tagtacatgtaacgcaacctataccaatagtagcaatagtagcattagta
    gtagcaataataatagcaatagttgtgtggtccatagtaatcatagaata
    taggaaaatattaagacaaagaaaaatagacaggttaattgatagactaa
    tagaaagagcagaagacagtggcaatgagagtgaaggagaaatatcagca
    cttgtggagatgggggtggagatggggcaccatgctccttgggatgttga
    tgatctgtagtgctacagaaaaattgtgggtcacagtctattatggggta
    cctgtgtggaaggaagcaaccaccactctattttgtgcatcagatgctaa
    agcatatgatacagaggtacataatgtttgggccacacatgcctgtgtac
    ccacagaccccaacccacaagaagtagtattggtaaatgtgacagaaaat
    tttaacatgtggaaaaatgacatggtagaacagatgcatgaggatataat
    cagtttatgggatcaaagcctaaagccatgtgtaaaattaaccccactct
    gtgttagtttaaagtgcactgatttgaagaatgatactaataccaatagt
    agtagcgggagaatgataatggagaaaggagagataaaaaactgctcttt
    caatatcagcacaagcataagaggtaaggtgcagaaagaatatgcatttt
    tttataaacttgatataataccaatagataatgatactaccagctataag
    ttgacaagttgtaacacctcagtcattacacaggcctgtccaaaggtatc
    ctttgagccaattcccatacattattgtgccccggctggttttgcgattc
    taaaatgtaataataagacgttcaatggaacaggaccatgtacaaatgtc
    agcacagtacaatgtacacatggaattaggccagtagtatcaactcaact
    gctgttaaatggcagtctagcagaagaagaggtagtaattagatctgtca
    atttcacggacaatgctaaaaccataatagtacagctgaacacatctgta
    gaaattaattgtacaagacccaacaacaatacaagaaaaagaatccgtat
    ccagagaggaccagggagagcatttgttacaataggaaaaataggaaata
    tgagacaagcacattgtaacattagtagagcaaaatggaataacacttta
    aaacagatagctagcaaattaagagaacaatttggaaataataaaacaat
    aatctttaagcaatcctcaggaggggacccagaaattgtaacgcacagtt
    ttaattgtggaggggaatttttctactgtaattcaacacaactgtttaat
    agtacttggtttaatagtacttggagtactgaagggtcaaataacactga
    aggaagtgacacaatcaccctcccatgcagaataaaacaaattataaaca
    tgtggcagaaagtaggaaaagcaatgtatgcccctcccatcagtggacaa
    attagatgttcatcaaatattacagggctgctattaacaagagatggtgg
    taatagcaacaatgagtccgagatcttcagacctggaggaggagatatga
    gggacaattggagaagtgaattatataaatataaagtagtaaaaattgaa
    ccattaggagtagcacccaccaaggcaaagagaagagtggtgcagagaga
    aaaaagagcagtgggaataggagctttgttccttgggttcttgggagcag
    caggaagcactatgggcgcagcctcaatgacgctgacggtacaggccaga
    caattattgtctggtatagtgcagcagcagaacaatttgctgagggctat
    tgaggcgcaacagcatctgttgcaactcacagtctggggcatcaagcagc
    tccaggcaagaatcctggctgtggaaagatacctaaaggatcaacagctc
    ctggggatttggggttgctctggaaaactcatttgcaccactgctgtgcc
    ttggaatgctagttggagtaataaatctctggaacagatttggaatcaca
    cgacctggatggagtgggacagagaaattaacaattacacaagcttaata
    cactccttaattgaagaatcgcaaaaccagcaagaaaagaatgaacaaga
    attattggaattagataaatgggcaagtttgtggaattggtttaacataa
    caaattggctgtggtatataaaattattcataatgatagtaggaggcttg
    gtaggtttaagaatagtttttgctgtactttctatagtgaatagagttag
    gcagggatattcaccattatcgtttcagacccacctcccaaccccgaggg
    gacccgacaggcccgaaggaatagaagaagaaggtggagagagagacaga
    gacagatccattcgattagtgaacggatccttggcacttatctgggacga
    tctgcggagcctgtgcctcttcagctaccaccgcttgagagacttactct
    tgattgtaacgaggattgtggaacttctgggacgcagggggtgggaagcc
    ctcaaatattggtggaatctcctacagtattggagtcaggaactaaagaa
    tagtgctgttagcttgctcaatgccacagccatagcagtagctgagggga
    cagatagggttatagaagtagtacaaggagcttgtagagctattcgccac
    atacctagaagaataagacagggcttggaaaggattttgctataagatgg
    gtggcaagtggtcaaaaagtagtgtgattggatggcctactgtaagggaa
    agaatgagacgagctgagccagcagcagatagggtgggagcagcatctcg
    agacctggaaaaacatggagcaatcacaagtagcaatacagcagctacca
    atgctgcttgtgcctggctagaagcacaagaggaggaggaggtgggtttt
    ccagtcacacctcaggtacctttaagaccaatgacttacaaggcagctgt
    agatcttagccactttttaaaagaaaaggggggactggaagggctaattc
    actcccaaagaagacaagatatccttgatctgtggatctaccacacacaa
    ggctacttccctgattagcagaactacacaccagggccaggggtcagata
    tccactgacctttggatggtgctacaagctagtaccagttgagccagata
    agatagaagaggccaataaaggagagaacaccagcttgttacaccctgtg
    agcctgcatgggatggatgacccggagagagaagtgttagagtggaggtt
    tgacagccgcctagcatttcatcacgtggcccgagagctgcatccggagt
    acttcaagaactgctgacatcgagcttgctacaagggactttccgctggg
    gactttccagggaggcgtggcctgggcgggactggggagtggcgagccct
    cagatcctgcatataagcagctgctttttgcctgtactgggtctctctgg
    ttagaccagatctgagcctgggagctctctggctaactagggaacccact
    gcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgccc
    gtctgttgtgtgactctggtaactagagatccctcagacccttttagtca
    gtgtggaaaatctctagca

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
SCALES
organisms genome genes
virus 10-100,000 10
bacteria 5 Mb 3,000
single cell 15Mb 6,000
simple animal 100Mb 15,000
man 3,000Mb 30-40,000

8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
Examples of Computational Problems
  • Physical and Genetic Maps
  • Genome assembly
  • Pairwise and Multiple Alignments
  • Motif Detection/Discrimination/Classification
  • Data Base Searches and Mining
  • Phylogenetic Tree Reconstruction
  • Gene Finding and Gene Parsing
  • Protein Secondary Structure Prediction
  • Protein Tertiary Structure Prediction
  • Protein Function Prediction
  • Comparative genomics
  • DNA microarray analysis
  • Gene regulation/regulatory networks

13
Machine Learning
  • Extract information from the data automatically
    (inference) via a process of model fitting
    (learning from examples).
  • Model Selection Neural Networks, Hidden Markov
    Models, Stochastic Grammars, Bayesian Networks,
    Graphical Models, Kernel Methods
  • Model Fitting Gradient Methods, Monte Carlo
    Methods,
  • Machine learning approaches are most useful in
    areas where there is a lot of data but little
    theory.

14
Three Key Factors for Expansion
  • Data Mining/Machine Learning Expansion is fueled
    by
  • Progress in sensors, data storage, and data
    management.
  • Computing power.
  • Theoretical framework.

15
(No Transcript)
16
(No Transcript)
17
Utility of Structural Information
(Baker and Sali, 2001)
18
CAVEAT
19
REMARKS
  • Structure/Folding
  • Backbone/Full Atom
  • Homology Modeling
  • Protein Threading
  • Ab Initio (Physical Potentials, Statistical
    Mechanics/Lattice Models)
  • Lego Approach
  • Statistical/Machine Learning (Training Sets, SS
    prediction)
  • Mixtures ab-initio with statistical potentials,
    machine learning with profiles, etc.

20
ß-sheet
ß-ladders
ß-strands
21
PROTEIN STRUCTURE PREDICTION
22
Intestinal Fatty Acid Binding Protein (beta
barrel)
23
(No Transcript)
24
(No Transcript)
25
Helices
  • 1GRJ (Grea Transcript Cleavage Factor From
    Escherichia Coli)

26
Antiparallel ß-sheets
  • 1MSC (Bacteriophage Ms2 Unassembled Coat Protein
    Dimer)

27
Parallel ß-sheets
  • 1FUE (Flavodoxin)

28
Contact map
29
Secondary structure prediction
30
SPARSE ENCODING
  • PROTEIN SEQUENCE MAPVC

 
 
31
(No Transcript)
32
GRAPHICAL MODELS
  • Bayesian statistics and modeling leads to very
    high-dimensional distributions P(D,H,M) which are
    typically intractable.
  • Need for factorization into independent clusters
    of variables that reflect the local (Markovian)
    dependencies of the world and the data.
  • Hence the general theory of graphical models.
  • Directed models reflect temporal and causality
    relationships NNs, HMMs, Bayesian networks, etc.
  • Directed models are used for instance in expert
    systems.
  • Undirected models reflect correlations Random
    Markov Fields, Boltzmann machines, etc.)
  • Undirected models are used for instance in image
    modeling problems.
  • Directed/Undirected and other models are
    possible.

33
GLOBAL FACTORIZATIONFOR BAYESIAN NETWORKS
  • X1, ,Xn random variables associated with the
    vertices of a DAG Directed Acyclic Graph
  • The local conditional distributions P(XiXj j
    parent of i) are the parameters of the model.
    They can be represented by look-up tables
    (costly) or other more compact parameterizations
    (Sigmoidal Belief Networks, XOR, etc).
  • The global distribution is the product of the
    local characteristicsP(X1,,Xn) ?i P(XiXj
    j parent of i)

34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
DATA PREPARATION
  •  
  • Starting point PDB data base.
  •        Remove sequences not determined by X ray
    diffraction.
  •        Remove sequences where DSSP crashes.
  •        Remove proteins with physical chain
    breaks (neighboring AA having
    distances exceeding 4 Angstroms)
  •        Remove sequences with resolution worst
    than 2.5 Angstroms.
  •        Remove chains with less than 30 AA.
  •        Remove redundancy (Hobohms algorithm,
    Smith-Waterman, PAM 120, etc.)
  • Build multiple alignments (BLAST,
    PSI-BLAST, etc.)

40
SECONDARY STRUCTURE PROGRAMS
  • DSSP (Kabsch and Sander, 1983) works by
    assigning potential backbone hydrogen bonds
    (based on the 3D coordinates of the backbone
    atoms) and subsequently by identifying repetitive
    bonding patterns.
  •   STRIDE (Frishman and Argos, 1995) in addition
    to hydrogen bonds, it uses also dihedral angles.
  •   DEFINE (Richards and Kundrot, 1988) uses
    difference distance matrices for evaluating the
    match of interatomic distances in the protein to
    those from idealized SS.

41
SECONDARY STRUCTURE ASSIGNMENTS
  • DSSP classes 
  • H alpha helix
  • E sheet
  • G 3-10 helix
  • S kind of turn
  • T beta turn
  • B beta bridge
  • I pi-helix (very rare)
  • C the rest
  • CASP (harder) assignment 
  • a H and G
  • ß E and B
  • ? the rest
  • Alternative assignment 
  • a H
  • ß B
  • ? the rest

42
ENSEMBLES
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
FUNDAMENTAL LIMITATIONS
  • 100 CORRECT RECOGNITION IS PROBABLY IMPOSSIBLE
    FOR SEVERAL REASONS
  • SOME PROTEINS DO NOT FOLD SPONTANEOUSLY OR MAY
    NEED CHAPERONES
  • QUATERNARY STRUCTURE BETA-STRAND PARTNERS MAY BE
    ON A DIFFERENT CHAIN
  • STRUCTURE MAY DEPEND ON OTHER VARIABLES
    ENVIRONMENT, PH
  • DYNAMICAL ASPECTS
  • FUZZINESS OF DEFINITIONS AND ERRORS IN DATABASES

48
(No Transcript)
49
(No Transcript)
50
BB-RNNs
51
2D RNNs
52
2D INPUTS
  • AA at positions i and j
  • Profiles at positions i and j
  • Correlated profiles at positions i and j
  • Secondary Structure, Accessibility, etc.

53
PERFORMANCE ()
6Å 8Å 10Å 12Å
non-contacts 99.9 99.8 99.2 98.9
contacts 71.2 65.3 52.2 46.6
all 98.5 97.1 93.2 88.5
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
COARSE MAPS
58
COARSE ANGLE PREDICTION
59
1b0ya (6Å coarse)
60
STRUCTURAL PROTEOMICSSUITE
  • SSpro secondary structure
  • SSpro8 secondary structure
  • ACCpro accessibility
  • CONpro contact number
  • DI-pro disulphide bridges
  • BETA-pro beta partners
  • CMAP-pro contact map
  • CCMAP-pro coarse contact map
  • CON23D-pro contact map to 3D
  • 3D-pro 3D structure

61
(No Transcript)
62
  • SISQQTVWNQMATVRTPLNFDSSKQSFCQFSVDLLGGGISVDKTGDWITL
    VQNSPISNLL
  • CCCECCCCCCEEEECCCCCCCCCCCCEEEEEEECCCCEEEECCCCCCEEE
    EECCHHHHHH
  • CCCEEEEECEEEEECCCCCCCTCCCCEEEEEEEETCSEEEECTTTTEEEE
    EECCHHHHHH
  • -----------------------------------
    ----
  • --------------------------
  • -------------------------------
  • --------------------------
  • eeeeee---e--e-e-eee-ee-eee---------e-e--eeeeee----
    ----------
  • RVAAWKKGCLMVKVVMSGNAAVKRSDWASLVQVFLTNSNSTEHFDACRWT
    KSEPHSWELI
  • HHHHHHCCCEEEEEEEEEECCEEECCCCCEEEEEEEECCCCCCCCCEEEE
    EECCCCCCCC
  • HHHHHHTTCEEEEEEEEEEEEEEECCCCCEEEEEEEECCCTTCCCEEEEE
    EECCTCCEEE
  • -----------------------
    ----------
  • --------------------
    ----
  • -----------------
    ----
  • ------------------
    ----
  • -----ee---e-------e-e-ee-e-e-e-----e--eeee--e-----
    --e-e-ee-e

63
Advantage of Machine Learning
  • Machine learning systems take time to train
    (weeks).
  • Once trained however they can predict structures
    almost faster than proteins can fold.
  • Predict or search protein structures on a genomic
    or bioengineering scale .

64
INSIGHTS INTO NEURAL SYSTEMS
  • Architectures with 6 layers
  • Propagation of output results from the center to
    the periphery
  • Role of lateral computations

65
(No Transcript)
66
DAG-RNNs APPROACH
  • Two steps
  • 1. Build relevant DAG to connect inputs, outputs,
    and hidden variables
  • 2. Use a deterministic (neural network)
    parameterization together with appropriate
    stationarity assumptions/weight sharing
  • Process structured data of variable size,
    topology, and dimensions efficiently
  • Sequences, trees, d-lattices, graphs, etc
  • Other applications

67
(No Transcript)
68
(No Transcript)
69
Structural Databases
  • PPDB Poxvirus Proteomic Database
  • ICBS Inter Chain Beta Sheet Database

70
Poxvirus Family
  • Great medical relevance
  • Reasonable size of the genomes
  • Potential threat as bioterrorism weapons
  • Need for poxvirus vaccines and drugs

71
PSPDB
  • Poxvirus Structural Proteomic DataBase
  • Vaccinia/Variola
  • 260 genes or so per family member

72
Smallpox Background
  • Smallpox virus variola major and variola minor
    viruses
  • Smallpox has caused more deaths in human history
    than any other known pathogen
  • Case-fatality rate of 30
  • Higher fatality rate for virgin soil
    populations
  • Americas and Hawaii
  • Biological Weapon
  • Used by British soldiers in French and Indian
    Wars and American Revolution
  • Smallpox vaccine using vaccinia virus
  • Widely used beginning in 19th century
  • Smallpox eradicated in 1980
  • Other poxviruses can evolve into deadly human
    pathogens
  • Monkeypox
  • Potential manipulation by humans
  • World population is becoming virgin soil

73
(No Transcript)
74
Screenshot Details B13R
75
(No Transcript)
76
ß-sheet
ß-ladders
ß-strands
77
Interchain ß-sheet
Interchain ß-ladders
78
1DFN Defensin
79
Why a database of ICBS interactions?
  • Role at multiple structural levels
  • Dimerization, oligomerization
  • Protein-protein interaction
  • Protein and peptide aggregation
  • Protein-nucleic acid interaction
  • Multiple functions / pathways
  • Involved in diseases
  • Artificial ß-sheet mimicks exist (Nowick)
  • There existed no systematic method to identify
    and characterize them within (or between) known
    structures.

80
(No Transcript)
81
(No Transcript)
82
(No Transcript)
83
Source PDB structures
1a72
Protein Data Bank
PDB Asymmetric Crystallographic Unit
84
Sources PDB and PQS structures
1a72
Protein Data Bank
Protein Quaternary Structure server
  • Copy
  • Translate and rotate copies
  • Assess (crystal packing vs. biological molecule)

PDB Asymmetric unit
PQS 1 or more likely biological macromolecule
85
Strength of ICBS interactions and ICBS index
Image provided by James Nowick
86
Ranking ICBS interactions the ICBS index
87
Update and Query of the ICBS Database
PUBLIC DATABASES
PDB
PQS
UPDATE OF THE ICBS DATABASE
Get PDB PQS structures
Determine secondary structure (DSSP)
Find inter-chain beta-ladders
-Count H-bonds in ICBS -Count heavy atom contacts
-Compute ICBS index -Extract miscellaneous data
Update the database
ICBS USER INTERFACE
ICBS database / Web server
Query form
Results pages
88
(No Transcript)
89
(No Transcript)
90
Prevalence of ICBS interactions(August 11, 2002)
  • 4,869 ICBS entries for 22,513 structures scanned
  • 2,536 PDB entries (14,6) with ß-sheet
    interactions (ladders of length gt 1), over 17,313
    entries with peptide or protein chains
  • identical chains 59.9 different 40.1
  • antiparallel interchain ladders 68.3 parallel
    8.5 mixed 23.2

91
Conclusion
  • The database is used to
  • Gain insight on ICBS interactions
  • Find potential drug targets
  • Guide the development of artificial molecules
  • Extensions
  • Contents
  • Protein-protein interactions (between structures)
  • Annotation
  • Sequence analysis
  • Curated, non-redundant set
  • Statistical characterization (interface
    context)
  • Interchain vs. intrachain comparison
  • Prediction
  • Issues specificity, binding strength

92
PROTEIN DOCKING
  • Problem of protein flexibility

93
  • Inclusion of B-factor (a measure of positional
    uncertainty and flexibility) into energy
    calculations
  • B-factor incorporation into structure
    Prediction/Assessment
  • Reverse drug design search an active site
    database using a small molecule as a query
  • Automated active site detection
  • New comparison algorithm
  • ADME/Toxicity Prediction?

94
ACKNOWLEDGMENTS
  • UCI
  • Gianluca Pollastri, Steve Hampson
  • Arlo Randall, Pierre-Francois Baisnee, S. Josh
    Swamidass, Jianlin Chen, Yimeng Dou, Yann
    Pecout, Alessandro Vullo, Lin Wu
  • DTU Soren Brunak
  • Columbia Burkhard Rost
  • U of Florence Paolo Frasconi
  • U of Bologna Rita Casadio, Piero Fariselli
  • www.igb.uci.edu/tools.htm
  • www.ics.uci.edu/pfbaldi
Write a Comment
User Comments (0)
About PowerShow.com