Discriminative Graphical Models for Structured Data Prediction - PowerPoint PPT Presentation

Loading...

PPT – Discriminative Graphical Models for Structured Data Prediction PowerPoint presentation | free to download - id: 6e15b3-YzIxN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Discriminative Graphical Models for Structured Data Prediction

Description:

Discriminative Graphical Models for Structured Data Prediction Yan Liu Language Technologies Institute School of Computer Science Carnegie Mellon University – PowerPoint PPT presentation

Number of Views:7
Avg rating:3.0/5.0
Date added: 11 September 2019
Slides: 53
Provided by: Eric4165
Learn more at: http://www.cs.cmu.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Discriminative Graphical Models for Structured Data Prediction


1
Discriminative Graphical Models for Structured
Data Prediction
  • Yan Liu
  • Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • Mar 30, 2006

2
Structured Data Prediction
  • Data in many applications have inherent structures

Example
Protein sequence Protein structures
3-D image Segmented objects
Text sequence Parsing tree
Input
John ate the cat .
SEQUENCEXSWGIKQLQARILX
Structures
Gp 41 core protein of HIV virus
  • Fundamental importance in many areas
  • Potential for significant theoretical and
    practical advances

3
Conditional Random Fields Lafferty et al, 2001
  • Global normalization over undirected graphical
    models
  • Model p (label yobservation x), not joint
    distribution
  • Allow arbitrary dependencies in observation (e.g.
    long range, overlapping)
  • Adaptive to different loss functions and
    regularizers

4
Conditional Random Fields (II)
  • Promising results in
  • Tagging (Collins, 2002) and parsing (Sha and
    Pereira, 2003)
  • Information extraction (Pinto et al., 2003)
  • Image processing (Kumar and Herbert, 2004)

5
Conditional Random Fields (II)
  • Promising results in
  • Tagging (Collins, 2002) and parsing (Sha and
    Pereira, 2003)
  • Information extraction (Pinto et al., 2003)
  • Image processing (Kumar and Herbert, 2004)
  • Recent developments
  • Alternative estimation algorithms (Collins, 2002,
    Dietterich et al, 2004)
  • Alternative loss functions, use of kernels
    (Taskar et al., 2003, Altun et al, 2003,
    Tsochantaridis et al, 2004)
  • Baysian formulation (Qi and Minka, 2005) and
    semi-markov version (Sarawagi and cohen, 2004)

6
Complex Structures with Long-range Dependencies
  • Long-range dependencies
  • The data points are dependent although they are
    distant in the observed order
  • A property beyond the Markov assumptions

Computational challenge to the current Markov
models
ACEDLPGAEDWCEGGIKQLQARIL
Protein structure prediction
Co-reference resolution
Stock market change
7
Outline
  • Brief introduction to protein structures
  • Conditional graphical models
  • Segmentation CRF
  • Chain graph model
  • Dynamic segmentation CRF
  • Conclusion and discussion

8
Biology on One Slide
Nobelprize.org
Predict protein structures from sequences
9
Protein Structure Hierarchy
10
Protein Structure Hierarchy
  • Focus on predicting the topology of the
    structures from sequences

APAFSVSPASGACGPECA
11
Previous Work
  • General approaches
  • Sequence similarity searches, e.g. PSI-BLAST
    Altschul et al, 1997
  • Profile HMM, .e.g. HMMER Durbin et al, 1998 and
    SAM Karplus et al, 1998
  • Homology modeling or threading, e.g. Threader
    Jones, 1998
  • Window-based methods, e.g. PSI_pred Jones, 2001
  • Methods of careful design for specific
    structures
  • Example aa- and ßß- hairpins, ß-turn and ß-helix
    Efimov, 1991 Wilmot and Thornton, 1990 Bradley
    at al, 2001

Structural similarity with less sequence
similarity (under 25) Long-range interactions
Hard to generalize
12
Conditional Graphical Models
  • Protein structural graph
  • Undirected graphs
  • Nodes the states of structural units or
    segments (of fixed or variable lengths) Edges
    interactions between units (local or long-range
    dependencies)
  • Segmentation definition Y M, Wi
  • M of segments
  • Wi Si, Di, Si state of ith segment, Di
    starting position of ith segment

Long-range interactions
Local interactions
13
Conditional Graphical Models (II)
  • Similar to CRFs, the conditional probability of a
    possible segmentation Y given the observed
    sequence x is defined as
  • Structural topology recognition is reduced to the
    segmentation and labeling problem

14
Conditional Graphical Models
1. How to define the structural units? 2. How to
get the long-range dependencies in the graph?
15
Training and Testing Phase
  • Training phase learn the model parameters
  • Minimizing regularized log loss
  • Iterative search algorithms by seeking the
    direction whose empirical values agree with the
    expectation
  • Testing phase search the segmentation that
    maximizes P(wx)

3. How to make efficient inferences ? (will not
cover detail)
16
Conditional Graphical Models for Protein
Structure Prediction
17
Conditional Graphical Models for Protein
Structure Prediction
18
Protein Secondary Structure Prediction
  • Given a protein sequence, predict its secondary
    structure assignments
  • Three classes helix (C), sheets (E) and coil (C)

APAFSVSPASGACGPECA CCEEEEECCCCCHHHCCC
19
CGM on Secondary Structure PredictionLiu et al,
Bioinformatics 2004 Liu et al, BLC 2003
Lafferty et al, ICML2004
  • Structural unit
  • Individual residue
  • Segmentation definition
  • Y (n, Si), n of residues, Si H, E or C
  • Models
  • Conditional random fields (CRFs)
  • Kernel conditional random fields (kCRFs)
  • where

20
Experiment Results on Prediction Combination
  • Comparison methods
  • Window-based label combination Rost and Sander,
    1993
  • Window-based score combination Jones, 1999
  • Maximum entropy Markov model (MEMM) McCallum et
    al, 2000
  • Higher-order MEMMs (H-MEMM), Pseudo state
    duration MEMMs (PSMEMM)

Graphical models are generally better than the
window-based approaches CRFs perform the best
among the four graphical models
21
Conditional Graphical Models for Protein
Structure Prediction
22
Structural motif recognition
  • Structural motif
  • Identifiable regular arrangement of secondary
    structural elements
  • Super-secondary structure, or protein fold

Training sequences
Testing sequence
..APAFSVSPASGACGPECA.. Contains the structural
motif? ..NNEEEEECCCCCHHHCCC..
Yes
23
CGM for Structural Motif Recognition
  • Structural unit
  • Secondary structure elements
  • Protein structural graph
  • Nodes for the states of secondary structural
    elements of variable lengths
  • Edges for interactions between nodes in 3-D
  • Example ß-a-ß motif

24
CGM for Structural Motif RecognitionLiu,
Carbonell, Weigele and Gopalakrishnan, RECOMB
2005
  • Segmentation definition
  • Si state of segment i
  • Di starting position of segment i
  • Segmentation conditional random fields (SCRFs)
  • For any graph, we have

25
CGM for Structural Motif RecognitionLiu,
Carbonell, Weigele and Gopalakrishnan, RECOMB
2005
  • Segmentation definition
  • M number of segments
  • Si state of segment i
  • Di starting position of segment i
  • Segmentation conditional random fields (SCRFs)
  • For any graph, we have
  • For a simplified graph, we have
  • Efficient inferences, similar to the forward and
    backward algorithm, can be derived for simplified
    graph

26
Structural Motif Recognition Complex Folds with
Structural Repeats
  • Prevalent in proteins and important in functions
  • Each rung consists of structural motifs and
    insertions of variable lengths
  • Challenge
  • Low sequence similarity in structural motifs
  • Long-range interactions due to insertions

27
CGM for Structural Motif Recognition Liu, Xing
and Carbonell, ICML 2005
  • Chain graph
  • A graph consisting of directed and undirected
    graphs
  • Given a variable set V that forms multiple
    subgraphs U, we have
  • Two layer segmentation W M, ?i, T
  • Level 1 Envelope one rung (repeat) with motifs
    and insertions
  • Level 2 Motifs/insertions
  • M of envelops
  • ?i the segmentation of envelopes
  • Ti the state of envelope (repeat, non-repeat)

28
CGM for Structural Motif Recognition Liu, Xing
and Carbonell, ICML 2005
  • Chain graph model
  • SCRF subunits
  • Zi only needs to be locally normalized
  • Reduce the computational complexity dramatically

SCRFs
Motif profile model
29
Experiments Structural Motif Recognition
  • Right-handed ß-helix fold
  • An helix-like structures with
  • Three parallel ß-strands (B1, B2, B3 strands)
  • T2 turn a conserved two-residue turn
  • Low sequence identity (under 25)
  • Bacterial pathogens during the infection of
    plants, the cause of whooping cough
  • Leucine-rich repeats (LLR)
  • Solenoid-like regular arrangement of beta-strands
    and an alpha-helix, connected by coils
  • Relatively high sequence identity (many Leucines)
  • A structural framework for protein-protein
    interaction

30
Experiment Results on SCRF for beta-helix
  • Cross-family validation for classifying ß-helix
    proteins
  • SCRFs can score all known ß-helices higher than
    non ß-helices

31
Experiment Results on SCRF for beta-helix
  • Predicted Segmentation for known ß -helices on
    cross-family validation

32
Verification on Proteins with Recently
Crystallized Structures
  • Successfully identify proteins from different
    organisms
  • 1YP2 Potato Tuber ADP-Glucose Pyrophosphorylase
  • score 10.47
  • 1PXZ Jun A 1, The Major Allergen From Cedar
    Pollen
  • score 32.35
  • GP14 of Shigella bacteriophage as a ß-helix
    protein with scoring 15.63

33
Experiment Results on Chain Graph Model for
ß-helix
  • Cross-family validation for classifying ß-helix
    proteins

Chain graph model reduces the real running time
of SCRFs model by around 50 times
34
Experiment Results on Chain Graph Model for LLR
  • Cross-family validation for classifying LLR
    proteins
  • Chain graph model can score all known LLR higher
    than non-LLR

35
Experiment Results on Chain Graph Model
  • Predicted Segmentation for known ß-helices and
    LLRs
  • (A) SCRFs (B) chain graph model

36
Conditional Graphical Models for Protein
Structure Prediction
37
Quaternary Structure Prediction
  • Quaternary structures
  • Multiple chains associated together through
    noncovalent bonds or disulfide bonds
  • Very limited research work to date
  • Complex structures
  • Few positive training data

..APAFSVSPASGACGPECA.. Contains the quaternary
structures? ..NNEEEEECCCCCHHHCCC..
Yes
38
Dynamic Segmentation CRF Liu, Carbonell,
Weigele and Gopalakrishnan, submitted 2006
  • Structural building blocks
  • Secondary structure elements
  • Super-secondary structure elements
  • Segmentation definition
  • For each sequence, we define
  • Yj ( Mj, Wj,i )
  • Mj of segments in jthsequence j
  • Wi,j a set of labels determining ith segment
  • Inter-chain and intra-chain interactions

Van Raaij et al. in Nature (1999)
39
Dynamic Segmentation CRF Liu, Carbonell,
Weigele and Gopalakrishnan, submitted 2006
  • Dynamic segmentation CRF
  • Exact inference is computationally infeasible
  • Training reversible jump MCMC combined with
    contrastive divergence
  • Testing reversible jump MCMC simulated annealing

40
Experiment Quaternary structure prediction
  • Triple beta-spirals
  • Described by van Raaij et al. in Nature (1999)
  • DNA virus and RNA virus
  • Three proteins with crystallized structures and
    about 20 without structure annotation
  • Characterized by unusual stability to heat,
    protease, and detergent

41
Experiment Quaternary structure prediction
Cross-family validation
Segmentation results
42
Conditional Graphical Models for Protein
Structure Prediction
43
Model Roadmap
Kernels
Conditional random fields
Kernel CRFs
Segmentation
Locally normalized
Segmentation CRFs
Chain graph model
Complex Structures involving multiple chains
Dynamic segmentation CRFs
44
Model Roadmap
Generalized as conditional graphical models
Answer questions of 1. How to define the
structural building blocks? 2. How to get the
long-range dependencies in the graph?
45
Discussion
  • Conditional graphical models for structured data
    prediction with the long-range dependencies
  • Long-range dependencies are common in many
    applications
  • These dependencies can be effectively handled by
    CGM given the guidance of expert knowledge
  • How about unsupervised learning or supervised
    learning without guidance?
  • On-going work
  • Bootstrap features using active learning
  • Application to text mining from clinical reports
  • Efficient inferences robust to large-scale
    applications

46
Other Projects
  • Ensemble approaches for text classification and
    genre classification
  • Boosting algorithm for noisy and inbalanced data
    Liu et al, 2002, Jin et al, 2003
  • Input-dependent Ensemble algorithms Liu et al,
    2003, Yan et al, 2004, Liu et al, 2004
  • Information extraction of hidden attributes from
    product description
  • Harmonium graphical model for video data analysis
  • Graph-based semi-supervised learning
  • Protein-protein interaction prediction

47
Acknowledgement
  • Carnegie Mellon University
  • Jaime Carbonell, Eric Xing, John Lafferty
  • Yiming Yang, Chris Langmead, Roni Rosenfeld
  • Rong Yan, Luo Si and other students in LTI
  • Univ. of Pittsburgh
  • Vanathi Gopalakrishnan, Judith Klein-Seetharaman,
    Ivet Bahar
  • Massachusetts Institute of Technology
  • Jonathan King, Peter Weigele, Bonnie Burger
  • Michigan State Univ.
  • Rong Jin

48
(No Transcript)
49
Features
Structural Motif Recognition
  • Node features
  • Regular expression template, HMM profiles
  • Secondary structure prediction scores
  • Segment length
  • Inter-node features
  • ß-strand Side-chain alignment scores
  • Preferences for parallel alignment scores
  • Distance between adjacent B23 segments
  • Features are general and easy to extend

50
Protein Structural Graph for Beta-helix
51
Evaluation Measure
P P-
T P u
T - o n
  • Q3 (accuracy)
  • Precision, Recall
  • Segment Overlap quantity (SOV)
  • Matthews Correlation coefficients

52
Local Information
  • PSI-blast profile
  • Position-specific scoring matrices (PSSM)
  • Linear transformationKim Park, 2003
  • SVM classifier with RBF kernel
  • Feature1 (Si) Prediction score for each residue
    Ri
About PowerShow.com