Mining frequent patterns in protein structures: A study of protease families - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Mining frequent patterns in protein structures: A study of protease families

Description:

Analysis of protein sequence and structure databases ... the physicochemical and structural properties of amino acids ... Multimodal Biometrics. Questions??? Thank You ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 25
Provided by: bhavend
Category:

less

Transcript and Presenter's Notes

Title: Mining frequent patterns in protein structures: A study of protease families


1
Mining frequent patterns in protein structuresA
study of protease families
  • Dr. Charles Yan
  • CS6890 (Section 001) ST Bioinformatics
  • The Machine Learning Approach
  • Presented By Bhavendra Matta

2
Presentation Structure
  • Problem
  • Introduction
  • Method Proposed
  • Results
  • Findings
  • About Authors
  • Questions

3
Problem
  • Mining frequent patterns in protein structure
  • Analysis of protein sequence and structure
    databases
  • usually reveal frequent patterns (FP) associated
    with
  • biological function. Data mining techniques
    generally consider
  • the physicochemical and structural properties of
    amino acids
  • and their microenvironment in the folded
    structures.

4
Important Terminology
  • Frequent Patterns in Protein Structures
  • The primary structure of proteins is the
    sequence of amino acids in the polypeptide chain.
    FP here refers to frequent patterns found in each
    type of Amino acids.
  • Conserved Residue
  • These are used to determine structural
    relationships between the sequences of a multiple
    sequence alignment.
  • VHAVOYJBIO
  • BHAVJOYBIO
  • OYJVHAVBIO
  • Here BIO is Conserved Residue.
  • Protease
  • Protease refers to a group of enzymes whose
    catalytic function is to breakdown peptide bonds
    of proteins.

5
continue..
  • Catalytic triad
  • It refers to three amino acid residues
    found inside the active site of certain
    proteases. These include Asp 102, His 57, and Ser
    195.
  • Unsupervised Learning.
  • It is a method of machine learning where a model
    is fit to observations output. Here the
    unsupervised learning is clustering forming type.
  • Microenvironment refers to the local structure
    assumed by residues close in space, but not
    necessarily contiguous along the sequence.
  • There are strong correlations between function
    and microenvironment.

6
Introduction
  • The paper presents a novel unsupervised learning
    approach to discover frequent patterns in the
    protein families.
  • FP calculation are based on three features (with
    no prior Functional motifs knowledge)
  • 1. Biochemical Features
  • 2. Geometric Features
  • 3. Dynamic Features
  • The identified FPs for each amino acids belongs
    to three protease subfamilies.
  • Chymotrypsin
  • Subtillsin subfamilies of Serine proteases
  • Papain subfamily Cysteine proteases
  • The catalytic triad residues are distinguished by
    their strong spatial coupling (high
    interconnectivity) to other conserved residues.

7
continue.
  • Proteins Function is associated with a particular
    sequences or structure motif.
  • Few catalytic residue database are
  • PDB ( Protein Data Base)
  • PROCAT Geometric hashing Function.
  • WEBFEATURE Bayesian Network
  • PINTS
  • TRILOGY

8
Method
  • Training Dataset
  • Feature Extraction
  • FP Discovery
  • Conserved Residue Identification.
  • Rank of Conserved Residue.

9
Dataset
  • A set of proteins belonging to a given family is
    selected as the training dataset. Features are
    extracted from all the amino acids in this
    dataset.
  • Two classes of enzymes, serine proteases and
    cysteine proteases are analyzed here.
  • Mainly all proteases typically have a catalytic
    triad at the active site.
  • These enzymes are classified into evolutionary
    subfamilies
  • S1-Chymotrypsin (S1)
  • S8-Subtilisin of serine proteases
  • C1-Papain of Cysteine proteases

10
Feature Extraction
  • Each amino acid is characterized in terms of its
  • Dynamic features
  • Biochemical features
  • Geometric features
  • of the residues in its microenvironment.

11
Dynamic features
  • It uses Gaussian network model, an elastic
    network model for describing the equilibrium
    dynamics of proteins, is used for characterizing
    the dynamics features.
  • GNM, the a-carbons (C) form the network nodes,
    and the nodes located within an interaction
    cut-off distance of 7.0. Å are connected via
    uniform elastic springs.
  • Another structural property CN too have a strong
    impact on equilibrium dynamics is the CN, which
    is defined as the number of amino acids (or
    a-carbons) that coordinate the central amino acid
    within a first interaction shell of 7.0 Å.

12
Biochemical features
  • It defines the Amino acid amino acid type and
    property.
  • The classification is based here on both the
    specific amino acid identity chemical features or
    functional groups
  • Chain mining multiple level association rules.

13
Geometric features
  • It uses a 3D reference frame to define each
    residue, using the three backbone atoms N, Ca and
    C (carbonyl C).
  • It uniquely defines the position and orientation
  • of the residue in the 3D space.
  • .

14
FP Discovery
  • It uses Apriori algorithm.
  • Algorithm
  • Calculate occurrence and support of each feature
    to build the FP.
  • Discard FPs with the support smaller than
    predefined minimum support.
  • Join the FPs to generate augmented FPs if length
    is FP is x then augmented FP length is x1.
  • Defining minimum support is based on the degree
    of FP to be considered.

15
FP Discovery
16
Identification of Conserved Residue
  • Applying Apriori Algorithm to proteins reveal FP
    with maximum length.
  • The FP occurs at least once in examined subfamily
    of proteins is considered to conserved FP.
  • Next, the conserved residues are removed from the
    original dataset, and the Apriori algorithm is
    applied again to the modified dataset.
  • All the conserved patterns of 20 types of amino
    acids were identified by this iterative search
    for each family.

17
Rank of Conserved Residue
  • Once the conserved residues are identified by the
    Apriori algorithm, a ranking method is needed to
    distinguish the catalytic residues.
  • It is assumed that the catalytic residues are
    optimally coupled with other conserved residues
    to achieve the highest cooperativity.
  • The amino acids that show the lowest
    interconnectivity (smallest number of connected
    neighbors) are removed from the list of
    considered residues.
  • The core residues are assigned the score zero,
    and the others are scored according to the number
    of iterations required to reach the core
    residues.

18
Results
  • Consider the serine residues in the serine
    protease family.
  • Information for a set of 111 serine residues is
    extracted from the 5 proteins in S1, and for a
    set of 250 serine residues from the 7 proteins in
    S8.
  • This is consistent with the fact that the
    conservation of the microenvironment and global
    dynamics is a more restrictive (and
    discriminative) feature than sequence
    conservation.
  • Another observation is that amino acids that
    sequentially neighbor the catalytic residues tend
    to be conserved.
  • The present unsupervised learning algorithm
    identified 22, 22
  • and 26 conserved residues in the S1, S8 and C1
    subfamilies.

19
continues
20
Result Continues
21
Conclusion
  • A novel unsupervised leaning approach to discover
    biologically meaningful FPs in protein structures
  • The approach incorporates features associated
    with collective
  • dynamics (GNM slow mode shapes) as well as the
    biochemical (amino acid types and physicochemical
    properties) and geometric (3D coordination
    directions) features in the microenvironment.
  • This approach can be used to discover and
    annotate all frequent patterns in the protein
    structure database.
  • It can help to predict structure and function of
    uncharacterized proteins, and identify the
    important amino acids or structural regions.

22
About Authors
  • Ivet Bahar
  • She is currently Chair and Professor of
    Department of Computational Biology, University
    of Pittsburgh, Pittsburgh.
  • She has more than 21 years of research work .
  • Currently Research Areas
  • Characterization of Proteins Structural Classes
  • Characterization of Anti-Cancer Agents
  • Conformational Dynamics of Proteins
  • Protein Folding Kinetics

23
About Author
  • Shann-Ching Chen
  • Carnegie Mellon University, Pittsburgh
  • Main focus on Machine Learning .
  • Current Project Areas
  • Retrieval of 3D Protein and Nucleic Acid
    Structures
  • Multimodal Biometrics

24
Questions???
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com