Applications of knowledge discovery to molecular biology: Identifying structural regularities in pro - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Applications of knowledge discovery to molecular biology: Identifying structural regularities in pro

Description:

seq.c: extract sequence info convert to graphic format ... Sequence (N-terminus to C-terminus): ALA ASN LYS THR LYS SER LEU GLU ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 48
Provided by: sxs3
Category:

less

Transcript and Presenter's Notes

Title: Applications of knowledge discovery to molecular biology: Identifying structural regularities in pro


1
Applications of knowledge discovery to molecular
biologyIdentifying structural regularities in
proteins
  • Shaobing Su
  • Supervisor Dr. Lawrence B. Holder
  • Committee Dr. Diane J. Cook
  • Dr. Edward Bellion

2
Outline
  • Motivation and goal of the research
  • SUBDUE knowledge discovery system
  • Proteins and PDB
  • Methods and results
  • Discussion and conclusion
  • Future research

3
Motivation and Goal
  • Explosive amount of molecular biology info need
    to be analyze to help understanding the
    underlining structure-function relationship in
    protein and other macromolecules.
  • Apply SUBDUE to the Brookhaven Protein
    Data Bank (PDB) to identify biologically
    meaningful patterns

4
SUBDUE knowledge discovery system
  • SUBDUE discovers patterns (substructures)
    in structural data sets
  • SUBDUE represent data as a labeled graph
  • Inputs vertices and edges
  • Outputs discovered patterns and instances

5
Example
Vertices objects or attributes Edges
relationships
shape
triangle
object
shape
square
on
object
4 instances of
6
SUBDUEs search algorithm
  • Minimum Description Length (MDL) principle
    The best theory to describe a set of data is the
    one that minimizes the DL of the entire data set
  • DL of the graph the number of bits necessary
    to completely describe the graph
  • Search for the substructure that results in
    the maximum compression

7
Inexact graph match approach
  • Find instances with a slight distortion
    insertion, deletion, and substitution of
    edges/vertices.
  • Threshold parameter specify amount of distortion
    allowed.

8
Overview of proteins
  • most important biomolecule
  • composed from 20 amino acids
  • structural hierarchy
  • very diverse structure and function

9
Structural hierarchy in proteins
  • Primary structure (sequence of protein)
  • Secondary structure (helix, sheet, random)
  • Tertiary structure (3-D)

10
Primary Structure of proteins
  • Average 100-150 residues (a.a.) linked in head to
    tail
  • N-terminus and C-terminus
  • Peptide bond, alpha-carbon

N-terminus
C-terminus
R1 O H R2 O
-
H3N - C?1 - C - N - C?2 - C - O

first a.a second a.a
peptide bond
11
Secondary structure elements
  • Ordered backbone arrangement helix and sheet
  • Helix (0 to 90 average 11 a.a several
    types)
  • Sheet (2 to 15 strands per sheet parallel and
    anti-parallel average 6 a.a.
    per strand)

12
Tertiary Structure of protein
  • Highly complicated 3-D arrangement
  • Folding of its secondary structure elements

13
Brookhaven Protein Data Bank (PDB)
  • Brookhaven National Laboratory
  • Over 6000 Experimentally determined
    3-D structure of biomolecules
  • Majority protein structures

14
Contents of PDB
  • SEQRES sequence of a.a. (three letter code)
  • HELIX starting, ending, and type
  • SHEET starts, ends, sense
  • ATOM (x, y, z) coordinates for each atoms
    in protein

15
Applications of SUBDUE to PDB- Methods and
Results
  • July 1997 PDBTM release (6000 PDB)
  • Global data set (4000 PDB)
  • Category data sets
    hemoglobin

    Myoglobin
    Ribonuclease A

16
Flowchart of Research
Preprocessing Application
Inputs to SUBDUE
Brookhaven PDB
Patterns in Category
Graphic representation
Patterns in Global others
Instance mapping
17
Preprocessing
  • compile PDB list for each category
  • model.c extract first model
  • seq.c extract sequence info
    convert to
    graphic format
  • secondary.c extract secondary structure info
    and convert to graphic format
  • coor.c extract 3D coordinates
    convert to
    grahic format

18
Primary structure and its representation
  • Sample PDB lines

    SEQRES 1 150 ALA ASN LYS THR 1ASH 139

    SEQRES 2 150 LYS SER LEU GLU 1ASH 140
  • Sequence (N-terminus to C-terminus)
    ALA ASN LYS THR
    LYS SER LEU GLU
  • SUBDUE graphic input (ALA ASN)

    v 1 ALA - - -
    ALA residue

    v 2 ASN - - - ASN residue

    e 1 2 bond
    - - - a peptide bond between ALA and ASN

19
Secondary structure and its representation -HELIX
  • Sample PDB lines (starting, ending, type)

    HELIX 1 ASN 1 HIS 13 1

    HELIX 2 ASN
    20 ASN 36 1
  • vertex h_type_length
  • Helix Length
    Hlength
    SeqNum(last a.a.) - SeqNum(first a.a.)
  • SUBDUE graphic input
    v 1 h_1_12
    - - - helix 1, type 1, length 12
    v
    2 h_1_16 - - - helix 2, type 1, length
    16

20
Secondary structure and its representation - SHEET
  • Sample PDB lines (sense, length)
    SHEET
    1 TYR 284 ILE 286 0

    SHEET 2 HIS 292
    THR 294 - 1
  • vertex s_sense_length
  • SUBDUE graphic input
    v 1 s_0_2
    - - - strand 1, sense 0, length 2
    v
    2 s_-1_2 - - - strand 2, sense -1,
    length 2

21
Overall secondary structure representation
  • PDB line
    SUBDUE graphic input

    HELIX 1 THR 3 MET 13 1 v 1
    h_1_10

    HELIX 2 ASN 24 ASN 34 1 v 2
    h_1_10 e 1 2 sh
    HELIX
    3 SER 50 GLN 60 1 v 3 s_0_7
    e 2 3 sh

    SHEET 1
    LYS 41 HIS 48 0 v 4 h_1_10
    e 3 4 sh

    SHEET 2 MET
    79 THR 87 -1 v 5 s_-1_8 e 4
    5 sh
  • sequential relationship is represented as edge
    sh
  • Visualization

N-terminus
C-terminus
22
Tertiary structure and its representation
  • Sample PDB lines X
    Y Z

    ATOM CA ALA 1 10.369 0.997 10.519
    ATOM CA ASN 2 6.691 0.239 9.830
  • vertex backbone carbon
    edge distance (vs, s)
  • Distance (Å)

    distance ((x2-x1)2 (y2-y1)2 (z2 - z1)2)1/2
  • v 1 CA_ALA

    v 2 CA_ASN

    e 1 2
    vs - - - very short distance

23
Rationale for representation choice-Criteria
  • Patterns identified by SUBDUE must be
    representative for each category
  • Patterns discovered by SUBDUE should discriminate
    one category from others

24
Primary sequence
  • vertex - a.a. residue name
  • edge - peptide bond

e 1 2 bond e 2 3 bond
bond bond
ARG GLU ALA
v 1 ARG v 2 GLU v 3 ALA
25
Secondary structure elements
  • Type of the helix
  • starting and ending points (a.a name and seq
    number)

Helix 1
type
length
1 12
starts ends
ASN HIS
N-terminus
C-terminus
26
Other ways of representing helix
  • Separate type and length
  • combine type and length

Helix 1
Helix_1_12
type length
1 12
27
Tertiary structure
  • (x, y, z) coordinates vary with different origin
    choice
  • avoid numeric number, use vs (?4 Å), s (4 Å lt
    dist ? 6 Å)

10.4 6.7
x x
y vs
y
1.0 C1 C2 0.2
z
z
10.5 9.8
28
ResultsPrimary structure patterns
Hemo_seq (63/65)
Hemo_sequence THR LYS THR TYR PHE PRO HIS PHE
ASP LEU SER HIS GLY SER ALA GLN VAL LYS GLY HIS
GLY LYS LYS VAL ALA ASP ALA LEU THR ASN ALA VAL
ALA HIS VAL ASP ASP MET PRO ASN ALA LEU SET ALA
LEU SER THR LEU ALA ALA HIS LEU PRO LAL GLU PHE
THR PRO ALA VAL HIS ALA SET LEU ASP LYS PHE LEU
ALA SET VAL SER THR VAL LEU THR SER LYS TYR
Myo_seq (67/103)
Myoglo_sequence VAL LSU SER GLU GLY GLU TRP GLN
LEU VAL LEU HIS VAL TRP ALA LYS VAL GLU ALA ASP
VAL ALA GLY HIS GLY GLN ASP ILE LEU ILE ARG LEU
PHE LYS SER HIS PRO GLU THR LEU GLU LYS PHE ASP
ARG
Ribo_A (59/68)
Ribonuclease_A_sequence GLY GLN THR ASN CYS TYR
GLN SER TYR SER THR MET SER ILE THR ASP CYS ARG
GLU THR GLY SER SER LYS TYR PRO ASN CYS ALA TYR
LYS THR THR GLN ALA ASN LYS HIS ILE ILE VAL ALA
CYS GLU GLY ASN PRO TYR VAL PRO VAL HIS PHE ASP
ALA SER VAL
29
Primary structure patterns
  • Unique to each sample category
  • hemoglobin and myoglobin proteins
    share little sequence similarity

30
ResultsHemo secondary structure patterns
1 h_1_14 -gt h_1_15 -gt h_1_6 -gt h_1_6 -gt h_1_19
-gt h_1_8 -gt h_1_18 -gt h_1_20
7 h_1_15 -gt h_1_15 -gt h_1_6 -gt h_1_1 -gt h_1_19
-gt h_1_8 -gt h_1_18 -gt h_1_20
31
ResultsMyo secondary structure patterns
1 h_1_15 -gt h_1_15 -gt h_1_6 -gt h_1_6 -gt h_1_19
-gt h_1_9 -gt h_1_18 -gt h_1_25
32
ResultsRibo_A secondary structure patterns
1 h_1_10 -gt h_1_10 -gt s_0_7 -gt s_0_7 -gt h_1_10
-gt s_0_3 -gt s_0_3 -gt s_-1_4 -gt s_-1_4 -gt s_-1_8
-gt s_-1_1 -gt s_-1_10 -gt s_-1_10 -gt s_-1_8 -gt
s_-1_8 -gt s_-1_5 -gt s_-1_3
10 h_1_10 -gt h_1_10 -gt s_0_7 -gt h_1_10 -gt s_0_3
-gt s_-1_4 -gt s_-1_8 -gt s_-1_8 -gt s_-1_6
33
ResultsTertiary structural patterns
  • SUBDUE finds small patterns (2 or 3 a.a.)
  • not unique for each category of proteins
  • not biologically meaningful

34
Visualization of secondary structure patterns
-hemoglobin
complete hemoglobin 2 instances of
pattern structure
N-terminus C-terminus
35
Visualization of secondary structure patterns
-myoglobin
complete myoglobin 1 instance of pattern
structure
N-terminus C-terminus
36
Visualization of secondary structure patterns
-ribonuclease_A
complete ribonuclease_A 1 instance of
pattern structure
N-terminus C-terminus
37
Discussion-Hemoglobin
  • Hemoglobin A, B, C, D chains
  • Two types of patterns identified by SUBDUE One
    for A, C chains, the other for B, D chains
  • Patterns exist in a majority of hemoglobin
    proteins
  • No instances of the best hemoglobin pattern found
    in other proteins in the global data set

38
Occurrence of hemo patterns
39
Occurrence of hemo patterns -continued
40
Discussion-Myoglobin
  • Myoglobin one chain
  • One dominant pattern identified by SUBDUE
  • Patterns exist in most of myoglobin proteins
  • No instances of the best myoglobin pattern
    found in other proteins in the global data set

41
Discussion-Hemoglobin and Myoglobin
  • Similar secondary structure patterns

Hemoglobin B, D chains (from N- to C-terminus)
h_1_14 -gt h_1_15 -gt h_1_6 -gt h_1_6 -gt
h_1_19 -gt h_1_8 -gt h_1_18 -gt h_1_20
Myoglobin chain (from N- to C-terminus) h_1_15
-gt h_1_15 -gt h_1_6 -gt h_1_6 -gt h_1_19 -gt h_1_9 -gt
h_1_18 -gt h_1_25
Hemoglobin A, C chains (from N- to
C-terminus) h_1_15 -gt h_1_15 -gt h_1_6 -gt h_1_1
-gt h_1_19 -gt h_1_8 -gt h_1_18 -gt h_1_20
42
Discussion-Hemoglobin and Myoglobin
  • Consistent with the genetic studies
  • Hemoglobin and myoglobin share one ancestral gene
  • Divergence occurred in the course of evolution.
    One copy of gene for myoglobin, four copies for
    hemoglobin.
  • The last helix of the hemoglobin is shorter One
    of the helix in hemoglobin A, C chains almost
    disappear allow conformational change


43
Discussion-ribonuclease A proteins
  • All patterns have three helices of the same size
  • Several strands appear twice indicating
    participation in two sheet formation.
  • Ribonuclease S protein (S-protein fragment) also
    has the pattern.

44
Conclusion of the results
  • Secondary structure patterns discovered by SUBDUE
    are representative to each category
  • Secondary structure patterns discovered by SUBDUE
    are distinct for each category
  • SUBDUE has the ability to discover biologically
    interesting patterns from PDB and other similar
    MB data bases

45
Comparison with other related studies
  • Different graphic representation
  • predefined patterns with exact or inexact graph
    match
  • Not applied systematically to PDB or other DB
  • SUBDUE would perform similar task if the inexact
    graph match routine is incorporated

46
Conclusions of the study
  • Abstraction over 3D structure to its secondary
    structural elements is suitable for discovery
  • SUBDUE discovered secondary structure patterns
    for each category can be used as a signature
    for its class
  • Inexact graph match is useful for finding similar
    patterns
  • SUBDUE is suitable for knowledge discovery in MB
    structural DB

47
Future Research
  • More consistent and detailed description of
    secondary structure
  • Add relative positions of the secondary
    structural elements to represent spatial
    relationship
  • Investigate alternative representation more
    suitable 3D coordinates representation weighting
    on different edges
  • Inexact graph match in predefined substructure
  • More collaboration with domain scientists
Write a Comment
User Comments (0)
About PowerShow.com