Applications of knowledge discovery to molecular biology: Identifying structural regularities in pro - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Applications of knowledge discovery to molecular biology: Identifying structural regularities in pro

Description:

seq.c: extract sequence info convert to graphic format ... Sequence (N-terminus to C-terminus): ALA ASN LYS THR LYS SER LEU GLU ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 48

Provided by: sxs3

Category:

more less

Transcript and Presenter's Notes

Title: Applications of knowledge discovery to molecular biology: Identifying structural regularities in pro

1
Applications of knowledge discovery to molecular
biologyIdentifying structural regularities in
proteins

Shaobing Su
Supervisor Dr. Lawrence B. Holder
Committee Dr. Diane J. Cook
Dr. Edward Bellion

2
Outline

Motivation and goal of the research
SUBDUE knowledge discovery system
Proteins and PDB
Methods and results
Discussion and conclusion
Future research

3
Motivation and Goal

Explosive amount of molecular biology info need
to be analyze to help understanding the
underlining structure-function relationship in
protein and other macromolecules.
Apply SUBDUE to the Brookhaven Protein
Data Bank (PDB) to identify biologically
meaningful patterns

4
SUBDUE knowledge discovery system

SUBDUE discovers patterns (substructures)
in structural data sets
SUBDUE represent data as a labeled graph
Inputs vertices and edges
Outputs discovered patterns and instances

5
Example
Vertices objects or attributes Edges
relationships
shape
triangle
object
shape
square
on
object
4 instances of
6
SUBDUEs search algorithm

Minimum Description Length (MDL) principle
The best theory to describe a set of data is the
one that minimizes the DL of the entire data set
DL of the graph the number of bits necessary
to completely describe the graph
Search for the substructure that results in
the maximum compression

7
Inexact graph match approach

Find instances with a slight distortion
insertion, deletion, and substitution of
edges/vertices.
Threshold parameter specify amount of distortion
allowed.

8
Overview of proteins

most important biomolecule
composed from 20 amino acids
structural hierarchy
very diverse structure and function

9
Structural hierarchy in proteins

Primary structure (sequence of protein)
Secondary structure (helix, sheet, random)
Tertiary structure (3-D)

10
Primary Structure of proteins

Average 100-150 residues (a.a.) linked in head to
tail
N-terminus and C-terminus
Peptide bond, alpha-carbon

N-terminus
C-terminus
R1 O H R2 O
-
H3N - C?1 - C - N - C?2 - C - O

first a.a second a.a
peptide bond
11
Secondary structure elements

Ordered backbone arrangement helix and sheet
Helix (0 to 90 average 11 a.a several
types)
Sheet (2 to 15 strands per sheet parallel and
anti-parallel average 6 a.a.
per strand)

12
Tertiary Structure of protein

Highly complicated 3-D arrangement
Folding of its secondary structure elements

13
Brookhaven Protein Data Bank (PDB)

Brookhaven National Laboratory
Over 6000 Experimentally determined
3-D structure of biomolecules
Majority protein structures

14
Contents of PDB

SEQRES sequence of a.a. (three letter code)
HELIX starting, ending, and type
SHEET starts, ends, sense
ATOM (x, y, z) coordinates for each atoms
in protein

15
Applications of SUBDUE to PDB- Methods and
Results

July 1997 PDBTM release (6000 PDB)
Global data set (4000 PDB)
Category data sets
hemoglobin

Myoglobin
Ribonuclease A

16
Flowchart of Research
Preprocessing Application
Inputs to SUBDUE
Brookhaven PDB
Patterns in Category
Graphic representation
Patterns in Global others
Instance mapping
17
Preprocessing

compile PDB list for each category
model.c extract first model
seq.c extract sequence info
convert to
graphic format
secondary.c extract secondary structure info
and convert to graphic format
coor.c extract 3D coordinates
convert to
grahic format

18
Primary structure and its representation

Sample PDB lines

SEQRES 1 150 ALA ASN LYS THR 1ASH 139

SEQRES 2 150 LYS SER LEU GLU 1ASH 140
Sequence (N-terminus to C-terminus)
ALA ASN LYS THR
LYS SER LEU GLU
SUBDUE graphic input (ALA ASN)

v 1 ALA - - -
ALA residue

v 2 ASN - - - ASN residue

e 1 2 bond
- - - a peptide bond between ALA and ASN

19
Secondary structure and its representation -HELIX

Sample PDB lines (starting, ending, type)

HELIX 1 ASN 1 HIS 13 1

HELIX 2 ASN
20 ASN 36 1
vertex h_type_length
Helix Length
Hlength
SeqNum(last a.a.) - SeqNum(first a.a.)
SUBDUE graphic input
v 1 h_1_12
- - - helix 1, type 1, length 12
v
2 h_1_16 - - - helix 2, type 1, length
16

20
Secondary structure and its representation - SHEET

Sample PDB lines (sense, length)
SHEET
1 TYR 284 ILE 286 0

SHEET 2 HIS 292
THR 294 - 1
vertex s_sense_length
SUBDUE graphic input
v 1 s_0_2
- - - strand 1, sense 0, length 2
v
2 s_-1_2 - - - strand 2, sense -1,
length 2

21
Overall secondary structure representation

PDB line
SUBDUE graphic input

HELIX 1 THR 3 MET 13 1 v 1
h_1_10

HELIX 2 ASN 24 ASN 34 1 v 2
h_1_10 e 1 2 sh
HELIX
3 SER 50 GLN 60 1 v 3 s_0_7
e 2 3 sh

SHEET 1
LYS 41 HIS 48 0 v 4 h_1_10
e 3 4 sh

SHEET 2 MET
79 THR 87 -1 v 5 s_-1_8 e 4
5 sh
sequential relationship is represented as edge
sh
Visualization

N-terminus
C-terminus
22
Tertiary structure and its representation

Sample PDB lines X
Y Z

ATOM CA ALA 1 10.369 0.997 10.519
ATOM CA ASN 2 6.691 0.239 9.830
vertex backbone carbon
edge distance (vs, s)
Distance (Å)

distance ((x2-x1)2 (y2-y1)2 (z2 - z1)2)1/2
v 1 CA_ALA

v 2 CA_ASN

e 1 2
vs - - - very short distance

23
Rationale for representation choice-Criteria

Patterns identified by SUBDUE must be
representative for each category
Patterns discovered by SUBDUE should discriminate
one category from others

24
Primary sequence

vertex - a.a. residue name
edge - peptide bond

e 1 2 bond e 2 3 bond
bond bond
ARG GLU ALA
v 1 ARG v 2 GLU v 3 ALA
25
Secondary structure elements

Type of the helix
starting and ending points (a.a name and seq
number)

Helix 1
type
length
1 12
starts ends
ASN HIS
N-terminus
C-terminus
26
Other ways of representing helix

Separate type and length
combine type and length

Helix 1
Helix_1_12
type length
1 12
27
Tertiary structure

(x, y, z) coordinates vary with different origin
choice
avoid numeric number, use vs (?4 Å), s (4 Å lt
dist ? 6 Å)

10.4 6.7
x x
y vs
y
1.0 C1 C2 0.2
z
z
10.5 9.8
28
ResultsPrimary structure patterns
Hemo_seq (63/65)
Hemo_sequence THR LYS THR TYR PHE PRO HIS PHE
ASP LEU SER HIS GLY SER ALA GLN VAL LYS GLY HIS
GLY LYS LYS VAL ALA ASP ALA LEU THR ASN ALA VAL
ALA HIS VAL ASP ASP MET PRO ASN ALA LEU SET ALA
LEU SER THR LEU ALA ALA HIS LEU PRO LAL GLU PHE
THR PRO ALA VAL HIS ALA SET LEU ASP LYS PHE LEU
ALA SET VAL SER THR VAL LEU THR SER LYS TYR
Myo_seq (67/103)
Myoglo_sequence VAL LSU SER GLU GLY GLU TRP GLN
LEU VAL LEU HIS VAL TRP ALA LYS VAL GLU ALA ASP
VAL ALA GLY HIS GLY GLN ASP ILE LEU ILE ARG LEU
PHE LYS SER HIS PRO GLU THR LEU GLU LYS PHE ASP
ARG
Ribo_A (59/68)
Ribonuclease_A_sequence GLY GLN THR ASN CYS TYR
GLN SER TYR SER THR MET SER ILE THR ASP CYS ARG
GLU THR GLY SER SER LYS TYR PRO ASN CYS ALA TYR
LYS THR THR GLN ALA ASN LYS HIS ILE ILE VAL ALA
CYS GLU GLY ASN PRO TYR VAL PRO VAL HIS PHE ASP
ALA SER VAL
29
Primary structure patterns

Unique to each sample category
hemoglobin and myoglobin proteins
share little sequence similarity

30
ResultsHemo secondary structure patterns
1 h_1_14 -gt h_1_15 -gt h_1_6 -gt h_1_6 -gt h_1_19
-gt h_1_8 -gt h_1_18 -gt h_1_20
7 h_1_15 -gt h_1_15 -gt h_1_6 -gt h_1_1 -gt h_1_19
-gt h_1_8 -gt h_1_18 -gt h_1_20
31
ResultsMyo secondary structure patterns
1 h_1_15 -gt h_1_15 -gt h_1_6 -gt h_1_6 -gt h_1_19
-gt h_1_9 -gt h_1_18 -gt h_1_25
32
ResultsRibo_A secondary structure patterns
1 h_1_10 -gt h_1_10 -gt s_0_7 -gt s_0_7 -gt h_1_10
-gt s_0_3 -gt s_0_3 -gt s_-1_4 -gt s_-1_4 -gt s_-1_8
-gt s_-1_1 -gt s_-1_10 -gt s_-1_10 -gt s_-1_8 -gt
s_-1_8 -gt s_-1_5 -gt s_-1_3
10 h_1_10 -gt h_1_10 -gt s_0_7 -gt h_1_10 -gt s_0_3
-gt s_-1_4 -gt s_-1_8 -gt s_-1_8 -gt s_-1_6
33
ResultsTertiary structural patterns

SUBDUE finds small patterns (2 or 3 a.a.)
not unique for each category of proteins
not biologically meaningful

34
Visualization of secondary structure patterns
-hemoglobin
complete hemoglobin 2 instances of
pattern structure
N-terminus C-terminus
35
Visualization of secondary structure patterns
-myoglobin
complete myoglobin 1 instance of pattern
structure
N-terminus C-terminus
36
Visualization of secondary structure patterns
-ribonuclease_A
complete ribonuclease_A 1 instance of
pattern structure
N-terminus C-terminus
37
Discussion-Hemoglobin

Hemoglobin A, B, C, D chains
Two types of patterns identified by SUBDUE One
for A, C chains, the other for B, D chains
Patterns exist in a majority of hemoglobin
proteins
No instances of the best hemoglobin pattern found
in other proteins in the global data set

38
Occurrence of hemo patterns
39
Occurrence of hemo patterns -continued
40
Discussion-Myoglobin

Myoglobin one chain
One dominant pattern identified by SUBDUE
Patterns exist in most of myoglobin proteins
No instances of the best myoglobin pattern
found in other proteins in the global data set

41
Discussion-Hemoglobin and Myoglobin

Similar secondary structure patterns

Hemoglobin B, D chains (from N- to C-terminus)
h_1_14 -gt h_1_15 -gt h_1_6 -gt h_1_6 -gt
h_1_19 -gt h_1_8 -gt h_1_18 -gt h_1_20
Myoglobin chain (from N- to C-terminus) h_1_15
-gt h_1_15 -gt h_1_6 -gt h_1_6 -gt h_1_19 -gt h_1_9 -gt
h_1_18 -gt h_1_25
Hemoglobin A, C chains (from N- to
C-terminus) h_1_15 -gt h_1_15 -gt h_1_6 -gt h_1_1
-gt h_1_19 -gt h_1_8 -gt h_1_18 -gt h_1_20
42
Discussion-Hemoglobin and Myoglobin

Consistent with the genetic studies
Hemoglobin and myoglobin share one ancestral gene
Divergence occurred in the course of evolution.
One copy of gene for myoglobin, four copies for
hemoglobin.
The last helix of the hemoglobin is shorter One
of the helix in hemoglobin A, C chains almost
disappear allow conformational change

43
Discussion-ribonuclease A proteins

All patterns have three helices of the same size
Several strands appear twice indicating
participation in two sheet formation.
Ribonuclease S protein (S-protein fragment) also
has the pattern.

44
Conclusion of the results

Secondary structure patterns discovered by SUBDUE
are representative to each category
Secondary structure patterns discovered by SUBDUE
are distinct for each category
SUBDUE has the ability to discover biologically
interesting patterns from PDB and other similar
MB data bases

45
Comparison with other related studies

Different graphic representation
predefined patterns with exact or inexact graph
match
Not applied systematically to PDB or other DB
SUBDUE would perform similar task if the inexact
graph match routine is incorporated

46
Conclusions of the study

Abstraction over 3D structure to its secondary
structural elements is suitable for discovery
SUBDUE discovered secondary structure patterns
for each category can be used as a signature
for its class
Inexact graph match is useful for finding similar
patterns
SUBDUE is suitable for knowledge discovery in MB
structural DB

47
Future Research

More consistent and detailed description of
secondary structure
Add relative positions of the secondary
structural elements to represent spatial
relationship
Investigate alternative representation more
suitable 3D coordinates representation weighting
on different edges
Inexact graph match in predefined substructure
More collaboration with domain scientists