Title: Applications of knowledge discovery to molecular biology: Identifying structural regularities in pro
1Applications of knowledge discovery to molecular
biologyIdentifying structural regularities in
proteins
- Shaobing Su
- Supervisor Dr. Lawrence B. Holder
- Committee Dr. Diane J. Cook
- Dr. Edward Bellion
2Outline
- Motivation and goal of the research
- SUBDUE knowledge discovery system
- Proteins and PDB
- Methods and results
- Discussion and conclusion
- Future research
3Motivation and Goal
- Explosive amount of molecular biology info need
to be analyze to help understanding the
underlining structure-function relationship in
protein and other macromolecules. - Apply SUBDUE to the Brookhaven Protein
Data Bank (PDB) to identify biologically
meaningful patterns
4SUBDUE knowledge discovery system
- SUBDUE discovers patterns (substructures)
in structural data sets - SUBDUE represent data as a labeled graph
- Inputs vertices and edges
- Outputs discovered patterns and instances
5Example
Vertices objects or attributes Edges
relationships
shape
triangle
object
shape
square
on
object
4 instances of
6SUBDUEs search algorithm
- Minimum Description Length (MDL) principle
The best theory to describe a set of data is the
one that minimizes the DL of the entire data set
- DL of the graph the number of bits necessary
to completely describe the graph - Search for the substructure that results in
the maximum compression
7Inexact graph match approach
- Find instances with a slight distortion
insertion, deletion, and substitution of
edges/vertices. - Threshold parameter specify amount of distortion
allowed.
8Overview of proteins
- most important biomolecule
- composed from 20 amino acids
- structural hierarchy
- very diverse structure and function
9Structural hierarchy in proteins
- Primary structure (sequence of protein)
- Secondary structure (helix, sheet, random)
- Tertiary structure (3-D)
10Primary Structure of proteins
- Average 100-150 residues (a.a.) linked in head to
tail - N-terminus and C-terminus
- Peptide bond, alpha-carbon
N-terminus
C-terminus
R1 O H R2 O
-
H3N - C?1 - C - N - C?2 - C - O
first a.a second a.a
peptide bond
11Secondary structure elements
- Ordered backbone arrangement helix and sheet
- Helix (0 to 90 average 11 a.a several
types) - Sheet (2 to 15 strands per sheet parallel and
anti-parallel average 6 a.a.
per strand)
12Tertiary Structure of protein
- Highly complicated 3-D arrangement
- Folding of its secondary structure elements
13Brookhaven Protein Data Bank (PDB)
- Brookhaven National Laboratory
- Over 6000 Experimentally determined
3-D structure of biomolecules - Majority protein structures
14Contents of PDB
- SEQRES sequence of a.a. (three letter code)
- HELIX starting, ending, and type
- SHEET starts, ends, sense
- ATOM (x, y, z) coordinates for each atoms
in protein
15Applications of SUBDUE to PDB- Methods and
Results
- July 1997 PDBTM release (6000 PDB)
- Global data set (4000 PDB)
- Category data sets
hemoglobin
Myoglobin
Ribonuclease A
16Flowchart of Research
Preprocessing Application
Inputs to SUBDUE
Brookhaven PDB
Patterns in Category
Graphic representation
Patterns in Global others
Instance mapping
17Preprocessing
- compile PDB list for each category
- model.c extract first model
- seq.c extract sequence info
convert to
graphic format - secondary.c extract secondary structure info
and convert to graphic format - coor.c extract 3D coordinates
convert to
grahic format
18Primary structure and its representation
- Sample PDB lines
SEQRES 1 150 ALA ASN LYS THR 1ASH 139
SEQRES 2 150 LYS SER LEU GLU 1ASH 140 - Sequence (N-terminus to C-terminus)
ALA ASN LYS THR
LYS SER LEU GLU - SUBDUE graphic input (ALA ASN)
v 1 ALA - - -
ALA residue
v 2 ASN - - - ASN residue
e 1 2 bond
- - - a peptide bond between ALA and ASN
19Secondary structure and its representation -HELIX
- Sample PDB lines (starting, ending, type)
HELIX 1 ASN 1 HIS 13 1
HELIX 2 ASN
20 ASN 36 1 - vertex h_type_length
- Helix Length
Hlength
SeqNum(last a.a.) - SeqNum(first a.a.) - SUBDUE graphic input
v 1 h_1_12
- - - helix 1, type 1, length 12
v
2 h_1_16 - - - helix 2, type 1, length
16
20Secondary structure and its representation - SHEET
- Sample PDB lines (sense, length)
SHEET
1 TYR 284 ILE 286 0
SHEET 2 HIS 292
THR 294 - 1
- vertex s_sense_length
- SUBDUE graphic input
v 1 s_0_2
- - - strand 1, sense 0, length 2
v
2 s_-1_2 - - - strand 2, sense -1,
length 2
21Overall secondary structure representation
- PDB line
SUBDUE graphic input
HELIX 1 THR 3 MET 13 1 v 1
h_1_10
HELIX 2 ASN 24 ASN 34 1 v 2
h_1_10 e 1 2 sh
HELIX
3 SER 50 GLN 60 1 v 3 s_0_7
e 2 3 sh
SHEET 1
LYS 41 HIS 48 0 v 4 h_1_10
e 3 4 sh
SHEET 2 MET
79 THR 87 -1 v 5 s_-1_8 e 4
5 sh - sequential relationship is represented as edge
sh
- Visualization
N-terminus
C-terminus
22Tertiary structure and its representation
- Sample PDB lines X
Y Z
ATOM CA ALA 1 10.369 0.997 10.519
ATOM CA ASN 2 6.691 0.239 9.830 - vertex backbone carbon
edge distance (vs, s) - Distance (Å)
distance ((x2-x1)2 (y2-y1)2 (z2 - z1)2)1/2 - v 1 CA_ALA
v 2 CA_ASN
e 1 2
vs - - - very short distance
23Rationale for representation choice-Criteria
- Patterns identified by SUBDUE must be
representative for each category - Patterns discovered by SUBDUE should discriminate
one category from others
24Primary sequence
- vertex - a.a. residue name
- edge - peptide bond
e 1 2 bond e 2 3 bond
bond bond
ARG GLU ALA
v 1 ARG v 2 GLU v 3 ALA
25Secondary structure elements
- Type of the helix
- starting and ending points (a.a name and seq
number)
Helix 1
type
length
1 12
starts ends
ASN HIS
N-terminus
C-terminus
26Other ways of representing helix
- Separate type and length
- combine type and length
Helix 1
Helix_1_12
type length
1 12
27Tertiary structure
- (x, y, z) coordinates vary with different origin
choice - avoid numeric number, use vs (?4 Å), s (4 Å lt
dist ? 6 Å)
10.4 6.7
x x
y vs
y
1.0 C1 C2 0.2
z
z
10.5 9.8
28ResultsPrimary structure patterns
Hemo_seq (63/65)
Hemo_sequence THR LYS THR TYR PHE PRO HIS PHE
ASP LEU SER HIS GLY SER ALA GLN VAL LYS GLY HIS
GLY LYS LYS VAL ALA ASP ALA LEU THR ASN ALA VAL
ALA HIS VAL ASP ASP MET PRO ASN ALA LEU SET ALA
LEU SER THR LEU ALA ALA HIS LEU PRO LAL GLU PHE
THR PRO ALA VAL HIS ALA SET LEU ASP LYS PHE LEU
ALA SET VAL SER THR VAL LEU THR SER LYS TYR
Myo_seq (67/103)
Myoglo_sequence VAL LSU SER GLU GLY GLU TRP GLN
LEU VAL LEU HIS VAL TRP ALA LYS VAL GLU ALA ASP
VAL ALA GLY HIS GLY GLN ASP ILE LEU ILE ARG LEU
PHE LYS SER HIS PRO GLU THR LEU GLU LYS PHE ASP
ARG
Ribo_A (59/68)
Ribonuclease_A_sequence GLY GLN THR ASN CYS TYR
GLN SER TYR SER THR MET SER ILE THR ASP CYS ARG
GLU THR GLY SER SER LYS TYR PRO ASN CYS ALA TYR
LYS THR THR GLN ALA ASN LYS HIS ILE ILE VAL ALA
CYS GLU GLY ASN PRO TYR VAL PRO VAL HIS PHE ASP
ALA SER VAL
29Primary structure patterns
- Unique to each sample category
- hemoglobin and myoglobin proteins
share little sequence similarity
30ResultsHemo secondary structure patterns
1 h_1_14 -gt h_1_15 -gt h_1_6 -gt h_1_6 -gt h_1_19
-gt h_1_8 -gt h_1_18 -gt h_1_20
7 h_1_15 -gt h_1_15 -gt h_1_6 -gt h_1_1 -gt h_1_19
-gt h_1_8 -gt h_1_18 -gt h_1_20
31ResultsMyo secondary structure patterns
1 h_1_15 -gt h_1_15 -gt h_1_6 -gt h_1_6 -gt h_1_19
-gt h_1_9 -gt h_1_18 -gt h_1_25
32ResultsRibo_A secondary structure patterns
1 h_1_10 -gt h_1_10 -gt s_0_7 -gt s_0_7 -gt h_1_10
-gt s_0_3 -gt s_0_3 -gt s_-1_4 -gt s_-1_4 -gt s_-1_8
-gt s_-1_1 -gt s_-1_10 -gt s_-1_10 -gt s_-1_8 -gt
s_-1_8 -gt s_-1_5 -gt s_-1_3
10 h_1_10 -gt h_1_10 -gt s_0_7 -gt h_1_10 -gt s_0_3
-gt s_-1_4 -gt s_-1_8 -gt s_-1_8 -gt s_-1_6
33ResultsTertiary structural patterns
- SUBDUE finds small patterns (2 or 3 a.a.)
- not unique for each category of proteins
- not biologically meaningful
34Visualization of secondary structure patterns
-hemoglobin
complete hemoglobin 2 instances of
pattern structure
N-terminus C-terminus
35Visualization of secondary structure patterns
-myoglobin
complete myoglobin 1 instance of pattern
structure
N-terminus C-terminus
36Visualization of secondary structure patterns
-ribonuclease_A
complete ribonuclease_A 1 instance of
pattern structure
N-terminus C-terminus
37Discussion-Hemoglobin
- Hemoglobin A, B, C, D chains
- Two types of patterns identified by SUBDUE One
for A, C chains, the other for B, D chains - Patterns exist in a majority of hemoglobin
proteins - No instances of the best hemoglobin pattern found
in other proteins in the global data set
38Occurrence of hemo patterns
39Occurrence of hemo patterns -continued
40Discussion-Myoglobin
- Myoglobin one chain
- One dominant pattern identified by SUBDUE
- Patterns exist in most of myoglobin proteins
- No instances of the best myoglobin pattern
found in other proteins in the global data set
41Discussion-Hemoglobin and Myoglobin
- Similar secondary structure patterns
Hemoglobin B, D chains (from N- to C-terminus)
h_1_14 -gt h_1_15 -gt h_1_6 -gt h_1_6 -gt
h_1_19 -gt h_1_8 -gt h_1_18 -gt h_1_20
Myoglobin chain (from N- to C-terminus) h_1_15
-gt h_1_15 -gt h_1_6 -gt h_1_6 -gt h_1_19 -gt h_1_9 -gt
h_1_18 -gt h_1_25
Hemoglobin A, C chains (from N- to
C-terminus) h_1_15 -gt h_1_15 -gt h_1_6 -gt h_1_1
-gt h_1_19 -gt h_1_8 -gt h_1_18 -gt h_1_20
42Discussion-Hemoglobin and Myoglobin
- Consistent with the genetic studies
- Hemoglobin and myoglobin share one ancestral gene
- Divergence occurred in the course of evolution.
One copy of gene for myoglobin, four copies for
hemoglobin. - The last helix of the hemoglobin is shorter One
of the helix in hemoglobin A, C chains almost
disappear allow conformational change
43Discussion-ribonuclease A proteins
- All patterns have three helices of the same size
- Several strands appear twice indicating
participation in two sheet formation. - Ribonuclease S protein (S-protein fragment) also
has the pattern.
44Conclusion of the results
- Secondary structure patterns discovered by SUBDUE
are representative to each category - Secondary structure patterns discovered by SUBDUE
are distinct for each category - SUBDUE has the ability to discover biologically
interesting patterns from PDB and other similar
MB data bases
45Comparison with other related studies
- Different graphic representation
- predefined patterns with exact or inexact graph
match - Not applied systematically to PDB or other DB
- SUBDUE would perform similar task if the inexact
graph match routine is incorporated
46Conclusions of the study
- Abstraction over 3D structure to its secondary
structural elements is suitable for discovery - SUBDUE discovered secondary structure patterns
for each category can be used as a signature
for its class - Inexact graph match is useful for finding similar
patterns - SUBDUE is suitable for knowledge discovery in MB
structural DB
47Future Research
- More consistent and detailed description of
secondary structure - Add relative positions of the secondary
structural elements to represent spatial
relationship - Investigate alternative representation more
suitable 3D coordinates representation weighting
on different edges - Inexact graph match in predefined substructure
- More collaboration with domain scientists