Title: Applications of Data Mining and Machine Learning in Bioinformatics
1Applications of Data Mining and Machine Learning
in Bioinformatics
- Yen-Jen Oyang
- Dept. of Computer Science and Information
Engineering
2Basics of Protein Structures
- A typical protein consists of hundreds to
thousands of amino acids. - There are 20 basic amino acids, each of which is
denoted by one English character.
320 Amino Acid - 1
Name Symbol Mass (-H2O) Side Chain Occurence ()
Alanine A, Ala 71.079 CH3- 7.49
Arginine R, Arg 156.188 HNC(NH2)-NH-(CH2)3- 5.22
Asparagine N, Asn 114.104 H2N-CO-CH2- 4.53
Aspartic acid D, Asp 115.089 HOOC-CH2- 5.22
Cysteine C, Cys 103.145 HS-CH2- 1.82
Glutamine Q, Gln 128.131 H2N-CO-(CH2)2- 4.11
Glutamic acid E, Glu 129.116 HOOC-(CH2)2- 6.26
Source http//prowl.rockefeller.edu/aainfo/struct
.htm
420 Amino Acid - 2
Name Symbol Mass (-H2O) Side Chain Occurence ()
Glycine G, Gly 57.052 H- 7.10
Histidine H, His 137.141 NCH-NH-CHC-CH2- __________ 2.23
Isoleucine I, Ile 113.160 CH3-CH2-CH(CH3)- 5.45
Leucine L, Leu 113.160 (CH3)2-CH-CH2- 9.06
Lysine K, Lys 128.17 H2N-(CH2)4- 5.82
Methionine M, Met 131.199 CH3-S-(CH2)2- 2.27
Phenylalanine F, Phe 147.177 Phenyl-CH2- 3.91
Proline P, Pro 97.117 -N-(CH2)3-CH- _________ 5.12
Source http//prowl.rockefeller.edu/aainfo/struct
.htm
520 Amino Acid - 3
Name Symbol Mass (-H2O) Side Chain Occurence ()
Serine S, Ser 87.078 HO-CH2- 7.34
Threonine T, Thr 101.105 CH3-CH(OH)- 5.96
Tryptophan W, Trp 186.213 Phenyl-NH-CHC-CH2- ___________ 1.32
Tyrosine Y, Tyr 163.176 4-OH-Phenyl-CH2- 3.25
Valine V, Val 99.133 CH3-CH(CH2)- 6.48
Source http//prowl.rockefeller.edu/aainfo/struct
.htm
6Three-dimensional Structure of Myoglobin
Source Lectures of BioInfo by yukijuan
7Prediction of Protein Functions
- Given a protein sequence, biochemists are
interested in its functions and its tertiary
structure.
8Protein Classification Based on the Homology Model
- The sizes of modern protein databases are growing
at fast rates. - In order to expedite the process to identify
protein functions, it is desirable to classify
the concerned protein, before biochemistry
experiments are conducted.
9- One widely used approach to classify proteins is
based on the homology model, i.e. classify
proteins based on the similarities of amino acid
sequences. - BLAST and FASTA are two most widely used software
utilities for computing the similarity between
two sequences. - We can cluster the proteins in an existing
protein database in advance as the next slide
exemplifies.
10An Example of Similar Protein Sequences
- 3BP2_HUMAN MAAEEMHWPVPMKAIGAQNLLTMPGGVAKAGYLHKKGGT
QLQLLKWPLRFVIIHKRCVYYFKSSTSASPQGAFSLSGYNRVMRAAEETT
SNNVFPFKIIHISKKHRTWFFSASSEEERKSWMALLRREIGHFHEKKDLP
LDTSDSSSDTDSFYGAVERPVDISLSPYPTDNEDYEHDDEDDSYLEPDSP
EPGRLEDALMHPPAYPPPPVPTPRKPAFSDMPRAHSFTSKGPGPLLPPPP
PKHGLPDVGLAAEDSKRDPLCPRRAEPCPRVPATPRRMSDPPLSTMPTAP
GLRKPPCFRESASPSPEPWTPGHGACSTSSAAIMATATSRNCDKLKSFHL
SPRGPPTSEPPPVPANKPKFLKIAEEDPPREAAMPGLFVPPVAPRPPALK
LPVPEAMARPAVLPRPEKPQLPHLQRSPPDGQSFRSFSFEKPRQPSQADT
GGDDSDEDYEKVPLPNSVFVNTTESCEVERLFKATSPRGEPQDGLYCIRN
SSTKSGKVLVVWDETSNKVRNYRIFEKDSKFYLEGEVLFVSVGSMVEHYH
THVLPSHQSLLLRHPYGYTGPR - 3BP2_MOUSE MAAEEMQWPVPMKAIGAQNLLTMPGGVAKAGYLHKKGGT
QLQLLKWPLRFVIIHKRCIYYFKSSTSASPQGAFSLSGYNRVMRAAEETT
SNNVFPFKIIHISKKHRTWFFSASSEDERKSWMAFVRREIGHFHEKKELP
LDTSDSSSDTDSFYGAVERPIDISLSSYPMDNEDYEHEDEDDSYLEPDSP
GPMKLEDALTYPPAYPPPPVPVPRKPAFSDLPRAHSFTSKSPSPLLPPPP
PKRGLPDTGSAPEDAKDALGLRRVEPGLRVPATPRRMSDPPMSNVPTVPN
LRKHPCFRDSVNPGLEPWTPGHGTSSVSSSTTMAVATSRNCDKLKSFHLS
SRGPPTSEPPPVPANKPKFLKIAEEPSPREAAKFAPVPPVAPRPPVQKMP
MPEATVRPAVLPRPENTPLPHLQRSPPDGQSFRGFSFEKARQPSQADTGE
EDSDEDYEKVPLPNSVFVNTTESCEVERLFKATDPRGEPQDGLYCIRNSS
TKSGKVLVVWDESSNKVRNYRIFEKDSKFYLEGEVLFASVGSMVEHYHTH
VLPSHQSLLLRHPYGYAGPR
11- When a protein with unknown functions is
inputted, the classification software identifies
the protein clusters that contain most similar
proteins. - The biochemists then can predict the functions of
the protein based on the output of the
classification software. - The protein clustering conducted in advance
expedites the search process.
12Applications of Data Classification in Microarray
Data Analysis
- In microarray data analysis, data classification
is employed to predict the class of a new sample
based on the existing samples with known class.
13- For example, in the Leukemia data set, there are
72 samples and 7129 genes. - 25 Acute Myeloid Leukemia(AML) samples.
- 38 B-cell Acute Lymphoblastic Leukemia samples.
- 9 T-cell Acute Lymphoblastic Leukemia samples.
14Model of Microarray Data Sets
15Applications of Data Clustering in Microarray
Data Analysis
- Data clustering has been employed in microarray
data analysis for - identifying the genes with similar expressions
- identifying the subtypes of samples.