Applications of Data Mining and Machine Learning in Bioinformatics - PowerPoint PPT Presentation

About This Presentation

Title:

Applications of Data Mining and Machine Learning in Bioinformatics

Description:

Applications of Data Mining and Machine Learning in Bioinformatics. Yen-Jen Oyang ... A typical protein consists of hundreds to thousands of amino acids. ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 16

Provided by: sys74

Learn more at: https://csie.org

Category:

more less

Transcript and Presenter's Notes

Title: Applications of Data Mining and Machine Learning in Bioinformatics

1
Applications of Data Mining and Machine Learning
in Bioinformatics

Yen-Jen Oyang
Dept. of Computer Science and Information
Engineering

2
Basics of Protein Structures

A typical protein consists of hundreds to
thousands of amino acids.
There are 20 basic amino acids, each of which is
denoted by one English character.

3
20 Amino Acid - 1
Name Symbol Mass (-H2O) Side Chain Occurence ()
Alanine A, Ala 71.079 CH3- 7.49
Arginine R, Arg 156.188 HNC(NH2)-NH-(CH2)3- 5.22
Asparagine N, Asn 114.104 H2N-CO-CH2- 4.53
Aspartic acid D, Asp 115.089 HOOC-CH2- 5.22
Cysteine C, Cys 103.145 HS-CH2- 1.82
Glutamine Q, Gln 128.131 H2N-CO-(CH2)2- 4.11
Glutamic acid E, Glu 129.116 HOOC-(CH2)2- 6.26
Source http//prowl.rockefeller.edu/aainfo/struct
.htm

4
20 Amino Acid - 2
Name Symbol Mass (-H2O) Side Chain Occurence ()
Glycine G, Gly 57.052 H- 7.10
Histidine H, His 137.141 NCH-NH-CHC-CH2- __________ 2.23
Isoleucine I, Ile 113.160 CH3-CH2-CH(CH3)- 5.45
Leucine L, Leu 113.160 (CH3)2-CH-CH2- 9.06
Lysine K, Lys 128.17 H2N-(CH2)4- 5.82
Methionine M, Met 131.199 CH3-S-(CH2)2- 2.27
Phenylalanine F, Phe 147.177 Phenyl-CH2- 3.91
Proline P, Pro 97.117 -N-(CH2)3-CH- _________ 5.12

Source http//prowl.rockefeller.edu/aainfo/struct
.htm
5
20 Amino Acid - 3
Name Symbol Mass (-H2O) Side Chain Occurence ()
Serine S, Ser 87.078 HO-CH2- 7.34
Threonine T, Thr 101.105 CH3-CH(OH)- 5.96
Tryptophan W, Trp 186.213 Phenyl-NH-CHC-CH2- ___________ 1.32
Tyrosine Y, Tyr 163.176 4-OH-Phenyl-CH2- 3.25
Valine V, Val 99.133 CH3-CH(CH2)- 6.48

Source http//prowl.rockefeller.edu/aainfo/struct
.htm
6
Three-dimensional Structure of Myoglobin
Source Lectures of BioInfo by yukijuan
7
Prediction of Protein Functions

Given a protein sequence, biochemists are
interested in its functions and its tertiary
structure.

8
Protein Classification Based on the Homology Model

The sizes of modern protein databases are growing
at fast rates.
In order to expedite the process to identify
protein functions, it is desirable to classify
the concerned protein, before biochemistry
experiments are conducted.

One widely used approach to classify proteins is
based on the homology model, i.e. classify
proteins based on the similarities of amino acid
sequences.
BLAST and FASTA are two most widely used software
utilities for computing the similarity between
two sequences.
We can cluster the proteins in an existing
protein database in advance as the next slide
exemplifies.

10
An Example of Similar Protein Sequences

3BP2_HUMAN MAAEEMHWPVPMKAIGAQNLLTMPGGVAKAGYLHKKGGT
QLQLLKWPLRFVIIHKRCVYYFKSSTSASPQGAFSLSGYNRVMRAAEETT
SNNVFPFKIIHISKKHRTWFFSASSEEERKSWMALLRREIGHFHEKKDLP
LDTSDSSSDTDSFYGAVERPVDISLSPYPTDNEDYEHDDEDDSYLEPDSP
EPGRLEDALMHPPAYPPPPVPTPRKPAFSDMPRAHSFTSKGPGPLLPPPP
PKHGLPDVGLAAEDSKRDPLCPRRAEPCPRVPATPRRMSDPPLSTMPTAP
GLRKPPCFRESASPSPEPWTPGHGACSTSSAAIMATATSRNCDKLKSFHL
SPRGPPTSEPPPVPANKPKFLKIAEEDPPREAAMPGLFVPPVAPRPPALK
LPVPEAMARPAVLPRPEKPQLPHLQRSPPDGQSFRSFSFEKPRQPSQADT
GGDDSDEDYEKVPLPNSVFVNTTESCEVERLFKATSPRGEPQDGLYCIRN
SSTKSGKVLVVWDETSNKVRNYRIFEKDSKFYLEGEVLFVSVGSMVEHYH
THVLPSHQSLLLRHPYGYTGPR
3BP2_MOUSE MAAEEMQWPVPMKAIGAQNLLTMPGGVAKAGYLHKKGGT
QLQLLKWPLRFVIIHKRCIYYFKSSTSASPQGAFSLSGYNRVMRAAEETT
SNNVFPFKIIHISKKHRTWFFSASSEDERKSWMAFVRREIGHFHEKKELP
LDTSDSSSDTDSFYGAVERPIDISLSSYPMDNEDYEHEDEDDSYLEPDSP
GPMKLEDALTYPPAYPPPPVPVPRKPAFSDLPRAHSFTSKSPSPLLPPPP
PKRGLPDTGSAPEDAKDALGLRRVEPGLRVPATPRRMSDPPMSNVPTVPN
LRKHPCFRDSVNPGLEPWTPGHGTSSVSSSTTMAVATSRNCDKLKSFHLS
SRGPPTSEPPPVPANKPKFLKIAEEPSPREAAKFAPVPPVAPRPPVQKMP
MPEATVRPAVLPRPENTPLPHLQRSPPDGQSFRGFSFEKARQPSQADTGE
EDSDEDYEKVPLPNSVFVNTTESCEVERLFKATDPRGEPQDGLYCIRNSS
TKSGKVLVVWDESSNKVRNYRIFEKDSKFYLEGEVLFASVGSMVEHYHTH
VLPSHQSLLLRHPYGYAGPR

When a protein with unknown functions is
inputted, the classification software identifies
the protein clusters that contain most similar
proteins.
The biochemists then can predict the functions of
the protein based on the output of the
classification software.
The protein clustering conducted in advance
expedites the search process.

12
Applications of Data Classification in Microarray
Data Analysis