Applications of Data Mining and Machine Learning in Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

Applications of Data Mining and Machine Learning in Bioinformatics

Description:

Applications of Data Mining and Machine Learning in Bioinformatics. Yen-Jen Oyang ... A typical protein consists of hundreds to thousands of amino acids. ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 16
Provided by: sys74
Learn more at: https://csie.org
Category:

less

Transcript and Presenter's Notes

Title: Applications of Data Mining and Machine Learning in Bioinformatics


1
Applications of Data Mining and Machine Learning
in Bioinformatics
  • Yen-Jen Oyang
  • Dept. of Computer Science and Information
    Engineering

2
Basics of Protein Structures
  • A typical protein consists of hundreds to
    thousands of amino acids.
  • There are 20 basic amino acids, each of which is
    denoted by one English character.

3
20 Amino Acid - 1
Name Symbol Mass (-H2O) Side Chain Occurence ()
Alanine A, Ala 71.079 CH3- 7.49
Arginine R, Arg 156.188 HNC(NH2)-NH-(CH2)3- 5.22
Asparagine N, Asn 114.104 H2N-CO-CH2- 4.53
Aspartic acid D, Asp 115.089 HOOC-CH2- 5.22
Cysteine C, Cys 103.145 HS-CH2- 1.82
Glutamine Q, Gln 128.131 H2N-CO-(CH2)2- 4.11
Glutamic acid E, Glu 129.116 HOOC-(CH2)2- 6.26
Source http//prowl.rockefeller.edu/aainfo/struct
.htm

4
20 Amino Acid - 2
Name Symbol Mass (-H2O) Side Chain Occurence ()
Glycine G, Gly 57.052 H- 7.10
Histidine H, His 137.141 NCH-NH-CHC-CH2- __________ 2.23
Isoleucine I, Ile 113.160 CH3-CH2-CH(CH3)- 5.45
Leucine L, Leu 113.160 (CH3)2-CH-CH2- 9.06
Lysine K, Lys 128.17 H2N-(CH2)4- 5.82
Methionine M, Met 131.199 CH3-S-(CH2)2- 2.27
Phenylalanine F, Phe 147.177 Phenyl-CH2- 3.91
Proline P, Pro 97.117 -N-(CH2)3-CH- _________ 5.12

Source http//prowl.rockefeller.edu/aainfo/struct
.htm
5
20 Amino Acid - 3
Name Symbol Mass (-H2O) Side Chain Occurence ()
Serine S, Ser 87.078 HO-CH2- 7.34
Threonine T, Thr 101.105 CH3-CH(OH)- 5.96
Tryptophan W, Trp 186.213 Phenyl-NH-CHC-CH2- ___________ 1.32
Tyrosine Y, Tyr 163.176 4-OH-Phenyl-CH2- 3.25
Valine V, Val 99.133 CH3-CH(CH2)- 6.48

Source http//prowl.rockefeller.edu/aainfo/struct
.htm
6
Three-dimensional Structure of Myoglobin
Source Lectures of BioInfo by yukijuan
7
Prediction of Protein Functions
  • Given a protein sequence, biochemists are
    interested in its functions and its tertiary
    structure.

8
Protein Classification Based on the Homology Model
  • The sizes of modern protein databases are growing
    at fast rates.
  • In order to expedite the process to identify
    protein functions, it is desirable to classify
    the concerned protein, before biochemistry
    experiments are conducted.

9
  • One widely used approach to classify proteins is
    based on the homology model, i.e. classify
    proteins based on the similarities of amino acid
    sequences.
  • BLAST and FASTA are two most widely used software
    utilities for computing the similarity between
    two sequences.
  • We can cluster the proteins in an existing
    protein database in advance as the next slide
    exemplifies.

10
An Example of Similar Protein Sequences
  • 3BP2_HUMAN MAAEEMHWPVPMKAIGAQNLLTMPGGVAKAGYLHKKGGT
    QLQLLKWPLRFVIIHKRCVYYFKSSTSASPQGAFSLSGYNRVMRAAEETT
    SNNVFPFKIIHISKKHRTWFFSASSEEERKSWMALLRREIGHFHEKKDLP
    LDTSDSSSDTDSFYGAVERPVDISLSPYPTDNEDYEHDDEDDSYLEPDSP
    EPGRLEDALMHPPAYPPPPVPTPRKPAFSDMPRAHSFTSKGPGPLLPPPP
    PKHGLPDVGLAAEDSKRDPLCPRRAEPCPRVPATPRRMSDPPLSTMPTAP
    GLRKPPCFRESASPSPEPWTPGHGACSTSSAAIMATATSRNCDKLKSFHL
    SPRGPPTSEPPPVPANKPKFLKIAEEDPPREAAMPGLFVPPVAPRPPALK
    LPVPEAMARPAVLPRPEKPQLPHLQRSPPDGQSFRSFSFEKPRQPSQADT
    GGDDSDEDYEKVPLPNSVFVNTTESCEVERLFKATSPRGEPQDGLYCIRN
    SSTKSGKVLVVWDETSNKVRNYRIFEKDSKFYLEGEVLFVSVGSMVEHYH
    THVLPSHQSLLLRHPYGYTGPR
  • 3BP2_MOUSE MAAEEMQWPVPMKAIGAQNLLTMPGGVAKAGYLHKKGGT
    QLQLLKWPLRFVIIHKRCIYYFKSSTSASPQGAFSLSGYNRVMRAAEETT
    SNNVFPFKIIHISKKHRTWFFSASSEDERKSWMAFVRREIGHFHEKKELP
    LDTSDSSSDTDSFYGAVERPIDISLSSYPMDNEDYEHEDEDDSYLEPDSP
    GPMKLEDALTYPPAYPPPPVPVPRKPAFSDLPRAHSFTSKSPSPLLPPPP
    PKRGLPDTGSAPEDAKDALGLRRVEPGLRVPATPRRMSDPPMSNVPTVPN
    LRKHPCFRDSVNPGLEPWTPGHGTSSVSSSTTMAVATSRNCDKLKSFHLS
    SRGPPTSEPPPVPANKPKFLKIAEEPSPREAAKFAPVPPVAPRPPVQKMP
    MPEATVRPAVLPRPENTPLPHLQRSPPDGQSFRGFSFEKARQPSQADTGE
    EDSDEDYEKVPLPNSVFVNTTESCEVERLFKATDPRGEPQDGLYCIRNSS
    TKSGKVLVVWDESSNKVRNYRIFEKDSKFYLEGEVLFASVGSMVEHYHTH
    VLPSHQSLLLRHPYGYAGPR

11
  • When a protein with unknown functions is
    inputted, the classification software identifies
    the protein clusters that contain most similar
    proteins.
  • The biochemists then can predict the functions of
    the protein based on the output of the
    classification software.
  • The protein clustering conducted in advance
    expedites the search process.

12
Applications of Data Classification in Microarray
Data Analysis
  • In microarray data analysis, data classification
    is employed to predict the class of a new sample
    based on the existing samples with known class.

13
  • For example, in the Leukemia data set, there are
    72 samples and 7129 genes.
  • 25 Acute Myeloid Leukemia(AML) samples.
  • 38 B-cell Acute Lymphoblastic Leukemia samples.
  • 9 T-cell Acute Lymphoblastic Leukemia samples.

14
Model of Microarray Data Sets
15
Applications of Data Clustering in Microarray
Data Analysis
  • Data clustering has been employed in microarray
    data analysis for
  • identifying the genes with similar expressions
  • identifying the subtypes of samples.
Write a Comment
User Comments (0)
About PowerShow.com