SVMProt: webbased support vector machine software for functional classification from its primary seq - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

SVMProt: webbased support vector machine software for functional classification from its primary seq

Description:

SVM-Prot: web-based support vector machine software for ... Hydrophobicity. Normalized Van der Waals volume. Polarity. Polarizability. Charge. Surface fusion ... – PowerPoint PPT presentation

Number of Views:158
Avg rating:3.0/5.0
Slides: 20
Provided by: great8
Category:

less

Transcript and Presenter's Notes

Title: SVMProt: webbased support vector machine software for functional classification from its primary seq


1
SVM-Prot web-based support vector machine
software for functional classification from its
primary sequence
  • Nucleic Acids Research, 2003
  • speaker yfhuang_at_CSIE.NTU

2
Outline
  • Introduction
  • Method
  • Result Discussion
  • Reference
  • Extension Reading

3
Introduction
  • Briefing introduction
  • Current developed approaches/methods
  • Problem
  • Goal

4
Briefing Introduction
  • Protein function prediction family prediction
  • Based on protein primary sequence
    physicochemical property
  • Using SVM to prediction functional family

5
Various methods
  • Sequence similarity
  • Evolutionary analysis
  • Structure-based approach
  • Protein/gene fusion
  • Protein interaction
  • Family classification by sequence clustering

6
Problems
  • Not all homologous proteins have analogous
    functions
  • Many proteins share promiscuous domains

7
Goal
  • Distantly-related proteins
  • Closely-related proteins

8
SVMProt
  • SVMProt classification system
  • Web-based software
  • Target using SVM classification to classify
    protein into functional family form its primary
    sequence
  • Representative proteins of a number of functional
    families Seed proteins of Pfam curated protein
    families
  • URL http//jing.cz3.nus.edu.sg/cgi-bin/svmprot.cg
    i

9
Method
  • Database
  • SVMProt processing
  • Example of feature example
  • Dataset
  • Scoring function quality measurement

10
Database
  • Data
  • 46 families Enzymes from BRENDA
  • G-protein coupled receptors from GPCRDB
  • Nuclear receptors from NucleaRDB
  • 5 families of channels 1 family of transporters
    from TCDB LGICdb
  • DNA- RNA-binding proteins derived from
    SWISS-PROT

11
Processing (1)
  • Feature vector - properity
  • Amino acid composition
  • Hydrophobicity
  • Normalized Van der Waals volume
  • Polarity
  • Polarizability
  • Charge
  • Surface fusion
  • Secondary structure
  • Solvent accessibility

12
(No Transcript)
13
Dimension of feature vector
From Protein function classification via support
vector machine approach, Mathematical
Biosciences 185 (2003) 111-122
14
Processing (2)
  • 3 descriptors (21 elements)
  • Composition (C) - 3
  • C is the number of amino acids of a particular
    property divided by the total number of amino
    acids
  • Transition (T) - 3
  • T characterizes the percent frequency with which
    amino acids of particular property is followed by
    amino acids of a different property
  • Distribution (D) - 15
  • D means the chain length within which the first,
    25, 50, 75, 100 of the amino acids of a
    particular property is located respectively

15
Example of feature vector
AA amino acid Sequence length (SL) 30 16
alanines ? n1 16 14 glutamic ? n2 14 C n1
100 / (n1n2) 53.33, n2 100 / (n1n2)
46.67 T (A ? E 15, E ? A 15) ? (15 / 29)
100 51.72 D (Index(AA) / SL) 100 As ?
3.33, 16.67, 40, 66.67, 96.67 Es ? 6.67,
26.67, 60, 76.67, 100
16
Dataset
  • Training set
  • Positive all distinct protein members in each
    family
  • Negative from seed proteins of the curated
    protein families in the Pfam database excluding
    those that belong to the family under study
  • Testing set
  • Positive all the remaining distinct proteins in
    each functional family
  • Negative all the remaining representative seed
    protein in Pfam curated families
  • Independent evaluation set (evaluate)
  • Both positive and negative samples

17
Scoring Function Quality Measurement
  • Scoring for SVMProt
  • Reliability index
  • R-value
  • P-value
  • Quality measurement for experment
  • TP (true positive)
  • TN (true negative)
  • FP (false positive)
  • FN (false negative)
  • Q (overall accuracy)

d the distance between the position of the
vector of classified protein and the optimal
separating hyperplane in hyperspace
18
P-value Probability of correct classification
9932 positive samples 45,999 negative samples
19
Result Discussion
  • Q of protein classification ranges from 69.1 to
    99.6 ? provide more comprehensive sampling of
    proteins not in a functional class
  • Prediction of distantly related proteins ? test
    on 24 randomly selected distantly related
    proteins in 7 families ? 14 proteins are
    correctly classified (58.3)
Write a Comment
User Comments (0)
About PowerShow.com