Title: SVMProt: webbased support vector machine software for functional classification from its primary seq
1SVM-Prot web-based support vector machine
software for functional classification from its
primary sequence
- Nucleic Acids Research, 2003
- speaker yfhuang_at_CSIE.NTU
2Outline
- Introduction
- Method
- Result Discussion
- Reference
- Extension Reading
3Introduction
- Briefing introduction
- Current developed approaches/methods
- Problem
- Goal
4Briefing Introduction
- Protein function prediction family prediction
- Based on protein primary sequence
physicochemical property - Using SVM to prediction functional family
5Various methods
- Sequence similarity
- Evolutionary analysis
- Structure-based approach
- Protein/gene fusion
- Protein interaction
- Family classification by sequence clustering
6Problems
- Not all homologous proteins have analogous
functions - Many proteins share promiscuous domains
7Goal
- Distantly-related proteins
- Closely-related proteins
8SVMProt
- SVMProt classification system
- Web-based software
- Target using SVM classification to classify
protein into functional family form its primary
sequence - Representative proteins of a number of functional
families Seed proteins of Pfam curated protein
families - URL http//jing.cz3.nus.edu.sg/cgi-bin/svmprot.cg
i
9Method
- Database
- SVMProt processing
- Example of feature example
- Dataset
- Scoring function quality measurement
10Database
- Data
- 46 families Enzymes from BRENDA
- G-protein coupled receptors from GPCRDB
- Nuclear receptors from NucleaRDB
- 5 families of channels 1 family of transporters
from TCDB LGICdb - DNA- RNA-binding proteins derived from
SWISS-PROT
11Processing (1)
- Feature vector - properity
- Amino acid composition
- Hydrophobicity
- Normalized Van der Waals volume
- Polarity
- Polarizability
- Charge
- Surface fusion
- Secondary structure
- Solvent accessibility
12(No Transcript)
13Dimension of feature vector
From Protein function classification via support
vector machine approach, Mathematical
Biosciences 185 (2003) 111-122
14Processing (2)
- 3 descriptors (21 elements)
- Composition (C) - 3
- C is the number of amino acids of a particular
property divided by the total number of amino
acids - Transition (T) - 3
- T characterizes the percent frequency with which
amino acids of particular property is followed by
amino acids of a different property - Distribution (D) - 15
- D means the chain length within which the first,
25, 50, 75, 100 of the amino acids of a
particular property is located respectively
15Example of feature vector
AA amino acid Sequence length (SL) 30 16
alanines ? n1 16 14 glutamic ? n2 14 C n1
100 / (n1n2) 53.33, n2 100 / (n1n2)
46.67 T (A ? E 15, E ? A 15) ? (15 / 29)
100 51.72 D (Index(AA) / SL) 100 As ?
3.33, 16.67, 40, 66.67, 96.67 Es ? 6.67,
26.67, 60, 76.67, 100
16Dataset
- Training set
- Positive all distinct protein members in each
family - Negative from seed proteins of the curated
protein families in the Pfam database excluding
those that belong to the family under study - Testing set
- Positive all the remaining distinct proteins in
each functional family - Negative all the remaining representative seed
protein in Pfam curated families - Independent evaluation set (evaluate)
- Both positive and negative samples
17Scoring Function Quality Measurement
- Scoring for SVMProt
- Reliability index
- R-value
- P-value
- Quality measurement for experment
- TP (true positive)
- TN (true negative)
- FP (false positive)
- FN (false negative)
- Q (overall accuracy)
d the distance between the position of the
vector of classified protein and the optimal
separating hyperplane in hyperspace
18P-value Probability of correct classification
9932 positive samples 45,999 negative samples
19Result Discussion
- Q of protein classification ranges from 69.1 to
99.6 ? provide more comprehensive sampling of
proteins not in a functional class - Prediction of distantly related proteins ? test
on 24 randomly selected distantly related
proteins in 7 families ? 14 proteins are
correctly classified (58.3)