Title: Prediction model building and feature selection with SVM in breast cancer diagnosis
1Prediction model building and feature selection
with SVM in breast cancer diagnosis
- Cheng-Lung Huang, Hung-Chang Liao, Mu-Chen Chen
Expert Systems with Applications 2008
2Introduction
- Breast cancer is a serious problem for the young
women of Taiwan. - Almost 64.1 of women with breast cancer are
diagnosed before the age of 50 and 29.3 of women
with breast cancer are diagnosed before the age
of 40. - However, the causes are still unknown.
3Introduction
- This study (Ziegler et al., 1993) shows that
fibroadenoma shared some risk factors with breast
cancer. - HSV-1 (herpes simplex virus type 1)
- EBV (Epstein-Barr virus)
- CMV (cytomegalovirus)
- HPV (human papillomavirus)
- HHV-8 (human herpesvirus-8)
4Introduction
- DNA viruses, as causes, are closely related to
the human cancers as part of the high-risk
factors. - In order to obtain the relationship between DNA
viruses and breast tumors. - This paper uses the support vector machines (SVM)
to find the pertinent bioinformatics.
5Two Important Challenge
- When using SVM, two problems are confronted
- How to choose the optimal input feature subset
for SVM. - How to set the best kernel parameters.
- These two problems are crucial because the
feature subset choice influences the appropriate
kernel parameters and vice versa.
6Feature Selection
- Feature selection is an important issue in
building classification systems. - It is advantageous to limit the number of input
features in a classifier in order to have a good
predictive and less computationally intensive
model. - This study tried F-score calculation to select
input features.
7F-Score
8F-Score Algorithm
9 Parameters Optimization
- To design a SVM, one must choose a kernel
function,set the kernel parameters and determine
a soft margin constant C. - The grid algorithm is an alternative to finding
the best C and gamma when using the RBF kernel
function. - This study tried grid search to find the best SVM
model parameters.
10Grid-Search Algorithm
11Data collection
- The source of 80 data points (tissue samples)
- 52 specimens of non-familial invasive ductal
breast cancer. - 28 mammary fibroadenomas.
- (From Chung-Shan Medical University Hospital )
12Data partition
- Data set is further randomly partitioned into
training and independent testing sets via a
stratified 5-fold cross validation.
13SVM-based optimize parameters and feature
selection
14The relative feature importance with F-score
15The relative importance of DNA virus based on the
F-score
16The five feature subsets based on the F-score
17Overall training and testing accuracy for each
feature subset
18Type I and type II errors
- Type I errors (the "false positive") the error
of rejecting the null hypothesis given that it is
actually true - Type II errors (the "false negative") the error
of failing to reject the null hypothesis given
that the alternative hypothesis is actually true
19Detail testing accuracy for feature subset of
size 2 and 3
20Linear discriminate analysis (LDA)
- Originally developed in 1936 by R.A. Fisher,
Discriminate Analysis is a classic method of
classification. - Discriminate analysis can be used only for
classification - Linear discriminant analysis finds a linear
transformation ("discriminant function") of the
two predictors, X and Y, that yields a new set of
transformed values that provides a more accurate
discrimination than either predictor alone - Transformed Target C1X C2Y
21The P-level of each attribute for LDA
Selection criteria P-level value lt 0.05
22Training and testing accuracy for LDA
23Comparison summary between SVM and LDA
24Conclusion
- In order to find the correlation DNA viruses with
breast tumor, and to achieve a high
classificatory accuracy. - F-score is adapted to find the important
features. - grid search approach is used to search the
optimal SVM parameters. - The results revealed that the SVM-based model has
good performance in diagnosing breast cancer
according to our data set.
25Conclusion
- The present studys results also show that the
attributesHSV-1, HHV-8 or HSV-1, HHV-8, CMV
can achieve identical high accuracy, at 86 of
average overall hit rate. - This study suggests simultaneously considering
HSV-1 and HHV-8 is feasible however, only
considering HHV-8 or HSV-1 is less accurate.
26Future Work
- The practical obstacle of the SVM-based (as well
as neural networks) classification model is its
black-box nature. - A possible solution for this issue is the use of
SVM rule extraction techniques or the use of
hybrid-SVM model combined with other more
interpretable models.
27