Fuzzy Machine Learning Methods for Biomedical Data Analysis - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Fuzzy Machine Learning Methods for Biomedical Data Analysis

Description:

Fuzzy C-Means deals with complex correlation between genes by assigning a gene into multiple clusters ... That mean that after we pick up one sample from the ... – PowerPoint PPT presentation

Number of Views:324
Avg rating:3.0/5.0
Slides: 55
Provided by: yuc54
Category:

less

Transcript and Presenter's Notes

Title: Fuzzy Machine Learning Methods for Biomedical Data Analysis


1
Fuzzy Machine Learning Methods for Biomedical
Data Analysis
Yanqing Zhang Department of Computer
Science Georgia State University Atlanta, GA
30302-5060 yzhang_at_gsu.edu
2
Outline
  • Background
  • Fuzzy Association Rule Mining for Decision
    Support (FARM-DS)
  • FARM-DS on Medical Data
  • FARM-DS on Microarray Expression Data
  • Fuzzy-Granular Gene Selection on Microarray
    Expression Data
  • Conclusion and Future Work

3
Background
  • Theory
  • Computational Intelligence, Granular Computing,
    Fuzzy Sets
  • Knowledge Discovery and Data mining (KDD)
  • Decision Support system (DS)
  • Rule-Based Reasoning (RBR), Association Rule
    Mining
  • Application
  • Bioinformatics, Medical Informatics, etc.
  • Concern
  • Accuracy
  • Interpretability

4
Outline
  • Background
  • Fuzzy Association Rule Mining for Decision
    Support (FARM-DS)
  • FARM-DS on Medical Data
  • FARM-DS on Microarray Expression Data
  • Fuzzy-Granular Gene Selection on Microarray
    Expression Data
  • Conclusion and Future Work

5
Motivation deal with numeric data
  • Traditional Association rule mining algorithm
  • If X, then Y
  • Conf Pr(YX) Supp Pr(X and Y)
  • dont work on numeric data
  • Fuzzy Logic
  • Feature transform
  • Fuzzy AR mining
  • (Zadeh, 1965)

6
Motivation decision support
  • FARs for classification
  • Accuracy vs. Interpretability
  • Very Few works
  • Hu et al. 2002
  • Combinatorial rule explosion
  • Chatterjee et al. 2004
  • Human intervention

7
FARM-DS
  • Target
  • Numeric data
  • Binary classification
  • Effectiveness
  • Accuracy
  • Interpretability
  • Modeling process
  • Training
  • Testing

8
Step 1 Fuzzy Interval Partition
  • 1-in-1-out 0-order TSK model
  • ANFIS for model optimization and parameter
    selection (Jang, 1993)

9
Step 2 Data Abstraction
positive cluster
  • Clustering
  • K-Means
  • Fuzzy C-means
  • Validation
  • clusters
  • Optimal cluster
  • Silhouette Value

negative cluster
10
Step 3 Generating Fuzzy Discrete Transactions
  • Project the center of each cluster on each
    feature
  • Create transactions
  • With positive cluster, 1 is inserted
  • With negative cluster, -1 is inserted

11
Step 3 - example
  • 5-2 3 transactions
  • 1 f1_1
  • 1 f1_1
  • 1 f1_1

f2
f1
  • Avoid combinatorial rule explosion
  • Number of different transactions are decided by
    number of clusters

12
Step 4 Association Rule Mining
  • Association Rule Mining on fuzzy discrete
    transactions
  • Traditional Apriori algorithm (Agrawal and
    Srikant 1994)
  • If f1 is low, f2 is high, , fh is low, then
    y1/-1
  • Rule pruning
  • For a pair of rules A and B, if B is more
    specific than A (that means A is included by B),
    and B has the same support value as A, A is
    eliminated.
  • A If f1 is low, then y1, sup50
  • B If f1 is low and f2 is high, then y1, sup50

13
Testing Phase
14
Adaptive FARM-DS
  • Train
  • Fuzzy intervals partition
  • Data abstraction
  • Generate fuzzy discrete transactions
  • AR mining
  • Test

He, et al. 2006a, IJDMB
15
Outline
  • Background
  • Fuzzy Association Rule Mining for Decision
    Support (FARM-DS)
  • FARM-DS on Medical Data
  • FARM-DS on Microarray Expression Data
  • Fuzzy-Granular Gene Selection on Microarray
    Expression Data
  • Conclusion and Future Work

16
Empirical Studies
  • Classification algorithms
  • C4.5 decision trees (Quinlan, 1993)
  • Support vector machines (Vapnik, 1995)
  • FARM-DS (He, et al. 2006a, IJDMB)
  • Accuracy Estimation
  • 5-folds cross validation
  • Interpretability

17
Evaluation metrics
  • Accuracy
  • Classification Error
  • Area under ROC curve (future work)
  • Interpretability
  • Rule numbers
  • Average rule lengths

Bradley, 1997
18
Datasets
Merz, et al. UCI repository of machine learning
databases, 1998
19
Result analysis on Accuracy
  • FARM-DS SVM gt C4.5
  • SVM2 and C4.5 results from (Bennett et al. 1997)

20
Result analysis on Interpretability
  • SVM, high accuracy, hard to interpret
  • C4.5, low accuracy , easy to interpret
  • FARM-DS, high accuracy, easy to interpret

21
Interpretability (1)
  • FARs extracted by FARM-DS are short and compact,
    and hence, easy to understand.
  • 22 positive rules and 8 negative rules are
    extracted.
  • In average,
  • the length of a positive rule is 2.6,
  • the length of a negative rule is 4.3,
  • and every sample activates
  • 3.3 positive rules and
  • 5.6 negative rules.

22
Interpretability (2)
  • FARs may help human experts to correct the
    wrongly classified samples.

23
Interpretability (3)
  • The larger support of the negative rules may help
    human experts to make final correct decisions and
    find inherent disease-resulting mechanisms.

24
Interpretability (4)
  • FARs are helpful to select important features.
  • Higher activation frequency means more important
    feature

25
Outline
  • Background
  • Fuzzy Association Rule Mining for Decision
    Support (FARM-DS)
  • FARM-DS on Medical Data
  • FARM-DS on Microarray Expression Data
  • Fuzzy-Granular Gene Selection on Microarray
    Expression Data
  • Conclusion and Future Work

26
Microarray Expression Data
  • Extremely high dimensionality
  • Gene selection
  • Cancer classification
  • Rule-based reasoning

27
Empirical Studies
  • Rule-Based Reasoning/Classification
  • CART for decision trees modeling (Breiman, et al.
    1984)
  • ANFIS for fuzzy neural networks modeling (Jang,
    1993)
  • FARM-DS (He, et al. 2006a, IJDMB)

28
Evaluation metrics
  • Accuracy
  • Classification Error
  • Area under ROC curve
  • Accuracy Estimation
  • Leave-one-out cross validation
  • Interpretability
  • Rule numbers
  • Average rule lengths

Bradley, 1997
29
AML/ALL leukemia dataset
 
Tang, et al. 2006
30
Result analysisAML/ALL leukemia dataset
  • Higher accuracy than CART
  • Easier to interpret than ANFIS

31
Rules extracted by FARM-DSAML/ALL leukemia
dataset
  • IF
  • gene2 (Y12670),
  • gene3 (D14659) and
  • gene5 (M80254) are down-regulated,
  • THEN the tissue is ALL(-1)

32
Prostate cancer dataset
 
Tang, et al. 2006
33
Result analysisprostate cancer dataset
  • Higher accuracy than CART
  • Easier to interpret than ANFIS

34
Rules extracted by FARM-DS prostate cancer
dataset
35
Outline
  • Background
  • Fuzzy Association Rule Mining for Decision
    Support (FARM-DS)
  • FARM-DS on Medical Data
  • FARM-DS on Microarray Expression Data
  • Fuzzy-Granular Gene Selection on Microarray
    Expression Data
  • Conclusion and Future Work

36
Gene Selection and Cancer Classification on
Microarray Expression Data
  • Extremely high dimensionality
  • AML/ALL leukemia dataset 72 7129
  • no more than 10 relevant genes (Golub, et al.
    1999)
  • Gene selection
  • accurate classification
  • helpful for cancer study

37
Gene Categorization and Gene Ranking
  • Informative genes
  • Redundant genes
  • Irrelevant genes
  • Noisy genes

38
Information Loss
  • Noise
  • Overfitting themselves
  • Complementary to redundant/irrelevant genes
  • Conflict with informative genes
  • Imbalanced gene selection
  • Inflexibility

How to decrease information loss?
Granulation!
39
Coarse Granulation with Relevance Indexes
  • Target remove irrelevant genes

imbalance
imbalance
balance
  • Target tune thresholds to select genes in balance

40
Fine Granulation with Fuzzy C-Means Clustering
  • clustering in the training samples space
  • genes with similar expression patterns have
    similar functions
  • a gene may have multiple functions (Fuzzy works
    here!)

41
Conquer with correlation-based Ranking
  • Lower-ranked genes are removed as redundant genes

42
Aggregation with Data Fusion
  • Pick up genes from different clusters in balance
  • An informative gene is more possible to survive
  • (due to fuzzy clustering)

43
Original Gene Set
Relevance Indexes -based pre-filtering
Relevant Gene Set
Correlation-based Gene Ranking 1
Gene Cluster 1
Fuzzy C-Means Clustering
Correlation-based Gene Ranking 2
Gene Cluster 2
Correlation-based Gene Ranking K
Gene Cluster K
Final Gene Set
44
Empirical Study
  • Comparison
  • Signal to Noise (S2N) (Furey, et al. 2000)
  • Fuzzy-Granular S2N
  • Fisher Criterion (FC) (Pavlidis, et al. 2001)
  • Fuzzy-Granular FC
  • T-Statistics (TS) (Duan, et al. 2004)
  • Fuzzy-Granular TS

45
Evaluation Methods
  • Metrics
  • Accuracy
  • Sensitivity
  • Specificity
  • Area under ROC curve
  • Estimation
  • Leave-1-out CV
  • .632 bootstrapping
  • .632 Perf 0.368 training perf 0.632
    testing perf

46
prostate cancer dataset
 
47
Result analysisprostate cancer dataset
48
Colon cancer dataset
 
49
Result analysiscolon cancer dataset
50
Conclusion
  • High-level data abstraction
  • data clustering techniques
  • Quantitative data transformed to fuzzy discrete
    transactions
  • Fuzzy interval partition
  • Apriori algorithm for AR mining
  • Strong decision support for biomedical study
  • High accuracy and easy to interpret
  • More accurate cancer classification
  • Eliminate irrelevant/redundant genes to decrease
    noise
  • Select informative genes in balance

51
Future Works
  • Applying FARM-DS on other biomedical applications
  • Integrating more intelligent data analysis
    techniques.
  • Cloud computing based fuzzy data mining
    algorithms for big data mining
  • GPU based fuzzy data mining algorithms for big
    data mining

52
References
  • 1 Y. C. He, Y.C. Tang, Y.-Q. Zhang and R.
    Sunderraman, Mining Fuzzy Association Rules from
    Microarray Gene Expression Data for Leukemia
    Classification, Proc. of International
    Conference on Granular Computing (GrC-IEEE 2006),
    Atlanta, pp. 461-465, May 10-12, 2006.
  • 2 Y.C. He and Y.C. Tang, Y.-Q. Zhang and R.
    Sunderraman, Adaptive Fuzzy Association Rule
    Mining for Effective Decision Support in
    Biomedical Applications, International Journal
    of Data Mining and Bioinformatics, Vol. 1, No. 1,
    pp. 3-18, 2006.
  • 3 Y.C. He, Y.C. Tang, Y.-Q. Zhang and R.
    Sunderraman, Fuzzy-Granular Gene Selection from
    Microarray Expression Data, Proc. of DMB2006 in
    conjunction with IEEE-ICDM2006, Hong Kong, Dec.
    18, 2006, (accepted).
  • 4 Y.C. He, Y.C. Tang, Y.-Q. Zhang and R.
    Sunderraman, Fuzzy-Granular Methods for
    Identifying Marker Genes from Microarray
    Expression Data, Computational Intelligence for
    Bioinformatics, Gary B. Fogel, David Corne, and
    Yi Pan (eds.), IEEE Press, 2007.

53
Acknowledgments
  • Thanks goto
  • Dr. Yuchun Tang
  • Dr. Yuanchen He
  • For their hard works on this research project.

54
Questions? Comments?
Write a Comment
User Comments (0)
About PowerShow.com