Understanding of data using Computational Intelligence methods - PowerPoint PPT Presentation

About This Presentation

Title:

Understanding of data using Computational Intelligence methods

Description:

Understanding of data using Computational Intelligence methods W odzis aw Duch Dept. of Informatics, Nicholas Copernicus University, Toru , Poland – PowerPoint PPT presentation

Number of Views:300

Avg rating:3.0/5.0

Slides: 59

Provided by: Valeri110

Category:

more less

Transcript and Presenter's Notes

Title: Understanding of data using Computational Intelligence methods

1
Understanding of data using Computational
Intelligence methods

Wlodzislaw Duch
Dept. of Informatics, Nicholas Copernicus
University, Torun, Poland
http//www.phys.uni.torun.pl/duch
IEA/AIE Cairns, 17-20.06.2002

2
What am I going to say

Data and CI
What we hope for.
Forms of understanding.
Visualization.
Prototypes.
Logical rules.
Some knowledge discovered.
Expert system for psychometry.
Conclusions, or why am I saying this?

3
Types of Data

Data was precious! Now it is overwhelming ...
Statistical data clean, numerical, controlled
experiments, vector space model.
Relational data marketing, finances.
Textual data Web, NLP, search.
Complex structures chemistry, economics.
Sequence data bioinformatics.
Multimedia data images, video.
Signals dynamic data, biosignals.
AI data logical problems, games, behavior

4
Computational Intelligence
Soft computing
Computational IntelligenceData gt
KnowledgeArtificial Intelligence
5
CI AI definition

Computational Intelligence is concerned with
solving effectively non-algorithmic
problems.This corresponds to all cognitive
processes, including low-level ones (perception).
Artificial Intelligence is a part of CI concerned
with solving effectively non-algorithmic problems
requiring systematic reasoning and symbolic
knowledge representation. Roughly this
corresponds to high-level cognitive processes.

6
Turning data into knowledge

What should CI methods do?
Provide descriptive and predictive non-parametric
models of data.
Allow to classify, approximate, associate,
correlate, complete patterns.
Allow to discover new categories and interesting
patterns.
Help to visualize multi-dimensional relationships
among data samples.
Allow to understand the data in some way.
Facilitate creation of ES and reasoning.

7
Forms of useful knowledge

AI/Machine Learning camp
Neural nets are black boxes.
Unacceptable! Symbolic rules forever.

But ... knowledge accessible to humans is in
symbols,
similarity to prototypes,
images, visual representations.
What type of explanation is satisfactory?
Interesting question for cognitive scientists.
Different answers in different fields.

8
Data understanding

Humans remember examples of each category and
refer to such examples as similarity-based or
nearest-neighbors methods do.
Humans create prototypes out of many examples
as Gaussian classifiers, RBF networks, neurofuzzy
systems do.
Logical rules are the highest form of
summarization of knowledge.

Types of explanation
visualization-based maps, diagrams, relations
...
exemplar-based prototypes and similarity
logic-based symbols and rules.

9
Visualization dendrograms

All projections (cuboids) on 2D subspaces are
identical, dendrograms do not show the structure.

Normal and malignant lymphocytes.
10
Visualization 2D projections

All projections (cuboids) on 2D subspaces are
identical, dendrograms do not show the structure.

3-bit parity all 5-bit combinations.
11
Visualization MDS mapping

Results of pure MDS mapping centers of
hierarchical clusters connected.

3-bit parity all 5-bit combinations.
12
Visualization 3D projections

Only age is continuous, other values are binary

Fine Needle Aspirate of Breast Lesions,
redmalignant, greenbenignA.J. Walker, S.S.
Cross, R.F. Harrison, Lancet 1999, 394, 1518-1521
13
Visualization MDS mappings

Try to preserve all distances in 2D nonlinear
mapping

MDS large sets using LVQ relative mapping.
14
Prototype-based rules
C-rules (Crisp), are a special case of F-rules
(fuzzy rules). F-rules (fuzzy rules) are a
special case of P-rules (Prototype). P-rules have
the form

IF P arg minR D(X,R) THAN Class(X)Class(P)

D(X,R) is a dissimilarity (distance) function,
determining decision borders around prototype P.
P-rules are easy to interpret! IF XYou are
most similar to the PSupermanTHAN You are in
the Super-league. IF XYou are most similar to
the PWeakling THAN You are in the
Failed-league. Similar may involve different
features or D(X,P).
15
P-rules
Euclidean distance leads to a Gaussian fuzzy
membership functions product as T-norm.
Manhattan function gt m(XP)exp-X-P Various
distance functions lead to different MF. Ex.
data-dependent distance functions, for symbolic
data
16
Crisp P-rules
New distance functions from info theory gt
interesting MF. Membership Functions gt new
distance function, with local D(X,R) for each
cluster.
Crisp logic rules use L? norm D?(X,P)
X-P? maxi Wi Xi-Pi D?(X,P) const gt
rectangular contours. L? (Chebyshev) distance
with thresholds ?P IF D?(X,P) ? ?P THEN
C(X)C(P) is equivalent to a conjunctive crisp
rule IF X1?P1-?P/W1,P1?P/W1 ? XN ?PN
-?P/WN,PN?P/WN THEN C(X)C(P)
17
Decision borders
D(P,X)const and decision borders D(P,X)D(Q,X).
Euclidean distance from 3 prototypes, one per
class.
Minkovski a20 distance from 3 prototypes.
18
P-rules for Wine
L? distance (crisp rules) 15 prototypes kept, 5
errors, f2, f8, f10 removed Euclidean
distance 11 prototypes kept, 7 errors

Manhattan distance
prototypes kept, 4 errors, f2 removed
Many other solutions.
Prototypes SV clusters.

19
Complex objects
Vector space concept is not sufficient for
complex object. A common set of features is
meaningless.
AI complex objects, states, subproblems. General
approach sufficient to evaluate similarity
D(Oi,Oj). Compare Oi, Oj define transformation
Elementary operators Wk, eg. substrings
substitutions. Many T connecting a pair of
objects Oi and Oj objects exist. Cost of
transformation sum of Wk costs. Similarity
lowest transformation costs. Bioinformatics
sophisticated similarity functions for
sequences.Dynamic programming finds similarities
in reasonable time. Use adaptive costs and
general framework for SBM methods.
20
Promoters
DNA strings, 57 aminoacids, 53 and 53 - samples
tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcggg
cttgtcgt
Euclidean distance, symbolic s a, c, t, g
replaced by x1, 2, 3, 4
PDF distance, symbolic sa, c, t, g replaced by
p(s)
21
Connection of CI with AI
AI/CI division is harmful for science! GOFAI
operators, state transformations and search
techniques are basic tools in AI solving problems
requiring systematic reasoning. CI methods may
provide useful heuristics for AI and define
metric relations between states, problems or
complex objects.
Example combinatorial productivity in AI systems
and FSM. Later decision tree for complex
structures.
22
Electric circuit example
Answering questions in complex domains requires
reasoning. Qualitative behavior of electric
circuit 7 variables, but Ohms law VIR, or
Kirhoffs law VtV1V2
Train a NeuroFuzzy system on Ohms and Kirhoffs
laws. Without solving equations answer questions
of the type If R2 grows, R1 Vt are constant,
what will happen with the current I and voltages
V1, V2 ? (taken from the PDP book, McClleland,
Rumelhart, Hinton)
23
Electric circuit search
AI create search tree, CI provide guiding
intuition. Any law of the form ABC or ABC,
ex VIR, has 13 true facts, 14 false facts and
may be learned by NF system.
Geometrical representation increasing, -
decreasing, 0 constant Find combination of Vt,
Rt, I, V1, V2, R1, R2 for which all 5
constraints are fulfilled. For 111 cases put of
372187
Search and check if X can be , 0, -, laws are
not satisfied if F(Vt0, Rt, I, V1, V2, R10,
R2) 0
24
Heuristic search
If R2 grows, R1 Vt are constant, what will
happen with the current I and voltages V1, V2 ?
We know that R2 , R1 0, Vt 0, V1?, V2?,
Rt?, I ? Take V1 and check ifF(Vt0,
Rt?, I?, V1, V2?, R10, R2) gt0 Since for
all V1, 0 and the function is F()gt0 take
variable that leads to unique answer, Rt
Single search path solves the problems. Useful
also in approximate reasoning where only some
conditions are fulfilled.
25
Logical rules

Crisp logic rules for continuous x use
linguistic variables (predicate functions).

sk(x) s True XkL x L X'k, for example
small(x) Truexx lt 1 medium(x)
Truexx Î 1,2 large(x) Truexx gt
2 Linguistic variables are used in crisp
(prepositional, Boolean) logic rules IF
small-height(X) AND has-hat(X) AND has-beard(X)
THEN (X is a Brownie) ELSE IF ... ELSE ...
26
Crisp logic decisions

Crisp logic is based on rectangular membership
functions

True/False values jump from 0 to 1. Step
functions are used for partitioning of the
feature space.
Very simple hyper-rectangular decision borders.
Severe limitation on the expressive power of
crisp logical rules!
27
DT decisions borders

Decision trees lead to specific decision borders.
SSV tree on Wine data, proline flavanoids
content

Decision tree forests many decision trees of
similar accuracy, but different selectivity and
specificity.
28
Logical rules - advantages

Logical rules, if simple enough, are preferable.

Rules may expose limitations of black box
solutions.
Only relevant features are used in rules.
Rules may sometimes be more accurate than NN and
other CI methods.
Overfitting is easy to control, rules usually
have small number of parameters.
Rules forever !? A logical rule about logical
rules is

29
Logical rules - limitations

Logical rules are preferred but ...

Only one class is predicted p(CiX,M) 0 or 1
black-and-white picture may be inappropriate in
many applications.
Discontinuous cost function allow only
non-gradient optimization.
Sets of rules are unstable small change in the
dataset leads to a large change in structure of
complex sets of rules.
Reliable crisp rules may reject some cases as
unclassified.
Interpretation of crisp rules may be misleading.
Fuzzy rules are not so comprehensible.

30
Rules - choices

Simplicity vs. accuracy.
Confidence vs. rejection rate.

p is a hit p- false alarm p- is a miss.
Accuracy (overall) A(M) p p--
Error rate L(M) p- p-
Rejection rate R(M)prp-r 1-L(M)-A(M)
Sensitivity S(M) p p /p
Specificity S-(M) p-- p-- /p-
31
Neural networks and rules
Myocardial Infarction
p(MIX)
0.7
Outputweights
Inputweights
Sex
Age
Smoking
Elevation
Pain
ECG ST
Duration
32
Knowledge from networks

Simplify networks force most weights to 0,
quantize remaining parameters, be constructive!

Regularization mathematical technique
improving predictive abilities of the network.
Result MLP2LN neural networks that are
equivalent to logical rules.

33
MLP2LN

Converts MLP neural networks into a network
performing logical operations (LN).

Input layer
Output one node per class.
Aggregation better features
Rule units threshold logic
Linguistic units windows, filters
34
Learning dynamics
Decision regions shown every 200 training epochs
in x3, x4 coordinates borders are optimally
placed with wide margins.
35
Neurofuzzy systems
Fuzzy m(x)0,1 (no/yes) replaced by a degree
m(x)?0,1. Triangular, trapezoidal, Gaussian ...
MF.
M.f-s in many dimensions

Feature Space Mapping (FSM) neurofuzzy system.
Neural adaptation, estimation of probability
density distribution (PDF) using single hidden
layer network (RBF-like) with nodes realizing
separable functions

36
Heterogeneous systems

Homogenous systems one type of building
blocks, same type of decision borders.
Ex neural networks, SVMs, decision trees, kNNs
.
Committees combine many models together, but lead
to complex models that are difficult to
understand.

Discovering simplest class structures, its
inductive bias requires heterogeneous adaptive
systems (HAS). Ockham razor simpler systems are
better. HAS examples NN with many types of
neuron transfer functions. k-NN with different
distance functions. DT with different types of
test criteria.
37
GhostMiner Philosophy

GhostMiner, data mining tools from our lab.
http//www.fqspl.com.pl/ghostminer/
Separate the process of model building and
knowledge discovery from model use gt
GhostMiner Developer GhostMiner Analyzer.

There is no free lunch provide different type
of tools for knowledge discovery. Decision tree,
neural, neurofuzzy, similarity-based, committees.
Provide tools for visualization of data.
Support the process of knowledge discovery/model
building and evaluating, organizing it into
projects.

38
Recurrence of breast cancer

Data from Institute of Oncology, University
Medical Center, Ljubljana, Yugoslavia.

286 cases, 201 no recurrence (70.3), 85
recurrence cases (29.7) no-recurrence-events,
40-49, premeno, 25-29, 0-2, ?, 2, left,
right_low, yes 9 nominal features age (9
bins), menopause, tumor-size (12 bins), nodes
involved (13 bins), node-caps, degree-malignant
(1,2,3), breast, breast quad, radiation.
39
Recurrence of breast cancer

Data from Institute of Oncology, University
Medical Center, Ljubljana, Yugoslavia.

Many systems used, 65-78 accuracy reported.
Single ruleIF (nodes-involved ? 0,2 Ù
degree-malignant 3 THEN recurrence, ELSE
no-recurrence 76.2 accuracy, only trivial
knowledge in the data Highly malignant breast
cancer involving many nodes is likely to strike
back.
40
Recurrence - comparison.
Method 10xCV accuracy MLP2LN 1
rule 76.2 SSV DT stable rules 75.7 ? 1.0
k-NN, k10, Canberra 74.1 ?1.2 MLPbackprop.
73.5 ? 9.4 (Zarndt)CART DT 71.4 ? 5.0
(Zarndt) FSM, Gaussian nodes 71.7 ? 6.8 Naive
Bayes 69.3 ? 10.0 (Zarndt) Other decision
trees lt 70.0
41
Breast cancer diagnosis.

Data from University of Wisconsin Hospital,
Madison, collected by dr. W.H. Wolberg.

699 cases, 9 cell features quantized from 1 to
10 clump thickness, uniformity of cell size,
uniformity of cell shape, marginal adhesion,
single epithelial cell size, bare nuclei, bland
chromatin, normal nucleoli, mitoses. Tasks
distinguish benign from malignant cases.
42
Breast cancer rules.

Data from University of Wisconsin Hospital,
Madison, collected by dr. W.H. Wolberg.

Simplest rule from MLP2LN, large regularization
If uniformity of cell size lt 3 Then
benign Else malignant Sensitivity0.97,
Specificity0.85 More complex solutions (3
rules) give in 10CV Sensitivity 0.95,
Specificity0.96, Accuracy0.96
43
Breast cancer comparison.
Method 10xCV accuracy k-NN, k3,
Manh 97.0 ? 2.1 (GM)FSM, neurofuzzy 96.9 ?
1.4 (GM) Fisher LDA 96.8 MLPbackprop.
96.7 (Ster, Dobnikar)LVQ 96.6 (Ster,
Dobnikar) IncNet (neural) 96.4 ? 2.1 (GM)Naive
Bayes 96.4 SSV DT, 3 crisp rules 96.0 ?
2.9 (GM) LDA (linear discriminant) 96.0
Various decision trees 93.5-95.6
44
SSV HAS Wisconsin

Heterogeneous decision tree that searches not
only for logical rules but also for
prototype-based rules.

Single P-rule gives simplest known description of
this data IF X-R303 lt 20.27 then
malignant else benign 18
errors, 97.4 accuracy. Good prototype for
malignant! Simple thresholds, thats what MDs
like the most!
Best L1O error 98.3 (FSM), best 10CV
around 97.5 (Naïve Bayes kernel, SVM) C
4.5 gives 94.72.0 SSV without distances
96.42.1 Several simple rules of similar
accuracy in CV tests exist.
45
Melanoma skin cancer

Collected in the Outpatient Center of Dermatology
in Rzeszów, Poland.
Four types of Melanoma benign, blue, suspicious,
or malignant.

250 cases, with almost equal class distribution.
Each record in the database has 13 attributes
asymmetry, border, color (6), diversity (5).
TDS (Total Dermatoscopy Score) - single index
Goal hardware scanner for preliminary diagnosis.

46
Melanoma results
Method Rules Training Test MLP2LN,
crisp rules 4 98.0 all 100 SSV Tree,
crisp rules 4 97.50.3 100FSM,
rectangular f. 7 95.51.0 100 knn
prototype selection 13 97.50.0 100
FSM, Gaussian f. 15 93.71.0 953.6 knn
k1, Manh, 2 features -- 97.40.3 100 LERS,
rough rules 21 -- 96.2
47
Antibiotic activity of pyrimidine compounds.
Pyrimidines which compound has stronger
antibiotic activity?
Common template, substitutions added at 3
positions, R3, R4 and R5.
27 features taken into account polarity, size,
hydrogen-bond donor or acceptor, pi-donor or
acceptor, polarizability, sigma effect. Pairs of
chemicals, 54 features, are compared, which one
has higher activity? 2788 cases, 5-fold
crossvalidation tests.
48
Antibiotic activity - results.
Pyrimidines which compound has stronger
antibiotic activity?
Mean Spearman's rank correlation coefficient
used -1lt rs lt 1 Method Rank correlation
FSM, 41 Gaussian rules 0.770.03Golem
(ILP) 0.68Linear regression 0.65CART
(decision tree) 0.50
49
Thyroid screening.

Garavan Institute, Sydney, Australia
15 binary, 6 continuous
Training 931913488 Validate 731773178
Determine important clinical factors
Calculate prob. of each diagnosis.

50
Thyroid some results.
Accuracy of diagnoses obtained with different
systems.
Method Rules/Features Training
Test MLP2LN optimized 4/6 99.9
99.36 CART/SSV Decision Trees 3/5
99.8 99.33 Best Backprop MLP
-/21 100 98.5 Naïve Bayes -/-
97.0 96.1 k-nearest neighbors -/-
- 93.8
51
Psychometry

MMPI (Minnesota Multiphasic Personality
Inventory) psychometric test.
Printed forms are scanned or computerized version
of the test is used.

Raw data 550 questions, exI am getting tired
quickly Yes - Dont know - No
Results are combined into 10 clinical scales and
4 validity scales using fixed coefficients.
Each scale measures tendencies towards
hypochondria, schizophrenia, psychopathic
deviations, depression, hysteria, paranoia etc.

52
Scanned form
53
Computer input
54
Scales
55
Psychometry

There is no simple correlation between single
values and final diagnosis.
Results are displayed in form of a histogram,
called a psychogram. Interpretation depends on
the experience and skill of an expert, takes into
account correlations between peaks.

Goal an expert system providing evaluation and
interpretation of MMPI tests at an expert level.
Problem agreement between experts only 70 of
the time alternative diagnosis and personality
changes over time are important.
56
Psychogram
57
Psychometric data

1600 cases for woman, same number for men.
27 classes norm, psychopathic, schizophrenia,
paranoia, neurosis, mania, simulation,
alcoholism, drug addiction, criminal tendencies,
abnormal behavior due to ...

Extraction of logical rules 14 scales
features. Define linguistic variables and use
FSM, MLP2LN, SSV - giving about 2-3 rules/class.
58
Psychometric data
Method Data N. rules Accuracy Gx
C 4.5 ? 55 93.0 93.7
? 61 92.5 93.1
FSM ? 69 95.4 97.6
? 98 95.9 96.9
10-CV for FSM is 82-85, for C4.5 is 79-84.
Input uncertainty Gx around 1.5 (best ROC)
improves FSM results to 90-92.
59
Psychometric Expert

Probabilities for different classes. For greater
uncertainties more classes are predicted.
Fitting the rules to the conditions
typically 3-5 conditions per rule, Gaussian
distributions around measured values that fall
into the rule interval are shown in green.
Verbal interpretation of each case, rule and
scale dependent.

60
MMPI probabilities
61
MMPI rules
62
MMPI verbal comments
63
Visualization

Probability of classes versus input uncertainty.
Detailed input probabilities around the measured
values vs. change in the single scale changes
over time define patients trajectory.
Interactive multidimensional scaling zooming on
the new case to inspect its similarity to other
cases.

64
Class probability/uncertainty
65
Class probability/feature
66
MDS visualization
67
Conclusions

Data understanding is challenging problem.
Classification rules are frequently only the
first step and may not be the best solution.
Visualization is always helpful.
P-rules may be competitive if complex decision
borders are required, providing different types
of rules.
Understanding of complex objects is possible,
although difficult, using adaptive costs and
distance as least expensive transformations
(action principles in physics).
Great applications are coming!

68
Challenges