Understanding of data using Computational

Intelligence methods

- Wlodzislaw Duch
- Dept. of Informatics, Nicholas Copernicus

University, Torun, Poland - http//www.phys.uni.torun.pl/duch
- IEA/AIE Cairns, 17-20.06.2002

What am I going to say

- Data and CI
- What we hope for.
- Forms of understanding.
- Visualization.
- Prototypes.
- Logical rules.
- Some knowledge discovered.
- Expert system for psychometry.
- Conclusions, or why am I saying this?

Types of Data

- Data was precious! Now it is overwhelming ...
- Statistical data clean, numerical, controlled

experiments, vector space model. - Relational data marketing, finances.
- Textual data Web, NLP, search.
- Complex structures chemistry, economics.
- Sequence data bioinformatics.
- Multimedia data images, video.
- Signals dynamic data, biosignals.
- AI data logical problems, games, behavior

Computational Intelligence

Soft computing

Computational IntelligenceData gt

KnowledgeArtificial Intelligence

CI AI definition

- Computational Intelligence is concerned with

solving effectively non-algorithmic

problems.This corresponds to all cognitive

processes, including low-level ones (perception). - Artificial Intelligence is a part of CI concerned

with solving effectively non-algorithmic problems

requiring systematic reasoning and symbolic

knowledge representation. Roughly this

corresponds to high-level cognitive processes.

Turning data into knowledge

- What should CI methods do?
- Provide descriptive and predictive non-parametric

models of data. - Allow to classify, approximate, associate,

correlate, complete patterns. - Allow to discover new categories and interesting

patterns. - Help to visualize multi-dimensional relationships

among data samples. - Allow to understand the data in some way.
- Facilitate creation of ES and reasoning.

Forms of useful knowledge

- AI/Machine Learning camp
- Neural nets are black boxes.
- Unacceptable! Symbolic rules forever.

- But ... knowledge accessible to humans is in
- symbols,
- similarity to prototypes,
- images, visual representations.
- What type of explanation is satisfactory?
- Interesting question for cognitive scientists.
- Different answers in different fields.

Data understanding

- Humans remember examples of each category and

refer to such examples as similarity-based or

nearest-neighbors methods do. - Humans create prototypes out of many examples

as Gaussian classifiers, RBF networks, neurofuzzy

systems do. - Logical rules are the highest form of

summarization of knowledge.

- Types of explanation
- visualization-based maps, diagrams, relations

... - exemplar-based prototypes and similarity
- logic-based symbols and rules.

Visualization dendrograms

- All projections (cuboids) on 2D subspaces are

identical, dendrograms do not show the structure.

Normal and malignant lymphocytes.

Visualization 2D projections

- All projections (cuboids) on 2D subspaces are

identical, dendrograms do not show the structure.

3-bit parity all 5-bit combinations.

Visualization MDS mapping

- Results of pure MDS mapping centers of

hierarchical clusters connected.

3-bit parity all 5-bit combinations.

Visualization 3D projections

- Only age is continuous, other values are binary

Fine Needle Aspirate of Breast Lesions,

redmalignant, greenbenignA.J. Walker, S.S.

Cross, R.F. Harrison, Lancet 1999, 394, 1518-1521

Visualization MDS mappings

- Try to preserve all distances in 2D nonlinear

mapping

MDS large sets using LVQ relative mapping.

Prototype-based rules

C-rules (Crisp), are a special case of F-rules

(fuzzy rules). F-rules (fuzzy rules) are a

special case of P-rules (Prototype). P-rules have

the form

- IF P arg minR D(X,R) THAN Class(X)Class(P)

D(X,R) is a dissimilarity (distance) function,

determining decision borders around prototype P.

P-rules are easy to interpret! IF XYou are

most similar to the PSupermanTHAN You are in

the Super-league. IF XYou are most similar to

the PWeakling THAN You are in the

Failed-league. Similar may involve different

features or D(X,P).

P-rules

Euclidean distance leads to a Gaussian fuzzy

membership functions product as T-norm.

Manhattan function gt m(XP)exp-X-P Various

distance functions lead to different MF. Ex.

data-dependent distance functions, for symbolic

data

Crisp P-rules

New distance functions from info theory gt

interesting MF. Membership Functions gt new

distance function, with local D(X,R) for each

cluster.

Crisp logic rules use L? norm D?(X,P)

X-P? maxi Wi Xi-Pi D?(X,P) const gt

rectangular contours. L? (Chebyshev) distance

with thresholds ?P IF D?(X,P) ? ?P THEN

C(X)C(P) is equivalent to a conjunctive crisp

rule IF X1?P1-?P/W1,P1?P/W1 ? XN ?PN

-?P/WN,PN?P/WN THEN C(X)C(P)

Decision borders

D(P,X)const and decision borders D(P,X)D(Q,X).

Euclidean distance from 3 prototypes, one per

class.

Minkovski a20 distance from 3 prototypes.

P-rules for Wine

L? distance (crisp rules) 15 prototypes kept, 5

errors, f2, f8, f10 removed Euclidean

distance 11 prototypes kept, 7 errors

- Manhattan distance
- prototypes kept, 4 errors, f2 removed
- Many other solutions.
- Prototypes SV clusters.

Complex objects

Vector space concept is not sufficient for

complex object. A common set of features is

meaningless.

AI complex objects, states, subproblems. General

approach sufficient to evaluate similarity

D(Oi,Oj). Compare Oi, Oj define transformation

Elementary operators Wk, eg. substrings

substitutions. Many T connecting a pair of

objects Oi and Oj objects exist. Cost of

transformation sum of Wk costs. Similarity

lowest transformation costs. Bioinformatics

sophisticated similarity functions for

sequences.Dynamic programming finds similarities

in reasonable time. Use adaptive costs and

general framework for SBM methods.

Promoters

DNA strings, 57 aminoacids, 53 and 53 - samples

tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcggg

cttgtcgt

Euclidean distance, symbolic s a, c, t, g

replaced by x1, 2, 3, 4

PDF distance, symbolic sa, c, t, g replaced by

p(s)

Connection of CI with AI

AI/CI division is harmful for science! GOFAI

operators, state transformations and search

techniques are basic tools in AI solving problems

requiring systematic reasoning. CI methods may

provide useful heuristics for AI and define

metric relations between states, problems or

complex objects.

Example combinatorial productivity in AI systems

and FSM. Later decision tree for complex

structures.

Electric circuit example

Answering questions in complex domains requires

reasoning. Qualitative behavior of electric

circuit 7 variables, but Ohms law VIR, or

Kirhoffs law VtV1V2

Train a NeuroFuzzy system on Ohms and Kirhoffs

laws. Without solving equations answer questions

of the type If R2 grows, R1 Vt are constant,

what will happen with the current I and voltages

V1, V2 ? (taken from the PDP book, McClleland,

Rumelhart, Hinton)

Electric circuit search

AI create search tree, CI provide guiding

intuition. Any law of the form ABC or ABC,

ex VIR, has 13 true facts, 14 false facts and

may be learned by NF system.

Geometrical representation increasing, -

decreasing, 0 constant Find combination of Vt,

Rt, I, V1, V2, R1, R2 for which all 5

constraints are fulfilled. For 111 cases put of

372187

Search and check if X can be , 0, -, laws are

not satisfied if F(Vt0, Rt, I, V1, V2, R10,

R2) 0

Heuristic search

If R2 grows, R1 Vt are constant, what will

happen with the current I and voltages V1, V2 ?

We know that R2 , R1 0, Vt 0, V1?, V2?,

Rt?, I ? Take V1 and check ifF(Vt0,

Rt?, I?, V1, V2?, R10, R2) gt0 Since for

all V1, 0 and the function is F()gt0 take

variable that leads to unique answer, Rt

Single search path solves the problems. Useful

also in approximate reasoning where only some

conditions are fulfilled.

Logical rules

- Crisp logic rules for continuous x use

linguistic variables (predicate functions).

sk(x) s True XkL x L X'k, for example

small(x) Truexx lt 1 medium(x)

Truexx Î 1,2 large(x) Truexx gt

2 Linguistic variables are used in crisp

(prepositional, Boolean) logic rules IF

small-height(X) AND has-hat(X) AND has-beard(X)

THEN (X is a Brownie) ELSE IF ... ELSE ...

Crisp logic decisions

- Crisp logic is based on rectangular membership

functions

True/False values jump from 0 to 1. Step

functions are used for partitioning of the

feature space.

Very simple hyper-rectangular decision borders.

Severe limitation on the expressive power of

crisp logical rules!

DT decisions borders

- Decision trees lead to specific decision borders.
- SSV tree on Wine data, proline flavanoids

content

Decision tree forests many decision trees of

similar accuracy, but different selectivity and

specificity.

Logical rules - advantages

- Logical rules, if simple enough, are preferable.

- Rules may expose limitations of black box

solutions. - Only relevant features are used in rules.
- Rules may sometimes be more accurate than NN and

other CI methods. - Overfitting is easy to control, rules usually

have small number of parameters. - Rules forever !? A logical rule about logical

rules is

Logical rules - limitations

- Logical rules are preferred but ...

- Only one class is predicted p(CiX,M) 0 or 1
- black-and-white picture may be inappropriate in

many applications. - Discontinuous cost function allow only

non-gradient optimization. - Sets of rules are unstable small change in the

dataset leads to a large change in structure of

complex sets of rules. - Reliable crisp rules may reject some cases as

unclassified. - Interpretation of crisp rules may be misleading.
- Fuzzy rules are not so comprehensible.

Rules - choices

- Simplicity vs. accuracy.
- Confidence vs. rejection rate.

p is a hit p- false alarm p- is a miss.

Accuracy (overall) A(M) p p--

Error rate L(M) p- p-

Rejection rate R(M)prp-r 1-L(M)-A(M)

Sensitivity S(M) p p /p

Specificity S-(M) p-- p-- /p-

Neural networks and rules

Myocardial Infarction

p(MIX)

0.7

Outputweights

Inputweights

Sex

Age

Smoking

Elevation

Pain

ECG ST

Duration

Knowledge from networks

- Simplify networks force most weights to 0,

quantize remaining parameters, be constructive!

- Regularization mathematical technique

improving predictive abilities of the network. - Result MLP2LN neural networks that are

equivalent to logical rules.

MLP2LN

- Converts MLP neural networks into a network

performing logical operations (LN).

Input layer

Output one node per class.

Aggregation better features

Rule units threshold logic

Linguistic units windows, filters

Learning dynamics

Decision regions shown every 200 training epochs

in x3, x4 coordinates borders are optimally

placed with wide margins.

Neurofuzzy systems

Fuzzy m(x)0,1 (no/yes) replaced by a degree

m(x)?0,1. Triangular, trapezoidal, Gaussian ...

MF.

M.f-s in many dimensions

- Feature Space Mapping (FSM) neurofuzzy system.
- Neural adaptation, estimation of probability

density distribution (PDF) using single hidden

layer network (RBF-like) with nodes realizing

separable functions

Heterogeneous systems

- Homogenous systems one type of building

blocks, same type of decision borders. - Ex neural networks, SVMs, decision trees, kNNs

. - Committees combine many models together, but lead

to complex models that are difficult to

understand.

Discovering simplest class structures, its

inductive bias requires heterogeneous adaptive

systems (HAS). Ockham razor simpler systems are

better. HAS examples NN with many types of

neuron transfer functions. k-NN with different

distance functions. DT with different types of

test criteria.

GhostMiner Philosophy

- GhostMiner, data mining tools from our lab.

http//www.fqspl.com.pl/ghostminer/ - Separate the process of model building and

knowledge discovery from model use gt

GhostMiner Developer GhostMiner Analyzer.

- There is no free lunch provide different type

of tools for knowledge discovery. Decision tree,

neural, neurofuzzy, similarity-based, committees. - Provide tools for visualization of data.
- Support the process of knowledge discovery/model

building and evaluating, organizing it into

projects.

Recurrence of breast cancer

- Data from Institute of Oncology, University

Medical Center, Ljubljana, Yugoslavia.

286 cases, 201 no recurrence (70.3), 85

recurrence cases (29.7) no-recurrence-events,

40-49, premeno, 25-29, 0-2, ?, 2, left,

right_low, yes 9 nominal features age (9

bins), menopause, tumor-size (12 bins), nodes

involved (13 bins), node-caps, degree-malignant

(1,2,3), breast, breast quad, radiation.

Recurrence of breast cancer

- Data from Institute of Oncology, University

Medical Center, Ljubljana, Yugoslavia.

Many systems used, 65-78 accuracy reported.

Single ruleIF (nodes-involved ? 0,2 Ù

degree-malignant 3 THEN recurrence, ELSE

no-recurrence 76.2 accuracy, only trivial

knowledge in the data Highly malignant breast

cancer involving many nodes is likely to strike

back.

Recurrence - comparison.

Method 10xCV accuracy MLP2LN 1

rule 76.2 SSV DT stable rules 75.7 ? 1.0

k-NN, k10, Canberra 74.1 ?1.2 MLPbackprop.

73.5 ? 9.4 (Zarndt)CART DT 71.4 ? 5.0

(Zarndt) FSM, Gaussian nodes 71.7 ? 6.8 Naive

Bayes 69.3 ? 10.0 (Zarndt) Other decision

trees lt 70.0

Breast cancer diagnosis.

- Data from University of Wisconsin Hospital,

Madison, collected by dr. W.H. Wolberg.

699 cases, 9 cell features quantized from 1 to

10 clump thickness, uniformity of cell size,

uniformity of cell shape, marginal adhesion,

single epithelial cell size, bare nuclei, bland

chromatin, normal nucleoli, mitoses. Tasks

distinguish benign from malignant cases.

Breast cancer rules.

- Data from University of Wisconsin Hospital,

Madison, collected by dr. W.H. Wolberg.

Simplest rule from MLP2LN, large regularization

If uniformity of cell size lt 3 Then

benign Else malignant Sensitivity0.97,

Specificity0.85 More complex solutions (3

rules) give in 10CV Sensitivity 0.95,

Specificity0.96, Accuracy0.96

Breast cancer comparison.

Method 10xCV accuracy k-NN, k3,

Manh 97.0 ? 2.1 (GM)FSM, neurofuzzy 96.9 ?

1.4 (GM) Fisher LDA 96.8 MLPbackprop.

96.7 (Ster, Dobnikar)LVQ 96.6 (Ster,

Dobnikar) IncNet (neural) 96.4 ? 2.1 (GM)Naive

Bayes 96.4 SSV DT, 3 crisp rules 96.0 ?

2.9 (GM) LDA (linear discriminant) 96.0

Various decision trees 93.5-95.6

SSV HAS Wisconsin

- Heterogeneous decision tree that searches not

only for logical rules but also for

prototype-based rules.

Single P-rule gives simplest known description of

this data IF X-R303 lt 20.27 then

malignant else benign 18

errors, 97.4 accuracy. Good prototype for

malignant! Simple thresholds, thats what MDs

like the most!

Best L1O error 98.3 (FSM), best 10CV

around 97.5 (Naïve Bayes kernel, SVM) C

4.5 gives 94.72.0 SSV without distances

96.42.1 Several simple rules of similar

accuracy in CV tests exist.

Melanoma skin cancer

- Collected in the Outpatient Center of Dermatology

in Rzeszów, Poland. - Four types of Melanoma benign, blue, suspicious,

or malignant.

- 250 cases, with almost equal class distribution.
- Each record in the database has 13 attributes

asymmetry, border, color (6), diversity (5). - TDS (Total Dermatoscopy Score) - single index
- Goal hardware scanner for preliminary diagnosis.

Melanoma results

Method Rules Training Test MLP2LN,

crisp rules 4 98.0 all 100 SSV Tree,

crisp rules 4 97.50.3 100FSM,

rectangular f. 7 95.51.0 100 knn

prototype selection 13 97.50.0 100

FSM, Gaussian f. 15 93.71.0 953.6 knn

k1, Manh, 2 features -- 97.40.3 100 LERS,

rough rules 21 -- 96.2

Antibiotic activity of pyrimidine compounds.

Pyrimidines which compound has stronger

antibiotic activity?

Common template, substitutions added at 3

positions, R3, R4 and R5.

27 features taken into account polarity, size,

hydrogen-bond donor or acceptor, pi-donor or

acceptor, polarizability, sigma effect. Pairs of

chemicals, 54 features, are compared, which one

has higher activity? 2788 cases, 5-fold

crossvalidation tests.

Antibiotic activity - results.

Pyrimidines which compound has stronger

antibiotic activity?

Mean Spearman's rank correlation coefficient

used -1lt rs lt 1 Method Rank correlation

FSM, 41 Gaussian rules 0.770.03Golem

(ILP) 0.68Linear regression 0.65CART

(decision tree) 0.50

Thyroid screening.

- Garavan Institute, Sydney, Australia
- 15 binary, 6 continuous
- Training 931913488 Validate 731773178
- Determine important clinical factors
- Calculate prob. of each diagnosis.

Thyroid some results.

Accuracy of diagnoses obtained with different

systems.

Method Rules/Features Training

Test MLP2LN optimized 4/6 99.9

99.36 CART/SSV Decision Trees 3/5

99.8 99.33 Best Backprop MLP

-/21 100 98.5 Naïve Bayes -/-

97.0 96.1 k-nearest neighbors -/-

- 93.8

Psychometry

- MMPI (Minnesota Multiphasic Personality

Inventory) psychometric test. - Printed forms are scanned or computerized version

of the test is used.

- Raw data 550 questions, exI am getting tired

quickly Yes - Dont know - No - Results are combined into 10 clinical scales and

4 validity scales using fixed coefficients. - Each scale measures tendencies towards

hypochondria, schizophrenia, psychopathic

deviations, depression, hysteria, paranoia etc.

Scanned form

Computer input

Scales

Psychometry

- There is no simple correlation between single

values and final diagnosis. - Results are displayed in form of a histogram,

called a psychogram. Interpretation depends on

the experience and skill of an expert, takes into

account correlations between peaks.

Goal an expert system providing evaluation and

interpretation of MMPI tests at an expert level.

Problem agreement between experts only 70 of

the time alternative diagnosis and personality

changes over time are important.

Psychogram

Psychometric data

- 1600 cases for woman, same number for men.
- 27 classes norm, psychopathic, schizophrenia,

paranoia, neurosis, mania, simulation,

alcoholism, drug addiction, criminal tendencies,

abnormal behavior due to ...

Extraction of logical rules 14 scales

features. Define linguistic variables and use

FSM, MLP2LN, SSV - giving about 2-3 rules/class.

Psychometric data

Method Data N. rules Accuracy Gx

C 4.5 ? 55 93.0 93.7

? 61 92.5 93.1

FSM ? 69 95.4 97.6

? 98 95.9 96.9

10-CV for FSM is 82-85, for C4.5 is 79-84.

Input uncertainty Gx around 1.5 (best ROC)

improves FSM results to 90-92.

Psychometric Expert

- Probabilities for different classes. For greater

uncertainties more classes are predicted. - Fitting the rules to the conditions
- typically 3-5 conditions per rule, Gaussian

distributions around measured values that fall

into the rule interval are shown in green. - Verbal interpretation of each case, rule and

scale dependent.

MMPI probabilities

MMPI rules

MMPI verbal comments

Visualization

- Probability of classes versus input uncertainty.
- Detailed input probabilities around the measured

values vs. change in the single scale changes

over time define patients trajectory. - Interactive multidimensional scaling zooming on

the new case to inspect its similarity to other

cases.

Class probability/uncertainty

Class probability/feature

MDS visualization

Conclusions

- Data understanding is challenging problem.
- Classification rules are frequently only the

first step and may not be the best solution. - Visualization is always helpful.
- P-rules may be competitive if complex decision

borders are required, providing different types

of rules. - Understanding of complex objects is possible,

although difficult, using adaptive costs and

distance as least expensive transformations

(action principles in physics). - Great applications are coming!

Challenges

- Fully automatic universal data analysis systems

press the button and wait for the truth

- Discovery of theories rather than data models
- Integration with image/signal analysis
- Integration with reasoning in complex domains
- Combining expert systems with neural networks
- .

We are slowly getting there. More more

computational intelligence tools (including our

own) are available.