Machine Learning Methods for Decision Support and

Discovery

- Constantin Aliferis M.D., Ph.D., Ioannis

Tsamardinos Ph.D. - Discovery Systems Laboratory,
- Department of Biomedical Informatics,
- Vanderbilt University
- 2004 MEDINFO Tutorial
- 7 September 2004

Acknowledgments

- Alexander Statnikov for code and for putting

together the Resource Web page and CD - Doug Hardin, Pierre Massion, Yindalon

Aphinyanaphongs, Laura E. Brown, and Nafeh

Fananapazir for access to data and results for

case studies

Goal

- The purpose of this tutorial is
- To help participants develop a solid

understanding of some of the most useful machine

learning methods. - To give several examples of how these methods can

be applied in practice, and - To provide resources for expanding the knowledge

gained in the tutorial.

Outline

- Part I Overview and Foundations
- 1. Tutorial Overview and goals
- 2. Importance of Machine Learning for discovery

and decision-support system construction - 3. A framework for inductive Machine Learning
- 4. Generalization and Over-fitting
- 5. Quick review of data preparation and model

evaluation - 6. Families of methods
- a. Bayesian classifiers
- break
- b. Neural Networks
- c. Support Vector Machines
- break
- 7. Quick Review of Additional families
- a. K-Nearest Neighborhs,
- b. Clustering,
- c. Decision Tree Induction,
- d. Genetic Algorithms

Outline (contd)

- Part II.More Advanced Methods and Case Studies
- 1. More Advanced Methods
- a. Causal Discovery methods using Causal

Probabilistic Networks - b. Feature selection
- break
- 2 Case Studies
- Building a diagnostic model from gene expression

data - Building a diagnostic model from mass

spectrometry data - Categorizing text into content categories
- break
- d. Discovery of causal structure using Causal

Probabilistic Network induction (demo) - 3. Conclusions and wrap-up
- a. Resources for machine learning
- b. Questions feedback

Definitions Importance of Machine Learning

A (Simplified) Motivating Example

- Assume we wish to create a decision support

system capable of diagnosing patients according

to two categories Lung Cancer and Normal. - The input to the system will be of array gene

expression measurements from tissue biopsies.

A (Simplified) Motivating Example

- Little is currently known about how gene

expression values differentiate human lung cancer

tissue from normal tissue. - Thus we will use an automated approach in which a

computer system will examine patients array gene

expression measurements and the correct diagnosis

(provided by a pathologist).

A (Simplified) Motivating Example

- The system will produce a program that implements

a function that assigns the correct diagnosis to

any pattern of array gene expression data to the

correct diagnostic label (and not just the

input-output patterns of the training data). - Thus the system will learn (i.e., generalize)

from training data the general input-output

function for our diagnosis problem.

A (Simplified) Motivating Example

- What are the principles and specific methods that

enable the creation of such learning systems? - What flavors of learning systems currently exist?
- What are their capabilities and limitations?
- These are some of the questions we will be

addressing in this tutorial

What is Machine Learning (ML)? How is it

different than Statistics and Data Mining?

- Machine Learning is the branch of Computer

Science (Artificial Intelligence in particular)

that studies systems that learn. - Systems that learn systems that improve their

performance with experience.

What is Machine Learning (ML)? How is it

different than Statistics and Data Mining?

- Typical tasks
- image recognition,
- Diagnosis,
- elicitation of possible causal structure of

problem domain, - game playing,
- solving optimization problems,
- prediction of structure or function of

biomolecules, - text categorization,
- identification of relevant variables, etc.

Indicative Example applications of ML in

Biomedicine

- Bioinformatics
- Prediction of Protein Secondary Structure
- Prediction of Signal Peptides
- Gene Finding and Intron/Exon Splice Site

Prediction - Diagnosis using cDNA and oligonucleotide array

gene expression data - Identification of molecular subtypes of patients

with various forms of cancer - Clinical problem areas
- Survival after Pneumonia (CAP)
- Survival after Syncope
- Diagnosis of Acute M.I.
- Diagnosis of Prostate Cancer
- Diagnosis of Breast Cancer
- Prescription and monitoring in hemodialysis
- Prediction of renal transplant graft failure

Importance of ML Task Types

- Diagnosis (what is the most likely disease given

a set of clinical findings?), - Prognosis (what will be the outcome after a

certain treatment has been given to a patient?), - Treatment selection (what treatment to give to a

specific patient?), - Prevention (what is the likelihood that a

specific patient will develop disease X if

preventable risk factor Y is present?). - ML has practically replaced Knowledge Acquisition

for building Decision Support (Expert) Systems.

Importance of ML Task Types (contd)

- Discovery
- Feature selection (e.g., what is a minimal set of

laboratory values needed for pneumonia

diagnosis?) - Concept formation (e.g., what are patterns of

genomic instability as measured by array CGH that

constitute molecular subtypes of lung cancer

capable of guiding development of new

treatments?) - Feature construction (e.g., how can

mass-spectrometry signals be decomposed into

individual variables that are highly predictive

for detection of cancer and can be traced back to

individual proteins that may play important roles

in carcinogensis?) information retrieval query

construction (e.g., what are PubMed Mesh terms

that predict with high sensitivity and

specificity whether medical journals talk about

treatment?) - Questions about function, interactions, and

structure (e.g., how do genes and proteins

regulate each other in the cells of lower and

higher organisms? what is the most likely

function of a protein given the sequence of its

aminoacids?), etc.

What is Machine Learning (ML)? How is it

different than Statistics and Data Mining?

- Broadly speaking ML, DM, and Statistics have

similar goals (modeling for classification and

hypothesis generation or testing). - Statistics has traditionally emphasized models

that can be solved analytically (for example

various versions of the Generalized Linear Model

GLM). To achieve this both restrictions in the

expressive power of models and their parametric

distributions are heavily used. - Data Mining emphasizes very large-scale data

storage, integration, retrieval and analysis

(typically the last one as a secondary focus). - Machine Learning seeks to use computationally

powerful approaches to learn very complex non- or

quasi-parametric models of the data. Some of

these models are closer to human representations

of the problem domain per se (or of problem

solving in the domain)

Importance of ML Data Types and Volume

- Overwhelming production of data
- Bioinformatics (mass-throughput assays for gene

expression, protein abundance, SNPs) - Clinical Systems (EPR, POE)
- Bibliographic collections
- The Web web pages, transaction records,

Importance of ML Reliance on Hard data and

evidence

- Machine learning has become critical for Decision

Support System Construction given extensive

cognitive biases and the corresponding need to

base MDSSs on hard scientific evidence and

high-quality data

Supplementary Cognitive Biases

- Main thesis
- human cognitive abilities are tailored to support

instinctive, reflexive, life-preserving reactions

traced back in our evolution as species. They are

not designed for rational, rigorous reasoning

such as the reasoning needed in science and

engineering. - In other words, there is a disconnect between our

innate cognitive ability and the complexity of

reasoning tasks required by the explosive

advances in science and technology in the last

few hundred years.

Supplementary But is the Cognitive Biases Thesis

Correct?

- Psychology of Judgment and Decision Making

(Plous) - Tversky and Kahneman (Judgment under uncertainty

Heuristics and Biases) - Methods of Influence (Cialdini)
- And highly-recommended supplementary information

can be found in - Professional Judgment (Elstein)
- Institute of Medicines Report in Medical Errors

(1999)

Supplementary Tversky and Kahneman Judgment

under uncertainty Heuristics and Biases

- This work (a constellation of psychological

studies converging to a description of human

decision making under uncertainty) is very highly

regarded and influential It was recently (2002)

awarded the Nobel Prize of Economics. - Main points
- People use a few simple heuristics when making

judgments under uncertainty - These heuristics sometimes are useful and other

times lead to severe and systematic errors - These heuristics are representativeness,

availability and anchoring

- Supplementary Representativeness
- E.g., the probability P that patient X has

disease D given that she has findings F is

assessed by the similarity of X to a prototypical

description of D (found in a textbook, or

recalled from earlier practice and training). - Why is this wrong?
- Reason 1 similarity ignores base-rate of D
- Reason 2 similarity ignores sample size
- Reason 3 similarity ignores predictability
- Reason 4 similarity is affected by redundant

features

- Supplementary Availability
- E.g., the probability P that patient X with

disease D given that she is given treatment T

will become healthy is assessed by recalling such

occurrences in ones prior experience - Why is this wrong?
- Reason 1 availability is influenced by

familiarity - Reason 2 availability is influenced by salience
- Reason 3 availability is influenced by elapsed

time - Reason 4 availability is influenced by rate of

abstract terms - Reason 5 availability is influenced by

imaginability - Reason 6 availability is influenced by

perceived association

- Supplementary Adjustment and Anchoring
- E.g., the probability P that patient X has

disease D given that she has findings F is

assessed by making an initial estimate P1 for

findings F1 and updating it when new evidence F2,

F3, , and so on, becomes available. - What goes wrong?
- Problem 1 initial estimate over-influences the

final estimate - Problem 2 initial estimate is often based on

quick and then extrapolated calculations - Problem 3 people overestimate the probability

of conjunctive events - Problem 4 according to initial anchor,

peoples predictions are calibrated differently

Supplementary Additional

- Methods of Influence (Cialdini, 1993)
- Reciprocation
- Commitment Consistency
- Social Proof
- Liking
- Authority
- Scarcity
- Professional Judgment (Dowie and Elstein 1988)
- Institute of Medicines Report in Medical Errors

(1999)

Supplementary Putting MDSSs and Machine Learning

in Historical Context

- 40s
- Foundations of Formal Decision-Making Theory by

VonNeuman and Morgerstern - 50s
- Ledley and Lusted lay out how logic and

probabilistic reasoning can help in diagnosis and

treatment selection in medicine - 60s
- Applications of Bayes theorem for diagnosis and

treatment selection pioneered by Warner and

DeDombal - Medline (NLM)
- Early 70s
- Ad-hoc systems (Myers et al Pauker et al)
- Study of Cognitive Biases (Kahneman, Tversky)
- Late 70s
- Rule-based systems (Buchanan Shortliffe)

Supplementary Milestones in MDSSs

- 80s
- Analysis of ad-hoc and RBSs (Heckerman et al.)
- Bayesian Networks (Pearl, Cooper, Heckerman et

al.) - Medical Decision Making as discipline (Pauker)
- Literature-driven decision support (Renels

Shortliffe) - Early 90s
- Web-enabled decision support wide-spread

information retrieval - Computational Causal Discovery (Pearl, Spirtes et

al. Cooper et al.) - Sound re-formulation of very large ad-hoc systems

(Shwe et al) - Analysis of Bayesian systems (Domingos et al,

Henrion et al.) - Proliferation of focused Statistics and Machine

Learning MDDSs - First-order Logics that combine classical FOL

with probabilistic reasoning, causation and

planning (Haddaway)

Supplementary Milestones in MDSSs

- Late 90s
- Efficient Inference for very large probabilistic

systems (Jordan et al) - Kernel-based methods for sample-efficient

learning (Vapnik) - Evidence-Based Medicine (Haynes et al)
- 21st Century
- Diagnosis, Prognosis and Treatment selection

(a.k.a. Personalized medicine or

Pharmacogenomics) based on molecular

information (proteomic spectra, gene expression

arrays, SNPs) collected via mass-throughput

assaying technology, and modeleld using machine

learning methods - Provide-order entry delivery of advanced decision

support - Advanced representation, storage, retrieval and

application of EBM information (guidelines,

journals, meta-analyses, clinical bioinformatics

models)

Importance of ML

- How often ML techniques are being used? Articles

in Medline (in parentheses last 2 years) - Artificial Intelligence 12,441 (2,358)
- Expert systems 2,271 (121)
- Neural Networks 5,403 (1,158)
- Support Vector Machines 163 (121)
- Clustering 17,937 (4,080)
- Genetic Algorithms 2,798 (969)
- Decision Trees 4,958 (752)
- Bayesian (Belief) Networks 1,627 (585)
- Bayes (Bayesian Statistics Nets) 4,369 (561)
- Compare to
- Regression 164,305 (28,134)
- Knowledge acquisition 310 (56)
- Knowledge representation 227 (27)
- 4 major Symbolic DSS 145 (10)
- (Internist-I, QMR, ILIAD, DxPlain)
- Rule-based systems 802 (151)

Importance of ML

- Importance of ML becomes very evident in cases

where - data analysis is too time consuming (e.g.,

classify web pages or medline documents into

content or quality categories) - There is little or no domain theory

What is the diagnosis?

Is this an early cancer?

A Framework for Inductive ML Related

Introductory Concepts

What is the difference between supervised and

unsupervised ML methods?

- Supervised learning
- Give to the learning algorithm several instances

of input-output pairs the algorithm learns to

predict the correct output that corresponds to

some inputs (not only previously seen but also

previously unseen ones (generalization)). - In our original example show to learning

algorithm array gene expression measurements from

several patient cases as well as normal subjects

then the learning algorithm induces a classifier

that can classify a previously unseen subject to

the correct diagnostic category given the gene

expression values observed in that subject

Classification

A

CLASSIFIER- INDUCTIVE ALGORITHM

TRAIN INSTANCES

B

C

CLASSIFIER

D

E

APPLICATION INSTANCES

A1, B1, C1, D1, E1 A2, B2, C2, D2, E2 An, Bn,

Cn, Dn, En

CLASSIFICATION PERFORMANCE

What is the difference between supervised and

unsupervised ML methods?

Unsupervised learning - Discover the

categories (or other structural properties of the

domain) - Example give the

learning algorithm gene expression

measurements of patients with Lung Cancer as well

as normal subjects the algorithm finds

sub-types (molecular profiles) of patients that

are very similar to each other, and different to

the rest of the types. Or another algorithm may

discover how various genes interact among

themselves to determine development of cancer.

Discovery

A

STRUCTURE- INDUCTION ALGORITHM

A

TRAIN INSTANCES

B

C

B

C

E

D

E

D

A1, B1, C1, D1, E1 A2, B2, C2, D2, E2 An, Bn,

Cn, Dn, En

PERFORMANCE

A first concrete attempt at solving our

hypothetical diagnosis problem using a particular

type of learning approach (decision tree

induction)

Decision Tree Induction

- An example decision tree to solve the problem of

how to classify subjects into lung cancer vs

normal

Gene139

Over-expressed

Normally- expressed

Under-expressed

Gene202

Gene8766

Lung cancer

Over-expressed

Normally- expressed

Over-expressed

Normally- expressed

Lung cancer

Normal

Lung cancer

Normal

How Can I Learn Such A Decision Tree

Automatically?

- A basic induction procedure is very simple in

principle - Start with an empty tree
- Put at the root of the tree the variable that

best classifies the training examples - Create branches under the variable corresponding

to its values - Under each branch repeat the process with the

remaining variables - Until we run out of variables or sample

How Can We Generalize From This Example?

A General Description of supervised Inductive ML

- Inductive Machine Learning algorithms can be

designed and analyzed using the following

framework - A language L in which we express models. The set

of all possible models expressible in L

constitutes our hypothesis space H - A scoring metric M tells us how good is a

particular model - A search procedure S helps us identify the best

model in H

Space of all possible models Models in H

x x

x x x x

x x x x

x x x

x

A General Description of supervised Inductive ML

- In our decision tree example
- A language L in which we express models

decision trees - The hypothesis space H space of all decision

trees that can be constructed with genes 1 to n - A scoring metric M telling us how good is a

particular model min (classification error

model complexity) - A search procedure S greedy search

How can ML methods fail?

- Wrong language Bias best model is not in H
- Example we look for models expressible as

discrete decision trees but the domain is

continuous

Space of all possible models Models in H

x

x x

x x x x

x x x x

x x x

How can ML methods fail?

- Search Failure best model is in H but search

fails to examine it - Example greedy search fails to capture a strong

gene-gene interaction effect

Space of all possible models Models in H

x x

x

x x x x

x x x x

x x x

Generalization Over-fitting

Generalization Over-fitting

- It was mentioned previously that a good learning

program learns something about the data beyond

the specific cases that have been presented to

it. - Indeed, it is trivial to just store and retrieve

the cases that have been seen in the past (rote

learning implemented as a lookup table). This

does not address the problem of how to handle new

cases, however.

Generalization Over-fitting

- In supervised learning we typically seek to

minimize i.i.d. error, that is error over

future cases (not used in training). Such cases

contain both previously encountered as well as

new cases. - i.i.d. independently sampled and identically

distributed problem instances. - In other words, the training and application

samples come from the same population

(distribution) with identical probability to be

selected for inclusion and this

population/distribution is time-invariant. - (Note if not time invariant then by

incorporating time as independent variable or by

other appropriate transformations we restore the

i.i.d. condition)

Supplementary Generalization Over-fitting

- Consider now the following simplified diagnostic

classification problem classify patients into

cancer (red/vertical pattern) versus normal

(green/no pettern) on the basis of the values of

two gene values (gene1, gene2)

Gene1

Gene2

Supplementary Generalization Over-fitting

- The diagonal line represents a perfect classifier

for this problem (do not worry for the time being

how to mathematically represent or

computationally implement the line we will see

how to do so in the Neural Network and Support

Vector Machine segments)

Gene1

Gene2

Supplementary Generalization Over-fitting

- Lets solve the same problem from a small sample

one such possible small sample is

Gene1

Gene2

Supplementary Generalization Over-fitting

- We may be tempted to solve the problem with a

fairly complicated line

Gene1

Gene2

Supplementary Generalization Over-fitting

- In which case we get several errors

Gene1

Gene2

Supplementary Generalization Over-fitting

- whereas with a simpler line

Gene1

Gene2

Supplementary Generalization Over-fitting

- a much smaller error

Gene1

Gene2

Generalization Over-fitting

- In general, over-fitting a model to the data

means that instead of general properties of the

population from which the data is sampled we

learn idiosyncracies (i.e., non-representative

properties) of the sample data. - Over-fitting and poor generalization (i.e., the

error in the overall population (true error) is

large) are synonymous as long as we have learned

the training data well (i.e., small apparent

error). - Over-fitting is not only affected by the

simplicity of the classifier (e.g., straight vs

wiggly line) but also by - the size of the sample,
- the complexity of the function we wish to learn

from data, - the amount of noise, and
- the number and nature (continuous discrete,

ordered, distribution, etc.) of the variables.

Generalization Over-fitting

- We wish to particularly emphasize the danger of

grossly over-fitting the learning when the number

of predictive variables is large relative to the

available sample. Consider for example the

following situation - Assume we have 5 binary predictors and two

samples, and wish to classify instances into two

classes - The predictors can encode 2532 possible

distinct patterns. Assume all patterns are

equally probable. - Hence the chances of the two cases having

different predictive patterns are 31/3297. - Thus in 97 of our samples of size two, the five

variables are sufficient to identify perfectly

the case. - Combine this with a powerful enough learning

algorithm (i.e., one that can effectively

associate any pattern with the desired outcome)

and it follows that in 97 of samples, one gets

optimal apparent error even when there is no

relationship between the target variable and the

predictive variables!

Generalization Over-fitting

- This situation is particularly relevant in

bioinformatics in which we routinely have gt10,000

continuous variables, noise, and lt500 samples.

Under these conditions every training instance

has almost always a unique value set of the

predictive variables - thus if one is not careful, the learning

algorithm can simply learn what amounts to a

lookup table (i.e., by associating the unique

predictor signature with the outcome of that case

for every case).

Generalization Over-fitting

- So how does one avoid over-fitting?
- Via a variety of approaches
- Use learning algorithms that intrinsically (by

design) generalize well - Pursue simple (highly biased) classifiers for

small samples - Choose unbiased and low-variance statistical

estimators of the true error and employ them

sparingly - Very important rule
- Estimate the performance (true error) of a model

with data you did not use to construct the model

Generalization Over-fitting

- Avoiding over-fitting will be a primary concern

of ours in this tutorial - We will outline here some specific

cross-validation procedures and use them to build

models in the case studies segment

Generalization Over-fitting

- Hold-out cross-validation method
- Split data in Train and Test data
- Learn with Train and estimate true error with Test

test

data

train

Generalization Over-fitting

- N-fold Cross-validation
- Split data in Train and Test data n times such

that union of test sets is full data set - Learn with Train and estimate true error with

Test in each split separately - Average test performance

train

train

train

test

train

train

train

train

test

test

test

test

test

Generalization Over-fitting

- Leave-one-out
- n-fold C.V. where n is equal to the number of

data instances

Supplementary Generalization Over-fitting

- Stratified (balanced) Cross-validation
- An n-fold C.V. in which (by design) the target

class has the same distribution as in the full

dataset

Supplementary Generalization Over-fitting

- Nested Cross-validation
- Assume we wish to apply cross-validation to find

the best parameter values for parameter C for a

classifier from parameter value set 1,..,100. - One way to use C.V. to select the best values for

C is to apply the holdout method 100 times, one

for each value of C and select the value that

gives the best error in the test sample. - The problem with this approach is that the true

error estimate is not reliable since it is

produced by running the best model on a test set

that was used to derive the best model. - In other words, our data used to estimate the

true error can no longer be used to produce

unbiased estimates since it also guided the

selection of the model.

Supplementary Generalization Over-fitting

- solution
- Split the Train data into two (Traintrain and

Validation), - Use the validation set to find the best

parameters, - Use the test set to estimate the true error

test

data

Validation

Traintrain

Supplementary Generalization Over-fitting

- If the sample is small, the nesting can be

repeated with different assignment of the test

set (i.e., nested n-fold C.V.) - One can also nest LOO with n-fold C.V. or LOO

with LOO

Te

data

TT

TT

V

TT

TT

TT

V

TT

V

V

V

V

Te

Supplementary Generalization Over-fitting

- Important notes
- Estimating the true error of the best model is a

separate procedure than generating the best

model the former requires an additional layer of

nesting our cross-validation - When there are several types of parameters to be

selected (e.g., normalization, discretization,

classifier parameters) one can - do one n-fold cross-validation using the

cartesian product of all parameters which uses

more sample but yields more conservative true

error estimates, or one can - nest the cross-validation to as many nesting

levels as the number of distinct parameters that

need optimization, which yields more unbiased

true error estimates but uses less sample

Quick Notes On Data Preparation

Data preparation

- Non-specific
- Is the data lawfully in our disposal?
- Are there issues that deal with patient privacy

and confidentiality as well as intellectual

property issues that need be resolved? - How were the data produced, by whom, when, with

what purpose in mind? - Any known or plausible biases present?
- References in the literature?
- Is there a codebook with clear definitions of

variables location, date of creation, method of

creation, value list, value meaning, missing

value codes and meanings, history of the database

and its elements?

Data preparation

- Data specific
- Valid values?
- Variable distributions?
- Descriptive statistics?
- Mechanisms of missingness imputation

Data preparation

- Learner specific
- De-noising
- Scaling/Normalization
- Discretization
- Transforming variable distributions
- Co-linearities
- Homoskedasticity
- Outliers
- Feature selection

Data preparation

- Task specific
- Reconstruct hidden or distorted signals from

observed ones - Infer presence of hidden variables, determine

their cardinality and values - Stem, normalize, extract terms
- Weight or Project variables

Basic Evaluation Metrics

Evaluation Metrics

- T test
- D disease
- Accuracy (0/1 loss)
- Number of correct classifications ad
- Number of total classifications

abcd

D D-

T a b ab

T- c d cd

ac bd abcd

Evaluation Metrics

- T test
- D disease
- Sensitivity proportion of true positives

identified by test - a
- ac

D D-

T a b ab

T- c d cd

ac bd abcd

Evaluation Metrics

- T test
- D disease
- Specificity proportion of true negatives

identified by test - d
- bd

D D-

T a b ab

T- c d cd

ac bd abcd

Evaluation Metrics

- T test
- D disease
- Positive predictive value (PPV) proportion of

true positives over test positives - a
- ab

D D-

T a b ab

T- c d cd

ac bd abcd

Evaluation Metrics

- T test
- D disease
- Negative predictive value (NPV) proportion of

true negatives over test negatives - d
- cd

D D-

T a b ab

T- c d cd

ac bd abcd

Evaluation Metrics

- Mean squared error (MSE) (Quadratic loss)
- 1/D S (predicted_value(i)-true_value(i))2
- D cardinality of test dataset
- Suitable for continuous outputs

D

i

Evaluation Metrics

- ROC area

1

Sensitivity

0

1

1-Specificity

Evaluation Metrics

- In information retrieval
- Precision is the name for PPV and
- Recall is the name for Sensitivity

Evaluation Metrics

- Recall-precision curve (and area under it)

100

Recall

0

100

Precision

Bayesian Classifiers

- Note we will be discussing Bayesian classifiers

using the diagnostic context, (which in terms of

applications and historical development of the

related ideas is representative). However the

ideas discussed readily translate to any type of

learning for classification and concomitant

decision support function.

Bayesian Classifiers

- Bayes Theorem (or formula) says that
- P (D) P(F D)
- P (D F)
- P(F)
- Where
- P(D) is the probability of some disease D in the

general population (i.e., before obtaining some

evidence F), a.k.a. as disease prior

probability - P(F) is the probability of some evidence in the

form of findings such as lab tests, physical

examination findings etc. - P(F D) is the probability of the same findings

given that someone has disease D - P(D F) is the probability of disease D given

that someone has the findings F (i.e., after

obtaining some evidence F), a.k.a. as disease

posterior probability

Bayesian Classifiers

- Since the most likely diagnosis is the one with

the maximum a posteriori probability, Bayes

formula allows one to solve the differential

diagnosis problem, as well as as any

classification learning problem that can be cast

as supervised learning - Indeed, in the sample limit, there cannot be a

better way to infer the most likely diagnosis

than Bayes theorem and thus it serves as the

theoretical gold standard against which

statistical and machine learning classifiers are

measured in terms of true error. - In that context it is referenced as the Bayes

Optimal Classifier

Bayesian Classifiers

- Note that Bayes formula can be applied to

diagnosis of multiple possibly inter-depended

diseases and non-independent findings since where

there is F one can place a vector of findings

(e.g., F1, F2-, F3-,,Fn) and where there is

D one can put a vector of diseases (e.g., D1-,

D2-, D3,,Dm) .

Bayesian Classifiers

- Further, the intuitive interpretation of Bayes

rule is that of updating belief about the

patients true state before seeing F we have

some prior belief (measured as probability) that

the patient has disease(s) D. After seeing F we

update the prior belief (diagnosis) to reflect

(incorporate) the new evidence F the new belief

is the posterior produced by Bayess rule

Bayesian Classifiers

- Unfortunately there is a significant drawback

with straightforward Bayes rule we need number

of probabilities, storage and computational time

that is exponential to the number of findings

(i.e., F) and the number of diseases (i.e.,

D). - This means that for any diagnostic or other

classification problem of non-trivial size

(measured in terms of F and D) straight Bayes

is not feasible

Bayesian Classifiers

- This has led to a simplified version in which we

disallow multiple diseases (i.e., require that

the patient may have only one disease at a time)

and we require that findings are independent

conditioned on the disease states (note this

does not mean that the findings are independent

in general, but rather, that they are

conditionally independent). - The combination of these two assumptions yields

required number of probabilities, storage and

computational time that is linear to the number

of findings and the number of diseases.

Simple (a.k.a. Naive) Bayes

- Application of Bayes rule with the Mutual

Disease Exclusivity assumption (MEE) and the

Conditional Independence assumption (FCID) is

known as Simple Bayes Rule, Naïve Bayes, or,

rather non-tastefully, as Idiots Bayes. - Simple Bayes can be implemented by plugging in

the main formula - P(F D) P P(Fi Dj)
- and P( F) S P(Fi, Dj) S P(Fi Dj) P(Dj)
- where Fi is the ith (singular) finding and Dj the

jth (singular) disease. - Several other (mathematically) equivalent

formulations exist using sensitivities and

specificities, likelihood ratios or other

convenient building blocks

i,j

i,j

i,j

Simple (a.k.a. Naive) Bayes

- Simple Bayes was applied very early (from the

early 60s and on) in Medical Informatics for

diagnosis and optimal treatment selection as well

as sequential testing. - See for example the classic papers by Warner et

al (1961), DeDombal (1972), Leaper (1972), Gorry

and Barnett (1968)

Variants of Simple Bayes

- Since the MEE and FCID assumptions clearly are

violated in many medical contexts, researchers

early on sought to relax them and created

modified Bayesian classifiers that approximated

P(F D) (Fryback 1978) or assumed independent

diseases and multiple diagnoses

(Multi-membership model of Ben-Basat, 1980). - These models (and many others not mentioned here)

have primarily historical significance currently,

because - (a) It was shown (1997, Domingos and Pazzani)

that the MEE and FCID assumptions are not

necessary but sufficient conditions for a wide

variety of target functions under 0/1 loss - (b) Bayesian Networks were invented and as we

will see next they allow flexible representation

of dependencies so that parsimony and

tractability is maintained without compromising

soundness - (c) several other restricted Bayesian classifiers

have been shown to perform well in a variety of

practical settings

Bayesian Networks

Supplementary Bayesian Networks Overview

- A Note On Terminology
- Brief Historical Perspective
- The Bayesian Network Model and Its Uses
- Learning BNs
- Reference Resources

Supplementary Bayesian Networks A Note On

Terminology

- Bayesian Networks (or Nets) generic name
- Belief Networks subjective probability-based,

non-causal - Causal Probabilistic Networks frequentist

probability-based, causal

Supplementary Bayesian Networks A Note On

Terminology

- Various other names for special model classes
- Influence Diagrams (Howard and Mathesson)

incorporate decision and utility nodes. Used for

decision analyses - Dynamic Bayesian Networks (Dagum et al.)

temporal semantics. Used as alternatives to

multivariate time series models and dynamic

control - Markov Decision Processes (Dean et al.) for

decision policy formulation in temporally-evolving

domains - Modifiable Temporal Belief Networks (Aliferis et

al.) for well-structured and very large problem

models that involve time and causation and cannot

be stored explicitly

Supplementary Bayesian Networks Historical

Perspective

- Naïve Bayesian Model (mutually exclusive

diseases, findings independent given diseases)

predominant model for medical decision support

systems in the 60s and early 70s because it

requires linear number of parameters and

computational steps (to total findings and

diseases) - Theorem 1 (Minsky, Peot) Naïve Bayes heuristic

usefulness (expected classification performance)

over all domains gets exponentially worse as

number of variables increases - Theorem 2 (see Mitchell) Full Bayesian

classifierperfect classifier - However FBC impractical and serves as analytical

tool only

Supplementary Bayesian Networks Historical

Perspective

- In the late 70s and up to mid-80s this led to

Production Systems (i.e., rule-based systems,

that is simplifications of first-order logic).

The most influential version of PSs (Shortliffe,

Buchanan) handled uncertainty through a modular

account of subjective belief (the Certainty

Factor Calculus) - Theorem 3 (Heckerman) The CFC is inconsistent

with probability theory unless rule-space search

graph is a tree. Consequently, forward and

backward reasoning cannot be combined in a CFC PS

and still produce valid results

Supplementary Bayesian Networks Historical

Perspective

- That led to research (late 80s) in Bayesian

Networks which can vary expressiveness between

the full dependency (or even the full Bayesian

classifier) and the Naïve Bayes model (Pearl,

Cooper)

Variables Conditionally Independent Given

Categories Categories Mutually Exclusive

Variables Conditionally Dependent

Bayesian Networks

Supplementary Bayesian Networks Historical

Perspective

- In the early 90s researchers developed the first

algorithms for learning BNs from data

(Herskovits, Cooper, Heckerman) - In the mid 90s researchers (Spirtes, Glymour,

Sheines, Pearl, Verma) discovered methods to

learn CPNs from observational data(!). We will

cover the foundations of this in the causal

discovery segment. - Overall BNs is the brain child of computer

scientists, medical informaticians, artificial

intelligence researchers, and industrial

engineers and is considered to be the

representation language of choice for most

biomedical Decision Support Systems today

Bayesian Networks The Bayesian Network Model and

Its Uses

- BNGraph (Variables (nodes), dependencies (arcs))

Joint Probability Distribution Markov

Property - Graph has to be DAG (directed acyclic) in the

standard BN model

JPD P(A, B, C)0.006 P(A, B, C-)0.014 P(A,

B-, C)0.054 P(A, B-, C-)0.126 P(A-, B,

C)0.240 P(A-, B, C-)0.160 P(A-, B-,

C)0.240 P(A-, B-, C-)0.160

A

B

C

- Theorem 4 (Neapolitan) any JPD can be

represented in BN form

Bayesian Networks The Bayesian Network Model and

Its Uses

- Markov Property the probability distribution of

any node N given its parents P is independent of

any subset of the non-descendent nodes W of N

A

e.g., D B,C,E,F,G A F

A,D,E,F,G,H,I,J B, C

C

D

B

F

G

E

H

J

I

Bayesian Networks The Bayesian Network Model and

Its Uses

- Theorem 5 (Pearl) the Markov property enables us

to decompose (factor) the joint probability

distribution into a product of prior and

conditional probability distributions

The original JPD P(A, B, C)0.006 P(A, B,

C-)0.014 P(A, B-, C)0.054 P(A, B-,

C-)0.126 P(A-, B, C)0.240 P(A-, B,

C-)0.160 P(A-, B-, C)0.240 P(A-, B-,

C-)0.160 Becomes P(A)0.8 P(B A)0.1 P(B

A-)0.5 P(C A)0.3 P(C A-)0.6

A

Up to Exponential Saving in Number of Parameters!

B

C

P(V) P p(ViPa(Vi))

i

Bayesian Networks The Bayesian Network Model and

Its Uses

- As we will see in the causal discovery segment,

BNs are a useful language for automated causal

discovery because the Markov property captures

causality thus - Revealing confounders
- Modeling explaining away
- Modeling/understanding selection bias
- Modeling causal pathways
- Modeling manipulation in the presence of

confounders - Modeling manipulation in the presence of

selection bias - Identifying targets for manipulation in causal

chains

Bayesian Networks The Bayesian Network Model and

Its Uses

- Once we have a BN model of some domain we can ask

questions

A

- Forward P(D,I- A)?
- Backward P(A C, D)?
- Forward Backward
- P(D,C- I, E)?
- Arbitrary abstraction/Arbitrary
- predictors/predicted variables

C

D

B

F

G

E

H

J

I

Bayesian Networks The Bayesian Network Model and

Its Uses

- The Markov property tells us which variables are

important to predict a variable (Markov Blanket),

thus providing a principled way to reduce

variable dimensionality

A

C

D

B

F

G

E

H

J

I

Bayesian Networks Demonstration of Flexible

Representation

- A BN in which FCID holds

D1

D2

D3

F3

F2

F1

F4

Bayesian Networks Demonstration of Flexible

Representation

- A BN in which MEE holds

D1

D2

D3

F3

F2

F1

F4

Bayesian Networks Demonstration of Flexible

Representation

- A BN in which MEE and FCID hold

D

F3

F2

F1

F4

Bayesian Networks Demonstration of Flexible

Representation

- Hybrid assumptions

D1

D2

D3

D2

D3

F3

F8

F2

F1

F4

F7

F9

F5

F6

Inference Algorithms

- Exact
- Lauritzen Spigelhalter
- Cooper Recursive decomposition
- Stochastic-approximate
- Likelihood weighting
- Dagum and Luby
- Variational (approximate but not stochastic)
- Jordan et al. (1998) solves queries in QMR-DT in

seconds - Reference
- An Introduction to Variational Methods for

Graphical Methods (1998) Michael I. Jordan,

Zoubin Ghahramani, Tommi S. Jaakkola, Lawrence K.

Saul. Machine Learning - An Optimal Approximation Algorithm For Bayesian

Inference (1997) Paul Dagum, Michael Luby.

Artificial Intelligence - Probabilistic Reasoning in Expert Systems Theory

and Algorithms by Richard E. Neapolitan. Kohn

Wiley 1990

Theoretical Complexity

- Inference is NP-hard (Cooper (exact, 1990) Dagum

and Luby (stochastic, 1993)) - Learning is NP-hard (Chickering, 1994, Bukaert,

1995) - However Many widely-applicable algorithms are

very efficient (allowing up to thousands of

variables for inference and up to gt100,000

variables for focused learning)

Automatic Construction of Bayesian Networks from

Data

- For causal discovery
- Perl, Verma (1988)
- Spirtes, Glymour, Scheines, (1991)
- For classification/automatic DSS construction
- Herskovits, Cooper (1991) Kutato (entropy-based)
- Cooper, Herskovits (1992) K2 (Bayesian)
- to be discussed at length in second part
- Reference
- Computation, Causation, and Discovery by Clark

Glymour (Preface), Gregory F. Cooper (Editor),

2000, AAAI Press

Supplementary Bayesian Networks Sparse

Candidate Algorithm

- Repeat
- Select candidate parents Ci for each variable Xi
- Set new best NW B to be Gn s.t. Gn maximizes
- a Bayesian score Score(GD) where G is a member
- of class of BNs for which PaG(Xi) Í PaBprev(Xi)

" Xi - Until Convergence
- Return B

Restriction Step Maximization Step

Supplementary Bayesian Networks Sparse

Candidate Algorithm

- SCA proceeds by selecting up to k candidate

parents for each variable on the basis of

pair-wise association - Then search is performed for a best network

within the space defined by the union of all

potential parents identified in the previous step - The procedure is iterated by feeding the parents

in the currently best network to the restriction

step - Theorem 6 (Friedman) SCA monotonically improves

the quality of the examined networks - Convergence criterion no gain in score, and

maximum number of cycles with no improvement in

score

Supplementary Bayesian Networks Learning

Partial Models

- Partial model feature (Friedman et al.)
- Examples
- Order relation (is X an ascendant or descendent

of Y?) - Markov Blanket membership (Is A in the MB of B?)
- We want
- And we approximate it by

P(f(GD) S (f(G) p(GD))

G

m

Conf(f) S f(Gi)

1

m

i1

Supplementary BN Applications

Supplementary Pathfinder

- Heckerman et al early 90s
- Diagnosis and test selection of lymph-node

pathology - Assumes MEE but not FICD
- Similarity networks (special enhancement to BNs)

allow more efficient Knowledge acquisition - Myopic test selection strategy (similar to Gorry

and Barnett) combined monetary cost/expected

survival utility measure - Led to Intellipath commercial product
- Reference
- Heckerman DE, Horvitz EJ, Nathwani BN. Toward

normative expert systems Part I. The Pathfinder

project. Methods Inf Med. 1992 Jun31(2)90-105. - Heckerman DE, Nathwani BN. Toward normative

expert systems Part II. Probability-based

representations for efficient knowledge

acquisition and inference.Methods Inf Med. 1992

Jun31(2)106-16. - Heckerman DE, Nathwani BN. An evaluation of the

diagnostic accuracy of Pathfinder. Comput Biomed

Res. 1992 Feb25(1)56-74.

Supplementary QMR-DT

- Stanford, late 80s, early 90s
- Probabilistic formulation of QMR KB (subsequent

version of INTERNIST I) - Full-scope of INTERNIST/QMR
- Uses
- two-layered BN representation,
- No MEE or FICD assumptions
- Stochastic inference
- Reference
- Shwe M, Cooper G. An empirical analysis of

likelihood-weighting simulation on a large,

multiply connected medical belief network. Comput

Biomed Res. 1991 Oct24(5)453-75.

Supplementary Analysis of sensitivity of BNs to

errors in probability specification

- Henrion et al. 1996
- System CPCS (subset of QMR)
- Results average probabilities assigned to the

actual diseases showed small sensitivity even to

large amounts of noise. - Explanation One reason is that the criterion for

performance is average probability of the true

hypotheses, which is insensitive to symmetric

noise distributions. But, even asymmetric,

logodds-normal noise has modest effects. A second

reason is that the gold-standard posterior

probabilities are often near zero or one, and are

little disturbed by noise. - Reference
- Max Henrion, Malcolm Pradhan, Brendan Del Favero,

Kurt Huang, Gregory Provan and Paul O'Rorke. Why

is Diagnosis Using Belief Networks Insensitive to

Imprecision in Probabilities? UAI, 1996

Supplementary Temporal Causal and Spatial

Reasoning with Probabilistic methods

- Haddaway 1995 temporal, causal and probabilistic

FOL - Aliferis (97, 98) temporal and causal Bayesian

Networks with clear causal-temporal semantics - Spatio-temporal BNs for GI endoscopy
- Reference
- Aliferis CF, Cooper GF. Temporal representation

design principles an assessment in the domain of

liver transplantation. Proc AMIA Symp.

1998170-4. - Ngo L, Haddawy P, Krieger RA, Helwig J. Efficient

temporal probabilistic reasoning via

context-sensitive model construction. Comput Biol

Med. 1997 Sep27(5)453-76.

Supplementary Dynamic construction of BNs from

Knowledge Bases to solve problem instances of

interest (KBMC)

- Haddaway 1995 probabilistic FOL KB?BN
- Aliferis Modifiable Temporal BNs temporal and

causal Bayesian Networks with adjustable hybrid

granularities, variable time horizon, and

interacting subnetworks (contexts) 1996-8 - Koller et al. object-oriented BNs, 1997
- Reference
- Generating Bayesian Networks from Probability

Logic Knowledge Bases P. Haddawy , Proceedings

of the Tenth Conference on Uncertainty in

Artificial Intelligence, July, 1994. - Daphne Koller and Avi Pfeffer. Object-Oriented

Bayesian Networks , UAI, 1997 - Aliferis C.F. A Temporal representation and

Reasoning Model for Medical Decision Support

Systems, Doctoral Thesis, 1998

Supplementary Other applications

- Parsing of natural language with BNs
- Haug et al 1999
- Charniak et al.
- Extensive applications for classification and

discovery Margaritis et al (1999), Aliferis

Tsamardinos et al. (2001,2,3) - focused causal discovery (parents-children or

Markov Blankets) - feature selection
- to be discussed at length in second part
- Reference
- Fiszman M, Chapman WW, Evans SR, Haug PJ.

Automatic identification of pneumonia related

concepts on chest x-ray reports.Proc AMIA Symp.

199967-71. - Charniak E. Bayesian Networks Without Tears. AI

Magazine 1991

Simple Bayes Revisited

- Domingos and pazzani 1997
- Naïve Bayes assumptions are sufficient for

accurate probability estimates in the sample

limit but not necessary for a wide variety of

learning problems when accuracy is the evaluation

metric - In small samples even if the assumptions are

violated SB can do better than more expressive

representations due to the bias-variance

decomposition of the error - The best way to correct (extend) SB is not to

join highly-associated findings - These results explain the excellent performance

of SB in text categorization with thousands of

variables (words) and many other

learning/inference tasks even against more

expressive representations - Reference
- On the Optimality of the Simple Bayesian

Classifier under Zero-One Loss (1997) Pedro

Domingos and Michael Pazzani. Machine Learning

Other Restricted Bayesian Classifiers

- The TAN classifier augments Naïve Bayes with

augmenting edges among findings such that the

resulting network among the findings is a tree

D

F3

F2

F1

F4

Other Restricted Bayesian Classifiers

- The TAN multinet classifier uses a different TAN

for each value of D and then chooses the

predicted class to be