Machine%20Learning%20Methods%20for%20Decision%20Support%20and%20Discovery - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

Machine%20Learning%20Methods%20for%20Decision%20Support%20and%20Discovery

Description:

... E. Brown, and Nafeh Fananapazir for access to data and results for case studies ... Bibliographic collections. The Web: web pages, transaction records, ... – PowerPoint PPT presentation

Number of Views:276
Avg rating:3.0/5.0
Slides: 131
Provided by: Constantin87
Learn more at: http://www.dsl-lab.org
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Machine%20Learning%20Methods%20for%20Decision%20Support%20and%20Discovery


1
Machine Learning Methods for Decision Support and
Discovery
  • Constantin Aliferis M.D., Ph.D., Ioannis
    Tsamardinos Ph.D.
  • Discovery Systems Laboratory,
  • Department of Biomedical Informatics,
  • Vanderbilt University
  • 2004 MEDINFO Tutorial
  • 7 September 2004

2
Acknowledgments
  • Alexander Statnikov for code and for putting
    together the Resource Web page and CD
  • Doug Hardin, Pierre Massion, Yindalon
    Aphinyanaphongs, Laura E. Brown, and Nafeh
    Fananapazir for access to data and results for
    case studies

3
Goal
  • The purpose of this tutorial is
  • To help participants develop a solid
    understanding of some of the most useful machine
    learning methods.
  • To give several examples of how these methods can
    be applied in practice, and
  • To provide resources for expanding the knowledge
    gained in the tutorial.

4
Outline
  • Part I Overview and Foundations
  • 1. Tutorial Overview and goals
  • 2. Importance of Machine Learning for discovery
    and decision-support system construction
  • 3. A framework for inductive Machine Learning
  • 4. Generalization and Over-fitting
  • 5. Quick review of data preparation and model
    evaluation
  • 6. Families of methods
  • a. Bayesian classifiers
  • break
  • b. Neural Networks
  • c. Support Vector Machines
  • break
  • 7. Quick Review of Additional families
  • a. K-Nearest Neighborhs,
  • b. Clustering,
  • c. Decision Tree Induction,
  • d. Genetic Algorithms

5
Outline (contd)
  • Part II.More Advanced Methods and Case Studies
  • 1. More Advanced Methods
  • a. Causal Discovery methods using Causal
    Probabilistic Networks
  • b. Feature selection
  • break
  • 2 Case Studies
  • Building a diagnostic model from gene expression
    data
  • Building a diagnostic model from mass
    spectrometry data
  • Categorizing text into content categories
  • break
  • d. Discovery of causal structure using Causal
    Probabilistic Network induction (demo)
  • 3. Conclusions and wrap-up
  • a. Resources for machine learning
  • b. Questions feedback

6
Definitions Importance of Machine Learning
7
A (Simplified) Motivating Example
  • Assume we wish to create a decision support
    system capable of diagnosing patients according
    to two categories Lung Cancer and Normal.
  • The input to the system will be of array gene
    expression measurements from tissue biopsies.

8
A (Simplified) Motivating Example
  • Little is currently known about how gene
    expression values differentiate human lung cancer
    tissue from normal tissue.
  • Thus we will use an automated approach in which a
    computer system will examine patients array gene
    expression measurements and the correct diagnosis
    (provided by a pathologist).

9
A (Simplified) Motivating Example
  • The system will produce a program that implements
    a function that assigns the correct diagnosis to
    any pattern of array gene expression data to the
    correct diagnostic label (and not just the
    input-output patterns of the training data).
  • Thus the system will learn (i.e., generalize)
    from training data the general input-output
    function for our diagnosis problem.

10
A (Simplified) Motivating Example
  • What are the principles and specific methods that
    enable the creation of such learning systems?
  • What flavors of learning systems currently exist?
  • What are their capabilities and limitations?
  • These are some of the questions we will be
    addressing in this tutorial

11
What is Machine Learning (ML)? How is it
different than Statistics and Data Mining?
  • Machine Learning is the branch of Computer
    Science (Artificial Intelligence in particular)
    that studies systems that learn.
  • Systems that learn systems that improve their
    performance with experience.

12
What is Machine Learning (ML)? How is it
different than Statistics and Data Mining?
  • Typical tasks
  • image recognition,
  • Diagnosis,
  • elicitation of possible causal structure of
    problem domain,
  • game playing,
  • solving optimization problems,
  • prediction of structure or function of
    biomolecules,
  • text categorization,
  • identification of relevant variables, etc.

13
Indicative Example applications of ML in
Biomedicine
  • Bioinformatics
  • Prediction of Protein Secondary Structure
  • Prediction of Signal Peptides
  • Gene Finding and Intron/Exon Splice Site
    Prediction
  • Diagnosis using cDNA and oligonucleotide array
    gene expression data
  • Identification of molecular subtypes of patients
    with various forms of cancer
  • Clinical problem areas
  • Survival after Pneumonia (CAP)
  • Survival after Syncope
  • Diagnosis of Acute M.I.
  • Diagnosis of Prostate Cancer
  • Diagnosis of Breast Cancer
  • Prescription and monitoring in hemodialysis
  • Prediction of renal transplant graft failure

14
Importance of ML Task Types
  • Diagnosis (what is the most likely disease given
    a set of clinical findings?),
  • Prognosis (what will be the outcome after a
    certain treatment has been given to a patient?),
  • Treatment selection (what treatment to give to a
    specific patient?),
  • Prevention (what is the likelihood that a
    specific patient will develop disease X if
    preventable risk factor Y is present?).
  • ML has practically replaced Knowledge Acquisition
    for building Decision Support (Expert) Systems.

15
Importance of ML Task Types (contd)
  • Discovery
  • Feature selection (e.g., what is a minimal set of
    laboratory values needed for pneumonia
    diagnosis?)
  • Concept formation (e.g., what are patterns of
    genomic instability as measured by array CGH that
    constitute molecular subtypes of lung cancer
    capable of guiding development of new
    treatments?)
  • Feature construction (e.g., how can
    mass-spectrometry signals be decomposed into
    individual variables that are highly predictive
    for detection of cancer and can be traced back to
    individual proteins that may play important roles
    in carcinogensis?) information retrieval query
    construction (e.g., what are PubMed Mesh terms
    that predict with high sensitivity and
    specificity whether medical journals talk about
    treatment?)
  • Questions about function, interactions, and
    structure (e.g., how do genes and proteins
    regulate each other in the cells of lower and
    higher organisms? what is the most likely
    function of a protein given the sequence of its
    aminoacids?), etc.

16
What is Machine Learning (ML)? How is it
different than Statistics and Data Mining?
  • Broadly speaking ML, DM, and Statistics have
    similar goals (modeling for classification and
    hypothesis generation or testing).
  • Statistics has traditionally emphasized models
    that can be solved analytically (for example
    various versions of the Generalized Linear Model
    GLM). To achieve this both restrictions in the
    expressive power of models and their parametric
    distributions are heavily used.
  • Data Mining emphasizes very large-scale data
    storage, integration, retrieval and analysis
    (typically the last one as a secondary focus).
  • Machine Learning seeks to use computationally
    powerful approaches to learn very complex non- or
    quasi-parametric models of the data. Some of
    these models are closer to human representations
    of the problem domain per se (or of problem
    solving in the domain)

17
Importance of ML Data Types and Volume
  • Overwhelming production of data
  • Bioinformatics (mass-throughput assays for gene
    expression, protein abundance, SNPs)
  • Clinical Systems (EPR, POE)
  • Bibliographic collections
  • The Web web pages, transaction records,

18
Importance of ML Reliance on Hard data and
evidence
  • Machine learning has become critical for Decision
    Support System Construction given extensive
    cognitive biases and the corresponding need to
    base MDSSs on hard scientific evidence and
    high-quality data

19
Supplementary Cognitive Biases
  • Main thesis
  • human cognitive abilities are tailored to support
    instinctive, reflexive, life-preserving reactions
    traced back in our evolution as species. They are
    not designed for rational, rigorous reasoning
    such as the reasoning needed in science and
    engineering.
  • In other words, there is a disconnect between our
    innate cognitive ability and the complexity of
    reasoning tasks required by the explosive
    advances in science and technology in the last
    few hundred years.

20
Supplementary But is the Cognitive Biases Thesis
Correct?
  • Psychology of Judgment and Decision Making
    (Plous)
  • Tversky and Kahneman (Judgment under uncertainty
    Heuristics and Biases)
  • Methods of Influence (Cialdini)
  • And highly-recommended supplementary information
    can be found in
  • Professional Judgment (Elstein)
  • Institute of Medicines Report in Medical Errors
    (1999)

21
Supplementary Tversky and Kahneman Judgment
under uncertainty Heuristics and Biases
  • This work (a constellation of psychological
    studies converging to a description of human
    decision making under uncertainty) is very highly
    regarded and influential It was recently (2002)
    awarded the Nobel Prize of Economics.
  • Main points
  • People use a few simple heuristics when making
    judgments under uncertainty
  • These heuristics sometimes are useful and other
    times lead to severe and systematic errors
  • These heuristics are representativeness,
    availability and anchoring

22
  • Supplementary Representativeness
  • E.g., the probability P that patient X has
    disease D given that she has findings F is
    assessed by the similarity of X to a prototypical
    description of D (found in a textbook, or
    recalled from earlier practice and training).
  • Why is this wrong?
  • Reason 1 similarity ignores base-rate of D
  • Reason 2 similarity ignores sample size
  • Reason 3 similarity ignores predictability
  • Reason 4 similarity is affected by redundant
    features

23
  • Supplementary Availability
  • E.g., the probability P that patient X with
    disease D given that she is given treatment T
    will become healthy is assessed by recalling such
    occurrences in ones prior experience
  • Why is this wrong?
  • Reason 1 availability is influenced by
    familiarity
  • Reason 2 availability is influenced by salience
  • Reason 3 availability is influenced by elapsed
    time
  • Reason 4 availability is influenced by rate of
    abstract terms
  • Reason 5 availability is influenced by
    imaginability
  • Reason 6 availability is influenced by
    perceived association

24
  • Supplementary Adjustment and Anchoring
  • E.g., the probability P that patient X has
    disease D given that she has findings F is
    assessed by making an initial estimate P1 for
    findings F1 and updating it when new evidence F2,
    F3, , and so on, becomes available.
  • What goes wrong?
  • Problem 1 initial estimate over-influences the
    final estimate
  • Problem 2 initial estimate is often based on
    quick and then extrapolated calculations
  • Problem 3 people overestimate the probability
    of conjunctive events
  • Problem 4 according to initial anchor,
    peoples predictions are calibrated differently

25
Supplementary Additional
  • Methods of Influence (Cialdini, 1993)
  • Reciprocation
  • Commitment Consistency
  • Social Proof
  • Liking
  • Authority
  • Scarcity
  • Professional Judgment (Dowie and Elstein 1988)
  • Institute of Medicines Report in Medical Errors
    (1999)

26
Supplementary Putting MDSSs and Machine Learning
in Historical Context
  • 40s
  • Foundations of Formal Decision-Making Theory by
    VonNeuman and Morgerstern
  • 50s
  • Ledley and Lusted lay out how logic and
    probabilistic reasoning can help in diagnosis and
    treatment selection in medicine
  • 60s
  • Applications of Bayes theorem for diagnosis and
    treatment selection pioneered by Warner and
    DeDombal
  • Medline (NLM)
  • Early 70s
  • Ad-hoc systems (Myers et al Pauker et al)
  • Study of Cognitive Biases (Kahneman, Tversky)
  • Late 70s
  • Rule-based systems (Buchanan Shortliffe)

27
Supplementary Milestones in MDSSs
  • 80s
  • Analysis of ad-hoc and RBSs (Heckerman et al.)
  • Bayesian Networks (Pearl, Cooper, Heckerman et
    al.)
  • Medical Decision Making as discipline (Pauker)
  • Literature-driven decision support (Renels
    Shortliffe)
  • Early 90s
  • Web-enabled decision support wide-spread
    information retrieval
  • Computational Causal Discovery (Pearl, Spirtes et
    al. Cooper et al.)
  • Sound re-formulation of very large ad-hoc systems
    (Shwe et al)
  • Analysis of Bayesian systems (Domingos et al,
    Henrion et al.)
  • Proliferation of focused Statistics and Machine
    Learning MDDSs
  • First-order Logics that combine classical FOL
    with probabilistic reasoning, causation and
    planning (Haddaway)

28
Supplementary Milestones in MDSSs
  • Late 90s
  • Efficient Inference for very large probabilistic
    systems (Jordan et al)
  • Kernel-based methods for sample-efficient
    learning (Vapnik)
  • Evidence-Based Medicine (Haynes et al)
  • 21st Century
  • Diagnosis, Prognosis and Treatment selection
    (a.k.a. Personalized medicine or
    Pharmacogenomics) based on molecular
    information (proteomic spectra, gene expression
    arrays, SNPs) collected via mass-throughput
    assaying technology, and modeleld using machine
    learning methods
  • Provide-order entry delivery of advanced decision
    support
  • Advanced representation, storage, retrieval and
    application of EBM information (guidelines,
    journals, meta-analyses, clinical bioinformatics
    models)

29
Importance of ML
  • How often ML techniques are being used? Articles
    in Medline (in parentheses last 2 years)
  • Artificial Intelligence 12,441 (2,358)
  • Expert systems 2,271 (121)
  • Neural Networks 5,403 (1,158)
  • Support Vector Machines 163 (121)
  • Clustering 17,937 (4,080)
  • Genetic Algorithms 2,798 (969)
  • Decision Trees 4,958 (752)
  • Bayesian (Belief) Networks 1,627 (585)
  • Bayes (Bayesian Statistics Nets) 4,369 (561)
  • Compare to
  • Regression 164,305 (28,134)
  • Knowledge acquisition 310 (56)
  • Knowledge representation 227 (27)
  • 4 major Symbolic DSS 145 (10)
  • (Internist-I, QMR, ILIAD, DxPlain)
  • Rule-based systems 802 (151)

30
Importance of ML
  • Importance of ML becomes very evident in cases
    where
  • data analysis is too time consuming (e.g.,
    classify web pages or medline documents into
    content or quality categories)
  • There is little or no domain theory

What is the diagnosis?
Is this an early cancer?
31
A Framework for Inductive ML Related
Introductory Concepts
32
What is the difference between supervised and
unsupervised ML methods?
  • Supervised learning
  • Give to the learning algorithm several instances
    of input-output pairs the algorithm learns to
    predict the correct output that corresponds to
    some inputs (not only previously seen but also
    previously unseen ones (generalization)).
  • In our original example show to learning
    algorithm array gene expression measurements from
    several patient cases as well as normal subjects
    then the learning algorithm induces a classifier
    that can classify a previously unseen subject to
    the correct diagnostic category given the gene
    expression values observed in that subject

33
Classification
A
CLASSIFIER- INDUCTIVE ALGORITHM
TRAIN INSTANCES
B
C
CLASSIFIER
D
E
APPLICATION INSTANCES
A1, B1, C1, D1, E1 A2, B2, C2, D2, E2 An, Bn,
Cn, Dn, En
CLASSIFICATION PERFORMANCE
34
What is the difference between supervised and
unsupervised ML methods?

Unsupervised learning - Discover the
categories (or other structural properties of the
domain) - Example give the
learning algorithm gene expression
measurements of patients with Lung Cancer as well
as normal subjects the algorithm finds
sub-types (molecular profiles) of patients that
are very similar to each other, and different to
the rest of the types. Or another algorithm may
discover how various genes interact among
themselves to determine development of cancer.
35
Discovery
A
STRUCTURE- INDUCTION ALGORITHM
A
TRAIN INSTANCES
B
C
B
C
E
D
E
D
A1, B1, C1, D1, E1 A2, B2, C2, D2, E2 An, Bn,
Cn, Dn, En
PERFORMANCE
36
A first concrete attempt at solving our
hypothetical diagnosis problem using a particular
type of learning approach (decision tree
induction)
37
Decision Tree Induction
  • An example decision tree to solve the problem of
    how to classify subjects into lung cancer vs
    normal

Gene139
Over-expressed
Normally- expressed
Under-expressed
Gene202
Gene8766
Lung cancer
Over-expressed
Normally- expressed
Over-expressed
Normally- expressed
Lung cancer
Normal
Lung cancer
Normal
38
How Can I Learn Such A Decision Tree
Automatically?
  • A basic induction procedure is very simple in
    principle
  • Start with an empty tree
  • Put at the root of the tree the variable that
    best classifies the training examples
  • Create branches under the variable corresponding
    to its values
  • Under each branch repeat the process with the
    remaining variables
  • Until we run out of variables or sample

39
How Can We Generalize From This Example?
40
A General Description of supervised Inductive ML
  • Inductive Machine Learning algorithms can be
    designed and analyzed using the following
    framework
  • A language L in which we express models. The set
    of all possible models expressible in L
    constitutes our hypothesis space H
  • A scoring metric M tells us how good is a
    particular model
  • A search procedure S helps us identify the best
    model in H

Space of all possible models Models in H
x x
x x x x
x x x x
x x x
x
41
A General Description of supervised Inductive ML
  • In our decision tree example
  • A language L in which we express models
    decision trees
  • The hypothesis space H space of all decision
    trees that can be constructed with genes 1 to n
  • A scoring metric M telling us how good is a
    particular model min (classification error
    model complexity)
  • A search procedure S greedy search

42
How can ML methods fail?

  • Wrong language Bias best model is not in H
  • Example we look for models expressible as
    discrete decision trees but the domain is
    continuous

Space of all possible models Models in H
x
x x
x x x x
x x x x
x x x
43
How can ML methods fail?

  • Search Failure best model is in H but search
    fails to examine it
  • Example greedy search fails to capture a strong
    gene-gene interaction effect

Space of all possible models Models in H
x x
x
x x x x
x x x x
x x x
44
Generalization Over-fitting
45
Generalization Over-fitting
  • It was mentioned previously that a good learning
    program learns something about the data beyond
    the specific cases that have been presented to
    it.
  • Indeed, it is trivial to just store and retrieve
    the cases that have been seen in the past (rote
    learning implemented as a lookup table). This
    does not address the problem of how to handle new
    cases, however.

46
Generalization Over-fitting
  • In supervised learning we typically seek to
    minimize i.i.d. error, that is error over
    future cases (not used in training). Such cases
    contain both previously encountered as well as
    new cases.
  • i.i.d. independently sampled and identically
    distributed problem instances.
  • In other words, the training and application
    samples come from the same population
    (distribution) with identical probability to be
    selected for inclusion and this
    population/distribution is time-invariant.
  • (Note if not time invariant then by
    incorporating time as independent variable or by
    other appropriate transformations we restore the
    i.i.d. condition)

47
Supplementary Generalization Over-fitting
  • Consider now the following simplified diagnostic
    classification problem classify patients into
    cancer (red/vertical pattern) versus normal
    (green/no pettern) on the basis of the values of
    two gene values (gene1, gene2)

Gene1
Gene2
48
Supplementary Generalization Over-fitting
  • The diagonal line represents a perfect classifier
    for this problem (do not worry for the time being
    how to mathematically represent or
    computationally implement the line we will see
    how to do so in the Neural Network and Support
    Vector Machine segments)

Gene1
Gene2
49
Supplementary Generalization Over-fitting
  • Lets solve the same problem from a small sample
    one such possible small sample is

Gene1
Gene2
50
Supplementary Generalization Over-fitting
  • We may be tempted to solve the problem with a
    fairly complicated line

Gene1
Gene2
51
Supplementary Generalization Over-fitting
  • In which case we get several errors

Gene1
Gene2
52
Supplementary Generalization Over-fitting
  • whereas with a simpler line

Gene1
Gene2
53
Supplementary Generalization Over-fitting
  • a much smaller error

Gene1
Gene2
54
Generalization Over-fitting
  • In general, over-fitting a model to the data
    means that instead of general properties of the
    population from which the data is sampled we
    learn idiosyncracies (i.e., non-representative
    properties) of the sample data.
  • Over-fitting and poor generalization (i.e., the
    error in the overall population (true error) is
    large) are synonymous as long as we have learned
    the training data well (i.e., small apparent
    error).
  • Over-fitting is not only affected by the
    simplicity of the classifier (e.g., straight vs
    wiggly line) but also by
  • the size of the sample,
  • the complexity of the function we wish to learn
    from data,
  • the amount of noise, and
  • the number and nature (continuous discrete,
    ordered, distribution, etc.) of the variables.

55
Generalization Over-fitting
  • We wish to particularly emphasize the danger of
    grossly over-fitting the learning when the number
    of predictive variables is large relative to the
    available sample. Consider for example the
    following situation
  • Assume we have 5 binary predictors and two
    samples, and wish to classify instances into two
    classes
  • The predictors can encode 2532 possible
    distinct patterns. Assume all patterns are
    equally probable.
  • Hence the chances of the two cases having
    different predictive patterns are 31/3297.
  • Thus in 97 of our samples of size two, the five
    variables are sufficient to identify perfectly
    the case.
  • Combine this with a powerful enough learning
    algorithm (i.e., one that can effectively
    associate any pattern with the desired outcome)
    and it follows that in 97 of samples, one gets
    optimal apparent error even when there is no
    relationship between the target variable and the
    predictive variables!

56
Generalization Over-fitting
  • This situation is particularly relevant in
    bioinformatics in which we routinely have gt10,000
    continuous variables, noise, and lt500 samples.
    Under these conditions every training instance
    has almost always a unique value set of the
    predictive variables
  • thus if one is not careful, the learning
    algorithm can simply learn what amounts to a
    lookup table (i.e., by associating the unique
    predictor signature with the outcome of that case
    for every case).

57
Generalization Over-fitting
  • So how does one avoid over-fitting?
  • Via a variety of approaches
  • Use learning algorithms that intrinsically (by
    design) generalize well
  • Pursue simple (highly biased) classifiers for
    small samples
  • Choose unbiased and low-variance statistical
    estimators of the true error and employ them
    sparingly
  • Very important rule
  • Estimate the performance (true error) of a model
    with data you did not use to construct the model

58
Generalization Over-fitting
  • Avoiding over-fitting will be a primary concern
    of ours in this tutorial
  • We will outline here some specific
    cross-validation procedures and use them to build
    models in the case studies segment

59
Generalization Over-fitting
  • Hold-out cross-validation method
  • Split data in Train and Test data
  • Learn with Train and estimate true error with Test

test
data
train
60
Generalization Over-fitting
  • N-fold Cross-validation
  • Split data in Train and Test data n times such
    that union of test sets is full data set
  • Learn with Train and estimate true error with
    Test in each split separately
  • Average test performance

train
train
train
test
train
train
train
train
test
test
test
test
test
61
Generalization Over-fitting
  • Leave-one-out
  • n-fold C.V. where n is equal to the number of
    data instances

62
Supplementary Generalization Over-fitting
  • Stratified (balanced) Cross-validation
  • An n-fold C.V. in which (by design) the target
    class has the same distribution as in the full
    dataset

63
Supplementary Generalization Over-fitting
  • Nested Cross-validation
  • Assume we wish to apply cross-validation to find
    the best parameter values for parameter C for a
    classifier from parameter value set 1,..,100.
  • One way to use C.V. to select the best values for
    C is to apply the holdout method 100 times, one
    for each value of C and select the value that
    gives the best error in the test sample.
  • The problem with this approach is that the true
    error estimate is not reliable since it is
    produced by running the best model on a test set
    that was used to derive the best model.
  • In other words, our data used to estimate the
    true error can no longer be used to produce
    unbiased estimates since it also guided the
    selection of the model.

64
Supplementary Generalization Over-fitting
  • solution
  • Split the Train data into two (Traintrain and
    Validation),
  • Use the validation set to find the best
    parameters,
  • Use the test set to estimate the true error

test
data
Validation
Traintrain
65
Supplementary Generalization Over-fitting
  • If the sample is small, the nesting can be
    repeated with different assignment of the test
    set (i.e., nested n-fold C.V.)
  • One can also nest LOO with n-fold C.V. or LOO
    with LOO

Te
data
TT
TT
V
TT
TT
TT
V
TT
V
V
V
V
Te
66
Supplementary Generalization Over-fitting
  • Important notes
  • Estimating the true error of the best model is a
    separate procedure than generating the best
    model the former requires an additional layer of
    nesting our cross-validation
  • When there are several types of parameters to be
    selected (e.g., normalization, discretization,
    classifier parameters) one can
  • do one n-fold cross-validation using the
    cartesian product of all parameters which uses
    more sample but yields more conservative true
    error estimates, or one can
  • nest the cross-validation to as many nesting
    levels as the number of distinct parameters that
    need optimization, which yields more unbiased
    true error estimates but uses less sample

67
Quick Notes On Data Preparation
68
Data preparation
  • Non-specific
  • Is the data lawfully in our disposal?
  • Are there issues that deal with patient privacy
    and confidentiality as well as intellectual
    property issues that need be resolved?
  • How were the data produced, by whom, when, with
    what purpose in mind?
  • Any known or plausible biases present?
  • References in the literature?
  • Is there a codebook with clear definitions of
    variables location, date of creation, method of
    creation, value list, value meaning, missing
    value codes and meanings, history of the database
    and its elements?

69
Data preparation
  • Data specific
  • Valid values?
  • Variable distributions?
  • Descriptive statistics?
  • Mechanisms of missingness imputation

70
Data preparation
  • Learner specific
  • De-noising
  • Scaling/Normalization
  • Discretization
  • Transforming variable distributions
  • Co-linearities
  • Homoskedasticity
  • Outliers
  • Feature selection

71
Data preparation
  • Task specific
  • Reconstruct hidden or distorted signals from
    observed ones
  • Infer presence of hidden variables, determine
    their cardinality and values
  • Stem, normalize, extract terms
  • Weight or Project variables

72
Basic Evaluation Metrics
73
Evaluation Metrics
  • T test
  • D disease
  • Accuracy (0/1 loss)
  • Number of correct classifications ad
  • Number of total classifications
    abcd

D D-
T a b ab
T- c d cd
ac bd abcd
74
Evaluation Metrics
  • T test
  • D disease
  • Sensitivity proportion of true positives
    identified by test
  • a
  • ac

D D-
T a b ab
T- c d cd
ac bd abcd
75
Evaluation Metrics
  • T test
  • D disease
  • Specificity proportion of true negatives
    identified by test
  • d
  • bd

D D-
T a b ab
T- c d cd
ac bd abcd
76
Evaluation Metrics
  • T test
  • D disease
  • Positive predictive value (PPV) proportion of
    true positives over test positives
  • a
  • ab

D D-
T a b ab
T- c d cd
ac bd abcd
77
Evaluation Metrics
  • T test
  • D disease
  • Negative predictive value (NPV) proportion of
    true negatives over test negatives
  • d
  • cd

D D-
T a b ab
T- c d cd
ac bd abcd
78
Evaluation Metrics
  • Mean squared error (MSE) (Quadratic loss)
  • 1/D S (predicted_value(i)-true_value(i))2
  • D cardinality of test dataset
  • Suitable for continuous outputs

D
i
79
Evaluation Metrics
  • ROC area

1
Sensitivity
0
1
1-Specificity
80
Evaluation Metrics
  • In information retrieval
  • Precision is the name for PPV and
  • Recall is the name for Sensitivity

81
Evaluation Metrics
  • Recall-precision curve (and area under it)

100
Recall
0
100
Precision
82
Bayesian Classifiers
  • Note we will be discussing Bayesian classifiers
    using the diagnostic context, (which in terms of
    applications and historical development of the
    related ideas is representative). However the
    ideas discussed readily translate to any type of
    learning for classification and concomitant
    decision support function.

83
Bayesian Classifiers
  • Bayes Theorem (or formula) says that
  • P (D) P(F D)
  • P (D F)
  • P(F)
  • Where
  • P(D) is the probability of some disease D in the
    general population (i.e., before obtaining some
    evidence F), a.k.a. as disease prior
    probability
  • P(F) is the probability of some evidence in the
    form of findings such as lab tests, physical
    examination findings etc.
  • P(F D) is the probability of the same findings
    given that someone has disease D
  • P(D F) is the probability of disease D given
    that someone has the findings F (i.e., after
    obtaining some evidence F), a.k.a. as disease
    posterior probability

84
Bayesian Classifiers
  • Since the most likely diagnosis is the one with
    the maximum a posteriori probability, Bayes
    formula allows one to solve the differential
    diagnosis problem, as well as as any
    classification learning problem that can be cast
    as supervised learning
  • Indeed, in the sample limit, there cannot be a
    better way to infer the most likely diagnosis
    than Bayes theorem and thus it serves as the
    theoretical gold standard against which
    statistical and machine learning classifiers are
    measured in terms of true error.
  • In that context it is referenced as the Bayes
    Optimal Classifier

85
Bayesian Classifiers
  • Note that Bayes formula can be applied to
    diagnosis of multiple possibly inter-depended
    diseases and non-independent findings since where
    there is F one can place a vector of findings
    (e.g., F1, F2-, F3-,,Fn) and where there is
    D one can put a vector of diseases (e.g., D1-,
    D2-, D3,,Dm) .

86
Bayesian Classifiers
  • Further, the intuitive interpretation of Bayes
    rule is that of updating belief about the
    patients true state before seeing F we have
    some prior belief (measured as probability) that
    the patient has disease(s) D. After seeing F we
    update the prior belief (diagnosis) to reflect
    (incorporate) the new evidence F the new belief
    is the posterior produced by Bayess rule

87
Bayesian Classifiers
  • Unfortunately there is a significant drawback
    with straightforward Bayes rule we need number
    of probabilities, storage and computational time
    that is exponential to the number of findings
    (i.e., F) and the number of diseases (i.e.,
    D).
  • This means that for any diagnostic or other
    classification problem of non-trivial size
    (measured in terms of F and D) straight Bayes
    is not feasible

88
Bayesian Classifiers
  • This has led to a simplified version in which we
    disallow multiple diseases (i.e., require that
    the patient may have only one disease at a time)
    and we require that findings are independent
    conditioned on the disease states (note this
    does not mean that the findings are independent
    in general, but rather, that they are
    conditionally independent).
  • The combination of these two assumptions yields
    required number of probabilities, storage and
    computational time that is linear to the number
    of findings and the number of diseases.

89
Simple (a.k.a. Naive) Bayes
  • Application of Bayes rule with the Mutual
    Disease Exclusivity assumption (MEE) and the
    Conditional Independence assumption (FCID) is
    known as Simple Bayes Rule, Naïve Bayes, or,
    rather non-tastefully, as Idiots Bayes.
  • Simple Bayes can be implemented by plugging in
    the main formula
  • P(F D) P P(Fi Dj)
  • and P( F) S P(Fi, Dj) S P(Fi Dj) P(Dj)
  • where Fi is the ith (singular) finding and Dj the
    jth (singular) disease.
  • Several other (mathematically) equivalent
    formulations exist using sensitivities and
    specificities, likelihood ratios or other
    convenient building blocks

i,j
i,j
i,j
90
Simple (a.k.a. Naive) Bayes
  • Simple Bayes was applied very early (from the
    early 60s and on) in Medical Informatics for
    diagnosis and optimal treatment selection as well
    as sequential testing.
  • See for example the classic papers by Warner et
    al (1961), DeDombal (1972), Leaper (1972), Gorry
    and Barnett (1968)

91
Variants of Simple Bayes
  • Since the MEE and FCID assumptions clearly are
    violated in many medical contexts, researchers
    early on sought to relax them and created
    modified Bayesian classifiers that approximated
    P(F D) (Fryback 1978) or assumed independent
    diseases and multiple diagnoses
    (Multi-membership model of Ben-Basat, 1980).
  • These models (and many others not mentioned here)
    have primarily historical significance currently,
    because
  • (a) It was shown (1997, Domingos and Pazzani)
    that the MEE and FCID assumptions are not
    necessary but sufficient conditions for a wide
    variety of target functions under 0/1 loss
  • (b) Bayesian Networks were invented and as we
    will see next they allow flexible representation
    of dependencies so that parsimony and
    tractability is maintained without compromising
    soundness
  • (c) several other restricted Bayesian classifiers
    have been shown to perform well in a variety of
    practical settings

92
Bayesian Networks
93
Supplementary Bayesian Networks Overview
  • A Note On Terminology
  • Brief Historical Perspective
  • The Bayesian Network Model and Its Uses
  • Learning BNs
  • Reference Resources

94
Supplementary Bayesian Networks A Note On
Terminology
  • Bayesian Networks (or Nets) generic name
  • Belief Networks subjective probability-based,
    non-causal
  • Causal Probabilistic Networks frequentist
    probability-based, causal

95
Supplementary Bayesian Networks A Note On
Terminology
  • Various other names for special model classes
  • Influence Diagrams (Howard and Mathesson)
    incorporate decision and utility nodes. Used for
    decision analyses
  • Dynamic Bayesian Networks (Dagum et al.)
    temporal semantics. Used as alternatives to
    multivariate time series models and dynamic
    control
  • Markov Decision Processes (Dean et al.) for
    decision policy formulation in temporally-evolving
    domains
  • Modifiable Temporal Belief Networks (Aliferis et
    al.) for well-structured and very large problem
    models that involve time and causation and cannot
    be stored explicitly

96
Supplementary Bayesian Networks Historical
Perspective
  • Naïve Bayesian Model (mutually exclusive
    diseases, findings independent given diseases)
    predominant model for medical decision support
    systems in the 60s and early 70s because it
    requires linear number of parameters and
    computational steps (to total findings and
    diseases)
  • Theorem 1 (Minsky, Peot) Naïve Bayes heuristic
    usefulness (expected classification performance)
    over all domains gets exponentially worse as
    number of variables increases
  • Theorem 2 (see Mitchell) Full Bayesian
    classifierperfect classifier
  • However FBC impractical and serves as analytical
    tool only

97
Supplementary Bayesian Networks Historical
Perspective
  • In the late 70s and up to mid-80s this led to
    Production Systems (i.e., rule-based systems,
    that is simplifications of first-order logic).
    The most influential version of PSs (Shortliffe,
    Buchanan) handled uncertainty through a modular
    account of subjective belief (the Certainty
    Factor Calculus)
  • Theorem 3 (Heckerman) The CFC is inconsistent
    with probability theory unless rule-space search
    graph is a tree. Consequently, forward and
    backward reasoning cannot be combined in a CFC PS
    and still produce valid results

98
Supplementary Bayesian Networks Historical
Perspective
  • That led to research (late 80s) in Bayesian
    Networks which can vary expressiveness between
    the full dependency (or even the full Bayesian
    classifier) and the Naïve Bayes model (Pearl,
    Cooper)

Variables Conditionally Independent Given
Categories Categories Mutually Exclusive
Variables Conditionally Dependent
Bayesian Networks
99
Supplementary Bayesian Networks Historical
Perspective
  • In the early 90s researchers developed the first
    algorithms for learning BNs from data
    (Herskovits, Cooper, Heckerman)
  • In the mid 90s researchers (Spirtes, Glymour,
    Sheines, Pearl, Verma) discovered methods to
    learn CPNs from observational data(!). We will
    cover the foundations of this in the causal
    discovery segment.
  • Overall BNs is the brain child of computer
    scientists, medical informaticians, artificial
    intelligence researchers, and industrial
    engineers and is considered to be the
    representation language of choice for most
    biomedical Decision Support Systems today

100
Bayesian Networks The Bayesian Network Model and
Its Uses
  • BNGraph (Variables (nodes), dependencies (arcs))
    Joint Probability Distribution Markov
    Property
  • Graph has to be DAG (directed acyclic) in the
    standard BN model

JPD P(A, B, C)0.006 P(A, B, C-)0.014 P(A,
B-, C)0.054 P(A, B-, C-)0.126 P(A-, B,
C)0.240 P(A-, B, C-)0.160 P(A-, B-,
C)0.240 P(A-, B-, C-)0.160
A
B
C
  • Theorem 4 (Neapolitan) any JPD can be
    represented in BN form

101
Bayesian Networks The Bayesian Network Model and
Its Uses
  • Markov Property the probability distribution of
    any node N given its parents P is independent of
    any subset of the non-descendent nodes W of N

A
e.g., D B,C,E,F,G A F
A,D,E,F,G,H,I,J B, C
C
D
B
F
G
E
H
J
I
102
Bayesian Networks The Bayesian Network Model and
Its Uses
  • Theorem 5 (Pearl) the Markov property enables us
    to decompose (factor) the joint probability
    distribution into a product of prior and
    conditional probability distributions

The original JPD P(A, B, C)0.006 P(A, B,
C-)0.014 P(A, B-, C)0.054 P(A, B-,
C-)0.126 P(A-, B, C)0.240 P(A-, B,
C-)0.160 P(A-, B-, C)0.240 P(A-, B-,
C-)0.160 Becomes P(A)0.8 P(B A)0.1 P(B
A-)0.5 P(C A)0.3 P(C A-)0.6
A
Up to Exponential Saving in Number of Parameters!
B
C
P(V) P p(ViPa(Vi))
i
103
Bayesian Networks The Bayesian Network Model and
Its Uses
  • As we will see in the causal discovery segment,
    BNs are a useful language for automated causal
    discovery because the Markov property captures
    causality thus
  • Revealing confounders
  • Modeling explaining away
  • Modeling/understanding selection bias
  • Modeling causal pathways
  • Modeling manipulation in the presence of
    confounders
  • Modeling manipulation in the presence of
    selection bias
  • Identifying targets for manipulation in causal
    chains

104
Bayesian Networks The Bayesian Network Model and
Its Uses
  • Once we have a BN model of some domain we can ask
    questions

A
  • Forward P(D,I- A)?
  • Backward P(A C, D)?
  • Forward Backward
  • P(D,C- I, E)?
  • Arbitrary abstraction/Arbitrary
  • predictors/predicted variables

C
D
B
F
G
E
H
J
I
105
Bayesian Networks The Bayesian Network Model and
Its Uses
  • The Markov property tells us which variables are
    important to predict a variable (Markov Blanket),
    thus providing a principled way to reduce
    variable dimensionality

A
C
D
B
F
G
E
H
J
I
106
Bayesian Networks Demonstration of Flexible
Representation
  • A BN in which FCID holds

D1
D2
D3
F3
F2
F1
F4
107
Bayesian Networks Demonstration of Flexible
Representation
  • A BN in which MEE holds

D1
D2
D3
F3
F2
F1
F4
108
Bayesian Networks Demonstration of Flexible
Representation
  • A BN in which MEE and FCID hold

D
F3
F2
F1
F4
109
Bayesian Networks Demonstration of Flexible
Representation
  • Hybrid assumptions

D1
D2
D3
D2
D3
F3
F8
F2
F1
F4
F7
F9
F5
F6
110
Inference Algorithms
  • Exact
  • Lauritzen Spigelhalter
  • Cooper Recursive decomposition
  • Stochastic-approximate
  • Likelihood weighting
  • Dagum and Luby
  • Variational (approximate but not stochastic)
  • Jordan et al. (1998) solves queries in QMR-DT in
    seconds
  • Reference
  • An Introduction to Variational Methods for
    Graphical Methods (1998) Michael I. Jordan,
    Zoubin Ghahramani, Tommi S. Jaakkola, Lawrence K.
    Saul. Machine Learning
  • An Optimal Approximation Algorithm For Bayesian
    Inference (1997) Paul Dagum, Michael Luby.
    Artificial Intelligence
  • Probabilistic Reasoning in Expert Systems Theory
    and Algorithms by Richard E. Neapolitan. Kohn
    Wiley 1990

111
Theoretical Complexity
  • Inference is NP-hard (Cooper (exact, 1990) Dagum
    and Luby (stochastic, 1993))
  • Learning is NP-hard (Chickering, 1994, Bukaert,
    1995)
  • However Many widely-applicable algorithms are
    very efficient (allowing up to thousands of
    variables for inference and up to gt100,000
    variables for focused learning)

112
Automatic Construction of Bayesian Networks from
Data
  • For causal discovery
  • Perl, Verma (1988)
  • Spirtes, Glymour, Scheines, (1991)
  • For classification/automatic DSS construction
  • Herskovits, Cooper (1991) Kutato (entropy-based)
  • Cooper, Herskovits (1992) K2 (Bayesian)
  • to be discussed at length in second part
  • Reference
  • Computation, Causation, and Discovery by Clark
    Glymour (Preface), Gregory F. Cooper (Editor),
    2000, AAAI Press

113
Supplementary Bayesian Networks Sparse
Candidate Algorithm
  • Repeat
  • Select candidate parents Ci for each variable Xi
  • Set new best NW B to be Gn s.t. Gn maximizes
  • a Bayesian score Score(GD) where G is a member
  • of class of BNs for which PaG(Xi) Í PaBprev(Xi)
    " Xi
  • Until Convergence
  • Return B

Restriction Step Maximization Step
114
Supplementary Bayesian Networks Sparse
Candidate Algorithm
  • SCA proceeds by selecting up to k candidate
    parents for each variable on the basis of
    pair-wise association
  • Then search is performed for a best network
    within the space defined by the union of all
    potential parents identified in the previous step
  • The procedure is iterated by feeding the parents
    in the currently best network to the restriction
    step
  • Theorem 6 (Friedman) SCA monotonically improves
    the quality of the examined networks
  • Convergence criterion no gain in score, and
    maximum number of cycles with no improvement in
    score

115
Supplementary Bayesian Networks Learning
Partial Models
  • Partial model feature (Friedman et al.)
  • Examples
  • Order relation (is X an ascendant or descendent
    of Y?)
  • Markov Blanket membership (Is A in the MB of B?)
  • We want
  • And we approximate it by

P(f(GD) S (f(G) p(GD))
G
m
Conf(f) S f(Gi)
1
m
i1
116
Supplementary BN Applications
117
Supplementary Pathfinder
  • Heckerman et al early 90s
  • Diagnosis and test selection of lymph-node
    pathology
  • Assumes MEE but not FICD
  • Similarity networks (special enhancement to BNs)
    allow more efficient Knowledge acquisition
  • Myopic test selection strategy (similar to Gorry
    and Barnett) combined monetary cost/expected
    survival utility measure
  • Led to Intellipath commercial product
  • Reference
  • Heckerman DE, Horvitz EJ, Nathwani BN. Toward
    normative expert systems Part I. The Pathfinder
    project. Methods Inf Med. 1992 Jun31(2)90-105.
  • Heckerman DE, Nathwani BN. Toward normative
    expert systems Part II. Probability-based
    representations for efficient knowledge
    acquisition and inference.Methods Inf Med. 1992
    Jun31(2)106-16.
  • Heckerman DE, Nathwani BN. An evaluation of the
    diagnostic accuracy of Pathfinder. Comput Biomed
    Res. 1992 Feb25(1)56-74.

118
Supplementary QMR-DT
  • Stanford, late 80s, early 90s
  • Probabilistic formulation of QMR KB (subsequent
    version of INTERNIST I)
  • Full-scope of INTERNIST/QMR
  • Uses
  • two-layered BN representation,
  • No MEE or FICD assumptions
  • Stochastic inference
  • Reference
  • Shwe M, Cooper G. An empirical analysis of
    likelihood-weighting simulation on a large,
    multiply connected medical belief network. Comput
    Biomed Res. 1991 Oct24(5)453-75.

119
Supplementary Analysis of sensitivity of BNs to
errors in probability specification
  • Henrion et al. 1996
  • System CPCS (subset of QMR)
  • Results average probabilities assigned to the
    actual diseases showed small sensitivity even to
    large amounts of noise.
  • Explanation One reason is that the criterion for
    performance is average probability of the true
    hypotheses, which is insensitive to symmetric
    noise distributions. But, even asymmetric,
    logodds-normal noise has modest effects. A second
    reason is that the gold-standard posterior
    probabilities are often near zero or one, and are
    little disturbed by noise.
  • Reference
  • Max Henrion, Malcolm Pradhan, Brendan Del Favero,
    Kurt Huang, Gregory Provan and Paul O'Rorke. Why
    is Diagnosis Using Belief Networks Insensitive to
    Imprecision in Probabilities? UAI, 1996

120
Supplementary Temporal Causal and Spatial
Reasoning with Probabilistic methods
  • Haddaway 1995 temporal, causal and probabilistic
    FOL
  • Aliferis (97, 98) temporal and causal Bayesian
    Networks with clear causal-temporal semantics
  • Spatio-temporal BNs for GI endoscopy
  • Reference
  • Aliferis CF, Cooper GF. Temporal representation
    design principles an assessment in the domain of
    liver transplantation. Proc AMIA Symp.
    1998170-4.
  • Ngo L, Haddawy P, Krieger RA, Helwig J. Efficient
    temporal probabilistic reasoning via
    context-sensitive model construction. Comput Biol
    Med. 1997 Sep27(5)453-76.

121
Supplementary Dynamic construction of BNs from
Knowledge Bases to solve problem instances of
interest (KBMC)
  • Haddaway 1995 probabilistic FOL KB?BN
  • Aliferis Modifiable Temporal BNs temporal and
    causal Bayesian Networks with adjustable hybrid
    granularities, variable time horizon, and
    interacting subnetworks (contexts) 1996-8
  • Koller et al. object-oriented BNs, 1997
  • Reference
  • Generating Bayesian Networks from Probability
    Logic Knowledge Bases P. Haddawy , Proceedings
    of the Tenth Conference on Uncertainty in
    Artificial Intelligence, July, 1994.
  • Daphne Koller and Avi Pfeffer. Object-Oriented
    Bayesian Networks , UAI, 1997
  • Aliferis C.F. A Temporal representation and
    Reasoning Model for Medical Decision Support
    Systems, Doctoral Thesis, 1998

122
Supplementary Other applications
  • Parsing of natural language with BNs
  • Haug et al 1999
  • Charniak et al.
  • Extensive applications for classification and
    discovery Margaritis et al (1999), Aliferis
    Tsamardinos et al. (2001,2,3)
  • focused causal discovery (parents-children or
    Markov Blankets)
  • feature selection
  • to be discussed at length in second part
  • Reference
  • Fiszman M, Chapman WW, Evans SR, Haug PJ.
    Automatic identification of pneumonia related
    concepts on chest x-ray reports.Proc AMIA Symp.
    199967-71.
  • Charniak E. Bayesian Networks Without Tears. AI
    Magazine 1991

123
Simple Bayes Revisited
  • Domingos and pazzani 1997
  • Naïve Bayes assumptions are sufficient for
    accurate probability estimates in the sample
    limit but not necessary for a wide variety of
    learning problems when accuracy is the evaluation
    metric
  • In small samples even if the assumptions are
    violated SB can do better than more expressive
    representations due to the bias-variance
    decomposition of the error
  • The best way to correct (extend) SB is not to
    join highly-associated findings
  • These results explain the excellent performance
    of SB in text categorization with thousands of
    variables (words) and many other
    learning/inference tasks even against more
    expressive representations
  • Reference
  • On the Optimality of the Simple Bayesian
    Classifier under Zero-One Loss (1997) Pedro
    Domingos and Michael Pazzani. Machine Learning

124
Other Restricted Bayesian Classifiers
  • The TAN classifier augments Naïve Bayes with
    augmenting edges among findings such that the
    resulting network among the findings is a tree

D
F3
F2
F1
F4
125
Other Restricted Bayesian Classifiers
  • The TAN multinet classifier uses a different TAN
    for each value of D and then chooses the
    predicted class to be
About PowerShow.com