Bayesian%20Network%20Anomaly%20Pattern%20Detection%20for%20Disease%20Outbreaks - PowerPoint PPT Presentation

About This Presentation
Title:

Bayesian%20Network%20Anomaly%20Pattern%20Detection%20for%20Disease%20Outbreaks

Description:

involving senior citizens. from eastern part of city. Number of children from. downtown hospital ... 52/200 records from 'recent' have Gender = Male AND Age = Senior ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 39
Provided by: me6100
Category:

less

Transcript and Presenter's Notes

Title: Bayesian%20Network%20Anomaly%20Pattern%20Detection%20for%20Disease%20Outbreaks


1
Bayesian Network Anomaly Pattern Detection for
Disease Outbreaks
  • Weng-Keen Wong (Carnegie Mellon University)
  • Andrew Moore (Carnegie Mellon University)
  • Gregory Cooper (University of Pittsburgh)
  • Michael Wagner (University of Pittsburgh)

2
Motivation
Suppose we have real-time access to Emergency
Department data from hospitals around a city
(with patient confidentiality preserved)
Primary Key Date Time Hospital ICD9 Prodrome Gender Age Home Location Work Location Many more
100 6/1/03 912 1 781 Fever M 20s NE ?
101 6/1/03 1045 1 787 Diarrhea F 40s NE NE
102 6/1/03 1103 1 786 Respiratory F 60s NE N
103 6/1/03 1107 2 787 Diarrhea M 60s E ?
104 6/1/03 1215 1 717 Respiratory M 60s E NE
105 6/1/03 1301 3 780 Viral F 50s ? NW
106 6/1/03 1305 3 487 Respiratory F 40s SW SW
107 6/1/03 1357 2 786 Unmapped M 50s SE SW
108 6/1/03 1422 1 780 Viral M 40s ? ?

3
The Problem
From this data, can we detect if a disease
outbreak is happening?
4
The Problem
From this data, can we detect if a disease
outbreak is happening?
Were talking about a non-specific disease
detection
5
The Problem
From this data, can we detect if a disease
outbreak is happening? How early can we detect
it?
6
The Problem
From this data, can we detect if a disease
outbreak is happening? How early can we detect
it?
The question were really asking
Whats strange about recent events?
7
Traditional Approaches
  • What about using traditional anomaly detection?
  • Typically assume data is generated by a model
  • Finds individual data points that have low
    probability with respect to this model
  • These outliers have rare attributes or
    combinations of attributes
  • Need to identify anomalous patterns not isolated
    data points

8
Traditional Approaches
What about monitoring aggregate daily counts of
certain attributes?
  • Weve now turned multivariate data into
    univariate data
  • Lots of algorithms have been developed for
    monitoring univariate data
  • Time series algorithms
  • Regression techniques
  • Statistical Quality Control methods
  • Need to know apriori which attributes to form
    daily aggregates for!

9
Traditional Approaches
  • What if we dont know what attributes to monitor?
  • What if we want to exploit the spatial, temporal
    and/or demographic characteristics of the
    epidemic to detect the outbreak as early as
    possible?

10
Traditional Approaches
  • We need to build a univariate detector to monitor
    each interesting combination of attributes

Diarrhea cases among children
Number of cases involving people working in
southern part of the city
Respiratory syndrome cases among females
Number of cases involving teenage girls living
in the western part of the city
Viral syndrome cases involving senior citizens
from eastern part of city
Botulinic syndrome cases
Number of children from downtown hospital
And so on
11
Traditional Approaches
  • We need to build a univariate detector to monitor
    each interesting combination of attributes

Diarrhea cases among children
Number of cases involving people working in
southern part of the city
Respiratory syndrome cases among females
Number of cases involving teenage girls living
in the western part of the city
Youll need hundreds of univariate detectors! We
would like to identify the groups with the
strangest behavior in recent events.
Viral syndrome cases involving senior citizens
from eastern part of city
Botulinic syndrome cases
Number of children from downtown hospital
And so on
12
One Possible Approach
Primary Key Date Time Gender Age Hospital Many more
100 8/24/03 912 M 20s 1
101 8/24/03 1045 F 40s 1

2243 8/17/03 1107 M 60s 2
2244 8/17/03 1215 M 60s 1

12567 8/24/02 1305 F 40s 3
12568 8/24/02 1357 M 50s 2

Todays Records
Yesterdays Records
Last Years Records
13
One Possible Approach
Primary Key Date Time Gender Age Hospital Many more
100 8/24/03 912 M 20s 1
101 8/24/03 1045 F 40s 1

2243 8/17/03 1107 M 60s 2
2244 8/17/03 1215 M 60s 1

12567 8/24/02 1305 F 40s 3
12568 8/24/02 1357 M 50s 2

Todays Records
Yesterdays Records
Last Years Records
Idea Can use association rules to find patterns
in todays records that werent there in past data
14
One Possible Approach
Recent records ( from today )
Primary Key Date Time Gender Age
100 8/24/03 912 M Child
101 8/24/03 1045 M Senior

Primary Key Date Time Source
100 8/24/03 912 Recent
101 8/24/03 1045 Recent

2164 8/17/03 1305 Baseline
2165 8/17/03 1357 Baseline

Baseline records ( from 7 days ago )
Primary Key Date Time Gender Age
2164 8/17/03 1305 F Senior
2165 8/17/03 1357 F Senior

Find which rules predict unusually high
proportions in recent records when compared to
the baseline eg. 52/200 records from recent
have Gender Male AND Age Senior 90/180
records from baseline have Gender Male AND
Age Senior
15
Which rules do we report?
  • Search over all rules with at most 2 components
  • For each rule, form a 2x2 contingency table eg.
  • Perform Fishers Exact Test to get a p-value for
    each rule (call this the score)
  • Report the rule with the lowest score

CountRecent CountBaseline
Home Location NW 48 45
Home Location ? NW 86 220
16
Problems with the Approach
  • Multiple Hypothesis Testing
  • 2. A Changing Baseline

17
Problem 1 Multiple Hypothesis Testing
  • Cant interpret the rule scores as p-values
  • Suppose we reject null hypothesis when score lt ?,
    where ? 0.05
  • For a single hypothesis test, the probability of
    making a false discovery ?
  • Suppose we do 1000 tests, one for each possible
    rule
  • Probability(false discovery) could be as bad as
  • 1 ( 1 0.05)1000 gtgt 0.05

18
Randomization Test
Aug 16, 2003 C2
Aug 17, 2003 C3
Aug 17, 2003 C4
Aug 17, 2003 C5
Aug 17, 2003 C6
Aug 17, 2003 C7
Aug 21, 2003 C8
Aug 21, 2003 C9
Aug 22, 2003 C10
Aug 22, 2003 C11
Aug 23, 2003 C12
Aug 23, 2003 C13
Aug 24, 2003 C14
Aug 24, 2003 C15
Aug 16, 2003 C2
Aug 17, 2003 C3
Aug 24, 2003 C4
Aug 17, 2003 C5
Aug 24, 2003 C6
Aug 17, 2003 C7
Aug 21, 2003 C8
Aug 21, 2003 C9
Aug 22, 2003 C10
Aug 22, 2003 C11
Aug 23, 2003 C12
Aug 23, 2003 C13
Aug 17, 2003 C14
Aug 17, 2003 C15
  • Take the recent cases and the baseline cases.
    Shuffle the date field to produce a randomized
    dataset called DBRand
  • Find the rule with the best score on DBRand.

19
Randomization Test
Repeat the procedure on the previous slide for
1000 iterations. Determine how many scores from
the 1000 iterations are better than the original
score.
If the original score were here, it would place
in the top 1 of the 1000 scores from the
randomization test. We would be impressed and an
alert should be raised.
Corrected p-value of the rule is better scores
/ iterations
20
Problem 2 A Changing Baseline
From Goldenberg, A., Shmueli, G., Caruana, R.
A., and Fienberg, S. E. (2002). Early
statistical detection of anthrax outbreaks by
tracking over-the-counter medication sales.
Proceedings of the National Academy of Sciences
(pp. 5237-5249)
21
Problem 2 A Changing Baseline
  • Baseline is affected by temporal trends in health
    care data
  • Seasonal effects in temperature and weather
  • Day of Week effects
  • Holidays
  • Etc.
  • Choosing the wrong baseline distribution can
    affect the detection time and false positives rate

22
Generating the Baseline
  • Taking into account that today is a public
    holiday
  • Taking into account that this is Spring
  • Taking into account recent heatwave
  • Taking into account recent flu levels
  • Taking into account that theres a known natural
    Food-borne outbreak in progress

23
Generating the Baseline
  • Taking into account that today is a public
    holiday
  • Taking into account that this is Spring
  • Taking into account recent heatwave
  • Taking into account recent flu levels
  • Taking into account that theres a known natural
    Food-borne outbreak in progress

Use a Bayes net to model the joint probability
distribution of the attributes
24
Obtaining Baseline Data
All Historical Data
  1. Learn Bayesian Network using Optimal Reinsertion
    Moore and Wong 2003

Todays Environment
Baseline
2. Generate baseline given todays environment
25
Environmental Attributes
  • Divide the data into two types of attributes
  • Environmental attributes attributes that cause
    trends in the data eg. day of week, season,
    weather, flu levels
  • Response attributes all other non-environmental
    attributes

26
Environmental Attributes
  • When learning the Bayesian network structure, do
    not allow environmental attributes to have
    parents.
  • Why?
  • We are not interested in predicting their
    distributions
  • Instead, we use them to predict the distributions
    of the response attributes
  • Side Benefit We can speed up the structure
    search by avoiding DAGs that assign parents to
    the environmental attributes

Season
Day of Week
Weather
Flu Level
27
Generate Baseline Given Todays Environment
Suppose we know the following for today
Season Day of Week Weather Flu Level
Today Winter Monday Snow High
Day of Week Monday
Flu Level High
Season Winter
Weather Snow
We fill in these values for the environmental
attributes in the learned Bayesian network
We sample 10000 records from the Bayesian network
and make this data set the baseline
Baseline
28
Generate Baseline Given Todays Environment
Suppose we know the following for today
Season Day of Week Weather Flu Level
Today Winter Monday Snow High
Day of Week Monday
Flu Level High
Season Winter
Weather Snow
We fill in these values for the environmental
attributes in the learned Bayesian network
Sampling is easy because environmental attributes
are at the top of the Bayes Net
We sample 10000 records from the Bayesian network
and make this data set the baseline
Baseline
29
Generate Baseline Given Todays Environment
Suppose we know the following for today
Season Day of Week Weather Flu Level
Today Winter Monday Snow High
Day of Week Monday
Flu Level High
Season Winter
Weather Snow
We fill in these values for the environmental
attributes in the learned Bayesian network
An alternate possible technique is to use
inference
We sample 10000 records from the Bayesian network
and make this data set the baseline
Baseline
30
Whats Strange About Recent Events (WSARE) 3.0
2. Search for rule with best score
  1. Obtain Recent and Baseline datasets

All Data
Recent Data
  1. Determine p-value of best scoring rule

Baseline
4. If p-value is less than threshold, signal alert
31
Simulator
32
Simulation
  • 100 different data sets
  • Each data set consisted of a two year period
  • Anthrax release occurred at a random point during
    the second year
  • Algorithms allowed to train on data from the
    current day back to the first day in the
    simulation
  • Any alerts before actual anthrax release are
    considered a false positive
  • Detection time calculated as first alert after
    anthrax release. If no alerts raised, cap
    detection time at 14 days

33
Other Algorithms used in Simulation
1. Standard algorithm
Signal
2. WSARE 2.0 Create baseline using historical
data from 7, 14, 21 and 28 days ago 3. WSARE 2.5
Use all past data but condition on
environmental attributes
34
Results on Simulation
35
Results on Simulation
36
Results on Actual ED Data from 2001
  • 1. Sat 2001-02-13 SCORE -0.00000004 PVALUE
    0.00000000
  • 14.80 ( 74/500) of today's cases have Viral
    Syndrome True and Encephalitic Prodome False
  • 7.42 (742/10000) of baseline have Viral
    Syndrome True and Encephalitic Syndrome False
  • 2. Sat 2001-03-13 SCORE -0.00000464 PVALUE
    0.00000000
  • 12.42 ( 58/467) of today's cases have
    Respiratory Syndrome True
  • 6.53 (653/10000) of baseline have
    Respiratory Syndrome True
  • 3. Wed 2001-06-30 SCORE -0.00000013 PVALUE
    0.00000000
  • 1.44 ( 9/625) of today's cases have 100 lt
    Age lt 110
  • 0.08 ( 8/10000) of baseline have 100 lt Age
    lt 110
  • 4. Sun 2001-08-08 SCORE -0.00000007 PVALUE
    0.00000000
  • 83.80 (481/574) of today's cases have
    Unknown Syndrome False
  • 74.29 (7430/10001) of baseline have Unknown
    Syndrome False
  • 5. Thu 2001-12-02 SCORE -0.00000087 PVALUE
    0.00000000
  • 14.71 ( 70/476) of today's cases have Viral
    Syndrome True and Encephalitic Syndrome False
  • 7.89 (789/9999) of baseline have Viral
    Syndrome True and Encephalitic Syndrome False

37
Related Work
  • Deviations between models induced by two datasets
    Ganti, Gehrke and Ramakrishnan
  • Emerging Patterns Dong and Li
  • Mining Surprising Patterns using Temporal
    Description Length Chakrabarti, Sarawagi and
    Dom
  • Contrast sets Bay and Pazzani
  • Association Rules and Data Mining in Hospital
    Infection Control and Public Health Surveillance
    Brossette et. al.
  • Spatial Scan Statistic Kulldorff

38
Conclusion
  • One approach to biosurveillance one algorithm
    monitoring millions of signals derived from
    multivariate data
  • instead of
  • Hundreds of univariate detectors
  • WSARE is best used as a general purpose safety
    net in combination with other detectors
  • Careful evaluation of statistical significance
  • Modeling historical data with Bayesian Networks
    to allow conditioning on unique features of today

Software http//www.autonlab.org/
Write a Comment
User Comments (0)
About PowerShow.com