What - PowerPoint PPT Presentation

About This Presentation
Title:

What

Description:

What's Strange About Recent Events (WSARE) Weng-Keen Wong (Carnegie Mellon University) ... testing because we perform a hypothesis test on each day in the history ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 72
Provided by: me68
Category:
Tags:

less

Transcript and Presenter's Notes

Title: What


1
Whats Strange About Recent Events (WSARE)
  • Weng-Keen Wong (Carnegie Mellon University)
  • Andrew Moore (Carnegie Mellon University)
  • Gregory Cooper (University of Pittsburgh)
  • Michael Wagner (University of Pittsburgh)

DIMACS Tutorial on Statistical and Other Analytic
Health Surveillance Methods
2
Motivation
Suppose we have access to Emergency Department
data from hospitals around a city (with patient
confidentiality preserved)
Primary Key Date Time Hospital ICD9 Prodrome Gender Age Home Location Work Location Many more
100 6/1/03 912 1 781 Fever M 20s NE ?
101 6/1/03 1045 1 787 Diarrhea F 40s NE NE
102 6/1/03 1103 1 786 Respiratory F 60s NE N
103 6/1/03 1107 2 787 Diarrhea M 60s E ?
104 6/1/03 1215 1 717 Respiratory M 60s E NE
105 6/1/03 1301 3 780 Viral F 50s ? NW
106 6/1/03 1305 3 487 Respiratory F 40s SW SW
107 6/1/03 1357 2 786 Unmapped M 50s SE SW
108 6/1/03 1422 1 780 Viral M 40s ? ?

3
The Problem
From this data, can we detect if a disease
outbreak is happening?
4
The Problem
From this data, can we detect if a disease
outbreak is happening?
Were talking about a non-specific disease
detection
5
The Problem
From this data, can we detect if a disease
outbreak is happening? How early can we detect
it?
6
The Problem
From this data, can we detect if a disease
outbreak is happening? How early can we detect
it?
The question were really asking In
the last n hours, has anything strange happened?
7
Traditional Approaches
  • What about using traditional anomaly detection?
  • Typically assume data is generated by a model
  • Finds individual data points that have low
    probability with respect to this model
  • These outliers have rare attributes or
    combinations of attributes
  • Need to identify anomalous patterns not isolated
    data points

8
Traditional Approaches
What about monitoring aggregate daily counts of
certain attributes?
  • Weve now turned multivariate data into
    univariate data
  • Lots of algorithms have been developed for
    monitoring univariate data
  • Time series algorithms
  • Regression techniques
  • Statistical Quality Control methods
  • Need to know apriori which attributes to form
    daily aggregates for!

9
Traditional Approaches
  • What if we dont know what attributes to monitor?

10
Traditional Approaches
  • What if we dont know what attributes to monitor?
  • What if we want to exploit the spatial, temporal
    and/or demographic characteristics of the
    epidemic to detect the outbreak as early as
    possible?

11
Traditional Approaches
  • We need to build a univariate detector to monitor
    each interesting combination of attributes

Diarrhea cases among children
Number of cases involving people working in
southern part of the city
Respiratory syndrome cases among females
Number of cases involving teenage girls living
in the western part of the city
Viral syndrome cases involving senior citizens
from eastern part of city
Botulinic syndrome cases
Number of children from downtown hospital
And so on
12
Traditional Approaches
  • We need to build a univariate detector to monitor
    each interesting combination of attributes

Diarrhea cases among children
Number of cases involving people working in
southern part of the city
Respiratory syndrome cases among females
Number of cases involving teenage girls living
in the western part of the city
Youll need hundreds of univariate detectors! We
would like to identify the groups with the
strangest behavior in recent events.
Viral syndrome cases involving senior citizens
from eastern part of city
Botulinic syndrome cases
Number of children from downtown hospital
And so on
13
Our Approach
  • We use Rule-Based Anomaly Pattern Detection
  • Association rules used to characterize anomalous
    patterns. For example, a two-component rule
    would be
  • Gender Male AND 40 ? Age lt 50
  • Related work
  • Market basket analysis Agrawal et. al, Brin et.
    al.
  • Contrast sets Bay and Pazzani
  • Spatial Scan Statistic Kulldorff
  • Association Rules and Data Mining in Hospital
    Infection Control and Public Health Surveillance
    Brossette et. al.

14
WSARE v2.0
  • Inputs

1. Multivariate date/time-indexed
biosurveillance-relevant data stream
2. Time Window Length
3. Which attributes to use?
Emergency Department Data
Ignore key
Last 24 hours
Primary Key Date Time Hospital ICD9 Prodrome Gender Age Home Location Work Location Many more
100 6/1/03 912 1 781 Fever M 20s NE ?
101 6/1/03 1045 1 787 Diarrhea F 40s NE NE
102 6/1/03 1103 1 786 Respiratory F 60s NE N

15
WSARE v2.0
  • Inputs

1. Multivariate date/time-indexed
biosurveillance-relevant data stream
2. Time Window Length
3. Which attributes to use?
3. And heres how seriously you should take it
2. Heres why
  • Outputs

1. Here are the records that most surprise me
Primary Key Date Time Hospital ICD9 Prodrome Gender Age Home Location Work Location Many more
100 6/1/03 912 1 781 Fever M 20s NE ?
101 6/1/03 1045 1 787 Diarrhea F 40s NE NE
102 6/1/03 1103 1 786 Respiratory F 60s NE N

16
WSARE v2.0 Overview
  1. Obtain Recent and Baseline datasets

2. Search for rule with best score
All Data
Recent Data
3. Determine p-value of best scoring rule through
randomization test
Baseline
4. If p-value is less than threshold, signal alert
17
Step 1 Obtain Recent and Baseline Data
Recent Data
Data from last 24 hours
Baseline
Baseline data is assumed to capture non-outbreak
behavior. We use data from 35, 42, 49 and 56
days prior to the current day
18
Step 2. Search for Best Scoring Rule
  • For each rule, form a 2x2 contingency table eg.
  • Perform Fishers Exact Test to get a p-value for
    each rule gt call this p-value the score
  • Take the rule with the lowest score. Call this
    rule RBEST.
  • This score is not the true p-value of RBEST
    because we are performing multiple hypothesis
    tests on each day to find the rule with the best
    score

CountRecent CountBaseline
Age Decile 3 48 45
Age Decile ? 3 86 220
19
The Multiple Hypothesis Testing Problem
  • Suppose we reject null hypothesis when score lt ?,
    where ? 0.05
  • For a single hypothesis test, the probability of
    making a false discovery ?
  • Suppose we do 1000 tests, one for each possible
    rule
  • Probability(false discovery) could be as bad as
    1 ( 1 0.05)1000 gtgt 0.05

20
Step 3 Randomization Test
June 4, 2002 C2
June 5, 2002 C3
June 12, 2002 C4
June 19, 2002 C5
June 26, 2002 C6
June 26, 2002 C7
July 2, 2002 C8
July 3, 2002 C9
July 10, 2002 C10
July 17, 2002 C11
July 24, 2002 C12
July 30, 2002 C13
July 31, 2002 C14
July 31, 2002 C15
June 4, 2002 C2
June 12, 2002 C3
July 31, 2002 C4
June 26, 2002 C5
July 31, 2002 C6
June 5, 2002 C7
July 2, 2002 C8
July 3, 2002 C9
July 10, 2002 C10
July 17, 2002 C11
July 24, 2002 C12
July 30, 2002 C13
June 19, 2002 C14
June 26, 2002 C15
  • Take the recent cases and the baseline cases.
    Shuffle the date field to produce a randomized
    dataset called DBRand
  • Find the rule with the best score on DBRand.

21
Step 3 Randomization Test
Repeat the procedure on the previous slide for
1000 iterations. Determine how many scores from
the 1000 iterations are better than the original
score.
If the original score were here, it would place
in the top 1 of the 1000 scores from the
randomization test. We would be impressed and an
alert should be raised.
Estimated p-value of the rule is better scores
/ iterations
22
Two Kinds of Analysis
  • Day by Day
  • If we want to run WSARE just for the current day
  • then we end here.
  • Historical Analysis
  • If we want to review all previous days and their
    p-values for several years and control for some
    percentage of false positives
  • then well once again run into overfitting
    problems
  • we need to compensate for multiple hypothesis
    testing because we perform a hypothesis test on
    each day in the history

23
We only need to do this for historical analysis!
  • False Discovery Rate Benjamini and Hochberg
  • Can determine which of these p-values are
    significant
  • Specifically, given an aFDR, FDR guarantees that
  • Given an aFDR, FDR produces a threshold below
    which any p-values in the history are considered
    significant

24
WSARE v3.0
25
WSARE v2.0 Review
  1. Obtain Recent and Baseline datasets

2. Search for rule with best score
All Data
Recent Data
3. Determine p-value of best scoring rule through
randomization test
Baseline
4. If p-value is less than threshold, signal alert
26
Obtaining the Baseline
Baseline
Recall that the baseline was assumed to be
captured by data that was from 35, 42, 49, and 56
days prior to the current day.
27
Obtaining the Baseline
Baseline
Recall that the baseline was assumed to be
captured by data that was from 35, 42, 49, and 56
days prior to the current day.
What if this assumption isnt true? What if data
from 7, 14, 21 and 28 days prior is better?
We would like to determine the baseline
automatically!
28
Temporal Trends
  • But health care data has many different trends
    due to
  • Seasonal effects in temperature and weather
  • Day of Week effects
  • Holidays
  • Etc.
  • Allowing the baseline to be affected by these
    trends may dramatically alter the detection time
    and false positives of the detection algorithm

29
Temporal Trends
From Goldenberg, A., Shmueli, G., Caruana, R.
A., and Fienberg, S. E. (2002). Early
statistical detection of anthrax outbreaks by
tracking over-the-counter medication sales.
Proceedings of the National Academy of Sciences
(pp. 5237-5249)
30
WSARE v3.0
  • Generate the baseline
  • Taking into account recent flu levels
  • Taking into account that today is a public
    holiday
  • Taking into account that this is Spring
  • Taking into account recent heatwave
  • Taking into account that theres a known natural
    Food-borne outbreak in progress

Bonus More efficient use of historical data
31
Conditioning on observed environment Well
understood for Univariate Time Series
Signal
  • Example Signals
  • Number of ED visits today
  • Number of ED visits this hour
  • Number of Respiratory Cases Today
  • School absenteeism today
  • Nyquil Sales today

32
An easy case
Upper Safe Range
Signal
Mean
  • Dealt with by Statistical Quality Control
  • Record the mean and standard deviation up the the
    current time.
  • Signal an alarm if we go outside 3 sigmas

33
Conditioning on Seasonal Effects
Signal
34
Conditioning on Seasonal Effects
Signal
Fit a periodic function (e.g. sine wave) to
previous data. Predict todays signal and 3-sigma
confidence intervals. Signal an alarm if were
off. Reduces False alarms from Natural
outbreaks. Different times of year deserve
different thresholds.
35
Example Tsui et. Al
Weekly counts of PI from week 1/98 to 48/00
From Value of ICD-9Coded Chief Complaints for
Detection of Epidemics, Fu-Chiang Tsui, Michael
M. Wagner, Virginia Dato, Chung-Chou Ho Chang,
AMIA 2000
36
Seasonal Effects with Long-Term Trend
Weekly counts of IS from week 1/98 to 48/00.
From Value of ICD-9Coded Chief Complaints for
Detection of Epidemics, Fu-Chiang Tsui, Michael
M. Wagner, Virginia Dato, Chung-Chou Ho Chang,
AMIA 2000
37
Seasonal Effects with Long-Term Trend
Called the Serfling Method Serfling, 1963
Weekly counts of IS from week 1/98 to 48/00.
Fit a periodic function (e.g. sine wave) plus a
linear trend ESignal a bt c sin(d
t/365) Good if theres a long term trend in the
disease or the population.
From Value of ICD-9Coded Chief Complaints for
Detection of Epidemics, Fu-Chiang Tsui, Michael
M. Wagner, Virginia Dato, Chung-Chou Ho Chang,
AMIA 2000
38
Day-of-week effects
From Goldenberg, A., Shmueli, G., Caruana, R.
A., and Fienberg, S. E. (2002). Early
statistical detection of anthrax outbreaks by
tracking over-the-counter medication sales.
Proceedings of the National Academy of Sciences
(pp. 5237-5249)
39
Day-of-week effects
Another simple form of ANOVA
Fit a day-of-week component ESignal a
deltaday E.G deltamon 5.42, deltatue 2.20,
deltawed 3.33, deltathu 3.10, deltafri
4.02, deltasat -12.2, deltasun -23.42
From Goldenberg, A., Shmueli, G., Caruana, R.
A., and Fienberg, S. E. (2002). Early
statistical detection of anthrax outbreaks by
tracking over-the-counter medication sales.
Proceedings of the National Academy of Sciences
(pp. 5237-5249)
40
Analysis of variance (ANOVA)
  • Good news
  • If youre tracking a daily aggregate (univariate
    data)then ANOVA can take care of many of these
    effects.
  • But
  • What if youre tracking a whole joint
    distribution of events?

41
Idea Bayesian Networks
Bayesian Network A graphical model representing
the joint probability distribution of a set of
random variables
On Cold Tuesday Mornings the folks coming in
from the North part of the city are more likely
to have respiratory problems
Patients from West Park Hospital are less likely
to be young
On the day after a major holiday, expect a boost
in the morning followed by a lull in the
afternoon
The Viral prodrome is more likely to co-occur
with a Rash prodrome than Botulinic
42
WSARE Overview
  1. Obtain Recent and Baseline datasets

2. Search for rule with best score
All Data
Recent Data
3. Determine p-value of best scoring rule through
randomization test
Baseline
4. If p-value is less than threshold, signal alert
43
Obtaining Baseline Data
All Historical Data
Todays Environment
  1. Learn Bayesian Network

2. Generate baseline given todays environment
Baseline
44
Obtaining Baseline Data
All Historical Data
Todays Environment
What should be happening today given todays
environment
  1. Learn Bayesian Network

2. Generate baseline given todays environment
Baseline
45
Step 1 Learning the Bayes Net Structure
Involves searching over DAGs for the structure
that maximizes a scoring function. Most common
algorithm is hillclimbing.
Initial Structure
3 possible operations
Add an arc
Delete an arc
Reverse an arc
46
Step 1 Learning the Bayes Net Structure
Involves searching over DAGs for the structure
that maximizes a scoring function. Most common
algorithm is hillclimbing.
Initial Structure
But hillclimbing is too slow and single link
modifications may not find the correct structure
(Xiang, Wong and Cercone 1997). We use Optimal
Reinsertion (Moore and Wong 2002).
3 possible operations
Add an arc
Delete an arc
Reverse an arc
47
Optimal Reinsertion
1. Select target node in current graph
2. Remove all arcs connected to T
T
48
Optimal Reinsertion
3. Efficiently find new in/out arcs
?
?
?
T
?
?
?
?
?
4. Choose best new way to connect T
T
49
The Outer Loop
  • Until no change in current DAG
  • Generate random ordering of nodes
  • For each node in the ordering, do Optimal
    Reinsertion

50
The Outer Loop
  • For NumJolts
  • Begin with randomly corrupted version of best
    DAG so far
  • Until no change in current DAG
  • Generate random ordering of nodes
  • For each node in the ordering, do Optimal
    Reinsertion

51
The Outer Loop
  • For NumJolts
  • Begin with randomly corrupted version of best
    DAG so far
  • Until no change in current DAG
  • Generate random ordering of nodes
  • For each node in the ordering, do Optimal
    Reinsertion

Conventional hill-climbing without maxParams
restriction
52
How is Optimal Reinsertion done efficiently?
P1
P2
P3
Scoring functions can be decomposed
T
Efficiency Tricks
  1. Create an efficient cache of NodeScore(PS-gtT)
    values using ADSearch Moore and Schneider 2002
  2. Restrict PS-gtT combinations to those with CPTs
    with maxParams or fewer parameters
  3. Additional Branch and Bound is used to restrict
    space an additional order of magnitude

53
Environmental Attributes
  • Divide the data into two types of attributes
  • Environmental attributes attributes that cause
    trends in the data eg. day of week, season,
    weather, flu levels
  • Response attributes all other non-environmental
    attributes

54
Environmental Attributes
  • When learning the Bayesian network structure, do
    not allow environmental attributes to have
    parents.
  • Why?
  • We are not interested in predicting their
    distributions
  • Instead, we use them to predict the distributions
    of the response attributes
  • Side Benefit We can speed up the structure
    search by avoiding DAGs that assign parents to
    the environmental attributes

Season
Day of Week
Weather
Flu Level
55
Step 2 Generate Baseline Given Todays
Environment
Suppose we know the following for today
Season Day of Week Weather Flu Level
Today Winter Monday Snow High
Day of Week Monday
Flu Level High
Season Winter
Weather Snow
We fill in these values for the environmental
attributes in the learned Bayesian network
We sample 10000 records from the Bayesian network
and make this data set the baseline
Baseline
56
Step 2 Generate Baseline Given Todays
Environment
Suppose we know the following for today
Season Day of Week Weather Flu Level
Today Winter Monday Snow High
Day of Week Monday
Flu Level High
Season Winter
Weather Snow
We fill in these values for the environmental
attributes in the learned Bayesian network
Sampling is easy because environmental attributes
are at the top of the Bayes Net
We sample 10000 records from the Bayesian network
and make this data set the baseline
Baseline
57
Why not use inference?
  • With sampling, we create the baseline data and
    then use it to obtain the p-value of the rule for
    the randomization test
  • If we used inference, we will not be able to
    perform the same randomization test and we need
    to find some other way to correct for the
    multiple hypothesis testing
  • Sampling was chosen for its simplicity

58
Why not use inference?
  • With sampling, we create the baseline data and
    then use it to obtain the p-value of the rule for
    the randomization test
  • If we used inference, we will not be able to
    perform the same randomization test and we need
    to find some other way to correct for the
    multiple hypothesis testing
  • Sampling was chosen for its simplicity

But there may be clever things to do with
inference which may help us. File this under
future work
59
Simulation
City with 9 regions and different population in
each region
NW 100 N 400 NE 500
W 100 C 200 E 300
SW 200 S 200 SE 600
For each day, sample the citys environment from
the following Bayesian Network
Previous Region Food Condition
Previous Region Anthrax Concentration
Previous Weather
Previous Flu Level
Date
Season
Day of Week
Region Anthrax Concentration
Weather
Flu Level
Region Food Condition
60
Simulation
DAY OF WEEK
FLU LEVEL
SEASON
WEATHER
Region Anthrax Concentration
Has Anthrax
AGE
Outside Activity
Immune System
GENDER
Region Grassiness
Has Flu
Has Sunburn
Heart Health
DATE
Region Food Condition
Has Cold
Has Allergy
REGION
Has Heart Attack
Has Food Poisoning
Disease
Actual Symptom
For each person in a region, sample their profile
REPORTED SYMPTOM
ACTION
DRUG
61
Visible Environmental Attributes
DAY OF WEEK
FLU LEVEL
SEASON
WEATHER
Region Anthrax Concentration
Has Anthrax
AGE
Outside Activity
Immune System
GENDER
Region Grassiness
Has Flu
Has Sunburn
Heart Health
DATE
Region Food Condition
Has Cold
Has Allergy
REGION
Has Heart Attack
Has Food Poisoning
Disease
Actual Symptom
REPORTED SYMPTOM
ACTION
DRUG
62
Simulation
DAY OF WEEK
FLU LEVEL
SEASON
WEATHER
Region Anthrax Concentration
Has Anthrax
AGE
Outside Activity
Immune System
GENDER
Region Grassiness
Has Flu
Has Sunburn
Heart Health
DATE
Region Food Condition
Has Cold
Has Allergy
REGION
Has Heart Attack
Has Food Poisoning
Disease
Actual Symptom
Diseases Allergy, cold, sunburn, flu, food
poisoning, heart problems, anthrax (in order of
precedence)
REPORTED SYMPTOM
ACTION
DRUG
63
Simulation
DAY OF WEEK
FLU LEVEL
SEASON
WEATHER
Region Anthrax Concentration
Has Anthrax
AGE
Outside Activity
Immune System
GENDER
Region Grassiness
Has Flu
Has Sunburn
Heart Health
DATE
Region Food Condition
Has Cold
Has Allergy
REGION
Has Heart Attack
Actions None, Purchase Medication, ED visit,
Absent. If Action is not None, output record to
dataset.
Has Food Poisoning
Disease
Actual Symptom
REPORTED SYMPTOM
ACTION
DRUG
64
Simulation Plot
65
Simulation Plot
Anthrax release (not highest peak)
66
Simulation
  • 100 different data sets
  • Each data set consisted of a two year period
  • Anthrax release occurred at a random point during
    the second year
  • Algorithms allowed to train on data from the
    current day back to the first day in the
    simulation
  • Any alerts before actual anthrax release are
    considered a false positive
  • Detection time calculated as first alert after
    anthrax release. If no alerts raised, cap
    detection time at 14 days

67
Other Algorithms used in Simulation
1. Standard algorithm
  • 2. WSARE 2.0
  • 3. WSARE 2.5
  • Use all past data but condition on
    environmental attributes

68
Results on Simulation
69
Conclusion
  • One approach to biosurveillance one algorithm
    monitoring millions of signals derived from
    multivariate data
  • instead of
  • Hundreds of univariate detectors
  • WSARE is best used as a general purpose safety
    net in combination with other detectors
  • Modeling historical data with Bayesian Networks
    to allow conditioning on unique features of today
  • Computationally intense unless we use clever
    algorithms

70
Conclusion
  • WSARE 2.0 deployed during the past year
  • WSARE 3.0 about to go online
  • WSARE now being extended to additionally exploit
    over the counter medicine sales

71
For more information
  • References
  • Wong, W. K., Moore, A. W., Cooper, G., and
    Wagner, M. (2002). Rule-based Anomaly Pattern
    Detection for Detecting Disease Outbreaks.
    Proceedings of AAAI-02 (pp. 217-223). MIT Press.
  • Wong, W. K., Moore, A. W., Cooper, G., and
    Wagner, M. (2003). Bayesian Network Anomaly
    Pattern Detection for Disease Outbreaks.
    Proceedings of ICML 2003.
  • Moore, A., and Wong, W. K. (2003). Optimal
    Reinsertion A New Search Operator for
    Accelerated and More Accurate Bayesian Network
    Structure Learning. Proceedings of ICML 2003.
  • AUTON lab website http//www.autonlab.org/wsare
  • Email wkw_at_cs.cmu.edu
Write a Comment
User Comments (0)
About PowerShow.com