PROGRAMS IN HOMELAND SECURITY AT DIMACS

About This Presentation

Title:

PROGRAMS IN HOMELAND SECURITY AT DIMACS

Description:

These algorithms apply to situations with ... Problem solved using dynamic programming algorithms. ... Seeking heuristic algorithms, approximations to optimal. ... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 131

Provided by: dimacsR

Learn more at: http://archive.dimacs.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: PROGRAMS IN HOMELAND SECURITY AT DIMACS

1
PROGRAMS IN HOMELAND SECURITY AT DIMACS
Fred S. Roberts DIMACS Director
2
THE FOUNDING OF DIMACSTHE NSF SCIENCE AND
TECHNOLOGY CENTERS PROGRAM

The STC program was launched by the White House
and the National Academy of Sciences in 1988 in
order to increase the economic competitiveness of
the U.S.
NSF ran a nationwide competition. The rules
cutting edge research
education and knowledge transfer
university-industry partnerships

3
THE FOUNDING OF DIMACS

Because of the increasing importance of discrete
mathematics and theoretical computer science,
especially in the fields of telecommunications
and computing, four institutions, Rutgers and
Princeton Universities and ATT Bell Labs and
Bell Communications Research (Bellcore) each
developed strong research groups in these fields.
Under the leadership of Rutgers, they came
together to found DIMACS and entered the STC
competition.
There were more than 800 preproposals more than
300 proposals, in all fields of science 11
winners.

4
The DIMACS Partners Today
Rutgers University Princeton University ATT
Labs Bell Labs (Lucent Technologies) NEC
Laboratories America Telcordia Technologies Affil
iates Avaya Labs HP Labs IBM Research Microsoft
Research Stevens Institute of Technology
5
WHO IS DIMACS?

There are about 250 scientists affiliated with
DIMACS and called permanent members.
Most are from the partner and affiliated
organizations.
They include many of the worlds leaders in
discrete mathematics and theoretical computer
science and their applications.
They also include statisticians, biologists,
psychologists, chemists, epidemiologists, and
engineers.
None are paid by DIMACS, but they join in DIMACS
projects.

6
Outline A Selection of DIMACS Projects

Bioterrorism Sensor Location
Port of Entry Inspection Algorithms
Monitoring Message Streams
Author Identification
Computational and Mathematical Epidemiology
Adverse Event/Disease Reporting/Surveillance/Analy
sis
Bioterrorism Working Group
Modeling Social Responses to Bioterrorism
Predicting Disease Outbreaks from Remote Sensing
and Media Data
Communication Security and Information Privacy

7
The Bioterrorism Sensor Location Problem
8

Early warning is critical in defense against
terrorism
This is a crucial factor underlying the
governments plans to place networks of
sensors/detectors to warn of a bioterrorist
attack

The BASIS System Salt Lake City
9
Locating Sensors is not Easy

Sensors are expensive
How do we select them and where do we place them
to maximize coverage, expedite an alarm, and
keep the cost down?
Approaches that improve upon existing, ad hoc
location methods could save countless lives in
the case of an attack and also money in capital
and operational costs.

10
Two Fundamental Problems

Sensor Location Problem
Choose an appropriate mix of sensors
decide where to locate them for best protection
and early warning

11
Two Fundamental Problems

Pattern Interpretation Problem When sensors set
off an alarm, help public health decision makers
decide
Has an attack taken place?
What additional monitoring is needed?
What was its extent and location?
What is an appropriate response?

12
The SLP What is a Measure of Success of a
Solution?

A modeling problem.
Needs to be made precise.
Many possible formulations.

13
The SLP What is a Measure of Success of a
Solution?

Identify and ameliorate false alarms.
Defending against a worst case attack or an
average case attack.
Minimize time to first alarm? (Worst case?
(Average case?)
Maximize coverage of the area.
Minimize geographical area not covered
Minimize size of population not covered
Minimize probability of missing an attack

14
The SLP What is a Measure of Success of a
Solution?

Cost Given a mix of available sensors and a
fixed budget, what mix will best accomplish our
other goals?

15
The SLP What is a Measure of Success of a
Solution?

Its hard to separate the goals.
Even a small number of sensors might detect an
attack if there is no constraint on time to
alarm.
Without budgetary restrictions, a lot more can be
accomplished.

16
The Sensor Location Problem

Approach is to develop new algorithmic methods.
We are building on approaches to other modeling
problems, seeing if they can be modified in the
sensor location context.
This is a multi-criteria modeling problem and it
seems hopeless to try to find optimal solutions
We will be happy with efficient algorithms that
find good solutions

17
Algorithmic Approaches I Greedy Algorithms
18
Greedy Algorithms

Find the most important location first and locate
a sensor there.
Find second-most important location.
Etc.
Builds on earlier mathematical work at Institute
for Defense Analyses (Grotte, Platt)
Steepest ascent approach.
No guarantee of optimal or best solution.
In practice, gets pretty close to optimal
solution.

19
Algorithmic Approaches II Variants of Classic
Location and Clustering Methods
20
Algorithmic Approaches II Variants of Classic
Location and Clustering Methods

Location theory locate facilities (sensors) to
be used by users located in a region.
Cluster analysis Given points in a metric space,
partition them into groups or clusters so points
within clusters are relatively close.
Clusters correspond to points covered by a
facility (sensor).

21
Variants of Classic Location and Clustering
Methods

k-median clustering Given k sensors, place them
so each point in the city is within x feet of a
sensor.
Complications More dimensions location affects
sensitivity, wind strength enters, sensors have
different characteristics, etc.
This higher-dimensional k-median clustering
problem is hard! Best-known algorithms are due to
Rafail Ostrovsky.

22
Variants of Classic Location and Clustering
Methods

Further complications make this even more
challenging
Different costs of different sensors
Restrictions on where we can place different
sensors
Is it better to have every point within x feet of
some sensor or every point within y feet of at
least three sensors (y gt x)?

Approximation methods due to Chuzhoy,
Ostrovsky, and Rabani and to Guha, Tardos, and
Shmoys are relevant.

23
Algorithmic Approaches III Variants of Highway
Sensor Network Algorithms
24
Variants of Highway Sensor Network Algorithms

Sensors located along highways and nearby
pathways measure atmospheric and road conditions.
Muthukrishnan, et al. have developed very
efficient algorithms for sensor location.
Based on bichromatic clustering and
bichromatic facility location (color nodes
corresponding to sensors red, nodes corresponding
to sensor messages blue)

25
Variants of Highway Sensor Network Algorithms

These algorithms apply to situations with many
more sensors than the bioterrorism sensor
location problem.
As BT sensor technology changes, we can envision
a myriad of miniature sensors distributed around
a city, making this work all the more relevant.

26
Algorithmic Approaches IV Building on Equipment
Placing Algorithms
27
Building on Equipment Placing Algorithms

The Node Placement Problem is problem of
determining locations or nodes to install certain
types of networking equipment.
Coverage and cost are a major consideration.
Researchers at Telcordia Technologies have
studied variations of this problem arising from
broadband access technologies.

28
The Broadband Access Node Placement Problem

There are inherent range limitations that drive
placement.
E.g. customer for DSL service must be within xx
feet of an assigned multiplexer.
Multiplexer sensor.
Problem solved using dynamic programming
algorithms.
(Tamra Carpenter, Martin Eiger,David Shallcross,
Paul Seymour)

29
The Broadband Access Node Placement Problem
Complications

Restrictions on types of equipment that can be
placed at a given node.
Constraints on how far a signal from a given
piece of equipment can travel.
Cost and profit maximization considerations.
Relevance of work on general integer programming,
the knapsack cover problem, and local access
network expansion problems.

30
The Pattern Interpretation Problem
31
The Pattern Interpretation Problem

It will be up to the Decision Maker to decide how
to respond to an alarm from the sensor network.

32
The Pattern Interpretation Problem

Little has been done to develop analytical models
for rapid evaluation of a positive alarm or
pattern of alarms from a sensor network.
How can this pattern be used to minimize false
alarms?
Given an alarm, what other surveillance measures
can be used to confirm an attack, locate areas of
major threat, and guide public health
interventions?

33
The Pattern Interpretation Problem (PIP)

Close connection to the SLP.
How we interpret a pattern of alarms will affect
how we place the sensors.
The same simulation models used to place the
sensors can help us in tracing back from an alarm
to a triggering attack.

34
Approaching the PIP Minimizing False Alarms
35
Approaching the PIP Minimizing False Alarms

One approach Redundancy. Require two or more
sensors to make a detection before an alarm is
considered confirmed.

36
Approaching the PIP Minimizing False Alarms

Portal Shield requires two positives for the
same agent during a specific time period.
Redundancy II Place two or more sensors at or
near the same location. Require two proximate
sensors to give off an alarm before we consider
it confirmed.
Redundancy drawbacks cost, delay in confirming
an alarm.

37
Approaching the PIP Using Decision Rules

Existing sensors come with a sensitivity level
specified and sound an alarm when the number of
particles collected is sufficiently high above
threshold.

38
Approaching the PIP Using Decision Rules

Alternative decision rule alarm if two sensors
reach 90 of threshold, three reach 75 of
threshold, etc.
One approach use clustering algorithms for
sounding an alarm based on a given distribution
of clusters of sensors reaching a percentage of
threshold.

39
Approaching the PIP Using Decision Rules

When sensors are to be used jointly, the rules
for tuning each sensor should be optimized to
take advantage of the fact that each is part of a
network.
The optimal tuning depends on the decision rule
applied to reach an overall decision given the
sensor inputs.

40
Approaching the PIP Using Decision Rules

Prior work along these lines in missile detection
(Cherikh and Kantor)

41
Approaching the PIP Using Decision Rules

Most work has concentrated on the case of
stochastic independence of information available
at two sensors clearly violated in BT sensor
location problems.
Even with stochastic independence, finding
optimal decision rules is nontrivial.
Recent promising approaches of Paul Kantor study
fusion of multiple methods for monitoring message
streams.

42
Approaching the PIP Spatio-Temporal Mining of
Sensor Data
43
Approaching the PIP Spatio-Temporal Mining of
Sensor Data

Sensors provide observations of the state of the
world localized in space and time.
Finding trends in data from individual sensors
time series data mining.
PIP detecting general correlations in multiple
time series of observations.
This has been studied in statistics, database
theory, knowledge discovery, data mining.
Complications proximity relationships based on
geography complex chronological effects.

44
Approaching the PIP Spatio-Temporal Mining of
Sensor Data

Sensor technology is evolving rapidly.
It makes sense to consider idealized settings
where data are collected continuously and
communicated instantly.
Then, modern methods of spatio-temporal data
mining due to Muthukrishnan and others are
relevant.

45
Approaching the PIP Triggering Other Methods of
Surveillance

One type of BT surveillance cannot be considered
in isolation.
Question How can the pattern of sensor warnings
guide other biosurveillance methods?
Increased syndromic surveillance?
Change threshold for alarm in syndromic
surveillance?
Increased attention to E.R. visits in a certain
region?

46
Approaching the PIP Triggering Other Methods of
Surveillance

Decreased threshold for alarm from subway worker
absenteeism levels?

47
Approaching the PIP Triggering Other Methods of
Surveillance

If there is an initial alarm, each sensor may be
read more often.
How do we pick the sensors to read more
frequently?
This is adaptive biosensor engagement.
Methods of bichromatic combinatorial optimization
may be relevant.
As for the SLP, sensors get one color, sensor
messages another.
Relevance of work of Muthukrishnan.

48
Outline

Bioterrorism Sensor Location
Port of Entry Inspection Algorithms
Monitoring Message Streams
Author Identification
Computational and Mathematical Epidemiology
Adverse Event/Disease Reporting/Surveillance/Analy
sis
Bioterrorism Working Group
Modeling Social Responses to Bioterrorism
Predicting Disease Outbreaks from Remote Sensing
and Media Data
Communication Security and Information Privacy

49
Port of Entry Inspection Algorithms
In collaboration with Los Alamos National
Laboratory
50
Port of Entry Inspection Algorithms

Goal Find ways to intercept illicit nuclear
materials and weapons destined for the U.S. via
the maritime transportation system
Aim Develop decision support algorithms that
will help us to optimally intercept illicit
materials and weapons
Find inspection schemes that minimize total
cost including cost of false positives and
false negatives

51
Sequential Decision Making Problem

Stream of entities arrives at a port
Decision Maker needs to decide which to inspect,
which to subject to increasingly stringent
inspection based on outcomes of previous
inspections
Our approach decision logics and combinatorial
optimization methods
Builds on approach of Stroud
and Saeger and large literature
in sequential decision making.

52
Sequential Decision Making Problem

Entities arriving to be classified into
categories.
Simple case 0 ok, 1 suspicious
Observations are made.
Inspection scheme specifies which observations
are to be made based on previous observations
Entities have attributes a0, a1, , an, each in a
number of states
Sample attributes
Does ships manifest set off an alarm?
Does container give off neutron or Gamma emission
above threshold?
Does a radiograph image come up positive?
Does an induced fission test come up positive?

53
Sequential Decision Making Problem

Simplest Case Attributes are in state 0 or 1
Then Entity is a binary string like 011001
Then Classification is a decision function F
that assigns each binary string to a category.
If there are two categories, 0 and 1, F is a
boolean function.
F(000) F(111) 1, F(abc) 0 otherwise
This classifies an entity as positive iff it has
none of the attributes or all of them.

54
Sequential Decision Making Problem

Different problems depending on whether or not F
is known. Assume first that F is known.
Given an entity, test its attributes until know
enough to calculate the value of F.
An inspection scheme tells us in which order to
test the attributes to minimize cost.
Even this simplified problem is hard
computationally.

55
Binary Decision Tree Approach

We assume we have sensors to measure presence or
absence of attributes.
Build a tree
Nodes are sensors or categories (0 or 1)
Label nodes with atrribute the sensor measures
for or the number of the category
Category nodes are leaves of the tree nodes
with only one neighbor
Two arcs exit from each sensor node, labeled left
and right.
Take the right arc when sensor says the attribute
is present, left arc otherwise

56
Binary Decision Tree Approach

We reach category 1 from the root only through
the path a0 to a1 to 1.
Thus, an entity is classified in category 1 iff
it has both attributes.
The binary decision tree corresponds to the
boolean function F(11) 1, F(10) F(01) F(00)
0.

Figure 1
57
Binary Decision Tree Approach

We reach category 1 from the root by
a0 L to a1 R a2 R 1 or
a0 R a2 R1
An entity is classified in category 1 iff has
a1 and a2 and not a0 or
a0 and a2 and possibly a1.
Corresponding boolean function F(111) F(101)
F(011) 1, F(abc) 0 otherwise.

Figure 2
58
Binary Decision Tree Approach

This binary decision tree corresponds to the same
boolean function
F(111) F(101) F(011) 1, F(abc) 0
otherwise.
However, it has one less observation node. So, it
is more efficient if all observations are equally
costly and equally likely.

Figure 3
59
Binary Decision Tree Approach

Even if the boolean function F is fixed, the
problem of finding the optimal binary decision
tree for it is NP-complete.
For small n, can try to solve it by brute force
enumeration.
But even for n 4, not practical. (n 4 at Port
of Long Beach-Los Angeles)
Seeking heuristic algorithms, approximations to
optimal.
Making special assumptions about the boolean
function F.
Example For so-called monotone boolean
functions, integer programming formulations give
promising heuristics.

60
Cost Functions

Above analysis Only uses number of sensors
Using a sensor has a cost
Unit cost of inspecting one item with it
Fixed cost of purchasing and deploying it
Delay cost from queuing up at the sensor station
How many nodes of the decision tree are actually
visited during average inspection? Depends on
distribution of entities.

61
Cost Functions

Cost of false positive Cost of additional tests.
If it means opening the container, its very
expensive.
Cost of false negative Complex issue.

62
Complications

Sensor errors probabilistic approach
More than two values of an attribute (present,
absent, present with 75 probability, )
Partially defined boolean functions (inferring
the boolean function from observations)
In this case, machine learning approaches are
promising
Bayesian binary regression
Splitting strategies
Pruning learned decision trees

63
Outline

Bioterrorism Sensor Location
Port of Entry Inspection Algorithms
Monitoring Message Streams
Author Identification
Computational and Mathematical Epidemiology
Adverse Event/Disease Reporting/Surveillance/Analy
sis
Bioterrorism Working Group
Modeling Social Responses to Bioterrorism
Predicting Disease Outbreaks from Remote Sensing
and Media Data
Communication Security and Information Privacy

64
Monitoring Message Streams Algorithmic Methods
for Automatic Processing of Messages
65
OBJECTIVE
Monitor huge communication streams, in
particular, streams of textualized communication
to automatically detect pattern changes and
"significant" events
Motivation monitoring email traffic, news,
communiques, faxes, voice intercepts (with speech
recognition)
66
TECHNICAL APPROACHES

Given stream of text in any language.
Decide whether "new events" are present in the
flow of messages.
Event new topic or topic with unusual level of
activity.
Initial Problem Retrospective or Supervised
Event Identification Classification into
pre-existing classes. Given example messages on
events/topics of interest, algorithm detects
instances in the stream.

67
TECHNICAL APPROACHES SUPERVISED FILTERING

Batch filtering Given examples of relevant
documents up front.
Adaptive filtering Examples accumulated need to
decide if will bother analyst for guidance pay
for information about relevance as process moves
along.

MORE COMPLEX PROBLEM PROSPECTIVE DETECTION OR
UNSUPERVISED FILTERING
Classes change - new classes or change meaning
A difficult problem in statistics
Recent new C.S. approaches
Semi-supervised Learning
Algorithm suggests a possible new event/topic
Human analyst labels it determines its
significance

69
COMPONENTS OF AUTOMATIC MESSAGE PROCESSING

(1). Compression of Text increase speed, reduce
memory/disk use
(2). Representation of Text convert text to
form amenable to computation and statistical
analysis
(3). Matching Scheme compute similarity between
texts
(4). Learning Method create profiles of
events/topics from known examples.
(5). Fusion Scheme -- combine multiple filtering
techniques to increase accuracy.

70
COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - II

These distinctions are somewhat arbitrary.
Many approaches to message processing overlap
several of these components of automatic message
processing our techniques usually address more
than one component.
Project Premise Existing methods dont exploit
the full power of the 5 components, synergies
among them, and/or an understanding of how to
apply them to text data.

71
COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - III

Our approach is to develop/explore methods for
each component and then to combine them.
In the first phase of the project, we did over
5000 complete experiments with different
combinations of methods.

72
Nearest Neighbor (kNN) Classifiers

Route message by
Finding k most similar training messages
(neighbors)
Assign to classes that are most common among
neighbors (using weighting by distance)
kNN classifiers studied since 1958, for text
since early 90s
Moderately effective for text has been
considered inefficient finding neighbors is slow
But, finding neighbors only needs to be done once
No matter how many classes (even if huge)
So for large number of topics, maybe more
efficient than one-classifier-per-topic approaches

73
Speeding up kNN

Can finding neighbors be made fast enough to make
kNN practical?
Worked on fast implementation
Store text and classes sparsely (Representation)
Store class labels sparsely
Arrange computations to do work proportional only
to number of class labels in neighbors, not total
number of classes
Search engine heuristics use the in-memory
inverted file (Matching)
Use inverted file (group by word, not by
document)
Retain only high impact terms within each
document, or within each inverted list
Compute similarities using only inverted lists
for the few words occurring in test document

74
kNN Results

Great reduction in size of inverted index and
speed of classification
Slight additional cost in effectiveness
Effectiveness slightly below our best methods
(Bayesian probit and logistic classifiers)
Compressed index 90 smaller than original index
w/only 7-12 loss in effectiveness (macro-F1)
Approximate matching is 10 to 100 times faster w/
only 2-10 loss in effectiveness (macro-F1)
Ours are first large scale experiments on search
engine heuristic for neighbor lookup in kNN
Partnership between theoreticians and
practitioners.

75
Bayesian Methods

Bayesian statistical methods place prior
probability distributions on all unknowns, and
then compute posterior distribution for the
unknowns conditional on the knowns.

Thomas Bayes
76
Bayesian Methods

Zhang and Oles (2001) developed an efficient
optimization algorithm for logistic regression
(10,000 dimensions) and achieved excellent
predictive performance.
The Bayesian approach explicitly incorporates
prior knowledge about model complexity
(regularization)
We extended the Bayesian approach to incorporate
a prior requirement for sparsity.
Logistic regression has one parameter per
dimension our sparse model sets many of these to
zero handles hundreds of thousands of parameters
efficiently.
Resulting sparse models produce outstanding
accuracy and ultra-fast predictions with no
ad-hoc feature selection

77
Bayesian Methods Sample Results

We have implemented several efficient variants,
e.g., probit,informative priors.
Publicly released software over 1000 downloads
Compared to Zhang Oles, our implementation
Eliminates ad hoc feature selection
Often uses less than 1 of the features at
prediction time
Is publicly available
Accuracy as good as the best results ever
published.
In sum, we have a sparseness-inducing Bayesian
approach that produces dramatically simpler
models with no loss in accuracy

78
Streaming Data Analysis

Motivated by need to make decisions about data
during an initial scan as data stream by
Recent development of theoretical CS algorithms
Algorithms motivated by intrusion detection,
transaction applications, time series
transactions

79
Streaming Text Data Historic Data Analysis

The accumulation of text messages is massive over
time
A lot of streaming research is focused on
on-going or current analyses
It is a great challenge to use only summarized
historic data and see if a currently emerging
phenomenon had precursors occurring in the past
We are working on a novel architecture for
historic and posterior analyses via small
summaries - sketches

80
Streaming Analysis Tool CM Sketch

Theoretical We have developed the CM Sketch that
uses (1/e) log 1/d space to approximate data
distribution with error at most e, and
probability of success at least 1-d.
All other previously known sample or sketch
methods use space at least (1/e2).
CM Sketch is an order of magnitude better.
Practical Few 10's of KBs gives accurate
summary of large data Create summaries of data
that allow historic queries to find
Heavy Hitters (Most Frequent Items)
Quantiles of a Distribution (Median, Percentiles
etc.)
Finding items with large changes

81
Outline

Bioterrorism Sensor Location
Port of Entry Inspection Algorithms
Monitoring Message Streams
Author Identification
Computational and Mathematical Epidemiology
Adverse Event/Disease Reporting/Surveillance/Analy
sis
Bioterrorism Working Group
Modeling Social Responses to Bioterrorism
Predicting Disease Outbreaks from Remote Sensing
and Media Data
Communication Security and Information Privacy

82
Large-scale Automated Author Identification
83
Statistical Analysis of Text

Statistical text analysis has a long history in
literary analysis and in solving disputed
authorship problems
First (?) is Thomas C. Mendenhall in 1887

Hamilton versus Madison the Federalist Papers
Mosteller and Wallace (1963) used Naïve Bayes
with a Poisson and Negative Binomial model
Good predictive performance

85
Some Background

Identification technologies important for
homeland security and in the legal system
Author attribution for textual artifacts using
topic independent stylometric features has a
long history
Historical focus on small numbers of authors and
low-dimensional representations via function words

86
Author ID Project Objectives

Application of state-of-the-art statistical and
computing technologies to authorship attribution
Work with very high-dimensional document
representations
Focus on providing working solutions to
particular problems

87
Author ID Project Focus

Goal Identification of Authors From Large
Collection of Objects
traditional disputed authorship (choose among k
known authors)
clustering of putative authors (e.g., internet
handles termin8r, heyr, KaMaKaZie)
document pair analysis Were two documents
written by the same author?
odd-man-out Were these documents written by one
of this set of authors or by someone else?

88
Representation

Long tradition in stylometry that seeks a small
number of textual characteristics that
distinguish the texts of authors from one another
(Burrows, Holmes, Binongo, Hoover, Mosteller
Wallace, McMenamin, Tweedie, etc.)
Typically use function words (a, with, as,
were, all, would, etc.) followed by PCA cluster
analysis
Function words aim to be topic-independent
Hoover (2003) shows that using all high-frequency
words does a better job than function words alone

89
Idiosyncratic Usage

Idiosyncratic usage less formalized in the
literature (misspellings, repeated neologisms,
etc.) but apparently useful. For example,
Fosters unmasking of Klein as the author of
Primary Colors
Klein and Anonymous loved unusual adjectives
ending in -y and inous cartoony, chunky,
crackly, dorky, snarly,, slimetudinous,
vertiginous,
Both Klein and Anonymous added letters to their
interjections ahh, aww, naww.
Both Klein and Anonymous loved to coin words
beginning in hyper-, mega-, post-, quasi-, and
semi-, more than all others put together
Klein and Anonymous use riffle to mean rifle
or rustle, a usage for which the OED provides no
instance in the past thousand years

90
Odd-Man Out

Were these documents written by one of this set
of authors or by someone else?
Training data contains documents by given set of
authors
Test data contains documents by some set of
authors including some not in original set
Bayesian hierarchical model incorporates prior
knowledge that model parameters for different
authors differ from each other
Initial success on small-scale simulated examples
Generalizations for more than one new author

91
Some Results

Created largest-ever (?) feature set including
function words, suffixes, POS tags, lengths,
spelling errors, common English errors,
grammatical errors, phrases, idiosyncratic usage,
ngrams, etc.
Extensive experiments for 1-of-K and
odd-man-out
New 1.2 million message Listserv corpus, 82,000
authors

92
Some Results - II

Developed general purpose feature
extraction software for author attribution
Bayesian Multinomial Regression Software extends
our highly scalable, sparse, BBR software (MMS
Project) to the multi-class case

93
Outline

Bioterrorism Sensor Location
Port of Entry Inspection Algorithms
Monitoring Message Streams
Author Identification
Computational and Mathematical Epidemiology
Adverse Event/Disease Reporting/Surveillance/Analy
sis
Bioterrorism Working Group
Modeling Social Responses to Bioterrorism
Predicting Disease Outbreaks from Remote Sensing
and Media Data
Communication Security and Information Privacy

94
Special Focus on Computational and Mathematical
Epidemiology
smallpox
95
Components of a Special Focus

Working Groups
Tutorials
Workshops
Visitor Programs
Graduate Student Programs
Postdoc Programs
Dissemination

96
A Sampling of Working Groups

WGs on Large Data Sets
Adverse Event/Disease Reporting, Surveillance
Analysis
Data Mining and Epidemiology
WGs on Analogies between Computers and Humans
Analogies between Computer Viruses/Immune Systems
and Human Viruses/Immune Systems
Distributed Computing, Social Networks, and
Disease Spread Processes

97
WGs on Methods/Tools of Theoretical CS

Phylogenetic Trees and Rapidly Evolving Diseases
Order-Theoretic Aspects of Epidemiology
WGs on Computational Methods for Analyzing Large
Models for Spread/Control of Disease
Spatio-temporal and Network Modeling of Diseases
Methodologies for Comparing Vaccination
Strategies

98
WGs on Mathematical Sciences Methodologies

Mathematical Models and Defense Against
Bioterrorism
Predictive Methodologies for Infectious Diseases
Statistical, Mathematical, and Modeling Issues in
the Analysis of Marine Diseases

99
Workshops on Modeling of Infectious Diseases
A Sampling of Workshops

The Pathogenesis of Infectious Diseases
Models/Methodological Problems of Botanical
Epidemiology
WS on Modeling of Non-Infectious Diseases
Disease Clusters

100
Workshops on Evolution and Epidemiology

Genetics and Evolution of Pathogens
The Epidemiology and Evolution of Influenza
The Evolution and Control of Drug Resistance
Models of Co-Evolution of Hosts and Pathogens

101
Workshops on Methodological Issues

Capture-recapture Models in Epidemiology
Spatial Epidemiology and Geographic Information
Systems
Ecologic Inference
Combinatorial Group Testing

102
Outline

Bioterrorism Sensor Location
Port of Entry Inspection Algorithms
Monitoring Message Streams
Author Identification
Computational and Mathematical Epidemiology
Adverse Event/Disease Reporting/Surveillance/Analy
sis
Bioterrorism Working Group
Modeling Social Responses to Bioterrorism
Predicting Disease Outbreaks from Remote Sensing
and Media Data
Communication Security and Information Privacy

103
The DIMACS Working Group on Adverse Event/Disease
Reporting, Surveillance, and Analysis
104
Working Group on Adverse Event/Disease Reporting,
Surveillance, and Analysis

Health surveillance a core activity in public
health
Concerns about bioterrorism have attracted
attention to new surveillance methods
OTC drug sales
Subway worker absenteeism
Ambulance dispatches
Spawns need for novel statistical methods for
surveillance of multiple data streams.
WG coordinated closely with National Syndromic
Surveillance Conferences

105
New Data Types for Public Health Surveillance

Managed care patient encounter data
Pre-diagnostic/chief complaint (text data)
Over-the-counter sales transactions
Drug store
Grocery store
911-emergency calls
Ambulance dispatch data
Absenteeism data
ED discharge summaries
Prescription/pharmaceuticals
Adverse event reports

106
Farzad Mostashari
107
New Analytic Methods and Approaches

Spatial-temporal scan statistics
Statistical process control (SPC)
Bayesian applications
Market-basket association analysis
Text mining
Rule-based surveillance
Change-point techniques

108
SubGroup on Privacy Confidentiality of Health
Data

Privacy concerns are a major stumbling block to
public health surveillance, in particular
bioterrorism surveillance.
Challenge produce anonymous data specific enough
for research.
Exploring ways to remove identifiers (s.s. ,
tel. , zip code) from data sets.
Exploring ways to aggregate, remove information
from data sets.
Partnerships with cryptographers
Exploring methods of combinatorial optimization

109
Outline

Bioterrorism Sensor Location
Port of Entry Inspection Algorithms
Monitoring Message Streams
Author Identification
Computational and Mathematical Epidemiology
Adverse Event/Disease Reporting/Surveillance/Analy
sis
Bioterrorism Working Group
Modeling Social Responses to Bioterrorism
Predicting Disease Outbreaks from Remote Sensing
and Media Data
Communication Security and Information Privacy

110
Bioterrorism Working Group
anthrax
111
Bioterrorism Working Group

Biosurveillance
Evolution
Modeling Bioterror Response Logistics
Computer Science Challenges
Agroterrorism

112
Modeling Bioterror Response Logistics

Exploring Discrete Optimization/Queueing
size of stockpiles of vaccines
allocation of medications
analysis of bottlenecks in treatment facilities
transportation schedules

1947 smallpox vaccincation queue NYC
113
Agroterrorism

Subgroup just starting
Interest in plant diseases
Partnership with the National Plant Diagnostic
Network
Emphasis on Data Mining and Epidemiology

114
Outline

Bioterrorism Sensor Location
Port of Entry Inspection Algorithms
Monitoring Message Streams
Author Identification
Computational and Mathematical Epidemiology
Adverse Event/Disease Reporting/Surveillance/Analy
sis
Bioterrorism Working Group
Modeling Social Responses to Bioterrorism
Predicting Disease Outbreaks from Remote Sensing
and Media Data
Communication Security and Information Privacy

115
Working Group on Modeling Social Responses to
Bioterrorism

Models of the spread of infectious disease
commonly assume passive bystanders and rational
actors who will comply with health authorities.
It is not clear how well this assumption applies
to situations like a bioterrorist attack using
smallpox or plague.

116
Working Group on Modeling Social Responses to
Bioterrorism

Interdisciplinary group is discussing
incorporating social behavior into models,
building models of public health decisionmaking,
risk communication.
Some Issues
Movement
Compliance
Rumor
Subcultural differences
Indirect economic effects
Social stigmata
Panic

How do you measure the indirect cost of an
attack?
117
Outline

Bioterrorism Sensor Location
Port of Entry Inspection Algorithms
Monitoring Message Streams
Author Identification
Computational and Mathematical Epidemiology
Adverse Event/Disease Reporting/Surveillance/Analy
sis
Bioterrorism Working Group
Modeling Social Responses to Bioterrorism
Predicting Disease Outbreaks from Remote Sensing
and Media Data
Communication Security and Information Privacy

118
Predicting Disease Outbreaks from Remote Sensing
and Media Data
Outbreaks of disease in other parts of the world
have the capacity to affect the security of the
US
Joint project with Imaging Science and
Information Systems Center at Georgetown Universit
y Medical School (ISIS Center)
119
Predicting Disease Outbreaks from Remote Sensing
and Media Data

Recent work has shown that its possible to
predict disease outbreaks in distant parts of the
world using remotely sensed satellite data.
SARS and heightened avian flu in the Pacific Rim
appeared following temperature anomalies in
China.
Could we have anticipated this
given enviro-climatic information?

120
Predicting Disease Outbreaks from Remote Sensing
and Media Data

Rift Valley Fever epidemic in 1997/8 in East
Africa occurred following heavy flooding related
to El Nino
Flooding in Venezuela in 1995 resulted in a
multi-pathogen outbreak.

121
Predicting Disease Outbreaks from Remote Sensing
and Media Data

Indications and warnings can alert US responders
to bioevents in faraway places.
Disease that can result in social disruptions can
be detected in open source media reports even if
there is no official reporting of this.

122
Predicting Disease Outbreaks from Remote Sensing
and Media Data

A model developed at the ISIS Center at
Georgetown predicts social disruptions due to
disease based on keyword hit counts from
text-based sources (media reports).
DIMACS Project goal Use media model to develop
ways to predict social disruptions from disease
from remote sensing enviro-climatic data.
We will be using remote sensing data indicating
increased Normalized Difference Vegetation Index
(NDVI).

123
Predicting Disease Outbreaks from Remote Sensing
and Media Data

Project Premise We can use enviro-climatic
indices such as NDVI coupled with disease-related
social disruption predictors from media data
delayed by several months to validate the
enviro-climatic indicators as predictors.
Approach Machine Learning
Project waiting to get started

124
Predicting Disease Outbreaks from Remote Sensing
and Media Data

The approach is similar to ones used by members
of the DIMACS team to estimate probability of a
match between remotely sensed signals and a
signature that has been observed before. This
work has been applied to face recognition and
explosive detection.

125
Outline

Bioterrorism Sensor Location
Port of Entry Inspection Algorithms
Monitoring Message Streams
Author Identification
Computational and Mathematical Epidemiology
Adverse Event/Disease Reporting/Surveillance/Analy
sis
Bioterrorism Working Group
Modeling Social Responses to Bioterrorism
Predicting Disease Outbreaks from Remote Sensing
and Media Data
Communication Security and Information Privacy

126
Special Focus on Communication Security and
Information Privacy
127
Special Focus on Communication Security and
Information Privacy

Working Groups
Privacy-Preserving Data Mining
Usable Privacy and Security Software
Data De-Identification, Combinatorial
Optimization, Graph Theory, and the Stat-OR
Interface
Intrusion Detection and Network Security
Management Systems

128
Special Focus on Communication Security and
Information Privacy

A Selection of Workshops
Software Security
Applied Cryptography and Network Security
Large-scale Internet Attacks
Mobile and Wireless Security
Security of Web Services and E-Commerce
Database Security Query Authorization and
Information Inference

129
Working Group on Analogies between Computer
Viruses and Biological Viruses

Can ideas for defending against biological
viruses lead to ideas for defending against
computer viruses?
Concern about large gap between initial time of
attack and implementation of defensive strategies
Public health approach Once a virus has
infected a machine, it tries to connect it to as
many computers as possible, as fast as possible.
A throttle limits rate at which a computer can
connect to new computers.

130
(No Transcript)

Write a Comment

User Comments (0)