Title: The BioPortal Informatics Infrastructure for Syndromic Surveillance and Biodefense
1The BioPortal Informatics Infrastructure for
Syndromic Surveillance and Biodefense
- Hsinchun Chen, Ph.D.
- Artificial Intelligence Lab, U. of Arizona
- NSF BioPortal Center
2Medical Informatics The computational,
algorithmic, database and information- centric
approach to the study of medical and health
care problems. From Medical Informatics
to Infectious Disease Informatics
3Syndromic Surveillance
- A syndrome is a set of symptoms or conditions
that occur together and suggest the presence of a
certain disease or an increased chance of
developing the disease (from NIH/NLM) - Syndromic surveillance is based on health-related
data that precede diagnosis and signals a
sufficient probability of a case or an outbreak
to warrant further public health response (from
CDC) - Targeting investigation of potential cases
- Detecting outbreaks associated with bioterrorism
4Syndromic Surveillance Data Sources in Different
Stages of Developing a Disease ? Reaching
Situational Awareness
Reproduced from Mandl et. al. (2004)
5Syndromic Surveillance Systems
- Generation 1, paper-based paper, fax, TEL, TEL
directory, etc. - Generation 2, email-based email, Word/Access,
pager, cell phone, etc. - Generation 3, database-driven database,
standards, messaging, tabulation, GIS, graphs,
text, etc. - Generation 4, search engine-based real-time,
interactive, web services, visualized, GIS,
graphs, texts, sequences, etc.
6Syndromic Surveillance System Survey
7Sample Systems and Data Sources Utilized
8- BioPortal Overview, WNV, BOT
- (real-time information collection, sharing,
access, visualization, and analysis Epi data) - Architecture, Information Sharing, Information
Retrieval, Standards, Policy, Privacy, Data
Mining, Visualization, HCI
9Project Background
- In September, 2002, representatives of 18
different agencies, including DOD, DOE, DOJ, DHS,
NIH/NLM, CDC, CIA, NSF, and NASA, are convened to
discuss disease surveillance - AI Lab was chosen to be the technical integrator
to work with New York and California States to
develop a prototype system targeting West Nile
Virus and Botulism
10BioPortal Project Goals
- Demonstrate and assess the technical feasibility
and scalability of an infectious disease
information sharing (across species and
jurisdictions), alerting, and analysis framework. - Develop and assess advanced data mining and
visualization techniques for infectious disease
data analysis and predictive modeling. - Identify important technical and policy-related
challenges in developing a national infectious
disease information infrastructure.
11Information Sharing Infrastructure Design
Portal Data Store (MS SQL 2000)
Data Ingest Control Module Cleansing /
Normalization
Info-Sharing Infrastructure
Adaptor
Adaptor
Adaptor
SSL/RSA
SSL/RSA
XML/HL7 Network
PHINMS Network
New
NYSDOH
CADHS
12Data Access Infrastructure Design
13Communications/Messaging
- Messaging Infrastructure installed and tested
- NYSDOH-UA PHIN MS
- CADHS-UA Regional message broker
- NWHC-UA PHIN MS
- XML generation/conversion tested
- NY_DeadBird, NY_Alerts, NY_BotHuman, NY_WNVHuman,
NY_CaptiveAnimal, NY_Mosquito - CA_BotHuman, CA_WNVHuman, CA_DeadBird,
CA_Chicken, CA_Mosquito - USGS_Epizoo
- A scalable, flexible, light-weight messaging
framework! - Easy to include new diseases, new jurisdictions,
and new techniques!
14Data Sharing Agreements - MOUs
- Agreement reached on data sharing principles
between partner entities - Consortium agreement will allow new partners to
join without full renegotiation - Current data sharing agreements and MOU focus on
institutional level data access - NYS template adapted to guide and regulate
individual user data access/security practice - Two levels of data access aggregate and detail
- A scalable, bottom-up, policy-guided information
sharing framework!
15Spatial-Temporal Visualization
- Integrates four visualization techniques
- GIS View
- Periodic Pattern View
- Timeline View
- Central Time Slider
- Visualizes the events in multiple dimensions to
identify hidden patterns - Spatial
- Temporal
- Hotspot analysis
- Phylogenetic (planned)
16BioPortal Prototype Systems
17Outbreak Detection Hotspot Analysis
- Hotspot is a condition indicating some form of
clustering in a spatial and temporal distribution
(Rogerson Sun 2001 Theophilides et. al. 2003
Patil Tailie 2004 Zeng et. al. 2004 Chang et.
al. 2005) - For WNV, localized clusters of dead birds
typically identify high-risk disease areas
(Gotham et. al. 2001) automatic detection of
dead bird clusters can help predict disease
outbreaks and allocate prevention/control
resources effectively
18Retrospective Hotspot Analysis Problem Statement
19Risk-Adjusted Support Vector Clustering (RSVC)
Feature space
Minimum sphere
Split into several clusters
High baseline density makes two points far apart
in feature space
Estimate baseline density
20Study II NY WNV
- On May 26, 2002, the first dead bird with WNV was
found in NY - Based on NYs test dataset
140 records
224 records
March 5
May 26
July 2
new cases
baseline
21Dead Bird Hotspots Identified
22(No Transcript)
23(No Transcript)
24(No Transcript)
25BioPortal HotSpot Analysis RSVC, SaTScan, and
CrimeStat Integrated (first visual, real-time
hotspot analysis system for disease surveillance)
- West Nile virus in California
26Hotspot Analysis-Enabled STV
27- BioPortal Livestock
- (syndromic category)
- Information Sharing and Information Retrieval
28Kansas RSVP-A System
29Rapid Syndrome Validation Project Animal
(RSVP-A) System
- URL http//clh.vet.ksu.edu
- Main function allows vets to enter syndromic
observations and retrieve statistics bar charts - A complete system with administrative functions
such as profile editing - Provides 2 web-based interfaces
- Regular browser
- Mobile devices (WAP)
- Current users 17 vets in 29 counties
- Projected users 200 (Kansas), 10K (nationwide)
30BioPortal Integration Livestock
31Data Characteristics
- Time Period
- 7/16 2003 10/17 2005
- Cross 2 states/29 counties in Kansas and New
Mexico
32Imported Attributes
- RSVP-A monitors 6 syndromes
- Non-neonatal diarrhea,
- Neurologic / recumbant
- Unexpected deaths
- Weight loss/feed refusal
- Abortion/birth defect
- Erosive lesions
33Records
34- BioPortal FMD
- (phylogenetic tree and news)
- Information Extraction, Information Sharing, and
Data/Text Mining
35FMD global surveillance lessons identified
- Must understand risks, and nature of changing
risks, in order to develop strategies for
prevention and mitigation on a global scale - Must understand the global situation in order to
prepare locally - United Kingdom FMD outbreak, 2001 12B, 50-60
of 4M farm animals (cows, pigs, sheep)
slaughtered
36International FMD BioPortal
- Real time web-based situational awareness of FMD
outbreaks worldwide through the establishment of
an international information technology system. - FMDv characterization at the genomic level
integrated with associated epidemiological
information and modeling tools to forecast
national, regional and/or international spread
and the prospect of importation into the US and
the rest of North America. - Web-based crisis management of resourcesfacilitie
s, personnel, diagnostics, and therapeutics.
37Preliminary Global FMD Dataset
- Provider UC Davis FMD Lab
- Information sources reference labs and OIE
- Coverage 28 countries globally
- Time span May, 1905 March, 2005
- Dataset size 30,000 records of which 6789
records are complete - Host species Cattle, Caprine, Ovine, Bovine,
Swine, NK, Elephant, Buffalo, Sheep, Camelidae,
Goat
38Global FMD Coverage in BioPortal
39FMD Migration Visualization using BioPortal
(cases in South Asia)
FMD Cases travel back and forth between countries
40BioPortal-Afghanistan
41International FMD News
- Provider UC Davis FMD Lab
- Information sources Google, Yahoo, and open
Internet sources - Time span Oct 4, 2004 present (real-time
messaging under development) - Data size 460 events (6/21/05)
- Coverage 51 countries
- (Africa11, Asia16,
- Europe12, Americas12)
42Searching FMD News
- http//fmd.ucdavis.edu/
- Searchable by
- Date range
- Country
- Keyword
43Visualizing FMD News on BioPortal
44FMD Genetic Visualization
- Goal Extend STV to incorporate 3rd dimension,
phylogenetic distance - Include a phylogenetic tree.
- Identify phylogenetic groups and color-code the
isolate points on the map. - Leverage available NCBI tools such as BLAST.
- Proof of concept SAT 2 3 analysis
- Data 54 partial DNA sequence records in South
Africa received from UC Davis FMD Lab
(Bastos,A.D. et al. 2000, 2003) - Date range 1978-1998
- Countries covered South Africa, Zimbabwe,
Zambia, Namibia, Botswana
45Sample FMD Sequence Records
Color-coded View (MEGA3)
Textual View of Gene Sequence
46Phylogenetic Treeof Sample FMD Data
Identify 6 groups within 2 major families (MEGA3
based on sequence similarity)
Group6
Group1
Group2
Group5
Group4
Group3
47Genetic, Spatial, and Temporal Visualization of
FMD Data
Phylogenetic tree color coded
Isolates locations color coded
Isolates appearances in time
48FMD Time Sequence Analysis
First family cases appeared throughout the period
2nd family cases exist before 1993 and a comeback
lately
Second family cases existed before 1993 and
reappeared later after 1997
49FMD Periodic Pattern Analysis
2nd family concentrated in Feb. while 1st family
spread evenly
50Locations of Family 1 records
Selected only groups 1, 2, and 3 and found a
spatial cluster
51Locations of Family 2 records
Sparse isolate locations
Selected only groups 4, 5, and 6
52FMD BioPortal activity
- Launched January 5, 2007
- 65 users from gt15 countries
- Belgium, Brazil, Canada, France, Germany,
Italy, India, Iran, Netherlands, Pakistan,
Paraguay, South Africa, Sweden, U.S., U.K. - Research institutes, diagnostic labs,
government and international agencies and
organizations, universities
53- BioPortal Arizona
- (chief complaint syndromic surveillance)
- Text Mining, Ontology
54Chief Complaints As a Data Source
- Chief complaints (CCs) are short free-text
phrases entered by triage practitioners
describing reasons for patients ER visit - Examples lt foot pain left foot pain cp
chest pain sob shortness of breath so
should be sob poss uti possibly urinary
tract infection - Advantages of using CCs for surveillance purposes
- Timeliness Diagnose results are on average 6
hours slower than CCs - Availability and low-cost Most hospitals have
free-text CCs available in electronic form
55Existing CC Classification Methods
56Syndromic Categories in Different Systems
57Overall System Design
Chief Complaints
58A Stage 2 Example CC Concepts ? Symptom Group
Concepts
coagulopathy
purpura
ecchymosis
bleeding 1/41/51/6 0.62
4
5
6
Blood In urine
ureteral stone
5
other1/50.2
coma
5
coma1/50.2 dead1/50.2
UMLS
5
out pass
altered_mental_status 1/50.2
59System Benchmarks
- Both RODS (Tsui et. al., 2003) and EARS (CDC,
2006 Hutwagner et. al., 2003) serve as the
benchmarks - RODS uses supervised learning method
- EARS uses rule-based method
- Both system are available for test
- Performance criteria are calculated by comparing
system outputs with the gold standard
60Syndromic Categories in Different Systems
61Expert Agreement by Syndromic Category
- Syndromic categories with kappa lower than 0.7
and Other were both excluded in the evaluation
62Performance Criteria
- Sensitivity (recall) TP/(TPFN)
- Specificity (negative recall) TN/(FPTN)
- Precision TP/(TPFP)
- F-measure 2 Precision Recall / (Precision
Recall) - In the context of syndromic surveillance,
sensitivity is more important than precision and
specificity (Chapman, 2005). Thus, the F2-measure
is used - F2 measure weights recall twice as much as
precision. - F2-measure (12)Precision Recall / (2Recall
Precision) - Note TPTrue Positive, TNTrue Negative
FPFalse Positive, FNFalse Negative
63Comparing BioPortal to RODS
p-value lt 0.1 p-value lt
0.05 p-value lt 0.01 Statistical test is
based on 2,500 bootstrapings.
64Comparing BioPortal to EARS
p-value lt 0.1 p-value lt
0.05 p-value lt 0.01 Statistical test is based
on 2,500 bootstrapings.
65- BioPortal Taiwan (international CC syndromic
surveillance) - Data/Text Mining, Information Retrieval and
Visualization
66Multi-lingual Chief ComplaintsChinese Example
- Data Characteristics
- Mixed expressions in both Chinese and English
- ????FEVER???????????(?)
- ??,?????A/W,????,????
- 18 CC records from NTU Med. Center contain
Chinese expressions. - Some hospitals have 100 CC records in Chinese
(For example, ??????) - Misspellings and typographic errors are not
serious
67Prevalence of Chinese Chief Complaints
- Medical Center ?????? (100),???? (18), ??????
(8) - Regional Hospital ???? (99), ??????? (87),
?????? (72),?????? (50), , etc. - Local Hospital ?????? (100), ???? (93), ??????
(88), , etc.
68The Role of Chinese Chief Complaints in Syndromic
Surveillance Systems
- The most important role of Chinese words/phrases
is for describing symptom related information - Example ?????? ???????? ????? ??
- Chinese Punctuation
- Name Entity
- Example Diarrhea SINCE THIS MORNING. Group
poisoning. Having dinner at ??? restaurant.
69Chinese CC Preprocessing System Design
English Expressions
Translated Chinese Phrases
Stage 0.1
Stage 0.2
Stage 0.3
Segmented Chinese Phrases
Chinese Expressions
Separate Chinese and English Expressions
Chinese Phrase Segmentation
Chinese Phrase Translation
Chinese Chief Complaints
Chinese to English Dictionary
Chinese Medical Phrases
Common Chinese Phrases
Raw Chinese CCs
Mutual Info.
70Chinese Phrases Segmentation
- Technology Used
- MI (Mutual Information)
- Test bed
- 1978 records from hospital A
- 18 records have Chinese expression
- Results
- 726 phrases extracted
- 370 (51) are medical related
- Example
- Input ????, ???????,???
- Output ?-?-?? , ?-??-???-? , ???
71Chinese Phrases Translation
- Recruited 3 physicians to help translating 370
extracted Chinese terms - 280 (76) terms have consistent translation
- Example
- Input
- ?-?-?? , ?-??-???-? , ???
- Intermediate output
- N/A-N/A-fighting , N/A-N/A-head injury-N/A ,
epistaxis - Final result
- fighting , head injury , epistaxis
72Result Self Validation
- Use the 280 translations against 1978 chief
complaints from hospital A
- 1610 (82) records are in English
- 368 (18) records contain Chinese
- 36 contains trivial info.
- Eg. r/o septic shock ????
- 64 contains non-trivial info.
- Eg. poor intake and ????
- 67 has complete translation
- 2 has partial translation
- 20 does not have translation
73Group by Hospital
74Group by Syndrome Classification
75- BioPortal Taiwan
- (SARS, social network visualization)
- Information Extraction, Data Mining, and
Visualization
76The Taiwan SARS Data
- Collected by the Graduate Institute of
Epidemiology at National Taiwan University. The
data contains the interview records of 1582
suspected cases, including 479 confirmed SARS
cases in which the sources of infection for 89
cases are still unknown. - Three kinds of information
- Symptom information
- Symptoms
- The onset dates of the symptoms
- Contact information
- Family members
- Roommates
- Classmates/Colleague
- The two-days contact history before the onset of
symptoms - Visiting information
- Foreign countries
- Hospital/High risk areas
- Transportation
77SARS Spreading Network Based On Family Members
The Family of Nurse ?
78SARS Spreading Network Based On Family Members
and Visiting Info of High-Risk Areas
The Family of Nurse ?
The Bridges Formed from the Visit in Peace Hosp.
at 4/21
79- BioPortal
- Towards building integrated, real-time situation
awareness for syndromic surveillance and
biodefense
80BioPortal Information
- Hsinchun Chen, hchen_at_eller.arizona.edu
- AI Lab, http//ai.arizona.edu
- BioPortal Demo
- http//bioportal.org