Title: Knowledge Discovery for Cancer Informatics and Public Health Informatics: Techniques, Case Studies, and Lessons Learned Hsinchun Chen Director, Artificial Intelligence Lab, University of Arizona Acknowledgements: NIH, NSF, NCI, ACC, NTU
1Knowledge Discovery for Cancer Informatics and
Public Health Informatics Techniques, Case
Studies, and Lessons LearnedHsinchun
ChenDirector, Artificial Intelligence
Lab,University of ArizonaAcknowledgements
NIH, NSF, NCI, ACC, NTU
2Artificial Intelligence Lab Research
- UA MIS Department (4th ranked) 30 research
scientists 25M funding since 1990 Chen, IEEE
and AAAS Fellow - Intelligence Security Informatics research
NSF, DOJ, CIA COPLINK system deployed in 1600
agencies Dark Web for countering terrorism - Biomedical Informatics research NLM, NCI Chen
NLM Scientific Counselor HelpfulMed, GeneScene
system and BioPortal
3A Little Promotion
4GeneScene Cancer Pathway Knowledge Extraction
And Visualization
5GeneScene Team
Dr. Gondy Leroy (Claremont) Byron Marshall
(Oregon SU) Dan McDonald (Utah SU) Zan Huang
(Penn SU) Jiexun Li (Drexel U) Chun-Ju
Tseng Shauna Eggers Dr. Jesse Martinez Dr.
George Watts Dr. Bernie Futscher (AZCC) Dr. Hua
Su Dr. Karin Quinones
- Text Mining Knowledge Integration
- Data Mining
- Visualization System Development
- Domain Experts
- User Studies Evaluation
6Outline
- GeneScene overview
- Research directions
- Text mining
- Knowledge integration
- Data mining
- GeneScene Visualizer
7Knowledge Explosion PubMed
- Average number of new citations appearing in
PubMed - In 1980 746/day
- In 2004 1,640/day
8(No Transcript)
9GeneScene Overview
- Motivation
- Relieving information overload in biomedical
research - Automating processes of knowledge extraction and
data analysis - Focus genetic regulatory pathways
- Dissection of regulatory networks is crucial for
a thorough understanding of biological processes - Complexity of biological networks raises
challenges for computational research - Research goals
- To develop novel Natural Language Processing
(NLP) techniques to support information
extraction - To develop machine learning and data mining
techniques to support high-throughput data
analysis - To create an integrated framework for
pathway-related knowledge representation and
visualization - Ultimately, to provide biomedical researchers
with a pathway-related knowledge discovery and
integration platform - Funding NIH/NLM, 1 R33 LM07299-01 (May 2002
April 2006)
10GeneScene Components Scope
Ontology-enhanced Knowledge Integration To
aggregate and consolidate pathway relations
extracted from literature and to integrate them
with existing knowledge sources using biomedical
ontologies
Text Mining of Biomedical Literature To
automatically extract regulatory relations
between biological entities from free text
- Data Mining for Genomic Studies
- To extract regulatory pathway information based
on genomic data other resouces
Visualization of Regulatory Pathways To
facilitate accessing, understanding, and analysis
of extracted pathway knowledge
11Text Mining
- Extract all pathway-relevant relations from text
- Relations with gene or protein names on either
end of the relation are extracted - Two types of relations in GeneScene
- Co-occurrence Relations (Concept Space)
relations between terms that often co-occur in a
set of abstracts - ? HelpfulMed (Cancer Space)
- Linguistic Relations precise semantically rich
relations from each abstract
12(No Transcript)
13(No Transcript)
14HelpfulMED Search of Medical Websites
15HelpfulMED search of Evidence-based Databases
16Consulting HelpfulMED Cancer Space (Thesaurus)
17Browsing HelpfulMED Cancer Map
18(No Transcript)
19- Chinese Medical Intelligence (CMI)
- Goal
- Providing medical and health information services
to both researchers and public. - Content
- 350,000 high quality medical-related webpages
collected from mainland China, Hong Kong and
Taiwan. - Meta-search 3 large general Chinese search
engines. - Key Features
- Built-in Simplified/Traditional Chinese encoding
conversion - Dynamic summarization for both Simplified and
Traditional Chinese - Automatic categorization
- Visualization using SOM
20Chinese folder display
Simplified Chinese summary
Chinese visualization with SOM
Traditional Chinese summary
21GeneScene Full Parser Arizona Relation Parser
(ARP)
- Syntax and semantics are combined together in a
hybrid parsing grammar as opposed to the
pipelined approach - Introducing over 150 new word classes, while
retaining many of the original syntax word
classes (i.e. noun, verb) - The new word classes have semantic and lexical
properties - Semantic and syntactic properties of the new tags
are not explicitly detailed in the dictionary,
but rather determined by the parsing rules that
define them - Rules that apply to the tags reveal the syntactic
and semantic roles of the tags
22ARP Architecture
Architecture diagram for the parser, consisting
of three main stages tagging, parsing, and
relation extraction
23Problem Gene Pathway
- Title Key roles for E2F1 in signaling
p53-dependent apoptosis and in cell division
within developing tumors. - Abstract Apoptosis induced by the p53 tumor
suppressor can attenuate cancer growth in
preclinical animal models. Inactivation of the
pRb proteins in mouse brain epithelium by the
T121 oncogene induces aberrant proliferation and
p53-dependent apoptosis. p53 inactivation causes
aggressive tumor growth due to an 85 reduction
in apoptosis. Here, we show that E2F1 signals
p53-dependent apoptosis since E2F1 deficiency
causes an 80 apoptosis reduction. E2F1 acts
upstream of p53 since transcriptional activation
of p53 target genes is also impaired. Yet, E2F1
deficiency does not accelerate tumor growth.
Unlike normal cells, tumor cell proliferation is
impaired without E2F1, counterbalancing the
effect of apoptosis reduction. These studies may
explain the apparent paradox that E2F1 can act as
both an oncogene and a tumor suppressor in
experimental systems
Expert errs and corrects
Final graph
24Prepositions OF/BY/IN
25Example Map (one abstract)
26Arizona Relation Parser Output
Original Sentence Resulting Relation Resulting Relation Resulting Relation Resulting Relation
Original Sentence Entity 1 Negation Connector Entity 2
(1) wild-type p53 tumor suppressor protein, which induces apoptosis wild-type p53 tumor suppressor protein False induces apoptosis
(2) Wt p53 also induced significant apoptosis Wt p53 False also induced significant apoptosis
(3) oncogene mutant p53 suppresses apoptosis oncogene mutant p53 False suppresses apoptosis
(4) mutant p53 blocked E1A-induced apoptosis Mutant p53 False blocked E1A-induced apoptosis
(4) mutant p53 blocked E1A-induced apoptosis E1A False induced apoptosis
(5) mutant p53 does not induce apoptosis mutant p53 True does not induce apoptosis
27Text Mining Statistics (Jan. 2005)
Collection P53 AP1 Yeast
Number of Abstracts 205,820 400,487 68,025
Number of Abstracts w/ Relation Extracted 87,903 90,773 28,971
Linguistic relations (full parser) 182,499 172,116 54,805
Concept Space Relations 2,724,099 3,265,524 6,535,737
28Knowledge Integration Organizing Relations
- Relations are more useful when well organized
- Multiple name strings of the same biological
entities or processes are aggregated - Important contextual information is captured
- Entities are cross-referenced to outside
resources - Well organized relations should help with domain
appropriate analysis tasks
29An Example Context and Term Variation
- 4 somewhat contradictory PubMed abstract
snippets - (1) wild-type p53 tumor suppressor protein, which
induces apoptosis - (2) Wt p53 also induced significant apoptosis
- (3) oncogene mutant p53 suppresses apoptosis
- (4) mutant p53 blocked E1A-induced apoptosis
- (1) and (2) say that p53 induces apoptosis
- (3) and (4) say that p53 inhibits apoptosis
From PubMed documents 10594026, 8643473, and
11809683, and 11375269
30An Example Context and Term Variation
- Analyzing the context more closely
- Wild-type (1), wt (2) p53 are non-mutated
- Mutant (3) (4) p53 are mutated
- P53 protein (1) is a protein
- Oncogene p53 (3) is a gene
- Identifying context is important in organizing
extracted information. Words near p53 suggest
that while normal p53 induces apoptosis, mutated
p53 inhibits it
From pubmed documents 10594026, 8643473, and
11809683, and 11375269
31Biological Entity Recognition and Identification
- To aggregate relations we need to recognize and
identify biological entities - Recognition finds substance references in text,
identification matches those references to
external resources (Tuason et al. 2004) - Three key object recognition difficulties (Fukuda
et al., Palakal et al.) - Compound words
- Ambiguous expressions
- New or unknown words
32Aggregation System Design
33A Decompositional Approach to Biomedical Concept
Matching
- BioAggregate tagger decomposes name strings found
in a relation by left-to-right longest-first
pattern matching using domain appropriate
lexicons of feature-signaling terms - Lexicons built from existing resources and human
generated lists - Substance names in LocusLink, RefSeq, HUGO, and
SGD, etc. - Biological processes in Gene Ontology
- Lexicons of other features
34Features For Decompositional Matching
Feature Lexicon Explanation
Aggregatable Substance A gene and its product(s). All references to a particular gene and its product(s) share the same Aggregatable Substance value. E.g., p53, tp53, and trp53.
Mutation Indicating the status of an aggregatable substance. Only has two values, mutated or not mutated (wild-type).
Substance Type "Type" of aggregatable substance. Currently there are three recognized types gene, protein, and mRNA.
Associator Essentially verbs. This feature attempts to resolve verbs that occur in multiple forms, but have the same stem. E.g., inhibit, inhibits, inhibited, and inhibiting all share the Connector Associator value "inhibit."
Function A biological process, such as apoptosis or angiogenesis (as in biological_process list of Gene Ontology), or an action performed on an aggregatable substance, such as phosphorylation or inhibition.
Species The species/organism information associated with an entity or relation.
Cellular Component The sub-cellular component or location of an entity or relation.
Stopword Common words judged to meet the standard ignoring this word will not mischaracterize pathway relations.
35P53 Testbed
- ARP extracted 182,499 relations from 87,903
PubMed abstracts related to the gene p53 - As extracted, the relations display very little
overlap with 142,974 distinct entity names and
127,397 distinct relational pairs
365 Levels of Aggregation Granularity
Aggregation Level Possible Applications
Baseline (string match entities and connector) basis of comparison
Feature Match (feature synonym resolution) detailed pathway analysis
Typed Substance (distinguish genes and proteins) pathway analysis granularity is comparable to some human-curated databases
Aggregatable Substance explore the function of a gene and its gene products
Simple Pathway (substance/function 4 categories for connectors) high level overviews and input to some machine learning algorithms
More Detailed
More Abstract
37Network Consolidation Results
- The number of distinct entities and relations are
sharply reduced over various levels of
aggregation - When fewer relations are disjoint, the knowledge
network encompasses more information - Network density increased 20-fold
38Genomic Data Mining in AI Lab
- Joint learning of genetic networks from
microarray data existing knowledge - Gene selection for cancer diagnosis
39Gene Selection for Cancer Diagnosis
- Gene array data have been widely used for cancer
classification/prediction (Golub et al., 1999
Ben-Dor et al., 2001) - The major problems of gene array data (Model et
al., 2001 Lu Han, 2003) - High dimensionality (hundreds or thousands of
genes) - Small number of available samples
- Most genes are irrelevant to cancer distinction
- Genes are interacting with each other
- It is important to identify the marker genes for
cancer diagnosis
40Experiment Ovarian Cancer Diagnosis
- Ovarian cancer
- 25,580 projected cases in 2004
- 16,090 deaths estimated in 2004
- 53 overall 5 year survival
- 31 5 year survival in those with distant
metastases at diagnosis - 75 of cases diagnosed in late stages (III IV)
- Predict survival of ovarian cancer alive or
dead? - Clinical measurements
- Two attributes stage grade
- Gene methylation level
- Differentially methylated between normal and
cancer tissues
41Ovarian Cancer Methylation Array
- University of Iowa Gynecologic Oncology tumor
bank (provided by Dr. Bernie Futscher at AZCC) - 114 DNA samples
- 11 Normal ovary
- 19 Stage I
- 18 Stage II
- 24 Stage III
- 17 Stage IV
- 25 Low malignant potential (LMP)
- 6560 genes
- Top 1000 genes with highest standard deviation
across all samples are regarded potentially
relevant
42Gene Selection Techniques
- Individual gene ranking
- F-statistics (Mendenhall Sincich, 1995)
- Gene subset selection
- Optimal search algorithms
- Genetic algorithm (GA) (Holland, 1975)
- Tabu search (TS) (Glover, 1986)
- Evaluation criteria
- Maximum relevance minimum redundancy (MRMR)
(Ding Peng, 2003) - Support vector machine (SVM) (Vapnik, 1995)
43Marker Genes for Survival Prediction
- Q1 Which genes can be used to predict the
survival of ovarian cancer based on their
methylation level?
Individual 95 CIs For Mean
Based on Pooled StDev Level
F N Mean StDev
--------------------------------- Full set
1000 30 67.690 2.100 (-) F-stat
100 30 77.398 1.414
(-) GA/MRMR 57 30 76.199 2.018
(-) GA/SVM 39 30
80.263 1.592 (-)
TS/MRMR 24 30 80.702 1.613
(-) TS/SVM 96 30
82.807 2.205
(-)
--------------------------------- Pooled
StDev 1.847 70.0
75.0 80.0 85.0
- Conclusion
- TS/SVM selected 96 out of 1000 genes, which
achieved the highest prediction accuracy (82.807)
44Methylation vs. Clinical Diagnosis
- Q2 Will methylation-based methods perform better
than clinical diagnosis in predicting survival of
ovarian cancer?
Individual 95 CIs For Mean
Based on Pooled StDev Level
F N Mean StDev
---------------------------------- Clinical
2 30 75.281 1.770
() Full set 1000 30 57.566
2.747 () F-stat 70 30 75.581
1.413 () GA/MRMR
40 30 75.506 1.756
() GA/SVM 46 30 79.813
0.205 ()
TS/MRMR 24 30 77.715 1.868
() TS/SVM 30 30
81.948 1.769
()
---------------------------------- Pooled
StDev 1.790 63.0
70.0 77.0
- Conclusions prediction accuracy
- Full set lt Clinical lt Marker genes (selected by
TV/SVM)
45GeneScene Visualizer
- To provide graphical presentation of large-scale
regulatory networks comprised of pathway
relations extracted through text mining
technologies - Three testing collections
- p53 (87,903 abstracts)
- AP1 (90,773 abstracts)
- Yeast (28,971 abstracts)
- Currently loading and parsing entire PubMed for
Cancer pathway - 7 million abstracts
46GeneScene Visualizer Functionality
- Searching by specific elements, e.g., diseases
or genes - Network-based exploration and navigation
- Accessing the underlying PubMed abstract
- Saving and loading search and visualization
results - Various manipulations on the table and network
view of the retrieved relations filter, sort,
zoom, highlight, isolate, expand, print, etc.
47GeneScene Visualizer V1.5
48GeneScene Visualizer V1.5
49Affect of Aggregation
- Same relations, before and after aggregation
Baseline level
Simple Pathway level
50Affect of Aggregation Mutation Feature
When mutant and non-mutant are combined, an
apparent conflict arises TP53 is both activating
and inhibiting MDM2.
When the Mutation feature is selected, non-mutant
TP53 is shown to activate and mutant TP53 to
inhibit MDM2.
51User FeedbackGeneral Comments
- Interviewees were generally impressed with the
features and usefulness of the system - In my head I've been trying to do what this is
doing for you. - It took me a few weeks just to find that Sin3
interacts with p53, where when you type this in
to Genescene it's right there. - Just playing around with the system I am
seeing things that I didn't know before. - If this is the entire Medline, I would probably
use it every time I search. - I think a lot of people would get a lot of use
out of this, as long as it doesn't scare them off
in the beginning.
52- Lessons Learned
- Biomedical information is precise but
terminologies fluid - Biomedical professionals need search and analysis
help - Biomedical linguistic parsing and ontologies are
promising for biomedical text mining - The need for integrated biomedical data (gene
microarray) and text mining (literature)
53Ongoing Research
- Combining bottom-up data mining
(MicroArray/Methylation) with top-down text
mining results - Creating CancerPath testbed for cancer genomic
network visualization - Biological networks topological analysis (growth,
preferential attachment) - Other biomedical applications plant science
pathway (Arabidopsis Galbraith Lab) infectious
disease surveillance
54BioPortalInfectious Disease Information
Sharing, Surveillance, Analysis, and Visualization
55Research Partners and Supports
- University of Arizona
- University of California, Davis
- Kansas State University
- University of Utah
- Arizona Department of Public Health
- New York State Department of Health/HRI
- California Department of Health Services/PHFE
- U.S. Geological Survey
- The SIMI Group
- National Taiwan University
- NSF
- CIA/ITIC
- DHS
- DOD/AFMIC
- CDC
- AZDPS
56UA Team Members
- Dr. Hsinchun Chen
- Dr. Daniel Zeng
- Lu Tseng
- Cathy Larson
- Kira Joslin
- Wei Chang
- James Ma
- Hsinmin Lu
- Ping Yan
- Aaron Sun
- Keith Alcock
- Sapna Brahmanandam
- Milind Chabbi
- Yuan Wang
57Outline
- Project Background
- BioPortal Achievements
- System Architecture
- System Functionalities
- BioPortal Collaboration Framework
- New Developments
- International Foot-and-mouth Disease Monitoring
- Syndromic Surveillance
- Disease Contact Tracing
58BioPortal Project Goals
- Demonstrate and assess the technical feasibility
and scalability of an infectious disease
information sharing (across species and
jurisdictions), alerting, and analysis framework. - Develop and assess advanced data mining and
visualization techniques for infectious disease
data analysis and predictive modeling. - Identify important technical and policy-related
challenges in developing a national infectious
disease information infrastructure.
59Information Sharing Infrastructure Design
Portal Data Store (MS SQL 2000)
Data Ingest Control Module Cleansing /
Normalization
Info-Sharing Infrastructure
Adaptor
Adaptor
Adaptor
SSL/RSA
SSL/RSA
XML/HL7 Network
PHINMS Network
New
NYSDOH
CADHS
60Data Access Infrastructure Design
61Datasets Integrated WNV, BOT
Index Dataset Test Data Available Data Duration (MM/YY) Data Size Spatial Granularity Temporal Granularity
1 NY WNV Human Yes Test Data 574 Zip Date
2 NY Dead Bird Yes Test Data 942 Lat/Long shifted
3 NY Mosquito Yes Test Data 815 County Date
4 NY WNV Captive Animal Yes Test Data 39 Zip Date
5 NY Botulism Human Yes Test Data 10 Zip Date
6 CA WNV Human Yes 09/0310/03 186 County Date
7 CA Dead Bird Yes 01/0310/03 3032 City/zip Minutes
8 CA Chicken Sera Yes 04/0310/03 18887 Site Date
9 CA Mosquito Pool Yes 01/9810/03 3518 Site Date
10 CA Botulism Yes 01/0112/02 53 Zip Date
11 USGS EPIZOO - Preliminary Yes 07/9909/03 46 events County Date
12 USGS EPIZOO WNV Yes 08/1999-07/2004 113 events County Date
13 USGS EPIZOO - Botulism Yes 12/1989-12/2004 702 events County Date
14 UC Davis FMD Yes 1996 - 2003 3288 Site/Province Date/Month
15 International FMD Yes 01/1982-03/2005 6789 Province Non-temporal
16 BioWatch Yes 1/10- 1/17 2004 480 Exact Site Date
17 CA Mosquito Treatment Yes 1/14-11/30 2004 6194 Exact Site Date
18 National Infant Botulism Yes 1/16-11/25 2004 15 Zip Date
62Communications/Messaging
- Scalable, flexible, light-weight, and extendible.
Easy to include - New diseases
- New jurisdictions
- New techniques!
- Messaging infrastructure installed and tested
- NYSDOH-UA PHIN MS
- CADHS-UA Regional message broker
- NWHC-UA PHIN MS
- XML generation/conversion
- NY_DeadBird, NY_Alerts, NY_BotHuman, NY_WNVHuman,
NY_CaptiveAnimal, NY_Mosquito - CA_BotHuman, CA_WNVHuman, CA_DeadBird,
CA_Chicken, CA_Mosquito - USGS_Epizoo
63Spatio-Temporal Data Mining Hotspot Analysis
- A hotspot is a condition indicating some form of
clustering in a spatial and temporal distribution
(Rogerson Sun 2001 Theophilides et al. 2003
Patil Tailie 2004). - For WNV, localized clusters of dead birds
typically identify high-risk disease areas
(Gotham et al. 2001). - Automatic detection of dead bird clusters using
hotspot analysis can help predict disease
outbreaks and aid in effective allocation of
prevention/control resources.
64Case Study (NY WNV)
- On May 26, 2002, the first dead bird with WNV was
found in NY - Based on NYs test dataset
140 records
224 records
March 5
May 26
July 2
new cases
baseline
65(No Transcript)
66Dead Bird Hotspots Identified
67WNV/BOT BioPortal
Acknowledgment NSF, ITIC, NYSDH, CDHS,
USGS (Drs. Kvach and Ascher)
68(No Transcript)
69(No Transcript)
70(No Transcript)
71BioPortal HotSpot Analysis RSVC, SaTScan, and
CrimeStat Integrated (first visual, real-time
hotspot analysis system for disease surveillance)
- West Nile virus in California
72Hotspot Analysis-Enabled STV
73International FMD BioPortal
Acknowledgment DHS, DOD, UC Davis (Drs. Thurmond
and Lynch)
74International FMD BioPortal Goals
- Real-time, web-based situational awareness of FMD
outbreaks worldwide through the establishment of
an international information sharing and analysis
system - FMDv characterization at the genomic level
integrated with associated epidemiological
information and modeling tools to forecast
national, regional, and/or international spread
and the prospect of import into the U.S. and the
rest of North America - Web-based crisis management of resourcesfacilitie
s, personnel, diagnostics, and therapeutics
75Research Plans
- Global FMD epidemiological data
- (Near) real-time data collection
- Web-based information sharing and analysis
- International FMD news
- Indexed collection of global FMD news
- Search and visualization of the FMD news via the
web - FMD genetic/sequence data
- Predictive model using phylogenetic, spatial, and
temporal information to stop FMD at the boarder - Visualization for FMD event in time, space, and
genetic space
76Preliminary Global FMD Dataset
- Provider UC Davis FMD Lab
- Information sources reference labs and OIE
- Coverage 28 countries globally
- Time span May, 1905 March, 2005
- Dataset size 30,000 records of which 6789
records are complete - Host species Cattle, Caprine, Ovine, Bovine,
Swine, NK, Elephant, Buffalo, Sheep, Camelidae,
Goat
77Global FMD Coverage in BioPortal
78FMD Migration Visualization using BioPortal
(cases in South Asia)
FMD Cases travel back and forth between countries
79International FMD News
- Provider UC Davis FMD Lab
- Information sources Google, Yahoo, and open
Internet sources - Time span Oct 4, 2004 present (real-time
messaging under development) - Data size 460 events (6/21/05)
- Coverage 51 countries
- (Africa11, Asia16,
- Europe12, Americas12)
80Searching FMD News
- http//fmd.ucdavis.edu/
- Searchable by
- Date range
- Country
- Keyword
81Visualizing FMD News on BioPortal
82FMD Genetic Information Analysis
- Genome clustering analysis
- Phylogenetic clustering
- Spatial clustering
- Temporal clustering
- Hotspot detection among gene sequences
- Create a tree structure based on semantic
distance between gene sequences. - Automatically detect the dense portion of the
tree. - Identify the connection between the semantic
cluster and the geographic pattern of gene
sequences.
83FMD Genetic Visualization
- Goal Extend STV to incorporate 3rd dimension,
phylogenetic distance - Include a phylogenetic tree.
- Identify phylogenetic groups and color-code the
isolate points on the map. - Leverage available NCBI tools such as BLAST.
- Proof of concept SAT 2 3 analysis
- Data 54 partial DNA sequence records in South
Africa received from UC Davis FMD Lab
(Bastos,A.D. et al. 2000, 2003) - Date range 1978-1998
- Countries covered South Africa, Zimbabwe,
Zambia, Namibia, Botswana
84Sample FMD Sequence Records
Color-coded View (MEGA3)
Textual View of Gene Sequence
85Phylogenetic Treeof Sample FMD Data
Identify 6 groups within 2 major families (MEGA3
based on sequence similarity)
Group6
Group1
Group2
Group5
Group4
Group3
86Genetic, Spatial, and Temporal Visualization of
FMD Data
Phylogenetic tree color coded
Isolates locations color coded
Isolates appearances in time
87FMD Time Sequence Analysis
First family cases appeared throughout the period
2nd family cases exist before 1993 and a comeback
lately
Second family cases existed before 1993 and
reappeared later after 1997
88FMD Periodic Pattern Analysis
2nd family concentrated in Feb. while 1st family
spread evenly
89Locations of Family 1 records
Selected only groups 1, 2, and 3 and found a
spatial cluster
90Locations of Family 2 records
Sparse isolate locations
Selected only groups 4, 5, and 6
91 92Chief Complaints As a Data Source
- Chief complaints (CCs) are short free-text
phrases entered by triage practitioners
describing reasons for patients ER visit - Examples lt foot pain left foot pain cp
chest pain sob shortness of breath so
should be sob poss uti possibly urinary
tract infection - Advantages of using CCs for surveillance purposes
- Timeliness Diagnose results are on average 6
hours slower than CCs - Availability and low-cost Most hospitals have
free-text CCs available in electronic form
93Existing CC Classification Methods
Classification Method Systems Authors
Keyword Match Synonym List Mapping Rules DOHMH (NY City), EARS Mikosz et. al. (2004)
Weighted Keyword Match (Vector Cosine Method) Mapping Rules ESSENCE Sniegoski (2004)
Naïve Bayesian RODS Olszewski (2003), Ivanov et. al (2002)
Bayesian Network N/A Chapman et. al. (2004)
94Overall System Design
Chief Complaints
95A Stage 2 Example CC Concepts ? Symptom Group
Concepts
coagulopathy
purpura
ecchymosis
bleeding 1/41/51/6 0.62
4
5
6
Blood In urine
ureteral stone
5
other1/50.2
coma
5
coma1/50.2 dead1/50.2
UMLS
5
out pass
altered_mental_status 1/50.2
96Summary of Stage 2 Performance
3001 concepts
1835 CC records from Stage 1
contains
417 unique concepts
97- BioPortal Taiwan Syndromic Surveillance
98Multi-lingual Chief ComplaintsChinese Example
- Data Characteristics
- Mixed expressions in both Chinese and English
- ????FEVER???????????(?)
- ??,?????A/W,????,????
- 18 CC records from NTU Med. Center contain
Chinese expressions. - Some hospitals have 100 CC records in Chinese
(For example, ??????) - Misspellings and typographic errors are not
serious
99Chinese CC Preprocessing System Design
English Expressions
Translated Chinese Phrases
Stage 0.1
Stage 0.2
Stage 0.3
Segmented Chinese Phrases
Chinese Expressions
Separate Chinese and English Expressions
Chinese Phrase Segmentation
Chinese Phrase Translation
Chinese Chief Complaints
Chinese to English Dictionary
Chinese Medical Phrases
Common Chinese Phrases
Raw Chinese CCs
Mutual Info.
100Result Self Validation
- Use the 280 translations against 1978 chief
complaints from hospital A
- 1610 (82) records are in English
- 368 (18) records contain Chinese
- 36 contains trivial info.
- Eg. r/o septic shock ????
- 64 contains non-trivial info.
- Eg. poor intake and ????
- 67 has complete translation
- 2 has partial translation
- 20 does not have translation
101General Grouping
- Taiwan surveillance data visualization 2.2M
scrubbed chief complaints records
102Group by Hospital
103Group by Syndrome Classification
104Disease Outbreak Detection Using Chief Complaints
104
105Data Source
- Emergency Department Free-text Chief Complaints
(CCs) - Medical practitioners use both Chinese and
English to record CCs - 368,151 records 23.77 contains Chinese
characters - Time period 2000-6-30 to 2003-4-27
- Use BioPortal Multilingual CC Classifier to
classify CC records into syndromes
106Syndrome Prevalence
- Botulism-Like (1.4)
- Constitutional (25.4)
- Gastrointestinal (26.4)
- Hemorrhagic (6.4)
- Neurological (14.1)
- Rash (2.4)
- Respiratory (17.8)
- Upper Respiratory (7.3)
- Lower Respiratory (12.7)
- Fever (18)
- Other (34.9)
- Choose Resp. and GI syndrome for further analysis
- Two syndromes with high prevalence
- Can be extended to other syndromes
107GI Syndrome Time Series
Gastrointestinal Syndrome Time Series
Autocorrelation Function
108GI Syndrome Time Series (contd)
- Strong day-of-week effect
- Seasonal effect is less strong
- Sporadic jumps
- Seems to have 1 2 peaks per year
109Estimation Results (GI Syndrome)
GI Time Series
Jumps
Outbreak State
110Estimation Results (GI Syndrome) (contd)
- Jumps appear during Chinese New Years
- The Markov switching model identified 4 high
GI-count period - 2000-12-23 to 2001-1-28 (Jan. 23 New Year Eve)
- 2002-1-29 to 2002-3-15 (Feb. 11 New Year Eve)
- 2002-5-9 to 2002-10-14
- 2002-12-13 to 2003-2-18 (Jan. 30 New Year Eve)
111- Taiwan SARS Contact Tracing
112Social Network Analysis in Epidemiology
- Conceptualizing a population as a set of
individuals linked together to form a large
social network provides a fruitful perspective
for better understanding the spread of some
infectious diseases. (Klovdahl, 1985) - Social Network Analysis in epidemiology has two
major activities - Network Construction
- Link the whole set of persons in a particular
population with relationships or types of
contacts - Network Analysis
- Measure and make inferences about structural
properties of the social networks through which
infectious agent spread
113A Taxonomy of Network Construction
Network Construction Network Construction Network Construction
Disease Linking Relationship Examples
Sexually Transmitted Disease (STD) Sexual Contact AIDS (Klovdahl, 1985) Gonorrhea (Ghani et al., 1997) Syphilis (Rethenberg et al., 1998)
Sexually Transmitted Disease (STD) Drug Use AIDS (Klovdahl et al., 1994) AIDS (Rethenberg et al., 1998)
Sexually Transmitted Disease (STD) Needle Sharing AIDS (Klovdahl et al., 1994) AIDS (Rethenberg et al., 1998)
Sexually Transmitted Disease (STD) Social Contact AIDS (Klovdahl et al., 1994) AIDS (Rethenberg et al., 1998)
Tuberculosis (TB) Personal Contact (Klovdahl et al., 2001) (McElroy et al, 2003)
Tuberculosis (TB) Geographical Contact (Klovdahl et al., 2001) (McElroy et al, 2003)
Severe Acute Respiratory Syndrome (SARS) The Source of Infection (CDC, 2003) (Shen et al., 2004)
Severe Acute Respiratory Syndrome (SARS) Personal Contact (Meyers et al., 2005)
CDC Centers for Disease Control and Prevention
114A Taxonomy of Network Analysis
Network Analysis Network Analysis Network Analysis
Levels of Analysis Description Examples
Network Visualization Show the spread of an infectious agent transmitted from one person to another AIDS (Klovdahl, 1985) Syphilis (Rethenberg et al., 1998) SARS (CDC, 2003) SARS (Shen et al., 2004)
Network Measurement Study the structure of a population through which an infectious agent is transmitted during close personal contact Develop disease containment strategies or programs Syphilis (Rethenberg et al., 1998) AIDS (Klovdahl et al., 1994) AIDS (Rethenberg et al., 1998)
Network Simulation Evaluate the spread of an infectious agent within a population with different network parameters Gonorrhea (Ghani et al., 1997) SARS (Meyers et al., 2005)
CDC Centers for Disease Control and Prevention
115Network Visualization
- Utilize social network to visualize the
transmission of an infectious agent from one
person to another within a particular population - Focus on the identification of
- Subgroups within the population
- Characteristics of each subgroup
- Bridges between subgroups which transmit a
disease from a subgroup to another
116Epidemic Phases and Social Networks
- Potterat et al. (2001) proposed that structure of
sexual networks is a more reliable indicator of
STD epidemic phase. - Two sexual networks in Colorado Springs, U.S.
were compared - Bacterial STD from 1990 to 1991 (a STD outbreak)
- Chlamydia from 1996 to 1999 (stable or declining
phase) - Sexual network in stable or declining phase was
relatively - Fragmented
- Dendritic
- Lack of cyclic structures
- Cunningham et al. (2004) further examined the
relationship between network characteristics and
epidemic phases. - After epidemic
- Macro-level structure
- Average distance declined.
- Density increased.
- Micro-level structure
- Numbers of n-cliques and k-plexes declined.
117Research Test Bed
- We use Taiwan SARS data as our research test bed.
- SARS (Severe Acute Respiratory Syndrome) is a
novel infectious disease which emerged in 2002. - The first human case was identified in Guangdong
Province, China on November 16, 2002. (Donnelly
et al., 2004) - A 65-years-old doctor from Guangdong Province
stayed at a hotel in Hong Kong in February 2003
and infected at least 17 other guests and
visitors at the hotel, some of whom later came to
other countries and initiated local transmission
of SARS. (Peiris et al., 2006) - 26 countries, including Vietnam, Singapore,
Canada, and Taiwan, reported SARS cases. - Financial impact 50B
118SARS in Taiwan
- The first SARS case in Taiwan was a Taiwanese
businessman who traveled to Guangdong Province
via Hong Kong in the early February 2003. - Had onset of symptoms on February 26, 2003
- Infected two family members and one healthcare
worker - Eighty percent of probable SARS cases were
infected in hospital setting. - The first outbreak began at a municipal hospital
in April 23, 2003. - Total seven hospital outbreaks were reported.
- Hospital shopping and transfer were suspected to
trigger such sequential hospital outbreaks.
119Taiwan SARS Data
- Taiwan SARS data was collected by the Graduate
Institute of Epidemiology at National Taiwan
University during the SARS period. - In this dataset, there are 961 patients,
including 638 suspected SARS patients and 323
confirmed SARS patients. - The contact-tracing data of patients in this
dataset has two main categories, personal and
geographical contacts, and nine types of
contacts. - Personal contacts family member, roommate,
colleague/classmate, and close contact - Geographical contacts foreign-country travel,
hospital visit, high risk area visit, hospital
admission history, and workplace
120Taiwan SARS Data (Cont.)
- Hospital admission history is the category with
largest number of records (43). - Personal contacts are primarily comprised of
family member records.
Category Type of Contacts Records Suspected Patients Confirmed Patients
Personal Family Member 177 48 63
Personal Roommate 18 11 15
Personal Colleague/Classmate 40 26 23
Personal Close Contact 11 10 12
Geographical Foreign-Country Travel 162 100 27
Geographical Hospital Visit 215 110 79
Geographical High Risk Area Visit 38 30 7
Geographical Hospital Admission History 622 401 153
Geographical Workplace 142 22 120
Total Total 1425 638 323
121Research Design
122Phase Analysis (Cont.)
- Network Partition
- We partition each contact network on a weekly
basis with linkage accumulation. - From 2/24 to 5/4, there are 10 weeks in total.
123Phase Analysis (Cont.)
- Network Measurement
- We investigate two factors that contribute to the
transmission of disease in macro-structure - Density the degree of intensity to which people
are linked together - Density
- Average degree of nodes
- Transferability the degree to which people can
infect others - Betweenness
- Number of components
Higher density
Lower density
Lower Transferability
Higher Transferability
124Connectivity Analysis
- Geographical contacts provide much higher
connectivity than personal contacts in the
network construction. - Decrease the number of components from 961 to 82
- Increase the average degree from 0.31 to 108.62
Applied Contacts in the network construction Average Degree (Patient Nodes) Maximum Degree (Patient Nodes) Number of Components
Personal Contacts 0.31 4 847
Geographical Contacts 108.62 474 82
Personal Geographical Contacts 108.85 474 10
125Connectivity Analysis (Cont.)
- The hospital admission history provides the
highest connectivity of nodes in the network
construction. - The hospital visit provides the second highest
connectivity. - This result is consistent with the fact that most
of patients got infected in the hospital
outbreaks during the SARS period.
Applied Contacts in the network construction Applied Contacts in the network construction Average Degree Maximum Degree Number of Components
Personal Contacts Family Member 0.204 4 893
Personal Contacts Roommate 0.031 2 946
Personal Contacts Colleague/Classmate 0.06 3 934
Personal Contacts Close Contact 0.023 1 949
Geographical Contacts Foreign-Country Travel 2.727 41 848
Geographical Contacts Hospital Visits 10.077 105 753
Geographical Contacts High Risk Area Visit 1.388 36 924
Geographical Contacts Hospital Admission History 50.479 289 409
Geographical Contacts Workplace 4.694 61 823
126One-Mode Network with Only Patient Nodes
127Contact Network with Geographical Nodes
128Potential Bridges Among Geographical Nodes
- Including geographical nodes helps to reveal some
potential people who play the role as a bridge to
transfer disease from one subgroup to another.
129Network Visualization (Cont.)
- For a hospital outbreak, including geographical
nodes and contacts in the network is also useful
to see the possible disease transmission scenario
within the hospital. - Background of the Example
- Mr. L, a laundry worker in H Hospital, had a
fever on 2003/4/16 and was reported as a
suspected SARS patient. - Nurse C took care of Mr. Liu on 4/16 and 4/17.
- Nurse C and Ms. N, another laundry worker in H
Hospital, began to have symptoms on 4/21. - H Hospital was reported to have an SARS outbreak
on 4/24. - Nurse Cs daughter had a fever on 5/1.
130Phase Analysis Density
- Normalized density and average degree show
similar patterns - In the importation phase, foreign-country contact
network increases dramatically in Week 4
(3/17-3/23), followed by personal contact
network. - In the hospital outbreak phase, both personal and
hospital networks increase dramatically. But in
Week 10, personal network still increases while
hospital network decreases.
Density
Average Degree
131Phase Analysis Transferability
- From betweenness, we can see that personal
network doesnt have enough transferability until
Week 9. - Personal network just forms several small
fragments without big groups in the importation
phase. - From the number of components, hospital network
is the only one which can consistently link
patients together.
Hospital Outbreak
Hospital Outbreak
Importation
Importation
Betweenness
Number of Components
132Ongoing Research
- Worldwide infectious disease breaking news
collection, monitoring, and analysis - Markov-switching model based disease surveillance
- Infectious disease social network analysis and
contact tracing - Other public health concerns and infectious
disease applications
133Building Research Partnership
- Emerging critical medical and public health
concerns - Willing and engaging international domain
(biomedical) partners and funding sources - Data, data, and more data
- From academic research to scalable
solutions/systems and lasting impacts
134For more informationBioPortal web site
http//www.bioportal.orgAI Lab web site
http//ai.arizona.eduhchen_at_eller.arizona.edu