Knowledge Discovery for Cancer Informatics and Public Health Informatics: Techniques, Case Studies, and Lessons Learned Hsinchun Chen Director, Artificial Intelligence Lab, University of Arizona Acknowledgements: NIH, NSF, NCI, ACC, NTU

About This Presentation

Title:

Knowledge Discovery for Cancer Informatics and Public Health Informatics: Techniques, Case Studies, and Lessons Learned Hsinchun Chen Director, Artificial Intelligence Lab, University of Arizona Acknowledgements: NIH, NSF, NCI, ACC, NTU

Description:

Knowledge Discovery for Cancer Informatics and Public Health Informatics: Techniques, Case Studies, – PowerPoint PPT presentation

Number of Views:613

Avg rating:3.0/5.0

Slides: 135

Provided by: hua9

Category:

more less

Transcript and Presenter's Notes

Title: Knowledge Discovery for Cancer Informatics and Public Health Informatics: Techniques, Case Studies, and Lessons Learned Hsinchun Chen Director, Artificial Intelligence Lab, University of Arizona Acknowledgements: NIH, NSF, NCI, ACC, NTU

1
Knowledge Discovery for Cancer Informatics and
Public Health Informatics Techniques, Case
Studies, and Lessons LearnedHsinchun
ChenDirector, Artificial Intelligence
Lab,University of ArizonaAcknowledgements
NIH, NSF, NCI, ACC, NTU
2
Artificial Intelligence Lab Research

UA MIS Department (4th ranked) 30 research
scientists 25M funding since 1990 Chen, IEEE
and AAAS Fellow
Intelligence Security Informatics research
NSF, DOJ, CIA COPLINK system deployed in 1600
agencies Dark Web for countering terrorism
Biomedical Informatics research NLM, NCI Chen
NLM Scientific Counselor HelpfulMed, GeneScene
system and BioPortal

3
A Little Promotion
4
GeneScene Cancer Pathway Knowledge Extraction
And Visualization
5
GeneScene Team
Dr. Gondy Leroy (Claremont) Byron Marshall
(Oregon SU) Dan McDonald (Utah SU) Zan Huang
(Penn SU) Jiexun Li (Drexel U) Chun-Ju
Tseng Shauna Eggers Dr. Jesse Martinez Dr.
George Watts Dr. Bernie Futscher (AZCC) Dr. Hua
Su Dr. Karin Quinones

Text Mining Knowledge Integration
Data Mining
Visualization System Development
Domain Experts
User Studies Evaluation

6
Outline

GeneScene overview
Research directions
Text mining
Knowledge integration
Data mining
GeneScene Visualizer

7
Knowledge Explosion PubMed

Average number of new citations appearing in
PubMed
In 1980 746/day
In 2004 1,640/day

8
(No Transcript)
9
GeneScene Overview

Motivation
Relieving information overload in biomedical
research
Automating processes of knowledge extraction and
data analysis
Focus genetic regulatory pathways
Dissection of regulatory networks is crucial for
a thorough understanding of biological processes
Complexity of biological networks raises
challenges for computational research
Research goals
To develop novel Natural Language Processing
(NLP) techniques to support information
extraction
To develop machine learning and data mining
techniques to support high-throughput data
analysis
To create an integrated framework for
pathway-related knowledge representation and
visualization
Ultimately, to provide biomedical researchers
with a pathway-related knowledge discovery and
integration platform
Funding NIH/NLM, 1 R33 LM07299-01 (May 2002
April 2006)

10
GeneScene Components Scope
Ontology-enhanced Knowledge Integration To
aggregate and consolidate pathway relations
extracted from literature and to integrate them
with existing knowledge sources using biomedical
ontologies
Text Mining of Biomedical Literature To
automatically extract regulatory relations
between biological entities from free text

Data Mining for Genomic Studies
To extract regulatory pathway information based
on genomic data other resouces

Visualization of Regulatory Pathways To
facilitate accessing, understanding, and analysis
of extracted pathway knowledge
11
Text Mining

Extract all pathway-relevant relations from text
Relations with gene or protein names on either
end of the relation are extracted
Two types of relations in GeneScene
Co-occurrence Relations (Concept Space)
relations between terms that often co-occur in a
set of abstracts
? HelpfulMed (Cancer Space)
Linguistic Relations precise semantically rich
relations from each abstract

12
(No Transcript)
13
(No Transcript)
14
HelpfulMED Search of Medical Websites
15
HelpfulMED search of Evidence-based Databases
16
Consulting HelpfulMED Cancer Space (Thesaurus)
17
Browsing HelpfulMED Cancer Map
18
(No Transcript)
19

Chinese Medical Intelligence (CMI)
Goal
Providing medical and health information services
to both researchers and public.
Content
350,000 high quality medical-related webpages
collected from mainland China, Hong Kong and
Taiwan.
Meta-search 3 large general Chinese search
engines.
Key Features
Built-in Simplified/Traditional Chinese encoding
conversion
Dynamic summarization for both Simplified and
Traditional Chinese
Automatic categorization
Visualization using SOM

20
Chinese folder display
Simplified Chinese summary
Chinese visualization with SOM
Traditional Chinese summary
21
GeneScene Full Parser Arizona Relation Parser
(ARP)

Syntax and semantics are combined together in a
hybrid parsing grammar as opposed to the
pipelined approach
Introducing over 150 new word classes, while
retaining many of the original syntax word
classes (i.e. noun, verb)
The new word classes have semantic and lexical
properties
Semantic and syntactic properties of the new tags
are not explicitly detailed in the dictionary,
but rather determined by the parsing rules that
define them
Rules that apply to the tags reveal the syntactic
and semantic roles of the tags

22
ARP Architecture
Architecture diagram for the parser, consisting
of three main stages tagging, parsing, and
relation extraction
23
Problem Gene Pathway

Title Key roles for E2F1 in signaling
p53-dependent apoptosis and in cell division
within developing tumors.
Abstract Apoptosis induced by the p53 tumor
suppressor can attenuate cancer growth in
preclinical animal models. Inactivation of the
pRb proteins in mouse brain epithelium by the
T121 oncogene induces aberrant proliferation and
p53-dependent apoptosis. p53 inactivation causes
aggressive tumor growth due to an 85 reduction
in apoptosis. Here, we show that E2F1 signals
p53-dependent apoptosis since E2F1 deficiency
causes an 80 apoptosis reduction. E2F1 acts
upstream of p53 since transcriptional activation
of p53 target genes is also impaired. Yet, E2F1
deficiency does not accelerate tumor growth.
Unlike normal cells, tumor cell proliferation is
impaired without E2F1, counterbalancing the
effect of apoptosis reduction. These studies may
explain the apparent paradox that E2F1 can act as
both an oncogene and a tumor suppressor in
experimental systems

Expert errs and corrects
Final graph
24
Prepositions OF/BY/IN
25
Example Map (one abstract)
26
Arizona Relation Parser Output
Original Sentence Resulting Relation Resulting Relation Resulting Relation Resulting Relation
Original Sentence Entity 1 Negation Connector Entity 2
(1) wild-type p53 tumor suppressor protein, which induces apoptosis wild-type p53 tumor suppressor protein False induces apoptosis
(2) Wt p53 also induced significant apoptosis Wt p53 False also induced significant apoptosis
(3) oncogene mutant p53 suppresses apoptosis oncogene mutant p53 False suppresses apoptosis
(4) mutant p53 blocked E1A-induced apoptosis Mutant p53 False blocked E1A-induced apoptosis
(4) mutant p53 blocked E1A-induced apoptosis E1A False induced apoptosis
(5) mutant p53 does not induce apoptosis mutant p53 True does not induce apoptosis
27
Text Mining Statistics (Jan. 2005)
Collection P53 AP1 Yeast
Number of Abstracts 205,820 400,487 68,025
Number of Abstracts w/ Relation Extracted 87,903 90,773 28,971
Linguistic relations (full parser) 182,499 172,116 54,805
Concept Space Relations 2,724,099 3,265,524 6,535,737
28
Knowledge Integration Organizing Relations

Relations are more useful when well organized
Multiple name strings of the same biological
entities or processes are aggregated
Important contextual information is captured
Entities are cross-referenced to outside
resources
Well organized relations should help with domain
appropriate analysis tasks

29
An Example Context and Term Variation

4 somewhat contradictory PubMed abstract
snippets
(1) wild-type p53 tumor suppressor protein, which
induces apoptosis
(2) Wt p53 also induced significant apoptosis
(3) oncogene mutant p53 suppresses apoptosis
(4) mutant p53 blocked E1A-induced apoptosis
(1) and (2) say that p53 induces apoptosis
(3) and (4) say that p53 inhibits apoptosis

From PubMed documents 10594026, 8643473, and
11809683, and 11375269
30
An Example Context and Term Variation

Analyzing the context more closely
Wild-type (1), wt (2) p53 are non-mutated
Mutant (3) (4) p53 are mutated
P53 protein (1) is a protein
Oncogene p53 (3) is a gene
Identifying context is important in organizing
extracted information. Words near p53 suggest
that while normal p53 induces apoptosis, mutated
p53 inhibits it

From pubmed documents 10594026, 8643473, and
11809683, and 11375269
31
Biological Entity Recognition and Identification

To aggregate relations we need to recognize and
identify biological entities
Recognition finds substance references in text,
identification matches those references to
external resources (Tuason et al. 2004)
Three key object recognition difficulties (Fukuda
et al., Palakal et al.)
Compound words
Ambiguous expressions
New or unknown words

32
Aggregation System Design
33
A Decompositional Approach to Biomedical Concept
Matching

BioAggregate tagger decomposes name strings found
in a relation by left-to-right longest-first
pattern matching using domain appropriate
lexicons of feature-signaling terms
Lexicons built from existing resources and human
generated lists
Substance names in LocusLink, RefSeq, HUGO, and
SGD, etc.
Biological processes in Gene Ontology
Lexicons of other features

34
Features For Decompositional Matching
Feature Lexicon Explanation
Aggregatable Substance A gene and its product(s). All references to a particular gene and its product(s) share the same Aggregatable Substance value. E.g., p53, tp53, and trp53.
Mutation Indicating the status of an aggregatable substance. Only has two values, mutated or not mutated (wild-type).
Substance Type "Type" of aggregatable substance. Currently there are three recognized types gene, protein, and mRNA.
Associator Essentially verbs. This feature attempts to resolve verbs that occur in multiple forms, but have the same stem. E.g., inhibit, inhibits, inhibited, and inhibiting all share the Connector Associator value "inhibit."
Function A biological process, such as apoptosis or angiogenesis (as in biological_process list of Gene Ontology), or an action performed on an aggregatable substance, such as phosphorylation or inhibition.
Species The species/organism information associated with an entity or relation.
Cellular Component The sub-cellular component or location of an entity or relation.
Stopword Common words judged to meet the standard ignoring this word will not mischaracterize pathway relations.
35
P53 Testbed

ARP extracted 182,499 relations from 87,903
PubMed abstracts related to the gene p53
As extracted, the relations display very little
overlap with 142,974 distinct entity names and
127,397 distinct relational pairs

36
5 Levels of Aggregation Granularity
Aggregation Level Possible Applications
Baseline (string match entities and connector) basis of comparison
Feature Match (feature synonym resolution) detailed pathway analysis
Typed Substance (distinguish genes and proteins) pathway analysis granularity is comparable to some human-curated databases
Aggregatable Substance explore the function of a gene and its gene products
Simple Pathway (substance/function 4 categories for connectors) high level overviews and input to some machine learning algorithms
More Detailed
More Abstract
37
Network Consolidation Results

The number of distinct entities and relations are
sharply reduced over various levels of
aggregation
When fewer relations are disjoint, the knowledge
network encompasses more information
Network density increased 20-fold

38
Genomic Data Mining in AI Lab

Joint learning of genetic networks from
microarray data existing knowledge
Gene selection for cancer diagnosis

39
Gene Selection for Cancer Diagnosis

Gene array data have been widely used for cancer
classification/prediction (Golub et al., 1999
Ben-Dor et al., 2001)
The major problems of gene array data (Model et
al., 2001 Lu Han, 2003)
High dimensionality (hundreds or thousands of
genes)
Small number of available samples
Most genes are irrelevant to cancer distinction
Genes are interacting with each other
It is important to identify the marker genes for
cancer diagnosis

40
Experiment Ovarian Cancer Diagnosis

Ovarian cancer
25,580 projected cases in 2004
16,090 deaths estimated in 2004
53 overall 5 year survival
31 5 year survival in those with distant
metastases at diagnosis
75 of cases diagnosed in late stages (III IV)
Predict survival of ovarian cancer alive or
dead?
Clinical measurements
Two attributes stage grade
Gene methylation level
Differentially methylated between normal and
cancer tissues

41
Ovarian Cancer Methylation Array

University of Iowa Gynecologic Oncology tumor
bank (provided by Dr. Bernie Futscher at AZCC)
114 DNA samples
11 Normal ovary
19 Stage I
18 Stage II
24 Stage III
17 Stage IV
25 Low malignant potential (LMP)
6560 genes
Top 1000 genes with highest standard deviation
across all samples are regarded potentially
relevant

42
Gene Selection Techniques

Individual gene ranking
F-statistics (Mendenhall Sincich, 1995)
Gene subset selection
Optimal search algorithms
Genetic algorithm (GA) (Holland, 1975)
Tabu search (TS) (Glover, 1986)
Evaluation criteria
Maximum relevance minimum redundancy (MRMR)
(Ding Peng, 2003)
Support vector machine (SVM) (Vapnik, 1995)

43
Marker Genes for Survival Prediction

Q1 Which genes can be used to predict the
survival of ovarian cancer based on their
methylation level?

Individual 95 CIs For Mean
Based on Pooled StDev Level
F N Mean StDev
--------------------------------- Full set
1000 30 67.690 2.100 (-) F-stat
100 30 77.398 1.414
(-) GA/MRMR 57 30 76.199 2.018
(-) GA/SVM 39 30
80.263 1.592 (-)
TS/MRMR 24 30 80.702 1.613
(-) TS/SVM 96 30
82.807 2.205
(-)
--------------------------------- Pooled
StDev 1.847 70.0
75.0 80.0 85.0

Conclusion
TS/SVM selected 96 out of 1000 genes, which
achieved the highest prediction accuracy (82.807)

44
Methylation vs. Clinical Diagnosis

Q2 Will methylation-based methods perform better
than clinical diagnosis in predicting survival of
ovarian cancer?

Individual 95 CIs For Mean
Based on Pooled StDev Level
F N Mean StDev
---------------------------------- Clinical
2 30 75.281 1.770
() Full set 1000 30 57.566
2.747 () F-stat 70 30 75.581
1.413 () GA/MRMR
40 30 75.506 1.756
() GA/SVM 46 30 79.813
0.205 ()
TS/MRMR 24 30 77.715 1.868
() TS/SVM 30 30
81.948 1.769
()
---------------------------------- Pooled
StDev 1.790 63.0
70.0 77.0

Conclusions prediction accuracy
Full set lt Clinical lt Marker genes (selected by
TV/SVM)

45
GeneScene Visualizer

To provide graphical presentation of large-scale
regulatory networks comprised of pathway
relations extracted through text mining
technologies
Three testing collections
p53 (87,903 abstracts)
AP1 (90,773 abstracts)
Yeast (28,971 abstracts)
Currently loading and parsing entire PubMed for
Cancer pathway
7 million abstracts

46
GeneScene Visualizer Functionality

Searching by specific elements, e.g., diseases
or genes
Network-based exploration and navigation
Accessing the underlying PubMed abstract
Saving and loading search and visualization
results
Various manipulations on the table and network
view of the retrieved relations filter, sort,
zoom, highlight, isolate, expand, print, etc.

47
GeneScene Visualizer V1.5
48
GeneScene Visualizer V1.5
49
Affect of Aggregation

Same relations, before and after aggregation

Baseline level
Simple Pathway level
50
Affect of Aggregation Mutation Feature
When mutant and non-mutant are combined, an
apparent conflict arises TP53 is both activating
and inhibiting MDM2.
When the Mutation feature is selected, non-mutant
TP53 is shown to activate and mutant TP53 to
inhibit MDM2.
51
User FeedbackGeneral Comments

Interviewees were generally impressed with the
features and usefulness of the system
In my head I've been trying to do what this is
doing for you.
It took me a few weeks just to find that Sin3
interacts with p53, where when you type this in
to Genescene it's right there.
Just playing around with the system I am
seeing things that I didn't know before.
If this is the entire Medline, I would probably
use it every time I search.
I think a lot of people would get a lot of use
out of this, as long as it doesn't scare them off
in the beginning.

Lessons Learned
Biomedical information is precise but
terminologies fluid
Biomedical professionals need search and analysis
help
Biomedical linguistic parsing and ontologies are
promising for biomedical text mining
The need for integrated biomedical data (gene
microarray) and text mining (literature)

53
Ongoing Research

Combining bottom-up data mining
(MicroArray/Methylation) with top-down text
mining results
Creating CancerPath testbed for cancer genomic
network visualization
Biological networks topological analysis (growth,
preferential attachment)
Other biomedical applications plant science
pathway (Arabidopsis Galbraith Lab) infectious
disease surveillance

54
BioPortalInfectious Disease Information
Sharing, Surveillance, Analysis, and Visualization
55
Research Partners and Supports

University of Arizona
University of California, Davis
Kansas State University
University of Utah
Arizona Department of Public Health
New York State Department of Health/HRI
California Department of Health Services/PHFE
U.S. Geological Survey
The SIMI Group
National Taiwan University

NSF
CIA/ITIC
DHS
DOD/AFMIC
CDC
AZDPS

56
UA Team Members

Dr. Hsinchun Chen
Dr. Daniel Zeng
Lu Tseng
Cathy Larson
Kira Joslin
Wei Chang
James Ma
Hsinmin Lu
Ping Yan
Aaron Sun

Keith Alcock
Sapna Brahmanandam
Milind Chabbi
Yuan Wang

57
Outline

Project Background
BioPortal Achievements
System Architecture
System Functionalities
BioPortal Collaboration Framework
New Developments
International Foot-and-mouth Disease Monitoring
Syndromic Surveillance
Disease Contact Tracing

58
BioPortal Project Goals

Demonstrate and assess the technical feasibility
and scalability of an infectious disease
information sharing (across species and
jurisdictions), alerting, and analysis framework.
Develop and assess advanced data mining and
visualization techniques for infectious disease
data analysis and predictive modeling.
Identify important technical and policy-related
challenges in developing a national infectious
disease information infrastructure.

59
Information Sharing Infrastructure Design
Portal Data Store (MS SQL 2000)
Data Ingest Control Module Cleansing /
Normalization
Info-Sharing Infrastructure
Adaptor
Adaptor
Adaptor
SSL/RSA
SSL/RSA
XML/HL7 Network
PHINMS Network
New
NYSDOH
CADHS
60
Data Access Infrastructure Design
61
Datasets Integrated WNV, BOT
Index Dataset Test Data Available Data Duration (MM/YY) Data Size Spatial Granularity Temporal Granularity
1 NY WNV Human Yes Test Data 574 Zip Date
2 NY Dead Bird Yes Test Data 942 Lat/Long shifted
3 NY Mosquito Yes Test Data 815 County Date
4 NY WNV Captive Animal Yes Test Data 39 Zip Date
5 NY Botulism Human Yes Test Data 10 Zip Date
6 CA WNV Human Yes 09/0310/03 186 County Date
7 CA Dead Bird Yes 01/0310/03 3032 City/zip Minutes
8 CA Chicken Sera Yes 04/0310/03 18887 Site Date
9 CA Mosquito Pool Yes 01/9810/03 3518 Site Date
10 CA Botulism Yes 01/0112/02 53 Zip Date
11 USGS EPIZOO - Preliminary Yes 07/9909/03 46 events County Date
12 USGS EPIZOO WNV Yes 08/1999-07/2004 113 events County Date
13 USGS EPIZOO - Botulism Yes 12/1989-12/2004 702 events County Date
14 UC Davis FMD Yes 1996 - 2003 3288 Site/Province Date/Month
15 International FMD Yes 01/1982-03/2005 6789 Province Non-temporal
16 BioWatch Yes 1/10- 1/17 2004 480 Exact Site Date
17 CA Mosquito Treatment Yes 1/14-11/30 2004 6194 Exact Site Date
18 National Infant Botulism Yes 1/16-11/25 2004 15 Zip Date
62
Communications/Messaging

Scalable, flexible, light-weight, and extendible.
Easy to include
New diseases
New jurisdictions
New techniques!
Messaging infrastructure installed and tested
NYSDOH-UA PHIN MS
CADHS-UA Regional message broker
NWHC-UA PHIN MS
XML generation/conversion
NY_DeadBird, NY_Alerts, NY_BotHuman, NY_WNVHuman,
NY_CaptiveAnimal, NY_Mosquito
CA_BotHuman, CA_WNVHuman, CA_DeadBird,
CA_Chicken, CA_Mosquito
USGS_Epizoo

63
Spatio-Temporal Data Mining Hotspot Analysis

A hotspot is a condition indicating some form of
clustering in a spatial and temporal distribution
(Rogerson Sun 2001 Theophilides et al. 2003
Patil Tailie 2004).
For WNV, localized clusters of dead birds
typically identify high-risk disease areas
(Gotham et al. 2001).
Automatic detection of dead bird clusters using
hotspot analysis can help predict disease
outbreaks and aid in effective allocation of
prevention/control resources.

64
Case Study (NY WNV)

On May 26, 2002, the first dead bird with WNV was
found in NY
Based on NYs test dataset

140 records
224 records
March 5
May 26
July 2
new cases
baseline
65
(No Transcript)
66
Dead Bird Hotspots Identified
67
WNV/BOT BioPortal
Acknowledgment NSF, ITIC, NYSDH, CDHS,
USGS (Drs. Kvach and Ascher)
68
(No Transcript)
69
(No Transcript)
70
(No Transcript)
71
BioPortal HotSpot Analysis RSVC, SaTScan, and
CrimeStat Integrated (first visual, real-time
hotspot analysis system for disease surveillance)

West Nile virus in California

72
Hotspot Analysis-Enabled STV
73
International FMD BioPortal
Acknowledgment DHS, DOD, UC Davis (Drs. Thurmond
and Lynch)
74
International FMD BioPortal Goals

Real-time, web-based situational awareness of FMD
outbreaks worldwide through the establishment of
an international information sharing and analysis
system
FMDv characterization at the genomic level
integrated with associated epidemiological
information and modeling tools to forecast
national, regional, and/or international spread
and the prospect of import into the U.S. and the
rest of North America
Web-based crisis management of resourcesfacilitie
s, personnel, diagnostics, and therapeutics

75
Research Plans

Global FMD epidemiological data
(Near) real-time data collection
Web-based information sharing and analysis
International FMD news
Indexed collection of global FMD news
Search and visualization of the FMD news via the
web
FMD genetic/sequence data
Predictive model using phylogenetic, spatial, and
temporal information to stop FMD at the boarder
Visualization for FMD event in time, space, and
genetic space

76
Preliminary Global FMD Dataset

Provider UC Davis FMD Lab
Information sources reference labs and OIE
Coverage 28 countries globally
Time span May, 1905 March, 2005
Dataset size 30,000 records of which 6789
records are complete
Host species Cattle, Caprine, Ovine, Bovine,
Swine, NK, Elephant, Buffalo, Sheep, Camelidae,
Goat

77
Global FMD Coverage in BioPortal
78
FMD Migration Visualization using BioPortal
(cases in South Asia)
FMD Cases travel back and forth between countries
79
International FMD News

Provider UC Davis FMD Lab
Information sources Google, Yahoo, and open
Internet sources
Time span Oct 4, 2004 present (real-time
messaging under development)
Data size 460 events (6/21/05)
Coverage 51 countries
(Africa11, Asia16,
Europe12, Americas12)

80
Searching FMD News

http//fmd.ucdavis.edu/
Searchable by
Date range
Country
Keyword

81
Visualizing FMD News on BioPortal
82
FMD Genetic Information Analysis

Genome clustering analysis
Phylogenetic clustering
Spatial clustering
Temporal clustering
Hotspot detection among gene sequences
Create a tree structure based on semantic
distance between gene sequences.
Automatically detect the dense portion of the
tree.
Identify the connection between the semantic
cluster and the geographic pattern of gene
sequences.

83
FMD Genetic Visualization

Goal Extend STV to incorporate 3rd dimension,
phylogenetic distance
Include a phylogenetic tree.
Identify phylogenetic groups and color-code the
isolate points on the map.
Leverage available NCBI tools such as BLAST.
Proof of concept SAT 2 3 analysis
Data 54 partial DNA sequence records in South
Africa received from UC Davis FMD Lab
(Bastos,A.D. et al. 2000, 2003)
Date range 1978-1998
Countries covered South Africa, Zimbabwe,
Zambia, Namibia, Botswana

84
Sample FMD Sequence Records
Color-coded View (MEGA3)
Textual View of Gene Sequence
85
Phylogenetic Treeof Sample FMD Data
Identify 6 groups within 2 major families (MEGA3
based on sequence similarity)
Group6
Group1
Group2
Group5
Group4
Group3
86
Genetic, Spatial, and Temporal Visualization of
FMD Data
Phylogenetic tree color coded
Isolates locations color coded
Isolates appearances in time
87
FMD Time Sequence Analysis
First family cases appeared throughout the period
2nd family cases exist before 1993 and a comeback
lately
Second family cases existed before 1993 and
reappeared later after 1997
88
FMD Periodic Pattern Analysis
2nd family concentrated in Feb. while 1st family
spread evenly
89
Locations of Family 1 records
Selected only groups 1, 2, and 3 and found a
spatial cluster
90
Locations of Family 2 records
Sparse isolate locations
Selected only groups 4, 5, and 6
91

Syndromic Surveillance

92
Chief Complaints As a Data Source

Chief complaints (CCs) are short free-text
phrases entered by triage practitioners
describing reasons for patients ER visit
Examples lt foot pain left foot pain cp
chest pain sob shortness of breath so
should be sob poss uti possibly urinary
tract infection
Advantages of using CCs for surveillance purposes
Timeliness Diagnose results are on average 6
hours slower than CCs
Availability and low-cost Most hospitals have
free-text CCs available in electronic form

93
Existing CC Classification Methods
Classification Method Systems Authors
Keyword Match Synonym List Mapping Rules DOHMH (NY City), EARS Mikosz et. al. (2004)
Weighted Keyword Match (Vector Cosine Method) Mapping Rules ESSENCE Sniegoski (2004)
Naïve Bayesian RODS Olszewski (2003), Ivanov et. al (2002)
Bayesian Network N/A Chapman et. al. (2004)
94
Overall System Design
Chief Complaints
95
A Stage 2 Example CC Concepts ? Symptom Group
Concepts
coagulopathy
purpura
ecchymosis
bleeding 1/41/51/6 0.62
4
5
6
Blood In urine
ureteral stone
5
other1/50.2
coma
5
coma1/50.2 dead1/50.2
UMLS
5
out pass
altered_mental_status 1/50.2
96
Summary of Stage 2 Performance
3001 concepts
1835 CC records from Stage 1
contains
417 unique concepts
97

BioPortal Taiwan Syndromic Surveillance

98
Multi-lingual Chief ComplaintsChinese Example

Data Characteristics
Mixed expressions in both Chinese and English
????FEVER???????????(?)
??,?????A/W,????,????
18 CC records from NTU Med. Center contain
Chinese expressions.
Some hospitals have 100 CC records in Chinese
(For example, ??????)
Misspellings and typographic errors are not
serious

99
Chinese CC Preprocessing System Design
English Expressions
Translated Chinese Phrases
Stage 0.1
Stage 0.2
Stage 0.3
Segmented Chinese Phrases
Chinese Expressions
Separate Chinese and English Expressions
Chinese Phrase Segmentation
Chinese Phrase Translation
Chinese Chief Complaints
Chinese to English Dictionary
Chinese Medical Phrases
Common Chinese Phrases
Raw Chinese CCs
Mutual Info.
100
Result Self Validation

Use the 280 translations against 1978 chief
complaints from hospital A

1610 (82) records are in English
368 (18) records contain Chinese

36 contains trivial info.
Eg. r/o septic shock ????
64 contains non-trivial info.
Eg. poor intake and ????

67 has complete translation
2 has partial translation
20 does not have translation

101
General Grouping

Taiwan surveillance data visualization 2.2M
scrubbed chief complaints records

102
Group by Hospital
103
Group by Syndrome Classification
104
Disease Outbreak Detection Using Chief Complaints

Markov Switching Model

104
105
Data Source

Emergency Department Free-text Chief Complaints
(CCs)
Medical practitioners use both Chinese and
English to record CCs
368,151 records 23.77 contains Chinese
characters
Time period 2000-6-30 to 2003-4-27
Use BioPortal Multilingual CC Classifier to
classify CC records into syndromes

106
Syndrome Prevalence

Botulism-Like (1.4)
Constitutional (25.4)
Gastrointestinal (26.4)
Hemorrhagic (6.4)
Neurological (14.1)

Rash (2.4)
Respiratory (17.8)
Upper Respiratory (7.3)
Lower Respiratory (12.7)
Fever (18)
Other (34.9)

Choose Resp. and GI syndrome for further analysis
Two syndromes with high prevalence
Can be extended to other syndromes

107
GI Syndrome Time Series
Gastrointestinal Syndrome Time Series
Autocorrelation Function
108
GI Syndrome Time Series (contd)

Strong day-of-week effect
Seasonal effect is less strong
Sporadic jumps
Seems to have 1 2 peaks per year

109
Estimation Results (GI Syndrome)
GI Time Series
Jumps
Outbreak State
110
Estimation Results (GI Syndrome) (contd)

Jumps appear during Chinese New Years
The Markov switching model identified 4 high
GI-count period
2000-12-23 to 2001-1-28 (Jan. 23 New Year Eve)
2002-1-29 to 2002-3-15 (Feb. 11 New Year Eve)
2002-5-9 to 2002-10-14
2002-12-13 to 2003-2-18 (Jan. 30 New Year Eve)

111

Taiwan SARS Contact Tracing

112
Social Network Analysis in Epidemiology

Conceptualizing a population as a set of
individuals linked together to form a large
social network provides a fruitful perspective
for better understanding the spread of some
infectious diseases. (Klovdahl, 1985)
Social Network Analysis in epidemiology has two
major activities
Network Construction
Link the whole set of persons in a particular
population with relationships or types of
contacts
Network Analysis
Measure and make inferences about structural
properties of the social networks through which
infectious agent spread

113
A Taxonomy of Network Construction
Network Construction Network Construction Network Construction
Disease Linking Relationship Examples
Sexually Transmitted Disease (STD) Sexual Contact AIDS (Klovdahl, 1985) Gonorrhea (Ghani et al., 1997) Syphilis (Rethenberg et al., 1998)
Sexually Transmitted Disease (STD) Drug Use AIDS (Klovdahl et al., 1994) AIDS (Rethenberg et al., 1998)
Sexually Transmitted Disease (STD) Needle Sharing AIDS (Klovdahl et al., 1994) AIDS (Rethenberg et al., 1998)
Sexually Transmitted Disease (STD) Social Contact AIDS (Klovdahl et al., 1994) AIDS (Rethenberg et al., 1998)
Tuberculosis (TB) Personal Contact (Klovdahl et al., 2001) (McElroy et al, 2003)
Tuberculosis (TB) Geographical Contact (Klovdahl et al., 2001) (McElroy et al, 2003)
Severe Acute Respiratory Syndrome (SARS) The Source of Infection (CDC, 2003) (Shen et al., 2004)
Severe Acute Respiratory Syndrome (SARS) Personal Contact (Meyers et al., 2005)
CDC Centers for Disease Control and Prevention
114
A Taxonomy of Network Analysis
Network Analysis Network Analysis Network Analysis
Levels of Analysis Description Examples
Network Visualization Show the spread of an infectious agent transmitted from one person to another AIDS (Klovdahl, 1985) Syphilis (Rethenberg et al., 1998) SARS (CDC, 2003) SARS (Shen et al., 2004)
Network Measurement Study the structure of a population through which an infectious agent is transmitted during close personal contact Develop disease containment strategies or programs Syphilis (Rethenberg et al., 1998) AIDS (Klovdahl et al., 1994) AIDS (Rethenberg et al., 1998)
Network Simulation Evaluate the spread of an infectious agent within a population with different network parameters Gonorrhea (Ghani et al., 1997) SARS (Meyers et al., 2005)
CDC Centers for Disease Control and Prevention
115
Network Visualization

Utilize social network to visualize the
transmission of an infectious agent from one
person to another within a particular population
Focus on the identification of
Subgroups within the population
Characteristics of each subgroup
Bridges between subgroups which transmit a
disease from a subgroup to another

116
Epidemic Phases and Social Networks

Potterat et al. (2001) proposed that structure of
sexual networks is a more reliable indicator of
STD epidemic phase.
Two sexual networks in Colorado Springs, U.S.
were compared
Bacterial STD from 1990 to 1991 (a STD outbreak)
Chlamydia from 1996 to 1999 (stable or declining
phase)
Sexual network in stable or declining phase was
relatively
Fragmented
Dendritic
Lack of cyclic structures
Cunningham et al. (2004) further examined the
relationship between network characteristics and
epidemic phases.
After epidemic
Macro-level structure
Average distance declined.
Density increased.
Micro-level structure
Numbers of n-cliques and k-plexes declined.

117
Research Test Bed

We use Taiwan SARS data as our research test bed.
SARS (Severe Acute Respiratory Syndrome) is a
novel infectious disease which emerged in 2002.
The first human case was identified in Guangdong
Province, China on November 16, 2002. (Donnelly
et al., 2004)
A 65-years-old doctor from Guangdong Province
stayed at a hotel in Hong Kong in February 2003
and infected at least 17 other guests and
visitors at the hotel, some of whom later came to
other countries and initiated local transmission
of SARS. (Peiris et al., 2006)
26 countries, including Vietnam, Singapore,
Canada, and Taiwan, reported SARS cases.
Financial impact 50B

118
SARS in Taiwan

The first SARS case in Taiwan was a Taiwanese
businessman who traveled to Guangdong Province
via Hong Kong in the early February 2003.
Had onset of symptoms on February 26, 2003
Infected two family members and one healthcare
worker
Eighty percent of probable SARS cases were
infected in hospital setting.
The first outbreak began at a municipal hospital
in April 23, 2003.
Total seven hospital outbreaks were reported.
Hospital shopping and transfer were suspected to
trigger such sequential hospital outbreaks.

119
Taiwan SARS Data

Taiwan SARS data was collected by the Graduate
Institute of Epidemiology at National Taiwan
University during the SARS period.
In this dataset, there are 961 patients,
including 638 suspected SARS patients and 323
confirmed SARS patients.
The contact-tracing data of patients in this
dataset has two main categories, personal and
geographical contacts, and nine types of
contacts.
Personal contacts family member, roommate,
colleague/classmate, and close contact
Geographical contacts foreign-country travel,
hospital visit, high risk area visit, hospital
admission history, and workplace

120
Taiwan SARS Data (Cont.)

Hospital admission history is the category with
largest number of records (43).
Personal contacts are primarily comprised of
family member records.

Category Type of Contacts Records Suspected Patients Confirmed Patients
Personal Family Member 177 48 63
Personal Roommate 18 11 15
Personal Colleague/Classmate 40 26 23
Personal Close Contact 11 10 12
Geographical Foreign-Country Travel 162 100 27
Geographical Hospital Visit 215 110 79
Geographical High Risk Area Visit 38 30 7
Geographical Hospital Admission History 622 401 153
Geographical Workplace 142 22 120
Total Total 1425 638 323
121
Research Design
122
Phase Analysis (Cont.)

Network Partition
We partition each contact network on a weekly
basis with linkage accumulation.
From 2/24 to 5/4, there are 10 weeks in total.

123
Phase Analysis (Cont.)

Network Measurement
We investigate two factors that contribute to the
transmission of disease in macro-structure
Density the degree of intensity to which people
are linked together
Density
Average degree of nodes
Transferability the degree to which people can
infect others
Betweenness
Number of components

Higher density
Lower density
Lower Transferability
Higher Transferability
124
Connectivity Analysis

Geographical contacts provide much higher
connectivity than personal contacts in the
network construction.
Decrease the number of components from 961 to 82
Increase the average degree from 0.31 to 108.62

Applied Contacts in the network construction Average Degree (Patient Nodes) Maximum Degree (Patient Nodes) Number of Components
Personal Contacts 0.31 4 847
Geographical Contacts 108.62 474 82
Personal Geographical Contacts 108.85 474 10
125
Connectivity Analysis (Cont.)

The hospital admission history provides the
highest connectivity of nodes in the network
construction.
The hospital visit provides the second highest
connectivity.
This result is consistent with the fact that most
of patients got infected in the hospital
outbreaks during the SARS period.

Applied Contacts in the network construction Applied Contacts in the network construction Average Degree Maximum Degree Number of Components
Personal Contacts Family Member 0.204 4 893
Personal Contacts Roommate 0.031 2 946
Personal Contacts Colleague/Classmate 0.06 3 934
Personal Contacts Close Contact 0.023 1 949
Geographical Contacts Foreign-Country Travel 2.727 41 848
Geographical Contacts Hospital Visits 10.077 105 753
Geographical Contacts High Risk Area Visit 1.388 36 924
Geographical Contacts Hospital Admission History 50.479 289 409
Geographical Contacts Workplace 4.694 61 823
126
One-Mode Network with Only Patient Nodes
127
Contact Network with Geographical Nodes
128
Potential Bridges Among Geographical Nodes

Including geographical nodes helps to reveal some
potential people who play the role as a bridge to
transfer disease from one subgroup to another.

129
Network Visualization (Cont.)

For a hospital outbreak, including geographical
nodes and contacts in the network is also useful
to see the possible disease transmission scenario
within the hospital.
Background of the Example
Mr. L, a laundry worker in H Hospital, had a
fever on 2003/4/16 and was reported as a
suspected SARS patient.
Nurse C took care of Mr. Liu on 4/16 and 4/17.
Nurse C and Ms. N, another laundry worker in H
Hospital, began to have symptoms on 4/21.
H Hospital was reported to have an SARS outbreak
on 4/24.
Nurse Cs daughter had a fever on 5/1.

130
Phase Analysis Density

Normalized density and average degree show
similar patterns
In the importation phase, foreign-country contact
network increases dramatically in Week 4
(3/17-3/23), followed by personal contact
network.
In the hospital outbreak phase, both personal and
hospital networks increase dramatically. But in
Week 10, personal network still increases while
hospital network decreases.

Density
Average Degree
131
Phase Analysis Transferability

From betweenness, we can see that personal
network doesnt have enough transferability until
Week 9.
Personal network just forms several small
fragments without big groups in the importation
phase.
From the number of components, hospital network
is the only one which can consistently link
patients together.

Hospital Outbreak
Hospital Outbreak
Importation
Importation
Betweenness
Number of Components
132
Ongoing Research

Worldwide infectious disease breaking news
collection, monitoring, and analysis
Markov-switching model based disease surveillance
Infectious disease social network analysis and
contact tracing
Other public health concerns and infectious
disease applications

133
Building Research Partnership

Emerging critical medical and public health
concerns
Willing and engaging international domain
(biomedical) partners and funding sources
Data, data, and more data
From academic research to scalable
solutions/systems and lasting impacts

134
For more informationBioPortal web site
http//www.bioportal.orgAI Lab web site
http//ai.arizona.eduhchen_at_eller.arizona.edu

Write a Comment

User Comments (0)