Knowledge Discovery for Cancer Informatics and Public Health Informatics: Techniques, Case Studies, and Lessons Learned Hsinchun Chen Director, Artificial Intelligence Lab, University of Arizona Acknowledgements: NIH, NSF, NCI, ACC, NTU - PowerPoint PPT Presentation


PPT – Knowledge Discovery for Cancer Informatics and Public Health Informatics: Techniques, Case Studies, and Lessons Learned Hsinchun Chen Director, Artificial Intelligence Lab, University of Arizona Acknowledgements: NIH, NSF, NCI, ACC, NTU PowerPoint presentation | free to view - id: ef99a-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Knowledge Discovery for Cancer Informatics and Public Health Informatics: Techniques, Case Studies, and Lessons Learned Hsinchun Chen Director, Artificial Intelligence Lab, University of Arizona Acknowledgements: NIH, NSF, NCI, ACC, NTU


Knowledge Discovery for Cancer Informatics and Public Health Informatics: Techniques, Case Studies, – PowerPoint PPT presentation

Number of Views:532
Avg rating:3.0/5.0


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Knowledge Discovery for Cancer Informatics and Public Health Informatics: Techniques, Case Studies, and Lessons Learned Hsinchun Chen Director, Artificial Intelligence Lab, University of Arizona Acknowledgements: NIH, NSF, NCI, ACC, NTU

Knowledge Discovery for Cancer Informatics and
Public Health Informatics Techniques, Case
Studies, and Lessons LearnedHsinchun
ChenDirector, Artificial Intelligence
Lab,University of ArizonaAcknowledgements
Artificial Intelligence Lab Research
  • UA MIS Department (4th ranked) 30 research
    scientists 25M funding since 1990 Chen, IEEE
    and AAAS Fellow
  • Intelligence Security Informatics research
    NSF, DOJ, CIA COPLINK system deployed in 1600
    agencies Dark Web for countering terrorism
  • Biomedical Informatics research NLM, NCI Chen
    NLM Scientific Counselor HelpfulMed, GeneScene
    system and BioPortal

A Little Promotion
GeneScene Cancer Pathway Knowledge Extraction
And Visualization
GeneScene Team
Dr. Gondy Leroy (Claremont) Byron Marshall
(Oregon SU) Dan McDonald (Utah SU) Zan Huang
(Penn SU) Jiexun Li (Drexel U) Chun-Ju
Tseng Shauna Eggers Dr. Jesse Martinez Dr.
George Watts Dr. Bernie Futscher (AZCC) Dr. Hua
Su Dr. Karin Quinones
  • Text Mining Knowledge Integration
  • Data Mining
  • Visualization System Development
  • Domain Experts
  • User Studies Evaluation

  • GeneScene overview
  • Research directions
  • Text mining
  • Knowledge integration
  • Data mining
  • GeneScene Visualizer

Knowledge Explosion PubMed
  • Average number of new citations appearing in
  • In 1980 746/day
  • In 2004 1,640/day

(No Transcript)
GeneScene Overview
  • Motivation
  • Relieving information overload in biomedical
  • Automating processes of knowledge extraction and
    data analysis
  • Focus genetic regulatory pathways
  • Dissection of regulatory networks is crucial for
    a thorough understanding of biological processes
  • Complexity of biological networks raises
    challenges for computational research
  • Research goals
  • To develop novel Natural Language Processing
    (NLP) techniques to support information
  • To develop machine learning and data mining
    techniques to support high-throughput data
  • To create an integrated framework for
    pathway-related knowledge representation and
  • Ultimately, to provide biomedical researchers
    with a pathway-related knowledge discovery and
    integration platform
  • Funding NIH/NLM, 1 R33 LM07299-01 (May 2002
    April 2006)

GeneScene Components Scope
Ontology-enhanced Knowledge Integration To
aggregate and consolidate pathway relations
extracted from literature and to integrate them
with existing knowledge sources using biomedical
Text Mining of Biomedical Literature To
automatically extract regulatory relations
between biological entities from free text
  • Data Mining for Genomic Studies
  • To extract regulatory pathway information based
    on genomic data other resouces

Visualization of Regulatory Pathways To
facilitate accessing, understanding, and analysis
of extracted pathway knowledge
Text Mining
  • Extract all pathway-relevant relations from text
  • Relations with gene or protein names on either
    end of the relation are extracted
  • Two types of relations in GeneScene
  • Co-occurrence Relations (Concept Space)
    relations between terms that often co-occur in a
    set of abstracts
  • ? HelpfulMed (Cancer Space)
  • Linguistic Relations precise semantically rich
    relations from each abstract

(No Transcript)
(No Transcript)
HelpfulMED Search of Medical Websites
HelpfulMED search of Evidence-based Databases
Consulting HelpfulMED Cancer Space (Thesaurus)
Browsing HelpfulMED Cancer Map
(No Transcript)
  • Chinese Medical Intelligence (CMI)
  • Goal
  • Providing medical and health information services
    to both researchers and public.
  • Content
  • 350,000 high quality medical-related webpages
    collected from mainland China, Hong Kong and
  • Meta-search 3 large general Chinese search
  • Key Features
  • Built-in Simplified/Traditional Chinese encoding
  • Dynamic summarization for both Simplified and
    Traditional Chinese
  • Automatic categorization
  • Visualization using SOM

Chinese folder display
Simplified Chinese summary
Chinese visualization with SOM
Traditional Chinese summary
GeneScene Full Parser Arizona Relation Parser
  • Syntax and semantics are combined together in a
    hybrid parsing grammar as opposed to the
    pipelined approach
  • Introducing over 150 new word classes, while
    retaining many of the original syntax word
    classes (i.e. noun, verb)
  • The new word classes have semantic and lexical
  • Semantic and syntactic properties of the new tags
    are not explicitly detailed in the dictionary,
    but rather determined by the parsing rules that
    define them
  • Rules that apply to the tags reveal the syntactic
    and semantic roles of the tags

ARP Architecture
Architecture diagram for the parser, consisting
of three main stages tagging, parsing, and
relation extraction
Problem Gene Pathway
  • Title Key roles for E2F1 in signaling
    p53-dependent apoptosis and in cell division
    within developing tumors.
  • Abstract Apoptosis induced by the p53 tumor
    suppressor can attenuate cancer growth in
    preclinical animal models. Inactivation of the
    pRb proteins in mouse brain epithelium by the
    T121 oncogene induces aberrant proliferation and
    p53-dependent apoptosis. p53 inactivation causes
    aggressive tumor growth due to an 85 reduction
    in apoptosis. Here, we show that E2F1 signals
    p53-dependent apoptosis since E2F1 deficiency
    causes an 80 apoptosis reduction. E2F1 acts
    upstream of p53 since transcriptional activation
    of p53 target genes is also impaired. Yet, E2F1
    deficiency does not accelerate tumor growth.
    Unlike normal cells, tumor cell proliferation is
    impaired without E2F1, counterbalancing the
    effect of apoptosis reduction. These studies may
    explain the apparent paradox that E2F1 can act as
    both an oncogene and a tumor suppressor in
    experimental systems

Expert errs and corrects
Final graph
Prepositions OF/BY/IN
Example Map (one abstract)
Arizona Relation Parser Output
Original Sentence Resulting Relation Resulting Relation Resulting Relation Resulting Relation
Original Sentence Entity 1 Negation Connector Entity 2
(1) wild-type p53 tumor suppressor protein, which induces apoptosis wild-type p53 tumor suppressor protein False induces apoptosis
(2) Wt p53 also induced significant apoptosis Wt p53 False also induced significant apoptosis
(3) oncogene mutant p53 suppresses apoptosis oncogene mutant p53 False suppresses apoptosis
(4) mutant p53 blocked E1A-induced apoptosis Mutant p53 False blocked E1A-induced apoptosis
(4) mutant p53 blocked E1A-induced apoptosis E1A False induced apoptosis
(5) mutant p53 does not induce apoptosis mutant p53 True does not induce apoptosis
Text Mining Statistics (Jan. 2005)
Collection P53 AP1 Yeast
Number of Abstracts 205,820 400,487 68,025
Number of Abstracts w/ Relation Extracted 87,903 90,773 28,971
Linguistic relations (full parser) 182,499 172,116 54,805
Concept Space Relations 2,724,099 3,265,524 6,535,737
Knowledge Integration Organizing Relations
  • Relations are more useful when well organized
  • Multiple name strings of the same biological
    entities or processes are aggregated
  • Important contextual information is captured
  • Entities are cross-referenced to outside
  • Well organized relations should help with domain
    appropriate analysis tasks

An Example Context and Term Variation
  • 4 somewhat contradictory PubMed abstract
  • (1) wild-type p53 tumor suppressor protein, which
    induces apoptosis
  • (2) Wt p53 also induced significant apoptosis
  • (3) oncogene mutant p53 suppresses apoptosis
  • (4) mutant p53 blocked E1A-induced apoptosis
  • (1) and (2) say that p53 induces apoptosis
  • (3) and (4) say that p53 inhibits apoptosis

From PubMed documents 10594026, 8643473, and
11809683, and 11375269
An Example Context and Term Variation
  • Analyzing the context more closely
  • Wild-type (1), wt (2) p53 are non-mutated
  • Mutant (3) (4) p53 are mutated
  • P53 protein (1) is a protein
  • Oncogene p53 (3) is a gene
  • Identifying context is important in organizing
    extracted information. Words near p53 suggest
    that while normal p53 induces apoptosis, mutated
    p53 inhibits it

From pubmed documents 10594026, 8643473, and
11809683, and 11375269
Biological Entity Recognition and Identification
  • To aggregate relations we need to recognize and
    identify biological entities
  • Recognition finds substance references in text,
    identification matches those references to
    external resources (Tuason et al. 2004)
  • Three key object recognition difficulties (Fukuda
    et al., Palakal et al.)
  • Compound words
  • Ambiguous expressions
  • New or unknown words

Aggregation System Design
A Decompositional Approach to Biomedical Concept
  • BioAggregate tagger decomposes name strings found
    in a relation by left-to-right longest-first
    pattern matching using domain appropriate
    lexicons of feature-signaling terms
  • Lexicons built from existing resources and human
    generated lists
  • Substance names in LocusLink, RefSeq, HUGO, and
    SGD, etc.
  • Biological processes in Gene Ontology
  • Lexicons of other features

Features For Decompositional Matching
Feature Lexicon Explanation
Aggregatable Substance A gene and its product(s). All references to a particular gene and its product(s) share the same Aggregatable Substance value. E.g., p53, tp53, and trp53.
Mutation Indicating the status of an aggregatable substance. Only has two values, mutated or not mutated (wild-type).
Substance Type "Type" of aggregatable substance. Currently there are three recognized types gene, protein, and mRNA.
Associator Essentially verbs. This feature attempts to resolve verbs that occur in multiple forms, but have the same stem. E.g., inhibit, inhibits, inhibited, and inhibiting all share the Connector Associator value "inhibit."
Function A biological process, such as apoptosis or angiogenesis (as in biological_process list of Gene Ontology), or an action performed on an aggregatable substance, such as phosphorylation or inhibition.
Species The species/organism information associated with an entity or relation.
Cellular Component The sub-cellular component or location of an entity or relation.
Stopword Common words judged to meet the standard ignoring this word will not mischaracterize pathway relations.
P53 Testbed
  • ARP extracted 182,499 relations from 87,903
    PubMed abstracts related to the gene p53
  • As extracted, the relations display very little
    overlap with 142,974 distinct entity names and
    127,397 distinct relational pairs

5 Levels of Aggregation Granularity
Aggregation Level Possible Applications
Baseline (string match entities and connector) basis of comparison
Feature Match (feature synonym resolution) detailed pathway analysis
Typed Substance (distinguish genes and proteins) pathway analysis granularity is comparable to some human-curated databases
Aggregatable Substance explore the function of a gene and its gene products
Simple Pathway (substance/function 4 categories for connectors) high level overviews and input to some machine learning algorithms
More Detailed
More Abstract
Network Consolidation Results
  • The number of distinct entities and relations are
    sharply reduced over various levels of
  • When fewer relations are disjoint, the knowledge
    network encompasses more information
  • Network density increased 20-fold

Genomic Data Mining in AI Lab
  • Joint learning of genetic networks from
    microarray data existing knowledge
  • Gene selection for cancer diagnosis

Gene Selection for Cancer Diagnosis
  • Gene array data have been widely used for cancer
    classification/prediction (Golub et al., 1999
    Ben-Dor et al., 2001)
  • The major problems of gene array data (Model et
    al., 2001 Lu Han, 2003)
  • High dimensionality (hundreds or thousands of
  • Small number of available samples
  • Most genes are irrelevant to cancer distinction
  • Genes are interacting with each other
  • It is important to identify the marker genes for
    cancer diagnosis

Experiment Ovarian Cancer Diagnosis
  • Ovarian cancer
  • 25,580 projected cases in 2004
  • 16,090 deaths estimated in 2004
  • 53 overall 5 year survival
  • 31 5 year survival in those with distant
    metastases at diagnosis
  • 75 of cases diagnosed in late stages (III IV)
  • Predict survival of ovarian cancer alive or
  • Clinical measurements
  • Two attributes stage grade
  • Gene methylation level
  • Differentially methylated between normal and
    cancer tissues

Ovarian Cancer Methylation Array
  • University of Iowa Gynecologic Oncology tumor
    bank (provided by Dr. Bernie Futscher at AZCC)
  • 114 DNA samples
  • 11 Normal ovary
  • 19 Stage I
  • 18 Stage II
  • 24 Stage III
  • 17 Stage IV
  • 25 Low malignant potential (LMP)
  • 6560 genes
  • Top 1000 genes with highest standard deviation
    across all samples are regarded potentially

Gene Selection Techniques
  • Individual gene ranking
  • F-statistics (Mendenhall Sincich, 1995)
  • Gene subset selection
  • Optimal search algorithms
  • Genetic algorithm (GA) (Holland, 1975)
  • Tabu search (TS) (Glover, 1986)
  • Evaluation criteria
  • Maximum relevance minimum redundancy (MRMR)
    (Ding Peng, 2003)
  • Support vector machine (SVM) (Vapnik, 1995)

Marker Genes for Survival Prediction
  • Q1 Which genes can be used to predict the
    survival of ovarian cancer based on their
    methylation level?

Individual 95 CIs For Mean
Based on Pooled StDev Level
F N Mean StDev
--------------------------------- Full set
1000 30 67.690 2.100 (-) F-stat
100 30 77.398 1.414
(-) GA/MRMR 57 30 76.199 2.018
(-) GA/SVM 39 30
80.263 1.592 (-)
TS/MRMR 24 30 80.702 1.613
(-) TS/SVM 96 30
82.807 2.205
--------------------------------- Pooled
StDev 1.847 70.0
75.0 80.0 85.0
  • Conclusion
  • TS/SVM selected 96 out of 1000 genes, which
    achieved the highest prediction accuracy (82.807)

Methylation vs. Clinical Diagnosis
  • Q2 Will methylation-based methods perform better
    than clinical diagnosis in predicting survival of
    ovarian cancer?

Individual 95 CIs For Mean
Based on Pooled StDev Level
F N Mean StDev
---------------------------------- Clinical
2 30 75.281 1.770
() Full set 1000 30 57.566
2.747 () F-stat 70 30 75.581
1.413 () GA/MRMR
40 30 75.506 1.756
() GA/SVM 46 30 79.813
0.205 ()
TS/MRMR 24 30 77.715 1.868
() TS/SVM 30 30
81.948 1.769
---------------------------------- Pooled
StDev 1.790 63.0
70.0 77.0
  • Conclusions prediction accuracy
  • Full set lt Clinical lt Marker genes (selected by

GeneScene Visualizer
  • To provide graphical presentation of large-scale
    regulatory networks comprised of pathway
    relations extracted through text mining
  • Three testing collections
  • p53 (87,903 abstracts)
  • AP1 (90,773 abstracts)
  • Yeast (28,971 abstracts)
  • Currently loading and parsing entire PubMed for
    Cancer pathway
  • 7 million abstracts

GeneScene Visualizer Functionality
  • Searching by specific elements, e.g., diseases
    or genes
  • Network-based exploration and navigation
  • Accessing the underlying PubMed abstract
  • Saving and loading search and visualization
  • Various manipulations on the table and network
    view of the retrieved relations filter, sort,
    zoom, highlight, isolate, expand, print, etc.

GeneScene Visualizer V1.5
GeneScene Visualizer V1.5
Affect of Aggregation
  • Same relations, before and after aggregation

Baseline level
Simple Pathway level
Affect of Aggregation Mutation Feature
When mutant and non-mutant are combined, an
apparent conflict arises TP53 is both activating
and inhibiting MDM2.
When the Mutation feature is selected, non-mutant
TP53 is shown to activate and mutant TP53 to
inhibit MDM2.
User FeedbackGeneral Comments
  • Interviewees were generally impressed with the
    features and usefulness of the system
  • In my head I've been trying to do what this is
    doing for you.
  • It took me a few weeks just to find that Sin3
    interacts with p53, where when you type this in
    to Genescene it's right there.
  • Just playing around with the system I am
    seeing things that I didn't know before.
  • If this is the entire Medline, I would probably
    use it every time I search.
  • I think a lot of people would get a lot of use
    out of this, as long as it doesn't scare them off
    in the beginning.

  • Lessons Learned
  • Biomedical information is precise but
    terminologies fluid
  • Biomedical professionals need search and analysis
  • Biomedical linguistic parsing and ontologies are
    promising for biomedical text mining
  • The need for integrated biomedical data (gene
    microarray) and text mining (literature)

Ongoing Research
  • Combining bottom-up data mining
    (MicroArray/Methylation) with top-down text
    mining results
  • Creating CancerPath testbed for cancer genomic
    network visualization
  • Biological networks topological analysis (growth,
    preferential attachment)
  • Other biomedical applications plant science
    pathway (Arabidopsis Galbraith Lab) infectious
    disease surveillance

BioPortalInfectious Disease Information
Sharing, Surveillance, Analysis, and Visualization
Research Partners and Supports
  • University of Arizona
  • University of California, Davis
  • Kansas State University
  • University of Utah
  • Arizona Department of Public Health
  • New York State Department of Health/HRI
  • California Department of Health Services/PHFE
  • U.S. Geological Survey
  • The SIMI Group
  • National Taiwan University
  • NSF
  • DHS
  • CDC

UA Team Members
  • Dr. Hsinchun Chen
  • Dr. Daniel Zeng
  • Lu Tseng
  • Cathy Larson
  • Kira Joslin
  • Wei Chang
  • James Ma
  • Hsinmin Lu
  • Ping Yan
  • Aaron Sun
  • Keith Alcock
  • Sapna Brahmanandam
  • Milind Chabbi
  • Yuan Wang

  • Project Background
  • BioPortal Achievements
  • System Architecture
  • System Functionalities
  • BioPortal Collaboration Framework
  • New Developments
  • International Foot-and-mouth Disease Monitoring
  • Syndromic Surveillance
  • Disease Contact Tracing

BioPortal Project Goals
  • Demonstrate and assess the technical feasibility
    and scalability of an infectious disease
    information sharing (across species and
    jurisdictions), alerting, and analysis framework.
  • Develop and assess advanced data mining and
    visualization techniques for infectious disease
    data analysis and predictive modeling.
  • Identify important technical and policy-related
    challenges in developing a national infectious
    disease information infrastructure.

Information Sharing Infrastructure Design
Portal Data Store (MS SQL 2000)
Data Ingest Control Module Cleansing /
Info-Sharing Infrastructure
XML/HL7 Network
PHINMS Network
Data Access Infrastructure Design
Datasets Integrated WNV, BOT
Index Dataset Test Data Available Data Duration (MM/YY) Data Size Spatial Granularity Temporal Granularity
1 NY WNV Human Yes Test Data 574 Zip Date
2 NY Dead Bird Yes Test Data 942 Lat/Long shifted
3 NY Mosquito Yes Test Data 815 County Date
4 NY WNV Captive Animal Yes Test Data 39 Zip Date
5 NY Botulism Human Yes Test Data 10 Zip Date
6 CA WNV Human Yes 09/0310/03 186 County Date
7 CA Dead Bird Yes 01/0310/03 3032 City/zip Minutes
8 CA Chicken Sera Yes 04/0310/03 18887 Site Date
9 CA Mosquito Pool Yes 01/9810/03 3518 Site Date
10 CA Botulism Yes 01/0112/02 53 Zip Date
11 USGS EPIZOO - Preliminary Yes 07/9909/03 46 events County Date
12 USGS EPIZOO WNV Yes 08/1999-07/2004 113 events County Date
13 USGS EPIZOO - Botulism Yes 12/1989-12/2004 702 events County Date
14 UC Davis FMD Yes 1996 - 2003 3288 Site/Province Date/Month
15 International FMD Yes 01/1982-03/2005 6789 Province Non-temporal
16 BioWatch Yes 1/10- 1/17 2004 480 Exact Site Date
17 CA Mosquito Treatment Yes 1/14-11/30 2004 6194 Exact Site Date
18 National Infant Botulism Yes 1/16-11/25 2004 15 Zip Date
  • Scalable, flexible, light-weight, and extendible.
    Easy to include
  • New diseases
  • New jurisdictions
  • New techniques!
  • Messaging infrastructure installed and tested
  • CADHS-UA Regional message broker
  • XML generation/conversion
  • NY_DeadBird, NY_Alerts, NY_BotHuman, NY_WNVHuman,
    NY_CaptiveAnimal, NY_Mosquito
  • CA_BotHuman, CA_WNVHuman, CA_DeadBird,
    CA_Chicken, CA_Mosquito
  • USGS_Epizoo

Spatio-Temporal Data Mining Hotspot Analysis
  • A hotspot is a condition indicating some form of
    clustering in a spatial and temporal distribution
    (Rogerson Sun 2001 Theophilides et al. 2003
    Patil Tailie 2004).
  • For WNV, localized clusters of dead birds
    typically identify high-risk disease areas
    (Gotham et al. 2001).
  • Automatic detection of dead bird clusters using
    hotspot analysis can help predict disease
    outbreaks and aid in effective allocation of
    prevention/control resources.

Case Study (NY WNV)
  • On May 26, 2002, the first dead bird with WNV was
    found in NY
  • Based on NYs test dataset

140 records
224 records
March 5
May 26
July 2
new cases
(No Transcript)
Dead Bird Hotspots Identified
WNV/BOT BioPortal
Acknowledgment NSF, ITIC, NYSDH, CDHS,
USGS (Drs. Kvach and Ascher)
(No Transcript)
(No Transcript)
(No Transcript)
BioPortal HotSpot Analysis RSVC, SaTScan, and
CrimeStat Integrated (first visual, real-time
hotspot analysis system for disease surveillance)
  • West Nile virus in California

Hotspot Analysis-Enabled STV
International FMD BioPortal
Acknowledgment DHS, DOD, UC Davis (Drs. Thurmond
and Lynch)
International FMD BioPortal Goals
  • Real-time, web-based situational awareness of FMD
    outbreaks worldwide through the establishment of
    an international information sharing and analysis
  • FMDv characterization at the genomic level
    integrated with associated epidemiological
    information and modeling tools to forecast
    national, regional, and/or international spread
    and the prospect of import into the U.S. and the
    rest of North America
  • Web-based crisis management of resourcesfacilitie
    s, personnel, diagnostics, and therapeutics

Research Plans
  • Global FMD epidemiological data
  • (Near) real-time data collection
  • Web-based information sharing and analysis
  • International FMD news
  • Indexed collection of global FMD news
  • Search and visualization of the FMD news via the
  • FMD genetic/sequence data
  • Predictive model using phylogenetic, spatial, and
    temporal information to stop FMD at the boarder
  • Visualization for FMD event in time, space, and
    genetic space

Preliminary Global FMD Dataset
  • Provider UC Davis FMD Lab
  • Information sources reference labs and OIE
  • Coverage 28 countries globally
  • Time span May, 1905 March, 2005
  • Dataset size 30,000 records of which 6789
    records are complete
  • Host species Cattle, Caprine, Ovine, Bovine,
    Swine, NK, Elephant, Buffalo, Sheep, Camelidae,

Global FMD Coverage in BioPortal
FMD Migration Visualization using BioPortal
(cases in South Asia)
FMD Cases travel back and forth between countries
International FMD News
  • Provider UC Davis FMD Lab
  • Information sources Google, Yahoo, and open
    Internet sources
  • Time span Oct 4, 2004 present (real-time
    messaging under development)
  • Data size 460 events (6/21/05)
  • Coverage 51 countries
  • (Africa11, Asia16,
  • Europe12, Americas12)

Searching FMD News
  • http//
  • Searchable by
  • Date range
  • Country
  • Keyword

Visualizing FMD News on BioPortal
FMD Genetic Information Analysis
  • Genome clustering analysis
  • Phylogenetic clustering
  • Spatial clustering
  • Temporal clustering
  • Hotspot detection among gene sequences
  • Create a tree structure based on semantic
    distance between gene sequences.
  • Automatically detect the dense portion of the
  • Identify the connection between the semantic
    cluster and the geographic pattern of gene

FMD Genetic Visualization
  • Goal Extend STV to incorporate 3rd dimension,
    phylogenetic distance
  • Include a phylogenetic tree.
  • Identify phylogenetic groups and color-code the
    isolate points on the map.
  • Leverage available NCBI tools such as BLAST.
  • Proof of concept SAT 2 3 analysis
  • Data 54 partial DNA sequence records in South
    Africa received from UC Davis FMD Lab
    (Bastos,A.D. et al. 2000, 2003)
  • Date range 1978-1998
  • Countries covered South Africa, Zimbabwe,
    Zambia, Namibia, Botswana

Sample FMD Sequence Records
Color-coded View (MEGA3)
Textual View of Gene Sequence
Phylogenetic Treeof Sample FMD Data
Identify 6 groups within 2 major families (MEGA3
based on sequence similarity)
Genetic, Spatial, and Temporal Visualization of
FMD Data
Phylogenetic tree color coded
Isolates locations color coded
Isolates appearances in time
FMD Time Sequence Analysis
First family cases appeared throughout the period
2nd family cases exist before 1993 and a comeback
Second family cases existed before 1993 and
reappeared later after 1997
FMD Periodic Pattern Analysis
2nd family concentrated in Feb. while 1st family
spread evenly
Locations of Family 1 records
Selected only groups 1, 2, and 3 and found a
spatial cluster
Locations of Family 2 records
Sparse isolate locations
Selected only groups 4, 5, and 6
  • Syndromic Surveillance

Chief Complaints As a Data Source
  • Chief complaints (CCs) are short free-text
    phrases entered by triage practitioners
    describing reasons for patients ER visit
  • Examples lt foot pain left foot pain cp
    chest pain sob shortness of breath so
    should be sob poss uti possibly urinary
    tract infection
  • Advantages of using CCs for surveillance purposes
  • Timeliness Diagnose results are on average 6
    hours slower than CCs
  • Availability and low-cost Most hospitals have
    free-text CCs available in electronic form

Existing CC Classification Methods
Classification Method Systems Authors
Keyword Match Synonym List Mapping Rules DOHMH (NY City), EARS Mikosz et. al. (2004)
Weighted Keyword Match (Vector Cosine Method) Mapping Rules ESSENCE Sniegoski (2004)
Naïve Bayesian RODS Olszewski (2003), Ivanov et. al (2002)
Bayesian Network N/A Chapman et. al. (2004)
Overall System Design
Chief Complaints
A Stage 2 Example CC Concepts ? Symptom Group
bleeding 1/41/51/6 0.62
Blood In urine
ureteral stone
coma1/50.2 dead1/50.2
out pass
altered_mental_status 1/50.2
Summary of Stage 2 Performance
3001 concepts
1835 CC records from Stage 1
417 unique concepts
  • BioPortal Taiwan Syndromic Surveillance

Multi-lingual Chief ComplaintsChinese Example
  • Data Characteristics
  • Mixed expressions in both Chinese and English
  • ????FEVER???????????(?)
  • ??,?????A/W,????,????
  • 18 CC records from NTU Med. Center contain
    Chinese expressions.
  • Some hospitals have 100 CC records in Chinese
    (For example, ??????)
  • Misspellings and typographic errors are not

Chinese CC Preprocessing System Design
English Expressions
Translated Chinese Phrases
Stage 0.1
Stage 0.2
Stage 0.3
Segmented Chinese Phrases
Chinese Expressions
Separate Chinese and English Expressions
Chinese Phrase Segmentation
Chinese Phrase Translation
Chinese Chief Complaints
Chinese to English Dictionary
Chinese Medical Phrases
Common Chinese Phrases
Raw Chinese CCs
Mutual Info.
Result Self Validation
  • Use the 280 translations against 1978 chief
    complaints from hospital A
  • 1610 (82) records are in English
  • 368 (18) records contain Chinese
  • 36 contains trivial info.
  • Eg. r/o septic shock ????
  • 64 contains non-trivial info.
  • Eg. poor intake and ????
  • 67 has complete translation
  • 2 has partial translation
  • 20 does not have translation

General Grouping
  • Taiwan surveillance data visualization 2.2M
    scrubbed chief complaints records

Group by Hospital
Group by Syndrome Classification
Disease Outbreak Detection Using Chief Complaints
  • Markov Switching Model

Data Source
  • Emergency Department Free-text Chief Complaints
  • Medical practitioners use both Chinese and
    English to record CCs
  • 368,151 records 23.77 contains Chinese
  • Time period 2000-6-30 to 2003-4-27
  • Use BioPortal Multilingual CC Classifier to
    classify CC records into syndromes

Syndrome Prevalence
  • Botulism-Like (1.4)
  • Constitutional (25.4)
  • Gastrointestinal (26.4)
  • Hemorrhagic (6.4)
  • Neurological (14.1)
  • Rash (2.4)
  • Respiratory (17.8)
  • Upper Respiratory (7.3)
  • Lower Respiratory (12.7)
  • Fever (18)
  • Other (34.9)
  • Choose Resp. and GI syndrome for further analysis
  • Two syndromes with high prevalence
  • Can be extended to other syndromes

GI Syndrome Time Series
Gastrointestinal Syndrome Time Series
Autocorrelation Function
GI Syndrome Time Series (contd)
  • Strong day-of-week effect
  • Seasonal effect is less strong
  • Sporadic jumps
  • Seems to have 1 2 peaks per year

Estimation Results (GI Syndrome)
GI Time Series
Outbreak State
Estimation Results (GI Syndrome) (contd)
  • Jumps appear during Chinese New Years
  • The Markov switching model identified 4 high
    GI-count period
  • 2000-12-23 to 2001-1-28 (Jan. 23 New Year Eve)
  • 2002-1-29 to 2002-3-15 (Feb. 11 New Year Eve)
  • 2002-5-9 to 2002-10-14
  • 2002-12-13 to 2003-2-18 (Jan. 30 New Year Eve)

  • Taiwan SARS Contact Tracing

Social Network Analysis in Epidemiology
  • Conceptualizing a population as a set of
    individuals linked together to form a large
    social network provides a fruitful perspective
    for better understanding the spread of some
    infectious diseases. (Klovdahl, 1985)
  • Social Network Analysis in epidemiology has two
    major activities
  • Network Construction
  • Link the whole set of persons in a particular
    population with relationships or types of
  • Network Analysis
  • Measure and make inferences about structural
    properties of the social networks through which
    infectious agent spread

A Taxonomy of Network Construction
Network Construction Network Construction Network Construction
Disease Linking Relationship Examples
Sexually Transmitted Disease (STD) Sexual Contact AIDS (Klovdahl, 1985) Gonorrhea (Ghani et al., 1997) Syphilis (Rethenberg et al., 1998)
Sexually Transmitted Disease (STD) Drug Use AIDS (Klovdahl et al., 1994) AIDS (Rethenberg et al., 1998)
Sexually Transmitted Disease (STD) Needle Sharing AIDS (Klovdahl et al., 1994) AIDS (Rethenberg et al., 1998)
Sexually Transmitted Disease (STD) Social Contact AIDS (Klovdahl et al., 1994) AIDS (Rethenberg et al., 1998)
Tuberculosis (TB) Personal Contact (Klovdahl et al., 2001) (McElroy et al, 2003)
Tuberculosis (TB) Geographical Contact (Klovdahl et al., 2001) (McElroy et al, 2003)
Severe Acute Respiratory Syndrome (SARS) The Source of Infection (CDC, 2003) (Shen et al., 2004)
Severe Acute Respiratory Syndrome (SARS) Personal Contact (Meyers et al., 2005)
CDC Centers for Disease Control and Prevention
A Taxonomy of Network Analysis
Network Analysis Network Analysis Network Analysis
Levels of Analysis Description Examples
Network Visualization Show the spread of an infectious agent transmitted from one person to another AIDS (Klovdahl, 1985) Syphilis (Rethenberg et al., 1998) SARS (CDC, 2003) SARS (Shen et al., 2004)
Network Measurement Study the structure of a population through which an infectious agent is transmitted during close personal contact Develop disease containment strategies or programs Syphilis (Rethenberg et al., 1998) AIDS (Klovdahl et al., 1994) AIDS (Rethenberg et al., 1998)
Network Simulation Evaluate the spread of an infectious agent within a population with different network parameters Gonorrhea (Ghani et al., 1997) SARS (Meyers et al., 2005)
CDC Centers for Disease Control and Prevention
Network Visualization
  • Utilize social network to visualize the
    transmission of an infectious agent from one
    person to another within a particular population
  • Focus on the identification of
  • Subgroups within the population
  • Characteristics of each subgroup
  • Bridges between subgroups which transmit a
    disease from a subgroup to another

Epidemic Phases and Social Networks
  • Potterat et al. (2001) proposed that structure of
    sexual networks is a more reliable indicator of
    STD epidemic phase.
  • Two sexual networks in Colorado Springs, U.S.
    were compared
  • Bacterial STD from 1990 to 1991 (a STD outbreak)
  • Chlamydia from 1996 to 1999 (stable or declining
  • Sexual network in stable or declining phase was
  • Fragmented
  • Dendritic
  • Lack of cyclic structures
  • Cunningham et al. (2004) further examined the
    relationship between network characteristics and
    epidemic phases.
  • After epidemic
  • Macro-level structure
  • Average distance declined.
  • Density increased.
  • Micro-level structure
  • Numbers of n-cliques and k-plexes declined.

Research Test Bed
  • We use Taiwan SARS data as our research test bed.
  • SARS (Severe Acute Respiratory Syndrome) is a
    novel infectious disease which emerged in 2002.
  • The first human case was identified in Guangdong
    Province, China on November 16, 2002. (Donnelly
    et al., 2004)
  • A 65-years-old doctor from Guangdong Province
    stayed at a hotel in Hong Kong in February 2003
    and infected at least 17 other guests and
    visitors at the hotel, some of whom later came to
    other countries and initiated local transmission
    of SARS. (Peiris et al., 2006)
  • 26 countries, including Vietnam, Singapore,
    Canada, and Taiwan, reported SARS cases.
  • Financial impact 50B

SARS in Taiwan
  • The first SARS case in Taiwan was a Taiwanese
    businessman who traveled to Guangdong Province
    via Hong Kong in the early February 2003.
  • Had onset of symptoms on February 26, 2003
  • Infected two family members and one healthcare
  • Eighty percent of probable SARS cases were
    infected in hospital setting.
  • The first outbreak began at a municipal hospital
    in April 23, 2003.
  • Total seven hospital outbreaks were reported.
  • Hospital shopping and transfer were suspected to
    trigger such sequential hospital outbreaks.

Taiwan SARS Data
  • Taiwan SARS data was collected by the Graduate
    Institute of Epidemiology at National Taiwan
    University during the SARS period.
  • In this dataset, there are 961 patients,
    including 638 suspected SARS patients and 323
    confirmed SARS patients.
  • The contact-tracing data of patients in this
    dataset has two main categories, personal and
    geographical contacts, and nine types of
  • Personal contacts family member, roommate,
    colleague/classmate, and close contact
  • Geographical contacts foreign-country travel,
    hospital visit, high risk area visit, hospital
    admission history, and workplace

Taiwan SARS Data (Cont.)
  • Hospital admission history is the category with
    largest number of records (43).
  • Personal contacts are primarily comprised of
    family member records.

Category Type of Contacts Records Suspected Patients Confirmed Patients
Personal Family Member 177 48 63
Personal Roommate 18 11 15
Personal Colleague/Classmate 40 26 23
Personal Close Contact 11 10 12
Geographical Foreign-Country Travel 162 100 27
Geographical Hospital Visit 215 110 79
Geographical High Risk Area Visit 38 30 7
Geographical Hospital Admission History 622 401 153
Geographical Workplace 142 22 120
Total Total 1425 638 323
Research Design
Phase Analysis (Cont.)
  • Network Partition
  • We partition each contact network on a weekly
    basis with linkage accumulation.
  • From 2/24 to 5/4, there are 10 weeks in total.

Phase Analysis (Cont.)
  • Network Measurement
  • We investigate two factors that contribute to the
    transmission of disease in macro-structure
  • Density the degree of intensity to which people
    are linked together
  • Density
  • Average degree of nodes
  • Transferability the degree to which people can
    infect others
  • Betweenness
  • Number of components

Higher density
Lower density
Lower Transferability
Higher Transferability
Connectivity Analysis
  • Geographical contacts provide much higher
    connectivity than personal contacts in the
    network construction.
  • Decrease the number of components from 961 to 82
  • Increase the average degree from 0.31 to 108.62

Applied Contacts in the network construction Average Degree (Patient Nodes) Maximum Degree (Patient Nodes) Number of Components
Personal Contacts 0.31 4 847
Geographical Contacts 108.62 474 82
Personal Geographical Contacts 108.85 474 10
Connectivity Analysis (Cont.)
  • The hospital admission history provides the
    highest connectivity of nodes in the network
  • The hospital visit provides the second highest
  • This result is consistent with the fact that most
    of patients got infected in the hospital
    outbreaks during the SARS period.

Applied Contacts in the network construction Applied Contacts in the network construction Average Degree Maximum Degree Number of Components
Personal Contacts Family Member 0.204 4 893
Personal Contacts Roommate 0.031 2 946
Personal Contacts Colleague/Classmate 0.06 3 934
Personal Contacts Close Contact 0.023 1 949
Geographical Contacts Foreign-Country Travel 2.727 41 848
Geographical Contacts Hospital Visits 10.077 105 753
Geographical Contacts High Risk Area Visit 1.388 36 924
Geographical Contacts Hospital Admission History 50.479 289 409
Geographical Contacts Workplace 4.694 61 823
One-Mode Network with Only Patient Nodes
Contact Network with Geographical Nodes
Potential Bridges Among Geographical Nodes
  • Including geographical nodes helps to reveal some
    potential people who play the role as a bridge to
    transfer disease from one subgroup to another.

Network Visualization (Cont.)
  • For a hospital outbreak, including geographical
    nodes and contacts in the network is also useful
    to see the possible disease transmission scenario
    within the hospital.
  • Background of the Example
  • Mr. L, a laundry worker in H Hospital, had a
    fever on 2003/4/16 and was reported as a
    suspected SARS patient.
  • Nurse C took care of Mr. Liu on 4/16 and 4/17.
  • Nurse C and Ms. N, another laundry worker in H
    Hospital, began to have symptoms on 4/21.
  • H Hospital was reported to have an SARS outbreak
    on 4/24.
  • Nurse Cs daughter had a fever on 5/1.

Phase Analysis Density
  • Normalized density and average degree show
    similar patterns
  • In the importation phase, foreign-country contact
    network increases dramatically in Week 4
    (3/17-3/23), followed by personal contact
  • In the hospital outbreak phase, both personal and
    hospital networks increase dramatically. But in
    Week 10, personal network still increases while
    hospital network decreases.

Average Degree
Phase Analysis Transferability
  • From betweenness, we can see that personal
    network doesnt have enough transferability until
    Week 9.
  • Personal network just forms several small
    fragments without big groups in the importation
  • From the number of components, hospital network
    is the only one which can consistently link
    patients together.

Hospital Outbreak
Hospital Outbreak
Number of Components
Ongoing Research
  • Worldwide infectious disease breaking news
    collection, monitoring, and analysis
  • Markov-switching model based disease surveillance
  • Infectious disease social network analysis and
    contact tracing
  • Other public health concerns and infectious
    disease applications

Building Research Partnership
  • Emerging critical medical and public health
  • Willing and engaging international domain
    (biomedical) partners and funding sources
  • Data, data, and more data
  • From academic research to scalable
    solutions/systems and lasting impacts

For more informationBioPortal web site
http//www.bioportal.orgAI Lab web site