Title: Medical Digital Library to Support Scenario Specific Information Retrieval
1Medical Digital Library to Support Scenario
Specific Information Retrieval
- Wesley W. Chu
- wwc_at_cs.ucla.edu
- Computer Science Department
- University of California
- Los Angeles, California
2A Project of theNIH Grant at UCLA
- A Digital File Room for Patient Care, Education,
and Research
3Background
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Current file rooms managing patient records have
limited functionality - Main goal of mapping patient ID to patient
records - PACS implementations are an electronic version of
the traditional file room
4Background
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
Lack of structure makes...
- Finding relevant information for a particular
user is time consuming and labor intensive
- Poorly structured and incomplete results, which
may affect patient management - Current search tools limited for general use and
not tailored to specific users or tasks
5Digital File Room Requirements
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- A navigable information space providing
- Relevant and reputable information
- Access to similar patient records
- Content-based cross referencing
- Dynamically updated data repository
- Tailored access for specific users and devices
6Hypotheses
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- A digital file room (digital library) that
delivers relevant and structured answers to
specific query can be developed from existing
medical databases - Such a digital file room will increase user
satisfaction and improve patient management
7Specific Aims
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- SA1 Develop a system that identifies and
provides access to reputable information sources - SA2 Provide users with greater query capability
(e.g. similar-to, approximate) - SA3 Extract knowledge from patient data, medical
literature and radiology teaching files to
support content-based cross-referencing - SA4 Provide access to dynamically updated
collections based on patient data - SA5 Adapt information retrieval to user and
device characteristics
8Significance
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Extend patient record to provide tailored and
timely access to a broader array of reputable
medical information
9Approach and Innovations
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Intelligent information registration
- Provide access to multiple, related data sources
through a single access point - Content-based navigation and matching
- Develop similarity matching based on medical
concepts patterns - Content correlation
- User and device modeling
- Adaptive information retrieval based on user and
device models - Scenario-based information web (proxies)
- Develop information web linking clustered data
sources for agiven set of related tasks (i.e.,
scenario)
10Intelligent Information Registration
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Registers multiple information sources to provide
transparent access through a single point (proxy
object). - Information requests are routed to appropriate
data sources based on query characteristics - Data sources are hierarchically clustered
according to a four-layer data model
11Content-Based Navigation Matching
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Two types of navigation
- Navigation of the information space using proxies
and content correlation - Pattern/similarity navigation using type
abstraction hierarchies (TAHs)
12Pattern-Based Type Abstraction Hierarchies
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Scalable, hierarchical knowledge structures that
facilitate similarity matching
13Adaptive Information Retrieval
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Tailors query processing and query results
according to - Particular user
- Characteristics of their device
- Examples
- Doctors prefer JAMA or Lancet while patients
prefer Time or CNN. - High resolution workstations support large,
detailed imaging studies while portable devices
need lower-bandwidth data. - Allows the system to retrieve appropriate data
for a particular query, user, and device
14Scenario-Based Proxy
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- A framework that defines, for a particular domain
and set of tasks, the access methods to and the
relationships between information sources.
- intelligent information registration
- pattern-based similarity matching
- adaptive information retrieval
15Scenario-Based Information Web
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- A directed graph that defines access paths for
navigation among proxy objects
correlated-to
similar-to
Literature
Patient
correlated-to
Teaching File
similar-to
16Scenario-Based Information Web
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Similar-to links relate objects based on their
similarity - patients similar by age, sex, and disease
Extends the scope of the digital file room into a
digital medical library
- Correlated-to links relate objects based on
related content - disease can be correlated to relevant literature
17Research Progress
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Phrase Indexing
- Phrase generated from a n-word combination in a
sentence. - Domain Specific Retrieval
- Document Summarization
- Content Correlation
- Linking of relevant documents via patterns
18Domain Specific Retrieval
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Document are grouped into domain-specific
collections - Medical patient reports
- Web sites are often tailored to specific subject
areas - Phrases can capture content better than single
word, thus improve retrieval performance
19Problem With Longer Phrases
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
Large combinatorial problem
To process longer phrases it is necessary to
partition documents into smaller segments
20Phrase Analysis
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- A phrase is defined as any 2, 3 or 4 words
co-occurring in a sentence (word combination) - Very large number of possible phrases
- Use a stoplist to remove useless words
- Normalize words to a common stem
right
The
upper
lobe
mass
is
seen
again.
sentence
case
right
the
upper
lobe
mass
is
seen
again
normalization
stop word
right
upper
lobe
mass
seen
again
removal
right
upp
lob
mass
seen
again
stemming
right
upp
lob
mass
seen
again
sorting
mass
right
lob
again
lob
mass
seen
mass
mass
again
candidate
right
lob
upp
mass
again
right
2-word
seen
lob
right
seen
combinations
seen
again
upp
lob
upp
right
upp
again
seen
upp
21Document Retrieval Evaluation
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Preliminary evaluation
- A domain specific collection of documents
- Can phrase analysis limited to sentences improve
retrieval effectiveness? - SMART system (single word terms) used as baseline
- Data
- Thoracic radiology patient reports
- Dictated reports
- Describe anatomy and abnormal findings such as
enlarged lymph nodes and cancer masses
22Domain SpecificDocument Retrieval
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Query right upper lobe mass
23Automatic Text Summarization
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Salton Method
- Given a text file with n paragraphs
- A paragraph can be represented by Di(di1, di2,
, dim) - dik is the weight to represent the importance
for term Tk(word or phrase) - The pair-wise similarity of two paragraphs
- Sim(Di,Dj) ? dik djk , k 1..m
- Text relationship map
- Nodes paragraph
- Links pair-wise similarity of the connected
nodes - Links are created if Sim(Di, Dj) gt threshold
Bushiness of a node of links of a node Text
Summarization derived from the Bushy nodes.
24Performance Comparison of Sultans Summarization
Method Based on Phrase and Single Word
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
Aspirin.txt Aspirin.txt words words words 2W phrases 2W phrases 2W phrases 3W phrases 3W phrases 3W phrases
Threshold Threshold 0.1 0.2 0.3 0.1 0.2 0.3 0.1 0.2 0.3
Paragraphs Ranking Based on Bushiness No.1 4 6 8 2 2 2 2 2 2
Paragraphs Ranking Based on Bushiness No.2 6 8 2 3 3 3 3 3 3
Paragraphs Ranking Based on Bushiness No.3 8 3 3 6 6 6 8 8 8
Paragraphs Ranking Based on Bushiness No.4 1 4 4 1 4 4 4 4 4
Paragraphs Ranking Based on Bushiness No.5 5 5 5 8 5 5 6 6 6
Paragraphs Ranking Based on Bushiness No.6 2 1 6 4 1 1 5 5 5
Paragraphs Ranking Based on Bushiness No.7 3 2 1 5 8 8 7 7 7
Paragraphs Ranking Based on Bushiness No.8 9 9 9 7 7 7 1 1 1
Paragraphs Ranking Based on Bushiness No.9 7 7 7 9 9 9 9 9 9
Summarization based on Phrases are less sensitive
to Threshold setting than Single Words.
25 N-words Distribution
26 Number Distinct Freq Words
27 Number of Valid Sentences
28Performance Comparison
29 Comparison (cont)
30Content Correlation
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Given a document in one collection, content
correlation links relevant documents in another
document collection
31Document ClusterBy Pattern
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- A pattern is a set of unique terms that
characterize some features in the data set - Patterns can be found in a collection of
documents by data mining - Documents are grouped into clusters based on
patterns via clustering technique
32Cluster Signature
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Every cluster can be classified according to the
occurrence frequency of the patterns - Looking to answer
- The set of patterns summarize a given cluster?
- How the patterns related among the clusters ?
Literature
Patient Records
33Deriving Cluster Signature
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Metrics
- Local Cluster Certainty (LCC) measures the
coverage of a pattern in a given cluster
(Popularity) - The Global Cluster Certainty (GCC) measures the
coverage of a pattern among clusters
(Exclusiveness) - The Cluster Signature is the set of those
patterns that have both high LCC and GCC - Documents from one collection (source) can be
linked to relevant clusters in another collection
(target)
34Preliminary Results
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- A collection of 69 pediatric urology literature
abstracts taken from Medline were clustered using
the complete link clustering algorithm - 3 large clusters, each with 2 or more
sub-clusters - GCC and LCC were calculated for patterns found in
several sub-clusters - Data from one sub-cluster is reported here
35GCC
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
LCC
Term/Phrase Cl
Pediatr 1.0
Result 1.0
Patient 1.0
Perform 1.0
Compl 1.0
Laparoscop 1.0
Urolog 0.34
Laparoscop pediatr 1.0
Laparoscop perform 1.0
Diagnost laparoscop 0.35
Laparoscop operat 0.35
Compl rate 0.35
Laparoscop patient 0.35
Laparoscop operat perform 0.0817
Laparoscop patient perform 0.0817
Term/Phrase Cg
Laparoscop 0.1887
Compl 0.0817
Child Laparoscop 1.0
Laparoscop patient 1.0
Compl Laparoscop 1.0
Comple techn 1.0
ltMEASgt compl 1.0
Laparoscop perform 0.6088
Compl rate 0.4564
Laparoscop patient perform 1.0
Laparoscop perform procedur 1.0
ltMEASgt compl rate 1.0
Laparoscop pediatr perform 1.0
Compl laparoscop techn 1.0
36Project Summary
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- A system that provides
- relevant and reputable information,
- access to similar patient records,
- content-based cross referencing,
- a dynamically updated data repository, and
- tailored access for specific users and devices
- will
- augment the patient record to provide tailored
and timely access to a broader array of reputable
information and - extend the digital file room into a digital
medical library.
37Research Results
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Phrase Indexing
- Developed an efficient algorithm for extracting
n-word features from textual documents - Phrase index provide better results than single
word index in document retrieval and
summarization - Content Correlation via Cluster Signature (LCC
GCC) - Preliminary results reveal the feasibility using
cluster signature for linking relevant documents - Work begun on proxy for information navigation
38Future Work
Background Hypothesis Specific Aims
Significance Approach and Innovations
Research Progress
- Develop Ontology for Intelligent Information
Registration - User Model for Information Retrieval