Title: Infrastructure for Semantic Expansion and Curation of the RadLex Ontology
1Infrastructure for Semantic Expansion and
Curation of the RadLex Ontology
- Rebecca Hazen Alexander van Esbroeck
- Northwestern University
- Dr. David Channin, Mentor
2Background
- RadLex - Radiology Lexicon
- Reduce variation and improve clarity in radiology
reports - 11,962 terms over 12 categories
3Establishing the need
- Missing many terms
- Imaging Observations
- Imaging Observation Characteristics
- Committee dependent development process
- Manual, time consuming, expensive
- Larger lexicons are harder to manage
- Difficult to sustain
4Proposed Solution
- Develop an automatic term extraction system
- Focusing on Imaging Observation and
Characteristics - Accelerate the expansion of RadLex
- Decrease the demands on committees
- Propose lists of strong candidates for inclusion
- Reduce development costs
5Processing System Description
- Collect free full-text articles from medical
journals - Identify new terms using LexEVS and NLP
techniques - Create ranked lists of imaging observations and
characteristics
6Processing System Overview
LexEVS
Concepts/Relationships
Article Text
Article Finder
Candidate Term Identification
Data/Annotations
Ranked Lists of Imaging Observations Characteris
tics
Context Processing
7LexEVS
- LexEVS was developed by NCI, NIH, caBIG, Mayo
Clinic - Designed to fulfill a community need for
standards in storing, accessing, managing and
distributing controlled vocabularies - Combination of LexBIG, LexGrid, EVS
- Programmable interfaces for accessing and
distributing controlled vocabularies - Provides a common API
8UIMA Architecture
- Framework for processing large collections of
documents - Processing modules can be connected into
pipelines
9Article Finder
- Locates and retrieves scientific articles
- Searches PubMed
- Returns free full-text, English, HTML articles.
- Removes tags and extracts the article text
10Articles Processed
- 1,128 Documents
- ImagingCTMRPETX-rayUSangiographytomography
findings Title
11Candidate Phrase Identification
- Identifies a list of candidate phrases from the
articles - Tokenizer
- Part-of-speech Tagger
- Linguistic Filter
- Extracts sequences of words matching a specific
pattern - Increased renal enhancement
- -ed verb, adj, noun
12LexEVS Annotator
- Use LexEVS to access vocabularies
- RadLex 2.0 NCI Thesaurus HL7 CTCAE
- Determine if phrases exist in RadLex as a single
concept - Retrieve vocabulary metadata
- What us that
- Annotate the document
- Build database of annotations
- Develop inclusion/exclusion criteria
13LexEVS Annotator
14Context Processing
- Find indicator words that are associated with
existing RadLex terms - Assign weights to those words as a function of
the number of RadLex terms with which they are
associated.
Focal confluent fibrosis can occur in the
cirrhotic liver as a hepatic mass in
approximately 14 of cases . This fibrosis
is accompanied by atrophy of the affected liver
parenchyma and retraction of the overlying
liver capsule (Figure 9 ).
15Context Processing
- Use those indicator words to identify new
phrases - Score new phrases as a function of the strength
of their association with the indicator words.
Less extensive findings included interlobular
septal thickening. Interlobular
septal thickening was seen in 32 patients
(89). A luminal mass was considered to be
present if there was a soft-tissue mass in the
lumen that arose from the
bowel wall.
16Phrase Ranking
- Calculate a termhood1 value for each phrase
- Termhood is based on a combination of
- Nesting
- Context Scores
- Length
- Orthography
- Stop List
1 termhood refers to the likelihood that a
candidate is a real term 2
17Term Splitting
- Phrases typically consist of an observation
accompanied by one or more characteristics of
that observation - Term splitting splits phrases into component
characteristics and observations - Based on frequency ratios
- Makes two new ranked lists
Candidate Term mediastinal soft tissue
infiltration
- mediastinal
- soft tissue
- infiltration
18Results
- imaging observations
- imaging observation characteristics
- precision
- Precision is defined as .
19Conclusions
- LexEVS is a powerful tool for exploiting a
variety of controlled vocabularies - Automatic term extraction can identify new
imaging observations and observation
characteristics - Adjusting context and processing can lead to
other kinds of terms - Broader searches for articles will lead to larger
collections of terms
20Future Work
- Use syntactic structure to improve extraction
- Automatic identification of relationships
- Infrastructure for distributed editing
- Semantic Wiki
21Selected References
- 1. Langlotz CP. RadLex a new method for indexing
online educational materials. Radiographics. 2006
Nov-Dec26(6)1595-7. - 2. Frantzi K, Ananiadou S, Mima H. Automatic
recognition of multi-word terms the
C-value/NC-value method. International Journal on
Digital Libraries 2000 3(2)115-130. - 3. Baneyx A, Charlet J, Jaulent M. Building an
ontology of pulmonary diseases with natural
language processing tools using textual corpora.
International Journal of Medical Informatics 2007
76(2-3) 208-215. - 4. Zhou L, Tao Y, Cimino J, Chen E, Liu H,
Lussier Y, Hripcsak G, Friedman C. Terminology
model discovery using natural language processing
and visualization techniques. Journal of
Biomedical Informatics. 2006 39(6)626-636. - 5. Church K, Hanks P. Word association norms,
mutual information, and lexicography.
Computational linguistics 1990 16(1)22-29. - 6. Snow R, Jurafsky D, Ng A. Learning syntactic
patterns for automatic hypernym discovery.
Advances in Neural Information Processing Systems
2005 171297-1304.
22Example Query
- NCI MetaThesaurus,
- Cronkhite-Canada Syndrome,
- exactMatch
23Structure
24References
- Retrieve Candidate Terms
- Query LexEVS
- Selecting Scheme, Search Algorithm, Restrictions
- Add Tags
- Coding Scheme, Concept Code, etc
- Store in database
25UIMA Pipeline Lexicons
- LexEVS Annotator
- Marks existing observations and characteristics,
as well as anatomic parts, treatments, etc. - Can be used for context information later on.
- Filters out existing terms.
Mock-up of Annotation Results
26Context Detection
- Looks for the identified context words
before/after candidate terms. - Calculates a context value for each term based on
the frequency of context terms and their weights.
Sample Context Terms
27Term Ranking Results
28Room for Improvement
- Improve precision of candidate term selection
- Use context-term groups as classifiers.
- Segment the article and identify term-rich
areas - Normalize the context and inverse document
frequency distributions. - Context phrases
- Improve linguistic filter and use more POS
information - Stop list additions
-