Title: Knowledge Engineering and SemiAutomatic Population of Medical Ontologies Using NLP Methodologies
1Knowledge Engineering and Semi-Automatic
Population of Medical Ontologies Using NLP
Methodologies
- Munich 11.06.2007
- Pinar Oezden Wennerberg
- pinar.oezden_at_jrc.it
2Agenda
- Knowledge Engineering and Ontology
- Definitions, methodologies, guidelines
- Medical Terminology and Natural Language
Processing (NLP) - The problem of medical terminology
- The context users, tasks, types of information
in the medical domain - The role of NLP and knowledge engineering
- Motivation for Semi-Automatic Ontology Population
- The knowledge acquisition bottleneck
- Vast amount of knowledge available in (un- /
semi-)structured text, WWW, databases etc. - One example approach
- Ontology population via Supervised Machine
Learning (ML) - Challenges
3Knowledge Engineering and Ontologies
- Some Definitions
- Humans and software agents need knowledge about
the world in order to reach good decisions - Such knowledge is typically stored in
knowledgebases - Knowledge engineering is the process of building
a knowledgebase - A knowledge engineer is someone, who
- investigates a particular domain,
- determines what concepts and relations are
important in that domain, - and creates a formal representation of objects
and relations in that domain. - (Russel Norvig, 1995)
4Knowledge Engineering and Ontologies
- An ontology specifies a finite, controlled,
extensible and machine processable vocabulary for
a given knowledgebase - Consists of concepts, properties, relations,
axioms - Knowledge engineering guidelines
- Decide what to talk about and on the vocabulary,
- Encode general knowledge and a specific problem
case - Execute queries and verify inference
- (Russel Norvig 1995)
5Medical Terminologies and Natural Language
Processing (NLP)
- Problem statement
- Numerous heterogenious medical terminologies and
coding schemes exist that need to interoperate - e.g. Systemized Nomenclature of Medicine (SNOMED)
for coding paptient notes, ICD (International
Classification of Diseases), ICD-9-CM for billing
purposes,RIZIV, IDEWE, ICPC-2, ATC etc. - Existing efforts UMLS, Galen, MeSH, etc.
6Medical Terminologies and Natural Language
Processing (NLP)
- Definition of context
- Information types to be collected are about
- Individuals (e.g. medical records)
- Groups (e.g. data about epidemiology, public
health) - Institutions (e.g. planning, management in
hospitals, clinics) - Domain specific knowledge (e.g. state-of-the-art
publications, proceedings) - Domain relevant tasks
- Data entry, query and retrieval about patients
- Information sharing and integration from
different applications and medical records
7Medical Terminologies and Natural Language
Processing (NLP)
Information Extraction
Knowledge Representation and Reasoning
Natural Language Processing
Machine Learning
Information Retrieval
Knowledge Discovery, Text Mining
Ontology Engineering
Adapted from Jena University www.julielab.de
8Motivation for Semi-Automatic Ontology Population
- The knowledge acquisition bottleneck
- Ideally the knowledge engineer interviews the
knowledge expert to get educated about the domain
i.e. to acquire knowledge - ? expensive in time and resources
- ? domain experts not alwaysavailable
- Availability of vast amount knowledge
- In resources such as medical databases, journals,
publications, conference proceedings, medical
reports etc. - World Wide Web
9Ontology Population via Supervised Machine
Learning
- Problem statement
- Identify and extract relevant knowledge (terms,
phrases, relations, facts) in text e.g. - Terms health disorder, malfunction,
sickness, illness, maladie, Krankheit ?
Disease - Smoking causes cancer ? ltSmoking, Cancergt
- Goal
- Assign them to the appropriate concepts of the
ontology as instance - Concept Disease
- Relation causes
10Ontology Population via Supervised Machine
Learning
- Processes
- Annotate (i.e. supervised)
- ltCAUgtSmokingltCAU/gt ltCAU-Rgtcauseslt/CAU-Rgt
ltDISgtcancerlt/DISgt - CAU DiseaseCause, CAU-R causalRelation, DIS
Disease - Learn and extract from a training set (i.e. ideal
world) - Extract from the test set (i.e. unknown world)
- Apply the learned rules on new documents to
discover and extract new knowledge
11Ontology Population via Supervised Machine
Learning
- Learn and extract from a training set (i.e. ideal
world) - Recognize syntactic constructs such as NPs, VPs,
PPs - Generate extraction rules
- Rule for concept Disease
- Disease- ltNP smokinggtltVP causesgtltNP DIS gt
- Rule for concept DiseaseCause
- DiseaseCause- ltNP CAUgtltVP causesgtltNP cancer
gt - Rule for relation causalRelation
- causalRelation- ltNP smokinggtltVP CAU-RgtltNP
cancer gt - Classify
- Disease cancer
- DiseaseCause smoking
- causalRelation causes
12Ontology Population via Supervised Machine
Learning
- Possible problems
- More than one value was extracted for a given
relation - Entities from different classes were extracted
(multiple concept assignment i.e. ambiguity) - Nothing was extracted
- Possible solutions
- Present the user all possible values, let the
user decide - To assist user with the decision process by
assigning confidence scores to possible values - i.e. how much does the system believe what it
suggests is relevant/true - Provide context information via text highlighting
to justify the systems confidence - Provide empty data entry slots for users to enter
their knowledge
13Challenges
- General challenges
- It is difficult to eliminate the knowledge
acquisition problem entirely - Due to the sensitivity of the domain (human
health) the knowledge experts cannot be
completely avoided - Computer scientists need to work together with
domain experts to a certain extent - Systems should be able to be used by
non-technicians - Multilinguality
- Healthcare workers, patients, administrators
should be able to have access to information in
their own language
14Challenges
- Knowledge/ontology engineering specific
challenges - Implicit information (typical for natural
language) i.e. not machine-processable (not
explicit) - Different levels of detail (granularity) is
required to meet different expectations - i.e. provide sufficient detail but abstract away
irrelevencies - Poly-hierachies to support multiple views
- may lead to ambiguities, contradictions
- Adaptability, extensibility for changing user
demands and for standards - Expressibility vs. computational tractibility
- Achieving consensus between practitioners
15Questions?
- Evaluation
- How do we know if we have a good system?
- Practitioners to evaluate the effficiency and
reliability of the developed systems?
16