Title: Mining for Lexons: Applying Unsupervised Learning Methods to create Ontology Bases
1Mining for LexonsApplying Unsupervised Learning
Methods to create Ontology Bases
- Marie-Laure Reinberger, Peter Spyns,Walter
Daelemans, Robert Meersman - CNTS - University of AntwerpSTARLab - Vrije
Universiteit Brussel
2Project OntoBasis
- Elaboration and adaptation of Text Analysis
tools for the building of specific domain
Ontologies - The role of Ontologies in Database and Web
Semantics
3Outline
- DOGMA
- Text Mining
- Syntactic analysisClusteringEvaluationResults
- Conclusion
4A DOGMA inspired ontology
- Developing Ontology-Guided Mediation for Agents
an ontology engineering approach - Ontology Base consisting of binary relations or
lexons extracted automatically - Layer of Ontological Commitments (rules)
5First step Ontology Base
- Unsupervised extraction of relevant terms
- Grouping or clustering of those terms
- Medical domain
- 4M corpus - Medline abstracts
6Outline
- DOGMA
- Text Mining
- Syntactic analysisClusteringEvaluationResults
- Conclusion
7Syntactic analysis
- NP1Subject The/DT patients/NNS NP1Subject
VP1 followed/VBD VP1 NP1Object a/DT '/''
healthy/JJ '/'' diet/NN NP1Object and/CC
NP2Subject 20/CD /NN NP2Subject VP2
took/VBD VP2 NP1Object a/DT high/JJ level/NN
NP3Object PNP P of/IN P NP physical/JJ
exercise/NN NP PNP ./.
8Data selection
- Subject-Verb-Object structure selectional
restriction functional relation - Structures selection according to frequency
- Building of initial classes
9Clustering algorithms
- Soft clustering
- Hard clustering
- Clusters merging
10Soft clustering
- each verb associated to its class of co-occuring
nouns - nouns clustered according to the similarity
between classes of nouns - algorithm run as long as a cluster is modified
11Soft clusters
- - hepatitis infection disease cases syndrome-
liver cirrhosis disease carcinoma HCC HBV virus
method model- liver transplantation
chemotherapy treatment- chemotherapy treatment
vaccine injection drug- immunization vaccine
vaccination
12Hard clustering
- each noun associated to its class of co-occuring
verbs - nouns clustered according to the similarity
between classes of verbs - each step measure lowered
- cut according to the percentage of nouns clustered
13Hard clusters
- - month year- children infant- concentration
number incidence use prevalence level rate-
course therapy transplantation treatment
immunization
14Evaluation
- Use of WordNet
- Building of a set of WordNet pairs of nouns
- Considering the set of WordNet pairs ? recall
WN pairs in clusters / WN pairs ? minimum
precision (mP) WN pairs in clusters /
pairs in clusters composed of WN words ?
extrapolated precision (eP) ext. WN pairs in
clusters / ext. pairs in clusters composed of
WN words
15Comparison raw text vs parsed textsoft clustering
Parsed1 Clusters gt 15 elements
dismissed Parsed2 Clusters gt 15 and clusters lt 3
elements dismissed
16Comparison raw text vs parsed texthard clustering
17Raw text or parsed text?
- Hard comparable results
- Soft better for parsed text
- N-grams gain of processing time
- Parsed text possible labeling of the relation
18Soft clustering vs Hard
Soft1 Clusters gt15 elements dismissed Soft2
Clusters gt 15 and clusters lt 3 elements dismissed
19Soft or Hard?
- SoftDifferent semantic dimensions
consideredProblem too many clusters, too many
associations - HardProblem only one semantic dimension of the
word is considered
20Merging hard and soft clusteringon disease
- Hard clustering 1 cluster- disease transmission
- Soft clustering 6 clusters- antigen hepatitis
virus protein disease- prevalence infection
correlation disease - Combination 2 clusters- hepatitis infection
disease cases syndrome- liver cirrhosis disease
carcinoma vaccine HCC HBV virus history method
model
21on chemotherapy
- Hard clustering- (none) transplantation
treatment - Soft clustering- hepatitis blood factor HBV
doses chemotherapy treatment vaccine vaccines
vaccination injection drug immunization- liver
chemotherapy treatment transplantation - Combination- liver transplantation chemotherapy
treatment
22Evaluation results (150-200 words)
23Turning to terms
- Catch terminology items we have missed so far
- Hard clustering on terms would allow a noun or
part of a term to appear in different
clustersface mask, mask, glove, protective
eyewearimmunoadsorbent, immunoassay, immunospot
24Prospects
- Focus on terms and on the verb-object dependency
- Filter the terminology
- Use prepositional structures to establish links
and build a network
25Example
- Cluster1 buy pay inc
- Cluster2 month year president
- Words found in WordNet buy pay month year
president
WordNet pairs buy pay month year
Clustered pairs (WN words)buy pay month
yearyear president month president
Recall 2/2100Precision 2/450
26Turning to terms