ASWC-08, Bangkok - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

ASWC-08, Bangkok

Description:

Sirindhorn International Institute of Technology. Thammasat ... Use The WHISK Algo. Proposed Framework. 1. 2. 3. Rule Application using Sliding Window (RAW) ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 76
Provided by: presa1
Category:
Tags: aswc | algo | bangkok

less

Transcript and Presenter's Notes

Title: ASWC-08, Bangkok


1
ASWC-08, Bangkok
Extracting Semantic Frames from Thai
Medical-Symptom Phrases with Unknown Boundaries
Peerasak Intarapaiboon Ekawit Nantajeewarawat Than
aruk Theeramunkong School of ICT Sirindhorn
International Institute of Technology Thammasat
University, Thailand
2
Background Thai Medical KB Construction Project
Funded by NECTEC
Web Page Collection
Internet
Data Collection
Information Extraction
Selected Keywords
Keyword Extraction
KB Construction
Dictionary
Link Construction
KB
3
Background Thai Medical KB Construction Project
Thai Medical Textual information On the Web
Data Collection
Information Extraction
Keyword Extraction
KB Construction
  • Disease characteristics
  • Causes
  • Treatment
  • Drug information
  • etc.
  • Currently,
  • 3,594 keywords
  • 22,122 information entries

Link Construction
Structured Data
4
Search in Context
In-Database Search
Disease List
Keyword Link
Relation Graph
5
  • The graph shows
  • Occurrences of keywords in disease
  • descriptions
  • Description types, e.g., treatment, cause.

6
Fine-grained semantic relation
Objective of this paper
7
Objective
Generate
Semantic Representation
Semantic Frame
Text
A framework for
  • Extracting semantic information from symptom
    descriptions
  • in the Thai language
  • Representing the extracted information in an
    ontology-based
  • machine-processable form

8
Objective
Generate
Semantic Representation
Semantic Frame
Text
Example
?????????????????????????
9
Objective
Pattern-Based Information Extraction Rules
Semantic Representation
Semantic Frame
Text
Example
?????????????????????????
10
Pattern-Based Information Extraction An Example
Rule
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
??sym ??????????????????org
?????????????????????ptime 6-12 ???
experience a sym pain in org chest which
lasts ptime 6-12 days
11
Pattern-Based Information Extraction An Example
Rule
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
??sym ??????????????????org
?????????????????????ptime 6-12 ???
12
Pattern-Based Information Extraction An Example
Rule
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
??sym ??????????????????org
?????????????????????ptime 6-12 ???
Symptom
Type
OBS sym ????????? LOC org ??????PER
ptime 6-12 ???
Organ
Type
Type
?????? Chest
6-12 ??? 6-12 days
PER
LOC
13
Thai IEDifficulty
Test corpus
Training corpus
Preprocessing
Free Text
Free Text
Word Segmentation
POS-tagging, Word Sense
Shallow-Parsed Text
Shallow-Parsed Text
Shallow Parsing
Rule Extraction
Extraction
IE Rules
Output
A rule-based IE framework
14
Thai IEDifficulty
Test corpus
Training corpus
Preprocessing
Free Text
Free Text
Word Segmentation
POS-tagging, Word Sense
Supplement techniques are necessary
Shallow-Parsed Text
Shallow-Parsed Text
Shallow Parsing
Rule Extraction
Extraction
IE Rules
Output
A rule-based IE framework
15
Proposed Framework
16
Proposed Framework
17
Proposed Framework
Use The WHISK Algo.
18
Proposed Framework
2
1
3
19
Rule Application using Sliding Window (RAW)
Rule
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
20
Rule Application using Sliding Window (RAW)
Rule predefined window size is 10
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
1, 10-portion
21
Rule Application using Sliding Window (RAW)
Rule predefined window size is 10
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
2, 11-portion
22
Rule Application using Sliding Window (RAW)
Rule predefined window size is 10
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
3, 12-portion
23
Rule Application using Sliding Window (RAW)
Rule predefined window size is 10
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
21, 30-portion
col ?????????sym ??????????????????o
rg ?????????????????????ptime 6-12???
18 19 20
21 22 23
24 25 26 27 28
29 30
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
34, 43-portion
Extraction frame
Correctness
Portion
21, 30 OBS sym ?????????
LOC org ??????PER ptime 6-12???
Correct 33, 42
OBS sym ??????? LOC org ???????PER
ptime 3-4??? Incorrect
34, 43 OBS sym ?????????
LOC org ???????PER ptime 3-4???
Correct
24
Proposed Framework
25
Wildcard Instantiation
Rule window size is 10
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
Wildcard instantiated across phrase boundary ?
Incorrect extracted
26
Classifier Learning
(sym)(org)???(ptime)
Training corpus
27
Classifier Learning
(sym)(org)???(ptime)
The 1st internal wildcard instantiation
Training corpus
classes spaces words
???? ??? ??????? sym org ??????
28
Classifier Learning
(sym)(org)???(ptime)
The 1st internal wildcard instantiation
Training corpus
classes spaces words
???? ??? ??????? sym org ??????
Feature Selection
Information gain
spaces words sym ???
29
Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
Instantiation spaces
words sym ???
Feature ??????sym ???????????? 0
3 1 1
0, 3, 1, 1
Wildcard-instantiation feature vector 0, 3, 1,
1
30
Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
Instantiation spaces
words ??? Feature
???? 0
1 0
0, 1, 0
Wildcard-instantiation feature vector 0, 3, 1,
1, 0, 1, 0
31
Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
Instantiation spaces
words space Feature
0
0 1
0, 0, 1
Wildcard-instantiation feature vector 0, 3, 1,
1, 0, 1, 0, 0, 0, 1
32
Proposed Framework
33
Overlapping Frame Filtering (OFF)
Annotated text
SYM S1LOC L1 SYM S2LOC L2 SYM S3LOC
L3 SYM SnLOC Ln
SYM S1LOC L1 SYM S2LOC L2 SYM S3LOC
L3 SYM SnLOC Ln
RAW
WIF
OFF
SYM S2LOC L2 SYM S3LOC L2
Overlapping frames
Use classifier score
Some slot fillers are from the same text position.
34
Experiments Output Templates
Type-MD1 Template Abnormal characteristics of
some observable entities
Symptom

Type
Type
Type
ATTR
OBS
Observed Entity
Attribute
PER
Period of time
35
Experiments Output Templates
Type-MD1 Template Abnormal characteristics of
some observable entities
Symptom
Secretion
Color

Type
Type
Type
?????? NasalMucus
??????? Green
ATTR
OBS
Observed Entity
Attribute
PER
6-10 ??? 6-10 days
Period of time
36
Experiments Output Templates
Type-MD2 Template Human-body locations at which
primitive symptoms appear
Symptom
Type
Primitive symptom
Organ
Type
Type
PER
LOC
Period of time
Human body
37
Experiments Output Templates
Type-MD2 Template Human-body locations at which
primitive symptoms appear
Symptom
Type
Primitive symptom
Organ
Type
Type
??????? Rip
6-10 ??? 6-10 days
PER
LOC
Period of time
Human body
38
Data Characteristics
Data sets grouped by disease groups
  • D3 (Test set)
  • The respiratory system,
  • The gastrointestinal tract system,
  • Infectious diseases, and
  • Accidental diseases
  • D1 (Training set)
  • The circulatory system,
  • The urology system,
  • The reproductive system,
  • The eye system, and
  • The ear system
  • D2 (Test set)
  • The skin/dermal system,
  • The skeletal system,
  • The endocrine system,
  • The nervous system,
  • Parasitic system, and
  • Venereal system

39
Data Characteristics
Data sets grouped by disease groups
  • D3 (Test set)
  • The respiratory system,
  • The gastrointestinal tract system,
  • Infectious diseases, and
  • Accidental diseases
  • D1 (Training set)
  • The circulatory system,
  • The urology system,
  • The reproductive system,
  • The eye system, and
  • The ear system
  • D2 (Test set)
  • The skin/dermal system,
  • The skeletal system,
  • The endocrine system,
  • The nervous system,
  • Parasitic system, and
  • Venereal system

40
Data Characteristics
Data sets grouped by disease groups
  • D3 (Test set)
  • The respiratory system,
  • The gastrointestinal tract system,
  • Infectious diseases, and
  • Accidental diseases
  • D1 (Training set)
  • The circulatory system,
  • The urology system,
  • The reproductive system,
  • The eye system, and
  • The ear system
  • D2 (Test set)
  • The skin/dermal system,
  • The skeletal system,
  • The endocrine system,
  • The nervous system,
  • Parasitic system, and
  • Venereal system

41
Example of Rules
42
Experimental Results
Use SVM as a classifier
OFF Preserve recall Improve precision
RAW High recall Low precision
WIF Preserve recall Improve precision
43
Classifier Comparisons SVM, kNN, NB, DT
44
Recall Improvement by Rule Generalization
  • To improve the performances of rules by rule
    generalization (RG)

Output template
Pattern
OBS 1ATTR 2
(org)(gq)
Generalize to
Output template
Pattern
OBS 1ATTR 2
(org)(ch)
OBS 1ATTR 2
(org)(col)
Overfitting Rules
45
Recall Improvement by Doubling Window Size and
Rule Generalization
2W window size doubling RG rule
generalization
OFF Preserve recall Improve precision
RAW High recall Low precision
WIF Preserve recall Improve precision
46
Classifier Comparisons SVM, kNN, NB, DT
47
Compared with Known-Boundary Test
Manually locate target phrases Apply rules to
located target phrases
Test Corpus
48
Compared with Known-Boundary Test
Manually locate target phrases Apply rules to
located target phrases
Test Corpus
IE Rules
49
Compared with Known-Boundary Test
Manually locate target phrases Apply rules to
located target phrases
Known boundary extraction
Our framework
Insignificant differences
50
Experimental ResultsOther Domains
  • The proposed framework is applied to the other
    domains, i.e.,
  • Soccer match reports (SR),
  • Soccer player transferring (ST),
  • Stock market (SM), and
  • Dividend yield (DY)

51
Domain Characteristics
Long target-phrases
A few target-phrases
52
Experimental ResultsOther Domains
OFF Preserve recall Improve precision
RAW High recall Low precision
WIF Preserve recall Improve precision
53
Classifier Comparisons SVM, kNN, NB, DT
54
Compared with Known-Boundary Test
Known boundary extraction
Our framework
Insignificant differences
55
Conclusions
  • In this work
  • We apply IE rules by the siding window
    technique.
  • We use WIF and OFF to classify extracted frames.
  • We apply the framework to medical-symptom
    descriptions,
  • Soccer match reports, Soccer player
    transferring,
  • Stock market, and Dividend yield.
  • Further work
  • How the semantic representations of symptom
    descriptions
  • facilitate automated reasoning, e.g., medical
    diagnosis reasoning.

56
Thank You
57
Conclusion
58
RAW An Example
Rule window size is 10
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
col ?????????sym ??????????????????o
rg ?????????????????????ptime 6-12???
18 19 20
21 22 23
24 25 26 27 28
29 30
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
Extraction frame
Correctness
Portion
21, 30 OBS sym ?????????
LOC org ??????PER ptime 6-12???
Correct 33, 42
OBS sym ??????? LOC org ???????PER
ptime 3-4??? Incorrect
59
RAW An Example
Rule window size is 10
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
col ?????????sym ??????????????????o
rg ?????????????????????ptime 6-12???
18 19 20
21 22 23
24 25 26 27 28
29 30
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
Incorrect extractions probably be produced
60
Wildcard Instantiation
Rule window size is 10
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
Internal wildcards
61
RAW An Example
Rule window size is 10
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
21, 30-portion
col ?????????sym ??????????????????o
rg ?????????????????????ptime 6-12???
18 19 20
21 22 23
24 25 26 27 28
29 30
Extraction frame
Correctness
Portion
62
RAW An Example
Rule predefined window size is 10
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
21, 30-portion
col ?????????sym ??????????????????o
rg ?????????????????????ptime 6-12???
18 19 20
21 22 23
24 25 26 27 28
29 30
Extraction frame
Correctness
Portion
63
RAW An Example
Rule window size is 10
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
21, 30-portion
col ?????????sym ??????????????????o
rg ?????????????????????ptime 6-12???
18 19 20
21 22 23
24 25 26 27 28
29 30
Extraction frame
Correctness
Portion
21, 30 OBS sym ????????? LOC
org ??????PER ptime 6-12???
Correct
64
RAW An Example
Rule window size is 10
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
col ?????????sym ??????????????????o
rg ?????????????????????ptime 6-12???
18 19 20
21 22 23
24 25 26 27 28
29 30
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
Extraction frame
Correctness
Portion
21, 30 OBS sym ????????? LOC
org ??????PER ptime 6-12???
Correct
65
RAW An Example
Rule window size is 10
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
col ?????????sym ??????????????????o
rg ?????????????????????ptime 6-12???
18 19 20
21 22 23
24 25 26 27 28
29 30
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
Extraction frame
Correctness
Portion
21, 30 OBS sym ?????????
LOC org ??????PER ptime 6-12???
Correct 33, 42
OBS sym ??????? LOC org ???????PER
ptime 3-4??? Incorrect
66
Components of IE Systems
Word segmentation
POS-tagging, Word Sense
Full parsing, Shallow parsing
67
Wildcard Instantiation
Rule window size is 10
Pattern
Output template
???




(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
34, 43-portion
An internal wildcard is instantiated across a
boundary ? unrelated slots are extracted
An external wildcard is instantiated across a
boundary ? unrelated slots are extracted
68
Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
Portion Instantiation
spaces words sym ???
Feature-1 33, 42
69
Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
34, 43-portion
Portion Instantiation
spaces words sym ???
Feature-1 33, 42 ??????sym
???????????? 0 3
1 1 0, 3, 1,
1 34, 43 ???
0 1
0 1 0, 1, 0, 1
70
Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
34, 43-portion
Portion Feature-1 33, 42 0, 3, 1,
1 34, 43 0, 1, 0, 1
71
Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
34, 43-portion
Portion Feature-1 Feature-2
Feature-3 33, 42 0, 3, 1, 1 1,
1, 0, 0, 1 1, 2, 1 34, 43 0,
1, 0, 1 2, 1, 0, 1, 0 0, 1, 1
72
Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
34, 43-portion
Portion Feature-1 Feature-2
Feature-3 Feature vector
Label 33, 42 0, 3, 1, 1
1, 1, 0, 0, 1 1, 2, 1 0, 3, 1,
1, 1, 1, 0, 0, 1, 1, 2, 1 -1 34, 43
0, 1, 0, 1 2, 1, 0, 1, 0
0, 1, 1 0, 1, 0, 1, 2, 1, 0, 1, 0, 0,
1, 1 1
73
Experimental ResultsRecall Improvement
  • To improve the performances of rules by rule
    generalization (RG)
  • To improve the performances of rules by
    doubling window size (2W)




74
Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
34, 43-portion
Portion Feature-1 Feature-2
Feature-3 Feature vector
Label 33, 42 0, 3, 1, 1
1, 1, 0, 0, 1 1, 2, 1 0, 3, 1,
1, 1, 1, 0, 0, 1, 1, 2, 1 -1 34, 43
0, 1, 0, 1 2, 1, 0, 1, 0
0, 1, 1 0, 1, 0, 1, 2, 1, 0, 1, 0, 0,
1, 1 1
Classifier Learning
75
Components of IE Systems
Word segmentation
POS-tagging, Word Sense
Supplement techniques are necessary
Full parsing, Shallow parsing
Write a Comment
User Comments (0)
About PowerShow.com