Title: Detecting Web Rumours with a Multilingual Ontology Supported Text Classification System
1Detecting Web Rumours with a Multilingual
Ontology Supported Text Classification System
- Nigel Collier1, Ai Kawazoe1, Son Doan1, Mika
Shigematsu2, Kiyosu Taniguchi2, Lihua Jin1, John
McCrae1, Hutchatai Chanlekha1, Dinh Dien3, Quoc
Hung3, Van Chi Nam3, Koichi Takeuchi4, Asanee
Kawtrakul5
- 1 National Institute of Informatics, Japan
- 2 National Institute of Infectious Diseases,
Japan - 3 Vietnam National University (HCM), Vietnam
- 4 Okayama University, Japan
- 5 Kasetsart University, Thailand
2The challenges of web-based surveillance
- Objectives
- Identify public health threats
- Assign a priority to each threat
- Track each event as it progresses with time
- Automate what is automatable
- Challenges
- The Web its scale, multilinguality and ambiguity
- Knowing what is normal
- Activities
- Semantics-based search and retrieval for disease
outbreak rumours - Focus on Asia-Pacific languages
- Expected impacts
- More timely access to information
- Fewer false alarms
- Promote research activity with open resources.
Free Text
3The tower of Babel
Language Distribution for 2.946 Billion Web
Pages 2003 (source Frederic Gey, SIGIR tutorial
2004)
Worlds Top 20 Languages by Population (source
Frederic Gey, SIGIR tutorial 2004)
? Increasing multi-linguality on the Web
4From text to facts the discovery pipeline
5TECHNOLOGY
6BioCaster system overview
Content and metadata storage and tools
Analysis tools
Exploitation tools
0 30 80
90 100
Source Preprocessing
Content
reject publish check alert
Entity Analysis
Knowledge objects
Markup (OWL)
Topic Classification
RDF repository, retrieval and reasoning Document
links Annotations Ontology Rules
Grounding
Entity time/location
Event extraction
BioCaster ontology (OWL)
Relevancy ranking II
Ontology authoring (Protégé)
Translation
7The BioCaster ontology2
2 Nigel Collier, Ai Kawazoe, Lihua Jin, Mika
Shigematsu, Dinh Dien, Roberto Barrero, Koichi
Takeuchi, Asanee Kawtrakul (2007), "A
multilingual ontology for infectious
disease outbreak surveillance rationale, design
and challenges", Journal of Language
Resources and Evaluation, Springer, DOI
information 10.1007/s10579-007-9019-7
8A closer look
9Semantics driven search
- What is it? Where is it? When did it happen?
- Find me
- All human cases of respiratory syndrome reported
in northeastern Vietnam between September 6th and
September 9th - What is necessary here? We need to know
- that A(H5N1) influenza has symptoms associated
with respiratory syndrome - about synonyms of A(H5N1) influenza at least
12 - that the northeastern region contains Cao Bang,
Lao Cai, Yen Bai, Phu Tho, Ha Giang, Tuyen
Quang, - that A(H5N1) influenza is caused by the Influenza
A virus subtype H5N1 maybe the disease is not
mentioned by name - that early September and this Thursday are
valid within the date range - that cases admitted to hospital are humans!
- what all of these terms are called in Vietnamese,
Chinese, .. maybe there is no report in the
English news media yet
10EXPERIMENTS
11Relevancy criteria
Relevant
- News reports with highest relevancy, which should
be proactively notified to public health expert
for follow up. - Ex. outbreak of newly emerging diseases,
bioterrorism, change of transmission modes
(animal-to-human to human-to-human), the spread
of diseases across international borders 3 - The border line between alert and publish,
which requires human intervention in order to
decide its final status. - Ex. outbreak of unidentified diseases
- Broad class of documents which pertain to
infectious disease related matters, not
sufficiently urgent to notify directly to public
health expert. - Too little pertinent Ex. articles about chronic
information to be of interest to public health
experts. - diseases (cancer, heart diseases, diabetes),
general news (politics, society, business, sport
)
Irrelevant
3 Based on the Annex 2 decision instrument for
risk assessment of the WHO
12Challenges in topic classification
Confusion
- No gold standard for infectious disease domain
- A small number of open annotation schemes
- Some recent work on incorporating XML
structure4-5 - ? Use important concepts to improve performance
- 4 Kudo T. and Matsumoto Y., 2004, A boosting
algorithm for classication of semi-structured
text", - Proceedings of the EMNLP 2004, pp. 301-308.
- 5 Zaki M. and Arrarwal, 2003, XRules An
effective structural classifier for XML data, - Proceedings of the 9th ACM SIGKDD 2003, pp.
316-325.
13Annotating entities in text
Annotation
Ontology
DISEASE
VIRUS
ltPERSON casetruegt A boy lt/PERSONgt was
infected with the virus
BACTERIA
attribute
PRODUCT (biological products)
OUTBREAK
Type
PERSON (in general)
Victims contracted the virus from
infected ltORGANISM transmissiontruegt birds
lt/ORGANISMgt
ORGANISM (animals)
LOCATION
attribute
TIME
.
CASE (diseased person)
Types are Named Entities (NEs) and Roles are
attributes of NEs
Role
TRANSMISSION (source of infection)
THERAPEUTIC
14Method
- Gold standard corpus
- Range of sources, 77(en) 23 (vi)
- Covering Health, Business, Politics, Society,
Sport, Science, Technology, Entertainment. - 350 relevant, 650 irrelevant (en)
- 300 relevant, 300 irrelevant (vi)
- English corpora was annotated for entity type and
role - Evaluation
- 10-fold cross validation
- Analysed models
- Naïve Bayes, SVM, CRF
- Only the best two are presented here Naïve Bayes
(en), SVM (vi)
15(EN) Results by entity type6
6 Son Doan, Ai Kawazoe and Nigel Collier
(2007), "The role of roles in classifying annotate
d biomedical texts", Proceedings of BioNLP 2007,
Prague, Czech Republic, June 29th.
16(EN) Results by thematic structure
All experiments use text entities case roles
17(VN) Results
Optimal method term ranking cut-off at 375
terms on an SVM model using TF-IDF weighting
18Highlighted difficulties
- Mixed topic news reports
- Weakness in data cleansing (e.g. links to other
stories headlines) - Grounding time historic vs. ongoing outbreaks
vs hypothetical outbreaks
India, Nepal Fight Outbreak of Japanese
Encephalitis Japanese encephalitis is caused by
a virus that infects the central nervous
system A recent outbreak of Japanese
encephalitis has killed more than six hundred
people in the state of Uttar Pradesh The
World Health Organization says about
fifty thousand cases of Japanese encephalitis
are reported each year United States health
officials say major outbreaks in the past have
hit China, Japan, South Korea, Taiwan, Thailand
and other areas
19Still to do
- A lot
- Inter-annotator agreement
- Expand the gold standard corpus
- Evaluate the system on live data
20CONCLUSION
21Final remarks
- BioCaster is operational
- 1400 feeds
- 5000 news articles/day analysed, 40.6 found to be
relevant - BioCaster ontology is open source and freely
available - http//biocaster.nii.ac.jp
- Still many research issues
- Disambiguating temporal and geographic
information - Adapting alerting metrics
- Need for shareable resources
- Ontologies, annotated texts (e.g ProMed)
benchmarks, plug and play tools are the keys to
progress
22Many thanks to
- The Japan Society for the Promotion of Science
and ROIS for funding support