Detecting Web Rumours with a Multilingual Ontology Supported Text Classification System - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Detecting Web Rumours with a Multilingual Ontology Supported Text Classification System

Description:

Nigel Collier1, Ai Kawazoe1, Son Doan1, Mika Shigematsu2, Kiyosu Taniguchi2, ... official reports, PH discussion lists, business reports, scientific publications, ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 23
Provided by: nig43
Category:

less

Transcript and Presenter's Notes

Title: Detecting Web Rumours with a Multilingual Ontology Supported Text Classification System


1
Detecting Web Rumours with a Multilingual
Ontology Supported Text Classification System
  • Nigel Collier1, Ai Kawazoe1, Son Doan1, Mika
    Shigematsu2, Kiyosu Taniguchi2, Lihua Jin1, John
    McCrae1, Hutchatai Chanlekha1, Dinh Dien3, Quoc
    Hung3, Van Chi Nam3, Koichi Takeuchi4, Asanee
    Kawtrakul5
  • 1 National Institute of Informatics, Japan
  • 2 National Institute of Infectious Diseases,
    Japan
  • 3 Vietnam National University (HCM), Vietnam
  • 4 Okayama University, Japan
  • 5 Kasetsart University, Thailand

2
The challenges of web-based surveillance
  • Objectives
  • Identify public health threats
  • Assign a priority to each threat
  • Track each event as it progresses with time
  • Automate what is automatable
  • Challenges
  • The Web its scale, multilinguality and ambiguity
  • Knowing what is normal
  • Activities
  • Semantics-based search and retrieval for disease
    outbreak rumours
  • Focus on Asia-Pacific languages
  • Expected impacts
  • More timely access to information
  • Fewer false alarms
  • Promote research activity with open resources.

Free Text
3
The tower of Babel
Language Distribution for 2.946 Billion Web
Pages 2003 (source Frederic Gey, SIGIR tutorial
2004)
Worlds Top 20 Languages by Population (source
Frederic Gey, SIGIR tutorial 2004)
? Increasing multi-linguality on the Web
4
From text to facts the discovery pipeline
5
TECHNOLOGY
  • TECHNOLOGY

6
BioCaster system overview
Content and metadata storage and tools
Analysis tools
Exploitation tools
0 30 80
90 100
Source Preprocessing
Content
reject publish check alert
Entity Analysis
Knowledge objects
Markup (OWL)
Topic Classification
RDF repository, retrieval and reasoning Document
links Annotations Ontology Rules
Grounding
Entity time/location
Event extraction
BioCaster ontology (OWL)
Relevancy ranking II
Ontology authoring (Protégé)
Translation
7
The BioCaster ontology2
2 Nigel Collier, Ai Kawazoe, Lihua Jin, Mika
Shigematsu, Dinh Dien, Roberto Barrero, Koichi
Takeuchi, Asanee Kawtrakul (2007), "A
multilingual ontology for infectious
disease outbreak surveillance rationale, design
and challenges", Journal of Language
Resources and Evaluation, Springer, DOI
information 10.1007/s10579-007-9019-7
8
A closer look
9
Semantics driven search
  • What is it? Where is it? When did it happen?
  • Find me
  • All human cases of respiratory syndrome reported
    in northeastern Vietnam between September 6th and
    September 9th
  • What is necessary here? We need to know
  • that A(H5N1) influenza has symptoms associated
    with respiratory syndrome
  • about synonyms of A(H5N1) influenza at least
    12
  • that the northeastern region contains Cao Bang,
    Lao Cai, Yen Bai, Phu Tho, Ha Giang, Tuyen
    Quang,
  • that A(H5N1) influenza is caused by the Influenza
    A virus subtype H5N1 maybe the disease is not
    mentioned by name
  • that early September and this Thursday are
    valid within the date range
  • that cases admitted to hospital are humans!
  • what all of these terms are called in Vietnamese,
    Chinese, .. maybe there is no report in the
    English news media yet

10
EXPERIMENTS
  • EXPERIMENTS

11
Relevancy criteria
Relevant
  • News reports with highest relevancy, which should
    be proactively notified to public health expert
    for follow up.
  • Ex. outbreak of newly emerging diseases,
    bioterrorism, change of transmission modes
    (animal-to-human to human-to-human), the spread
    of diseases across international borders 3
  • The border line between alert and publish,
    which requires human intervention in order to
    decide its final status.
  • Ex. outbreak of unidentified diseases
  • Broad class of documents which pertain to
    infectious disease related matters, not
    sufficiently urgent to notify directly to public
    health expert.
  • Too little pertinent Ex. articles about chronic
    information to be of interest to public health
    experts.
  • diseases (cancer, heart diseases, diabetes),
    general news (politics, society, business, sport
    )

Irrelevant
3 Based on the Annex 2 decision instrument for
risk assessment of the WHO
12
Challenges in topic classification
Confusion
  • No gold standard for infectious disease domain
  • A small number of open annotation schemes
  • Some recent work on incorporating XML
    structure4-5
  • ? Use important concepts to improve performance
  • 4 Kudo T. and Matsumoto Y., 2004, A boosting
    algorithm for classication of semi-structured
    text",
  • Proceedings of the EMNLP 2004, pp. 301-308.
  • 5 Zaki M. and Arrarwal, 2003, XRules An
    effective structural classifier for XML data,
  • Proceedings of the 9th ACM SIGKDD 2003, pp.
    316-325.

13
Annotating entities in text
Annotation
Ontology
DISEASE
VIRUS
ltPERSON casetruegt A boy lt/PERSONgt was
infected with the virus
BACTERIA
attribute
PRODUCT (biological products)
OUTBREAK
Type
PERSON (in general)
Victims contracted the virus from
infected ltORGANISM transmissiontruegt birds
lt/ORGANISMgt
ORGANISM (animals)
LOCATION
attribute
TIME
.
CASE (diseased person)
Types are Named Entities (NEs) and Roles are
attributes of NEs
Role
TRANSMISSION (source of infection)
THERAPEUTIC
14
Method
  • Gold standard corpus
  • Range of sources, 77(en) 23 (vi)
  • Covering Health, Business, Politics, Society,
    Sport, Science, Technology, Entertainment.
  • 350 relevant, 650 irrelevant (en)
  • 300 relevant, 300 irrelevant (vi)
  • English corpora was annotated for entity type and
    role
  • Evaluation
  • 10-fold cross validation
  • Analysed models
  • Naïve Bayes, SVM, CRF
  • Only the best two are presented here Naïve Bayes
    (en), SVM (vi)

15
(EN) Results by entity type6
6 Son Doan, Ai Kawazoe and Nigel Collier
(2007), "The role of roles in classifying annotate
d biomedical texts", Proceedings of BioNLP 2007,
Prague, Czech Republic, June 29th.
16
(EN) Results by thematic structure
All experiments use text entities case roles
17
(VN) Results
Optimal method term ranking cut-off at 375
terms on an SVM model using TF-IDF weighting
18
Highlighted difficulties
  • Mixed topic news reports
  • Weakness in data cleansing (e.g. links to other
    stories headlines)
  • Grounding time historic vs. ongoing outbreaks
    vs hypothetical outbreaks

India, Nepal Fight Outbreak of Japanese
Encephalitis Japanese encephalitis is caused by
a virus that infects the central nervous
system A recent outbreak of Japanese
encephalitis has killed more than six hundred
people in the state of Uttar Pradesh   The
World Health Organization says about
fifty thousand cases of Japanese encephalitis
are reported each year United States health
officials say major outbreaks in the past have
hit China, Japan, South Korea, Taiwan, Thailand
and other areas
19
Still to do
  • A lot
  • Inter-annotator agreement
  • Expand the gold standard corpus
  • Evaluate the system on live data

20
CONCLUSION
  • CONCLUSION

21
Final remarks
  • BioCaster is operational
  • 1400 feeds
  • 5000 news articles/day analysed, 40.6 found to be
    relevant
  • BioCaster ontology is open source and freely
    available
  • http//biocaster.nii.ac.jp
  • Still many research issues
  • Disambiguating temporal and geographic
    information
  • Adapting alerting metrics
  • Need for shareable resources
  • Ontologies, annotated texts (e.g ProMed)
    benchmarks, plug and play tools are the keys to
    progress

22
Many thanks to
  • The Japan Society for the Promotion of Science
    and ROIS for funding support
Write a Comment
User Comments (0)
About PowerShow.com