Title: JRC Corporate Image Toolkit
1The Medical Information System - MedISys eHealth
2009 Second International ICST Conference on Elect
ronic Healthcare for the 21st century September
23-25, 2009 - Istanbul, Turkey Erik van der
Goot the OPTIMA team (OPensource Text
Information Mining and Analysis ) European
Commission Joint Research Centre
(JRC)Institute for the Protection and Security
of the Citizen (IPSC) erik.van-der-goot_at_jrc.ec.e
uropa.eu
2JRC - who
3JRC - where
4MedISys - Overview
- Objective
- Provide open source data collection and analysis
for surveillance and epidemiology - Replace manual scanning of multiple newspapers
and web portals - Support national and international Public Health
(PH) organisations to monitor issues of Public
Health concern (e.g. CBRN) - Functionality
- Gather, filter, classify, extract and aggregate
health-related information - Monitor trends, detect breaking news
- Visualise analysis results
- Alert users
- Allows customised views
- In combination with RNS tool, allows manual
moderation.
5Background - History
- Based on JRCs Europe Media Monitor (EMM)
technology (EMM live since 2002
http//emm.newsbrief.eu). - On request / initiative of the ECs Directorate
General for Health and Consumer Protection (DG
SANCO). - Password-protected service for Public Health
bodies since 2005. - Public service since early 2007
(http//medusa.jrc.it/, restricted
functionality).
6Background - Media Monitoring
- EU Commission Media Monitoring (until 2001/2002)
- Traditional cut and paste for printed press only
- Monitoring of incoming news wires (e.g. Reuters,
AFP) - Simple keyword based filtering of wires
- Manual selection of printed press items
- Human classification of items
- Potential problems
- Not real-time for mainstream media printed
press typically once a day - Limited coverage not all media is printed
- Inaccurate and incomplete classification
subjective and limited number of categories - Labour intensive and expensive limited number of
articles per reviewer per day, requires topical
knowledge and requires language knowledge
7EMM History
- New Challenges (as seen in 2002)
- Enlargement (10 countries) more media, more
languages - More use of electronic publishing (media)
- Electronic distribution of results (webmobile)
- Automatic alerting functions
- New approach EMM - a one stop shop for Media
Monitoring - Facilitate (not replace) human Media Monitoring
activities - Extend monitoring beyond the traditional news
wires (Internet). - Improve coverage, number of languages, analysis.
- Apply automatic categorization and analysis to
all sources - Provide new services like automatic e-mail, sms,
mobile editions etc. - Provide editorial system to manage the
information and produce newsletters etc. - Important EMM is not Yet Another Internet Search
Engine
8EMM System Features
- Automatic language recognition
- Based on continuously updated language specific
frequency tables - Automated information/entity extraction
- 400.000 persons and organizations based on
continuously updated list of entities, many
language specific synonyms. - Geotagging
- Based on homegrown harmonised multilingual
geo-data set, about 600.000 place name variants
in most languages covered by EMM, mostly national
capitals, regional capitals and provincial
capitals. - Improved Categorization Engine
- Boolean combinations, proximity, wildcards
- Support for Arabic and similar (automatic
noun-prefix processing) Support for Chinese and
similar (no whitespace) - Tonality/Sentiment
- Simple bag of words approach, range from very
negative to very positive, corrected for long
term source bias, interesting for following
reporting trends per category
9 more features
- Duplicate detection
- Metadata categorization
- Allows selection of articles based on any
previously assigned meta-data. - Automated information linking
- Incremental topic based clustering and
storytracking, geolocation. - 10 minute interval incremental clustering on
last 4 hours worth of news. (Top Stories on front
page) - Automatic detection of breaking news
- Cluster growth rate
- Flux of articles per category
- Indexing
- Index full text and most metadata.
- Statistics/Trend analysis
- Quantitative analysis of reporting. Maintain
simple count statistics.
10and more features
- Event extraction
- Language independent event grammars used to
parse clusters using language dependent resources
to fill the grammar slots. - Currently for 5 languages (en, fr, it, pt, ru),
violent events, humanitarian events
11Development time line
EMM/RNS
Domain specific application
MediSys
Continuous development New features NewsExplorer
First version 2005
EMM System redesign
Redesign based on EMM
RNS redesign
12MediSys System Overview
MediSys Newsbrief
NewsDesk Service (a.k.a. RNS) Editorial Interface
EMM Open Source Monitoring Engine
13Problems to solve
- Find relevant information
- Millions of new articles/blogs/items/tweets
published on Internet each day - Deliver the information to the right user
- Allow for many (possibly overlapping) categories
to meet specific needs - Timely
- Right now if possible
- In short Deliver targeted information timely to
the right user
14Approach
- Wide coverage
- Many sources
- Local, Regional, National and International
coverage - Many languages
- Multilinguality cross-lingual information
access - Fast coverage
- High frequency monitoring of sites, some sites
every 5 minutes - Overcome the information overflow
- Categorization, aggregation, duplicate
identification, clustering - Customisability of MedISys NewsBrief
- Search functions
- RNS tool for manual moderation and targeted
dissemination
15Input data
- 2200 Sources (world-wide, but primary focus on
Europe) - 4,000 HTML web pagesRSS feeds
- 100 specialist medical sites
- 20 commercial newswires
- Specialist pay-for sources (LexisMed)
- 24/7, near continuous monitoring
- 80,000 new articles/items per day
- Converts dirty html with adverts, menus, html
tags, related stories, etc. into clean and
standardised Unicode-encoded RSS format - Use RSS when available
- Perform full content analysis
16MediSys Screenshots
17MedISys Current subscribers and users include
- Supranational organisations
- Directorate General Health and Consumer
Protection (SANCO) - European Centre for Disease Control, Stockholm
(ECDC) - European Food Safety Authority (EFSA)
- World Health Organisation (WHO)
- National Public Health organisations
- Swiss Federal Office of Public Health
- Icelandic Ministry of Health
- Spanish Ministry of Sanitation Ministry of
Health and Consumer Protection - Institut de Veille Sanitaire (France)
- Global Public Health Intelligence Network
(Canada) - Danish Emergency Management Agency
- Italian Ministry of Health and Ministry of
Defence - Dutch Institute of Public Health Food and
Consumer Product Safety Authority - The (general?) public
- Currently 1000 visitors, 37000 hits per day
on public system
18Locations mentioned in MedISys medical articles
across languages
Importance of multilingual information gathering
Italian - German
19Multilingual and cross lingual analysis (1)
Barack Obama (Eu,yo) Barak Obama (az,wo) ?????
????? (ba,uk) ????? ?????? (ar) ????? ??????
(ar,fa) ????? ??????? ????? (ru) Baraque Obama
(pt) ?????? (ja) ????? ?????? (th) ????? ?????
(hy) ?????? ?????? (dv) ????? ????? (yi) ???
?????? (he) ?????? (zh) ?????? ?????? (dv) ????
?????? (ur)
- Data processing layer
- Detect known entities across languages using
large multilingual set of name variants (updated
daily) - Geo-locate the articles using large multilingual
geo-database - Apply content based categorization using
multilingual category definitions
20Multilingual and cross lingual analysis (2)
- Data presentation layer
- Convenience links to external Machine
Translation programs, where available. - Display of other MedISys categories, of persons
and organisations found in text. - Display on-line English translation of Chinese
and Arabic -
21Aggregation of multilingual information
- Documents from all languages get classified
according to the same countries and categories. - An increase of the number of media reports on any
country-category combination is detected, - independently of the reporting language.
- Graphs and alerts may show events not yet
reported in your own language.
22Detection using statistics
- Detect abnormal flux of reporting for a
particular country/category combination
23Recent case
24News Clusters mostly about CategorySat.
02-05-2009, Influenza A
25Categorized and Clustered NewsSat. 02-05-2009,
Influenza A
26PULS Event detection
Results from Helsinki University
27Category definitions Example haemorrhagic fever
- Terms (single or multi-word)
- Cumulative weights with threshold
- Case forcing
- Upper case characters in pattern only match
uppercase in text (useful for acronyms etc.) - Wild cards
- Single letters (_)
- Zero, one or more letters ()
- Adjacent words ()
- Boolean combinations of term lists
- And, or, not
- Using proximity operator (within X words)
28Customisability of MedISys
- Add more news sources or new categories, e.g.
- Events Cricket World Cup, Rugby World Cup, UEFA
Euro 2008 - New diseases
- Other classes, e.g. deliberate release of
chemicals - (on request of recognised users/partners)
- Output formats web pages, email alerts, or RSS
feed to integrate into your environment. - Email alerts
- daily vs. breaking news only
- for daily notification specify hour
- for breaking news level-dependent
- User-selected languages only
29Customisability Filter by language/news
source/category
30Rapid News Service - RNS (restricted to
subscribed users)
- Allows MedISys users to further customise their
view of the news - Selection of specific languages and feeds
- Allows human moderation
- Manual selection of news items
- Drag and drop compilation of newsletters
- Allows moderators to forward news items to user
groups - Allows user management
- Via SMS alerts, emails or newsletters
- Shows overview of relative activity of each
category over time
31RNS moderation Editing interface for newsletter
Manual selection of news items, drag and drop
compilation of newsletters.
32RNS moderation Alert overview page
Time line shows overview of relative activity of
each category over time.
33MedISys - Summary
- High coverage helps monitor a large number of
multilingual media reports. - Includes tools to help beat the information
overflow - via clustering, duplicate detection
- categorization information aggregation
visualisation mapping - further means are being implemented e.g.
multiligual medical event extraction - Special features of MedISys
- Fully automatic (moderation possible)
- Real time (10-minute updates), 24/7
- High multilinguality (43 languages)
- Multilingual information aggregation
- Part of EMM family of applications, active team
much new functionality to come.