JRC Corporate Image Toolkit - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

JRC Corporate Image Toolkit

Description:

Not real-time' for mainstream media: printed press typically once a day ... Currently for 5 languages (en, fr, it, pt, ru), violent events, humanitarian events ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 34
Provided by: Erikvan9
Category:
Tags: jrc | corporate | day | fast | image | toolkit

less

Transcript and Presenter's Notes

Title: JRC Corporate Image Toolkit


1
The Medical Information System - MedISys eHealth
2009 Second International ICST Conference on Elect
ronic Healthcare for the 21st century September
23-25, 2009 - Istanbul, Turkey Erik van der
Goot the OPTIMA team (OPensource Text
Information Mining and Analysis ) European
Commission Joint Research Centre
(JRC)Institute for the Protection and Security
of the Citizen (IPSC) erik.van-der-goot_at_jrc.ec.e
uropa.eu
2
JRC - who
3
JRC - where
4
MedISys - Overview
  • Objective
  • Provide open source data collection and analysis
    for surveillance and epidemiology
  • Replace manual scanning of multiple newspapers
    and web portals
  • Support national and international Public Health
    (PH) organisations to monitor issues of Public
    Health concern (e.g. CBRN)
  • Functionality
  • Gather, filter, classify, extract and aggregate
    health-related information
  • Monitor trends, detect breaking news
  • Visualise analysis results
  • Alert users
  • Allows customised views
  • In combination with RNS tool, allows manual
    moderation.

5
Background - History
  • Based on JRCs Europe Media Monitor (EMM)
    technology (EMM live since 2002
    http//emm.newsbrief.eu).
  • On request / initiative of the ECs Directorate
    General for Health and Consumer Protection (DG
    SANCO).
  • Password-protected service for Public Health
    bodies since 2005.
  • Public service since early 2007
    (http//medusa.jrc.it/, restricted
    functionality).

6
Background - Media Monitoring
  • EU Commission Media Monitoring (until 2001/2002)
  • Traditional cut and paste for printed press only
  • Monitoring of incoming news wires (e.g. Reuters,
    AFP)
  • Simple keyword based filtering of wires
  • Manual selection of printed press items
  • Human classification of items
  • Potential problems
  • Not real-time for mainstream media printed
    press typically once a day
  • Limited coverage not all media is printed
  • Inaccurate and incomplete classification
    subjective and limited number of categories
  • Labour intensive and expensive limited number of
    articles per reviewer per day, requires topical
    knowledge and requires language knowledge

7
EMM History
  • New Challenges (as seen in 2002)
  • Enlargement (10 countries) more media, more
    languages
  • More use of electronic publishing (media)
  • Electronic distribution of results (webmobile)
  • Automatic alerting functions
  • New approach EMM - a one stop shop for Media
    Monitoring
  • Facilitate (not replace) human Media Monitoring
    activities
  • Extend monitoring beyond the traditional news
    wires (Internet).
  • Improve coverage, number of languages, analysis.
  • Apply automatic categorization and analysis to
    all sources
  • Provide new services like automatic e-mail, sms,
    mobile editions etc.
  • Provide editorial system to manage the
    information and produce newsletters etc.
  • Important EMM is not Yet Another Internet Search
    Engine

8
EMM System Features
  • Automatic language recognition
  • Based on continuously updated language specific
    frequency tables
  • Automated information/entity extraction
  • 400.000 persons and organizations based on
    continuously updated list of entities, many
    language specific synonyms.
  • Geotagging
  • Based on homegrown harmonised multilingual
    geo-data set, about 600.000 place name variants
    in most languages covered by EMM, mostly national
    capitals, regional capitals and provincial
    capitals.
  • Improved Categorization Engine
  • Boolean combinations, proximity, wildcards
  • Support for Arabic and similar (automatic
    noun-prefix processing) Support for Chinese and
    similar (no whitespace)
  • Tonality/Sentiment
  • Simple bag of words approach, range from very
    negative to very positive, corrected for long
    term source bias, interesting for following
    reporting trends per category

9
more features
  • Duplicate detection
  • Metadata categorization
  • Allows selection of articles based on any
    previously assigned meta-data.
  • Automated information linking
  • Incremental topic based clustering and
    storytracking, geolocation.
  • 10 minute interval incremental clustering on
    last 4 hours worth of news. (Top Stories on front
    page)
  • Automatic detection of breaking news
  • Cluster growth rate
  • Flux of articles per category
  • Indexing
  • Index full text and most metadata.
  • Statistics/Trend analysis
  • Quantitative analysis of reporting. Maintain
    simple count statistics.

10
and more features
  • Event extraction
  • Language independent event grammars used to
    parse clusters using language dependent resources
    to fill the grammar slots.
  • Currently for 5 languages (en, fr, it, pt, ru),
    violent events, humanitarian events

11
Development time line
EMM/RNS
Domain specific application
MediSys
Continuous development New features NewsExplorer
First version 2005
EMM System redesign
Redesign based on EMM
RNS redesign
12
MediSys System Overview
MediSys Newsbrief
NewsDesk Service (a.k.a. RNS) Editorial Interface
EMM Open Source Monitoring Engine
13
Problems to solve
  • Find relevant information
  • Millions of new articles/blogs/items/tweets
    published on Internet each day
  • Deliver the information to the right user
  • Allow for many (possibly overlapping) categories
    to meet specific needs
  • Timely
  • Right now if possible
  • In short Deliver targeted information timely to
    the right user

14
Approach
  • Wide coverage
  • Many sources
  • Local, Regional, National and International
    coverage
  • Many languages
  • Multilinguality cross-lingual information
    access
  • Fast coverage
  • High frequency monitoring of sites, some sites
    every 5 minutes
  • Overcome the information overflow
  • Categorization, aggregation, duplicate
    identification, clustering
  • Customisability of MedISys NewsBrief
  • Search functions
  • RNS tool for manual moderation and targeted
    dissemination

15
Input data
  • 2200 Sources (world-wide, but primary focus on
    Europe)
  • 4,000 HTML web pagesRSS feeds
  • 100 specialist medical sites
  • 20 commercial newswires
  • Specialist pay-for sources (LexisMed)
  • 24/7, near continuous monitoring
  • 80,000 new articles/items per day
  • Converts dirty html with adverts, menus, html
    tags, related stories, etc. into clean and
    standardised Unicode-encoded RSS format
  • Use RSS when available
  • Perform full content analysis

16
MediSys Screenshots
17
MedISys Current subscribers and users include
  • Supranational organisations
  • Directorate General Health and Consumer
    Protection (SANCO)
  • European Centre for Disease Control, Stockholm
    (ECDC)
  • European Food Safety Authority (EFSA)
  • World Health Organisation (WHO)
  • National Public Health organisations
  • Swiss Federal Office of Public Health
  • Icelandic Ministry of Health
  • Spanish Ministry of Sanitation Ministry of
    Health and Consumer Protection
  • Institut de Veille Sanitaire (France)
  • Global Public Health Intelligence Network
    (Canada)
  • Danish Emergency Management Agency
  • Italian Ministry of Health and Ministry of
    Defence
  • Dutch Institute of Public Health Food and
    Consumer Product Safety Authority
  • The (general?) public
  • Currently 1000 visitors, 37000 hits per day
    on public system

18
Locations mentioned in MedISys medical articles
across languages
Importance of multilingual information gathering
Italian - German
19
Multilingual and cross lingual analysis (1)
Barack Obama (Eu,yo) Barak Obama (az,wo) ?????
????? (ba,uk) ????? ?????? (ar) ????? ??????
(ar,fa) ????? ??????? ????? (ru) Baraque Obama
(pt) ?????? (ja) ????? ?????? (th) ????? ?????
(hy) ?????? ?????? (dv) ????? ????? (yi) ???
?????? (he) ?????? (zh) ?????? ?????? (dv) ????
?????? (ur)
  • Data processing layer
  • Detect known entities across languages using
    large multilingual set of name variants (updated
    daily)
  • Geo-locate the articles using large multilingual
    geo-database
  • Apply content based categorization using
    multilingual category definitions

20
Multilingual and cross lingual analysis (2)
  • Data presentation layer
  • Convenience links to external Machine
    Translation programs, where available.
  • Display of other MedISys categories, of persons
    and organisations found in text.
  • Display on-line English translation of Chinese
    and Arabic

21
Aggregation of multilingual information
  • Documents from all languages get classified
    according to the same countries and categories.
  • An increase of the number of media reports on any
    country-category combination is detected,
  • independently of the reporting language.
  • Graphs and alerts may show events not yet
    reported in your own language.

22
Detection using statistics
  • Detect abnormal flux of reporting for a
    particular country/category combination

23
Recent case
24
News Clusters mostly about CategorySat.
02-05-2009, Influenza A
25
Categorized and Clustered NewsSat. 02-05-2009,
Influenza A
26
PULS Event detection
Results from Helsinki University
27
Category definitions Example haemorrhagic fever
  • Terms (single or multi-word)
  • Cumulative weights with threshold
  • Case forcing
  • Upper case characters in pattern only match
    uppercase in text (useful for acronyms etc.)
  • Wild cards
  • Single letters (_)
  • Zero, one or more letters ()
  • Adjacent words ()
  • Boolean combinations of term lists
  • And, or, not
  • Using proximity operator (within X words)

28
Customisability of MedISys
  • Add more news sources or new categories, e.g.
  • Events Cricket World Cup, Rugby World Cup, UEFA
    Euro 2008
  • New diseases
  • Other classes, e.g. deliberate release of
    chemicals
  • (on request of recognised users/partners)
  • Output formats web pages, email alerts, or RSS
    feed to integrate into your environment.
  • Email alerts
  • daily vs. breaking news only
  • for daily notification specify hour
  • for breaking news level-dependent
  • User-selected languages only

29
Customisability Filter by language/news
source/category
30
Rapid News Service - RNS (restricted to
subscribed users)
  • Allows MedISys users to further customise their
    view of the news
  • Selection of specific languages and feeds
  • Allows human moderation
  • Manual selection of news items
  • Drag and drop compilation of newsletters
  • Allows moderators to forward news items to user
    groups
  • Allows user management
  • Via SMS alerts, emails or newsletters
  • Shows overview of relative activity of each
    category over time

31
RNS moderation Editing interface for newsletter
Manual selection of news items, drag and drop
compilation of newsletters.
32
RNS moderation Alert overview page
Time line shows overview of relative activity of
each category over time.
33
MedISys - Summary
  • High coverage helps monitor a large number of
    multilingual media reports.
  • Includes tools to help beat the information
    overflow
  • via clustering, duplicate detection
  • categorization information aggregation
    visualisation mapping
  • further means are being implemented e.g.
    multiligual medical event extraction
  • Special features of MedISys
  • Fully automatic (moderation possible)
  • Real time (10-minute updates), 24/7
  • High multilinguality (43 languages)
  • Multilingual information aggregation
  • Part of EMM family of applications, active team
    much new functionality to come.
Write a Comment
User Comments (0)
About PowerShow.com