Quality Labelling of Web Content - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Quality Labelling of Web Content

Description:

Institute of Informatics & Telecommunications. NCSR 'Demokritos', Athens, Greece ... American Medical Association (AMA) e-Europe 2002. Quality Label (logo) ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 49
Provided by: ncsr2
Category:

less

Transcript and Presenter's Notes

Title: Quality Labelling of Web Content


1
Quality Labelling of Web Content
Software Knowledge Engineering
Laboratory Institute of Informatics
Telecommunications NCSR Demokritos, Athens,
Greece
Vangelis Karkaletsis
3rd IFIP Conference on Artificial Intelligence
Applications Innovations (AIAI 2006) Athens,
7-9 June 2006
2
Contents
  • Quality labels / trustmark schemes
  • Existing labeling processes
  • Needs for new technologies
  • On-going projects and initiatives
  • QUATRO, MedIEQ, W3C WCL-XG,
  • Concluding Remarks

3
Quality labels / trustmark schemes - I
  • Quality labels / trustmark schemes have been
    established in many parts of the world
  • some are online versions of existing schemes,
  • others have been developed specifically for the
    web.

4
Quality labels / trustmark schemes - II
  • Inform the user about the quality of data and
    services provided
  • if these are of a certain type, they fulfill
    certain criteria or meet given requirements
  • for example, a label may include an assertion
    that the labeled web site has a suitable privacy
    policy, that the publisher is clearly identified,
    and that it meets legal practice in one or more
    identified countries.
  • Two notable areas of interest for quality labels
    / trustmarks schemes are
  • those designed to give consumers confidence in
    eCommerce operations, and
  • those that indicate that health related content
    has been peer reviewed

5
Quality labels / trustmark schemes III an
example the WMA label for health related web
content
  • Identification
  • Content
  • Confidentiality
  • Advertising and Sponsoring
  • Virtual Consultation
  • Non compliance

6
Quality labels / trustmark schemes IV some
facts about the health related web content
  • The number of health web sites and online
    services is increasing day by day
  • 70-80 of Internet users seeks health information
    for them or for their relatives
  • More than 4 out of 10 health information seekers
    say the material they find affect their health
    decisions

7
Quality labels / trustmark schemes V some facts
about the health related web content
  • Quality of health related web content is
    extremely variable
  • from evidence-based healthcare to widespread
    practice of fraud and potentially-dangerous
    claims
  • Increase in consumer knowledge changes how
    patient, professionals and providers interact
  • Patients are becoming more proactive in their
    care management
  • Effects in Public Health
  • An example Vaccines

8
(No Transcript)
9
(No Transcript)
10
Existing labeling processes - I
  • Organisations around the world are working on
    establishing quality standards
  • For example, concerning health sites
  • European Commission
  • eEurope 2002 Quality criteria for health related
    web sites
  • American Medical Association
  • Guidelines for medical health information sites
    on the Internet
  • Internet Healthcare Coalition
  • eHealth Code of Ethics
  • .

11
Existing labeling processes - II
  • Quality standards initiatives are not enough
  • Self-adherence to codes of conduct or ethics,
    nothing more than a claim with little
    enforceability
  • Necessary the establishment of labeling
    mechanisms
  • by third party accreditation
  • by creating portals where web sites are organized
    and characterized against certain labeling
    criteria
  • Such initiatives already exist

12
Existing labeling processes - III
Codes of Conduct are defined as sets of quality
criteria that provide a list of recommendations
for the development and content of websites
Quality Label (logo) is diplayed on screen and
represents a commitment by a provider to
implement or adhere to a code of conduct
User Guidance enables users to check if a site
complies with certain standards by accessing a
series of questions from a displayed logo
Filtering Tools applied manually or
automatically, accept or reject web resources -
resources are selected for their quality and
relevance to a particular audience
Third Parties certification quality and
accreditation labels are awarded by a third party
to inform consumers that a site provides
information meeting current standards for content
and form
13
Existing Labeling Processes - IV
  • (A.) A web site issues a request, for a label, to
    a third party (labeling operator)
  • Site checked and if OK, a label is generated
  • the label is either stored locally at the sites
    server, or stored in the operators database (a
    link to the label is added in the web site)
  • sites content is examined periodically and if an
    unacceptable change occurs, the label is either
    removed or replaced with a relevant message

14
Existing Labeling Processes - V
  • (B.) Location of unlabeled web sites in specific
    thematic areas
  • Characterization of the located sites, with
    respect to certain criteria
  • Filtering of some of the web sites based on their
    characterization
  • Organizing the rest of the web sites into web
    directories to facilitate access by information
    consumers

15
Need for new labeling technologies I
  • Problems of existing labeling processes due to
  • High costs to offer the service
  • Huge amount of information to assess (too many
    sites)
  • Content changes rapidly
  • Broken links to accredited websites
  • Not standardised rating criteria
  • Dishonest use of the label

16
Need for new labeling technologies II
  • Most of the work in labeling processes is
    currently performed manually
  • A site may have hundreds of pages (static/dynamic
    ones) or other resources (.doc, .rtf, .pdf,
    images, )
  • Probably all or most of the resources have to be
    checked
  • A single label may be used for the whole site or
    different labels for sites resources

17
Need for new labeling technologies III
  • Access of the end-users to labeled resources must
    be improved
  • If labels could be recognised by web browsers and
    search engines this would motivate content
    providers to label their resources

18
Need for new labeling technologies IV
  • Need for technologies that enable the automation
    of the labeling operators work this involves
  • Technology for creating machine processable
    labels
  • establishing common schemas and vocabularies
    exploiting semantic web technologies (RDF, OWL),
  • developing label generators with user-friendly
    interfaces
  • Technology for maintaining the labels
  • monitoring the label with respect to its issue
    and expiration dates, its integrity (when stored
    by the content provider)
  • monitoring the label with respect to the labeled
    content exploiting content analysis technologies
  • Technology for locating unlabeled web resources
  • necessary for domain specific portals collecting
    web resources meeting certain quality criteria

19
Need for new labeling technologies V
  • Need for technologies that enable the access of
    the end-user to the label and its content this
    involves
  • Technology for locating the label inside a web
    resource, reading and validating labels content
  • Technology for presenting the labels content and
    validation results to the end user

20
Technology for creating machine processable
labels - I
  • Advantages of establishing common schemas and
    vocabularies
  • A label that is machine readable and uses common
    descriptors will be interpreted more easily by
    semantic web tools than one that uses purely
    proprietary elements
  • For instance, if a user agent is configured to
    look for Label A but finds a site that is
    accredited by Label B, at least the common
    elements will be recognised, even if those
    specific to Label B are not.
  • The incentive for content providers to gain
    accreditation for their material is therefore
    enhanced if the labeling scheme they adopt uses
    at least some of the common descriptors

21
Technology for creating machine processable
labels - II
  • Advantages of establishing common schemas and
    vocabularies (cont.)
  • A common set of elements facilitates the
    application of web content analysis techniques to
    ensure that an accredited site continues to meet
    at least some of the common labeling criteria.
  • For instance, a content analyser cannot tell
    whether an e-mail sent to an eCommerce operator
    will be responded to within a given time, but it
    can detect that a contact route is still provided
    3 months after the site was last reviewed by a
    human, even if the nature of the contact route
    changes
  • The use of a common vocabulary offers commercial
    advantages to labeling authorities by increasing
    the value of the labels for content providers and
    end-users

22
Technology for creating machine processable
labels - III
  • Establishing common schemas and vocabularies
    exploiting semantic web technologies (RDF, OWL)
  • Some basics on Resource Description Framework
    (RDF)
  • Established by W3C
  • Can be considered as providing our file format
  • We'll use it for sharing Web content labels
    (import/export/publish for our tools)
  • Queried using an SQL-like language, SPARQL
  • RDF/XML files make simple statements
  • Advertising is present here
  • There is a service of virtual consultation for
    professionals
  • The intended audience is health professionals
  • It's in the MeSH 'quality of health care' category

23
Technology for creating machine processable
labels IV
  • Some basics on RDF (cont.)
  • Enables sharing the work mixing schemas
  • QUATRO for talking about 'advertisements'
  • ltquatroacgt1lt/quatroacgt
  • WMA for 'virtual consultations'
  • ltwmavirtconsprofgt0lt/wmavirtconsprofgt
  • Dublin Core for 'intended audience'
  • ltdcaudiencegtChildrenlt/dcaudiencegt
  • And unlimited others...
  • Re-use of existing RDF vocabularies means
  • saving time from re-defining existing concepts
  • making re-use of both software and data more
    likely

24
Technology for creating machine processable
labels - V
  • Some basics on the Ontology Web Language (OWL)
  • W3C recommendation
  • Can be used to explicitly represent the meaning
    of terms in vocabularies and the relationships
    between those terms.
  • OWL has more facilities for expressing meaning
    and semantics than XML, RDF, and RDF-S, and thus
    OWL goes beyond these languages in its ability to
    represent machine interpretable content on the
    Web.
  • OWL is a revision of the DAMLOIL web ontology
    language
  • OWL has three increasingly-expressive
    sublanguages OWL Lite, OWL DL, and OWL Full.

25
Technology for creating machine processable
labels - VI
  • Having the languages for representing labels
    data is not enough to attract content providers
    to create labels and add them to their content
  • Developing label generators with user-friendly
    interfaces for the content providers
  • An example the ICRA label generator

26
Technology for maintaining the labels - I
  • Monitoring the labels content
  • When a label is generated, the following data may
    be stored in the labeling authoritys data base
  • Issue date, expiration date
  • labels content hash
  • the label itself
  • If the label is reviewed at some point
  • the labeling authoring updates the dates
  • the hash may also be updated if after the review
    the authority changes the label (the new one is
    sent to the content provider)
  • it is also possible that the provider asks for
    changes in the label
  • The data stored about the label can be used by a
    tool to examine
  • the label against the expiration date (if date
    has passed, alert the authority)
  • the labels integrity (when stored by the content
    provider he/she may change it)

27
Technology for maintaining the labels - II
  • Monitoring the labels content against the
    content of the labeled resource using content
    analysis technologies
  • spidering technology that enables navigating the
    monitored site to locate resources (pages,
    documents) related to the labeling criteria
  • information extraction technology to extract from
    the located resources the data corresponding to
    the labeling criteria

28
Technology for maintaining the labels - III
  • Spidering involves techniques and tools for
  • Site navigation to traverse a Web site,
    collecting information from each resource visited
    and forwarding it to the Resource
    classification and Link Scoring modules
  • Resource classification is responsible for
    deciding whether a resource is an interesting one
    and should be stored or not, exploiting
  • machine learning techniques to train a
    classifier,
  • techniques for natural language processing, image
    analysis, page layout analysis that will provide
    the necessary features for the classifiers
    training
  • domain specific resources (terminologies,
    vocabularies, ontologies)
  • Link-scoring validates the links to be followed.
    Only links with a score above a certain threshold
    are followed.
  • machine learning techniques, heuristics, domain
    specific resources can be used

29
Technology for maintaining the labels - IV
  • Information extraction may involve
  • wrappers for different types of resources
  • techniques and tools from the areas of machine
    learning, natural language processing, image
    analysis (in case of image resources) for
  • term extraction
  • named entity recognition and classification
  • relations extraction
  • use of domain specific resources (terminologies,
    vocabularies, ontologies)

30
Technology for locating unlabeled web resources
  • Use of focused crawling technology to locate
    unlabeled domain specific web resources
    exploiting
  • existing search engines
  • machine learning techniques
  • domain specific resources

31
Technologies for improving accessibility to the
labels - I
  • Locating the label of a web resource, reading and
    validating its content
  • parse the resources content to locate an RDF
    label
  • if such a label exists
  • identify the labeling authority
  • calculate the labels hash
  • get for the specific resource the data stored in
    the authoritys data base
  • process the resources content with the content
    analyser
  • validate the label against the data in the
    authoritys data base
  • provide the labels data and the validation
    results to the tools responsible for presenting
    them to the end -user

32
Technologies for improving accessibility to the
labels - II
  • Presenting the labels content and validation
    results to the end user
  • enabling existing web browsers and search engines
  • to communicate with web services that are able to
    locate and validate labels in the retrieved
    resources, as well as
  • to present the labels data and validation
    results in a format understandable by the
    end-user
  • natural language generation technology can be
    exploited to present the labels content in the
    end-users language according to his/her
    knowledge and needs

33
QUATRO project - I
  • The EC-funded project Quality Assurance and
    Content Description QUATRO (Safer Internet
    Programme)
  • Provides a common vocabulary and machine
    processable RDF schema for quality labeling
  • Known as RDF-CL, it allows a small amount of
    metadata to be applied to anything from a single
    resource such as a web page, to millions of items
    on any number of web sites.
  • A default label can be set for a whole web site
    or set of web sites, with overrides coming into
    play as required.
  • Labels can be stored on the labeled site or in a
    database operated by the Labeling Authority.

34
QUATRO project - II
  • QUATRO vocabulary is divided into four
    categories
  • General Criteria, such as whether the labeled
    site includes a privacy statement, contact point
    etc.
  • Criteria for labeling to ensure accuracy of
    information such as the content providers
    credentials and appropriate disclosure of
    funding.
  • Criteria for labeling to ensure compliance with
    rules and legislation for e-business such as fair
    marketing practices and measures to protect
    children
  • Terms used in operating the labeling scheme
    itself such as the date the label was issued,
    when it was last reviewed and by whom.
  • Three different domain specific vocabularies
    developed by QUATRO partners
  • ICRA nudity, sexual content, violence,
  • IQUA integrity, responsibility,
    confidentiality, protection of intellectual and
    industrial property rights,
  • WMA health related content and services

35
QUATRO project - III
  • QUATRO develops tools to support the exploitation
    of its labels
  • QUATRO proxy server (QUAPRO)
  • Takes as input a URL, either from a search engine
    or a browser, and examines whether there are
    labels inside
  • Parses the label (only for QUATRO-based labels)
    in order to check its validity in terms of
  • the label itself, or
  • the URLs content (in QUATRO, this is restricted
    only to pages with pornographic content)
  • Returns a result on the labels validity (valid,
    invalid, unknown)
  • A browser extension, the Metadata Visualizer
    (ViQ)
  • A search engine wrapper which is a web interface
    displaying annotated search results that link to
    the corresponding labels, the Label Display
    Interface (LADI)

36
QUATRO project - IV Architecture
ViQ (the browser plug-in)
Web
SOAP
SOAP
QUAPRO (the QUATRO proxy)
Labelling Authorities databases
DAcc
LADI (the search engine data wrapper)
DAcc
Data ACCess interface
SOAP
SOAP
FilterX (the content analyser)
37
QUATRO project LADI
38
QUATRO project ViQ
39
QUATRO project - V
  • QUATRO addresses the needs of both the labeling
    operator and the end-user
  • QUATROs technology for creating machine
    processable labels represents the first step
    towards a platform that will support the work of
    the labeling operator
  • Technology for automated web content analysis is
    still required

40
MedIEQ Project I
  • MedIEQ Quality Labeling of Medical Web content
    using Multilingual Information Extraction
  • EC-funded project
  • DG SANCO Health Consumer protection,
    Directorate C Public Health and Risk Assessment
  • Public Health Programme, Priority Area 1.
    Health Information, Action 1.5 eHealth
  • Duration 01/01/2006-01/01/2009

41
MedIEQ Project - II Project Objectives
  • develop a scheme for the quality labelling of
    health related web content and provide the tools
    supporting the creation, maintenance and access
    of labelling data according to this scheme
  • specify a methodology for the content analysis of
    medical web sites according to the MedIEQ scheme
    and develop the tools that will implement it
  • integrate these technologies into a prototype
    labelling system
  • demonstrate the resulting prototype in 7
    different languages (Spanish, Catalan, German,
    English, Greek, Czech, and Finnish) and two
    labelling applications (third party
    accreditation, classification)

42
MedIEQ Project - III
43
The MedIEQ Project - IV Indicators for measuring
the achievement of objectives
  • Reduction of the manual labelling time
  • Labeling unlabeled sites, monitoring labeled
    sites, ..
  • Effective extraction from large collections of
    medical web content
  • Processing time, precision of extracted data,
  • Effort required to customize the system into new
    languages
  • 7 languages to be supported
  • Implementation of an open architecture
  • Effort required to integrate new techniques and
    tools,

44
W3C Content Label Incubator Group
  • WCL-XG aims to foster ideas for how content
    providers can inform search engines, aggregators
    and other data systems that
  • their content is of a certain type,
  • fulfils certain criteria or meets given
    requirements
  • Content labels will need to be applicable to a
    resource or a group of resources
  • It should be possible to build systems that in
    some way show the labels to be trustworthy
  • WCL-XG output may be used as input to a working
    group leading to a full W3C Recommendation
  • WCL-XG started its work on February 2006 and is
    scheduled to complete in June

45
Concluding
  • Quality labeling is an application area for
    content analysis, knowledge management, and
    intelligent interfaces
  • Strong need for the development of tools
    assisting the work of labeling authorities
  • Browsers and search engines will have to be
    enriched with functionalities that enable the
    recognition of machine processable labels
    (RDF-CL) and the presentation of their content to
    the end-user

46
Concluding
  • Establishment of quality labels in practice
    cannot be enforced by measures
  • If content providers realize that
  • content labels can be created and added easily
    to their content
  • labeling authorities are equipped with technology
    that facilitates the monitoring of the provided
    web content against the labeling criteria
  • search engines and browsers can inform users on
    the existence of quality labels and their
    features
  • they will adopt machine readable content labeling
    technology
  • leading to the increase of labeled sites
  • improving in turn the quality knowledge
    disseminated through the Web
  • What do you think? Is this the right way to
    proceed?

47
Useful Links
  • QUATRO site
  • http//www.quatro-project.org
  • MedIEQ site
  • http//www. medieq.org
  • WCL-XG
  • http//www.w3.org/2005/Incubator/wcl

48
Quality Labelling of Web Content
Vangelis Karkaletsis
Thank you !
3rd IFIP Conference on Artificial Intelligence
Applications Innovations (AIAI 2006) Athens,
7-9 June 2006
Write a Comment
User Comments (0)
About PowerShow.com