Towards a technical infrastructure for research image data publication - PowerPoint PPT Presentation

1 / 79
About This Presentation
Title:

Towards a technical infrastructure for research image data publication

Description:

... for images of 'rodents' can return images described as being of 'mice' ... Joseki system with custom software to harvest metadata into a local Jena model. ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 80
Provided by: graham124
Category:

less

Transcript and Presenter's Notes

Title: Towards a technical infrastructure for research image data publication


1
Towards a technical infrastructure for research
image data publication
  • Graham Klyne
  • Image Bioinformatics Research Group
  • Zoology Department, University of Oxford

2
Overview
  • Part 1 analysis - data in the scholarly life
    cycle
  • Key concepts
  • Repository survey summary
  • Dealing with data
  • Part 2 synthesis - an architectural framework
  • Publication infrastructure for documents
  • Infrastructure enhancements for data
  • Part 3 development proposals - components
  • Supporting pieces
  • Data web core elements

3
Part 1 Analysis
  • Incorporating images and other data into the life
    cycle of published research

4
Documents, images and data
  • Documents
  • interpreted by humans
  • structure visible but content generally opaque to
    computer software
  • Data
  • processed by computer software
  • not generally for human consumption
  • Images
  • interpretable by humans, sometimes only partially
  • inaccessible to most computer software
  • typically handled as data
  • additional information needed for discovery and
    interpretation

5
Documents, data and metadata
  • Metadata data about data
  • has machine-processable structure
  • Metadata is clearly distinct from documents and
    images
  • Metadata is not always clearly distinct from
    other data
  • For images, metadata may be particularly
    important to aid discovery, interpretation and
    effective display

6
Types of metadata
  • Generic metadata (sometimes called descriptive
    metadata)
  • common to all kinds of publication, typically as
    Dublin Core
  • Structural metadata
  • file format, image format, etc
  • standards exist
  • Preservation metadata
  • needed for access to content
  • Domain-specific metadata
  • used for discovery or interpretation of content
  • standards not universally available

7
Summary of repository survey
  • Not many image collections in institutional
    repositories
  • One exemplary image collection (SERPENT) was in a
    specialized (non-institutional) repository ref
    serpent page
  • http//imageweb.zoo.ox.ac.uk/wiki/index.php/Defini
    ngImageAccess/Repository/SERPENT
  • There is very little domain specific metadata
    associated with image collections.
  • SERPENT again being an honourable exception
  • Where available, metadata quality is variable
  • May need external annotation to make collections
    useful
  • Even generic metadata is not always well served
  • Deployed institutional repositories seem to be
    primarily set up to deal with papers and theses

8
OAI-PMH and domain-specific metadata
  • We believe domain specific metadata is important
    for discovery and interpretation of research
    image data
  • OAI-PMH is not a panacea for accessing domain
    metadata
  • administratively difficult to deploy, if possible
  • cannot perform discovery based on metadata values
    (the repository community model seems to be to
    use a separate service like OAIster for such
    operations)
  • Domain-specific metadata can reasonably be
    handled as part of the data with modest
    repository support
  • e.g. ePrints add a couple of structural metadata
    fields, Fedora define an appropriate content
    model
  • external tool support based on OAI-PMH for basic
    access

9
Looking forwardpublishing research data
  • Building on repositories for papers and theses,
    we outline a technical vision for data
    publication
  • Examine the framework within which papers are
    published
  • Suggest implications of additionally publishing
    data
  • Consider how the publication framework might be
    adjusted to accommodate these implications
  • Propose a technical architecture for publishing
    papers and data that accommodates these
    adjustments
  • Identify specific free-standing evolutionary
    developments that work toward supporting the
    proposed technical architecture
  • A generic approach to data publication can also
    handle image data, but with greater emphasis on
    metadata.

10
Elements of published-paperbased research
  • Primary observations new data derived from
    laboratory or field studies.
  • Selection, analysis and argumentation manual
    interpretation and organization for rhetorical
    effect, which accordingly limits the volume of
    material.
  • Published papers written materials published via
    journals and the web, all the result of some
    manual interpretation.
  • Preservation maintain copies of material that
    can be recovered for subsequent analysis or
    publication.
  • Optional for all forms of material, though
    generally an automatic side effect for anything
    published via an institutional repository.

11
Implications of publishing data
  • Greater volume
  • there are no constraints, or radically fewer
    constraints, on the amount of material that can
    be published. Hence less need to select primary
    observations, in contrast to selection for
    inclusion in papers
  • Requirements for automatic processing
  • Computer handling requires that software is
    tailored to the particular data sources, or the
    sources have been aligned with respect to data
    structures and labelling
  • Tool support
  • better computational tools are needed to explore,
    summarize and search volumes of published data

12
Elements of published-databased research
  • Data publication expose primary observation data
    to public access
  • Data alignment adapt published data to some
    common form so that data from different sources
    can be combined
  • Explore, summarize, search additional tools to
    aid researchers and software applications in
    finding relevant published data

13
Further implications of publishing data
  • Contextual data for interpretation
  • Domain-specific (meta)data may be needed for
    interpretation, especially for images and
    similar e.g. bare images are not enough
  • Data and metadata
  • The distinction between data and metadata is
    blurred some metadata need to be available as
    first class data objects
  • Primary observations and interpretations
  • A distinction may be seen in published material
    between primary observations and contestable
    interpretations
  • Secondary research from existing primary
    observations
  • Possible multiple interpretations of the same
    primary data suggest a role for provenance
    tracking metadata

14
Part 2 Synthesis
  • Creating a technical architecture for data
    publication

15
Technical requirements for data publication
  • Data publication tools
  • exposing both generic and arbitrary
    domain-specific metadata
  • common access mechanisms for metadata
  • resource location and extraction based on
    metadata
  • Data alignment tools
  • query, combine and align metadata to deliver a
    coordinated view across heterogeneous
    repositories.
  • Data access tools
  • to explore, summarize and search published data
    and metadata
  • External annotation tools
  • allowing multiple interpretations of primary data
    to be captured
  • further tools to support provenance,
    personalization and trust

16
Infrastructure to support data publication
technical requirements
  • The following slides illustrate how the
    technical elements may be built upon each other,
    followed by more detailed descriptions of those
    elements
  • 1. Present framework elements for online paper
    publication
  • 2. Present framework elements for online data
    publication
  • 3. Proposed framework for publishing data via
    separate repositories
  • 4. Proposed framework for publishing data via
    combined repositories
  • 5. Data web details

17
Present framework for online publication and
access to papers
18
Present framework for online publication and
access to data
19
Proposed framework for publishing data via
separate repositories
20
Proposed framework for publishing data via
combined repositories
21
Data web more details
22
Framework componentdescriptions ...
23
Research group data
24
Research group data
  • Research data may be collected and stored locally
    on computers, but not generally accessible
    outside the group
  • Corresponds to use of computers for data
    collection and analysis
  • Stored locally, not publicly accessible

25
Institutional repository and publisher
repositories
26
Institutional repository
  • Preservation and/or web publication of papers,
    theses, images or and data, facilities typically
    provided by research institutions
  • Current deployments focus on publishing papers
    and theses
  • Some institutions are also looking to publish
    images and other data sets.

27
Publisher repositories
  • Publishers may publish papers to the web.
  • Publishers may also publish auxiliary data and/or
    other commentary to the web
  • e.g. Nature Publishing Group provides Connotea,
    which directly supports limited third-party
    annotation (tagging) of published articles, and
    can be used by other web software to hold richer
    annotations.
  • Publishers we have spoken to are particularly
    interested in linking journal papers to
    supporting primary observations.

28
Public discovery services
29
Public discovery services
  • An obvious public discovery service is Google
  • Also include services such as Citeseer, ROAR,
    professional society indexing and abstracting
    services, etc.
  • http//imageweb.zoo.ox.ac.uk/wiki/index.php/Defini
    ngImageAccess/Resource/ROAR
  • Also includes JISC services such as OpenDOAR,
    Intute repository search
  • http//imageweb.zoo.ox.ac.uk/wiki/index.php/Defini
    ngImageAccess/Resource/OpenDOAR
  • http//imageweb.zoo.ox.ac.uk/wiki/index.php/Defini
    ngImageAccess/Project/Intute
  • And any third party service that can be used to
    search for online publications or data sets

30
Research group repository and specialized
applications
31
Research group repository
  • A research group repository, or database, is a
    system for publishing research group data (and
    possibly papers, presentations, etc.) to the web.
  • It is provisioned by a research group rather than
    institutionally, and provides timely rather than
    long-lived access to research data.
  • These repositories may be bespoke databases
    and/or web sites.
  • Alternatively, common repository software could
    be used, which might be expected to ease the
    migration of content to an institutional
    repository for preservation and longer term
    publication.

32
Specialized applications
  • For automated access to research group data and
    possibly to public databases, specialized
    applications may be used
  • These are often developed to handle particular
    data for a particular research project, and may
    not be readily repurposed to work with other data.

33
Public databases
34
Public databases
  • A number of public databases are being created.
    For example, in the life science area, there are
    public databases for gene sequences, proteomics,
    model species data, gene ontology, etc. These
    provide for online access to pooled results from
    many research groups in the domains covered,
    which is increasingly important in life science
    research.
  • Similar public databases exist for areas of
    humanities and/or classics research (e.g. Beazley
    archive, LIMC), though these are not always tuned
    for programmatic access to data.
  • These systems typically employ expert curation
    and incur ongoing maintenance.

35
Repository and database adapters
36
Repository and database adapters
  • Access to repository data and metadata is
    currently provided by web browser interfaces or
    harvesting systems with web front-ends (OAIster,
    Intute repository search), which lack a common
    mechanism for programmatic access.
  • We propose that a uniform interface to various
    data sources may be provided by implementing
    adapters that present data mapped to RDF and
    RSS/ATOM data models, and provide query access
    via a SPARQL (semantic web query) endpoint
    service.
  • A secondary function of the repository and
    database adapters may be to serve metadata about
    the data source itself for example, indicating
    the URI of a resource containing configuration
    data for a browse or search service.

37
Browse and search services
38
Browse service
  • With uniform programmatic access to multiple data
    sources, a common browser service can provide for
    exploration of diverse data sources.
  • Configuration data, at the level of an RDF schema
    with additional annotations, might be used to
    provide a browsing experience tailored to the
    content of each data source (e.g. mSpace,
    Fresnel).
  • Note en passant Karger and Schraefels pathetic
    fallacy
  • This will not replace the need for
    domain-specific user interfaces for viewing and
    interacting with data sources, but may provide a
    starting point for learning about new data
    sources.
  • It may be that generic user interface tools based
    on Semantic Lens ideas can provide
    application-specific user interfaces, but a
    generic browser will likely be needed to support
    configuration of such tools for new data sources.

39
Search service
  • Complementing the generic browse service, this
    service allows unstructured, semi-structured or
    fully-structured search for data in an arbitrary
    data source.
  • A simple search may simply locate keywords in
    designated parts of some structured data,
    applying purely syntactic criteria.
  • A more advanced search service may use taxonomic
    and other semantic information to expand or
    restrict a search (e.g. a search for images of
    rodents can return images described as being of
    mice).

40
Annotation and user registration services
41
Annotation services
  • With publication of primary observations comes
    the possibility for different researchers to
    analyze them and draw different conclusions.
    Annotation services provide mechanisms for such
    alternative analyses to be published and made
    available in a common framework.
  • Annotations may consist of simple tagging (e.g.
    Flickr, del.icio.us, Connotea), or more complex
    semantic descriptions (e.g. Annotea).
  • If the published information is viewed as an
    RDF-style graph (nodes denoting entities linked
    by edges denoting relationships between them),
    then annotations provide a means to describe or
    deduce new links between the nodes in a graph, or
    between nodes in separately published graphs.

42
User registration, personalization, provenance
and trust services
  • Roughly metadata about users
  • The notion of user services is motivated
    particularly by third party annotations
  • The provenance of an annotation will
    substantially effect its credibility for other
    researchers. The system must be able to record
    disagreement, and provide principled ways to
    resolve conflicts in a way that is appropriate to
    each users needs and preferences.
  • Provenance may extend to personalized trust
    analysis, avoiding the need for determination of
    a universal truth in the face of differing
    opinions and conflicting data.
  • Personalization services may also be accessed by
    browse, search and other services, allowing
    different users to apply different options and
    views when accessing data.
  • Should link to existing authentication and
    authorization systems.

43
Secondary and domain specific services
44
Secondary and domain specific services
  • A goal of this technical architecture is to
    provide a uniform platform for deploying
    secondary, application- and domain-specific
    services, with common programmatic access and
    data formats.
  • Domain specific services fill the role previously
    served by specialized services for research data.
  • Secondary and domain specific services work with
    standardized data formats, and hence are more
    readily applied to new data sources with similar
    processing requirements.
  • Application-specific data analysis, data
    processing workflows, visualization services and
    mashups are possible secondary services.

45
Data webalignment and query handler
46
Data web alignment and query handler
  • Although the various data sources are presented
    in a common format using adapters, no attempt is
    made to enforce common schema or labels for data
    on the various sources.
  • Different sources may be describing the same
    objects or relationships using different labels.
  • If common schema and naming standards happen to
    be used, then linkages between different sources
    may be beneficially exploited without a data web.
    But we do not assume general use of such common
    standards.
  • The data handling services described previously
    would typically be used separately with each
    source, with the user left to determine linkages
    between information.
  • A data web is a domain- or community-specific
    aggregation of data sources, providing a
    coordinated view with common core schema and
    instance identifiers applied across all
    participating data sources.

47
Data web more details
48
Data web overview
  • Data sources are subscribed to a data web by
    submission of schema mapping and identifier
    mapping details to the schema alignment service
    and coreference service respectively.
  • Data alignment builds upon schema and ontology
    alignment research (e.g. Halevy, Doerr), but is
    applied to an open-world web environment.
  • The query handler component may be stateless,
    allowing replication for arbitrary scaling if
    complex alignment computations are needed.

49
Schema alignment registry service
  • A registry of information about various data
    sources, describing how schema elements from each
    are mapped to a data web core schema.
  • Adopts a modular, rule-based approach to schema
    mapping (e.g. drawing upon schema alignment
    language work by Martin Doerr).
  • Schema mapping may be performed as a service, or
    by a calling application using information
    provided by this service - this is a design area
    needing further investigation, particularly with
    respect to its impact on scalability.
  • The schema alignment service also serves as a
    registry of metadata schemas supported by the
    various data sources, information that may be
    used in planning distributed queries.
  • Schema use and alignment information is provided
    by subscribing a data source to a data web. In
    principle, any party can perform such a
    subscription, allowing for open-ended data webs.

50
Coreference service
  • Separates instance-level alignment from
    schema-level data alignment services.
  • Holds a registry of information about instance
    identifiers used by various subscribed data
    sources, describing how identifiers from each are
    mapped to and from a common identifier form.
  • Initial implementations will probably use a
    simple identifier-translation approach, with
    possibilities to adopt richer inference-based
    approaches. Details of the identifier mapping
    are abstracted by the service interface.
  • Identifier mapping may be performed as a service,
    or by a calling application using information
    provided by this service - detailed designs are
    yet to be researched.

51
Distributed query handler
  • Accepts queries expressed using a data web core
    schema and instance identifiers
  • Translates submitted queries into subsidiary
    queries across known data sources, using
    information provided by the schema alignment and
    coreference services.
  • Results are translated back to data web core
    schema and instance identifiers.
  • In principle, query distribution and schema
    mapping functions can be implemented and tested
    separately interactions will need to be
    analyzed when they are deployed together.

52
Role of technical architecturedesign proposals
  • The foregoing architecture description is not of
    itself a proposed development, but describes an
    idealized framework upon which developments may
    be based
  • In the next part, some possible elements of a
    development plan are suggested, based on the
    idealized framework described here

53
Part 3 Development plan
  • Identifying independently deployable components
    that make up the proposed technical architecture
    - a plan for incremental deployment

54
Components for publishing and accessing data
  • From the idealized technical infrastructure we
    can identify a number of potential free-standing
    developments that are separately deployable and
    can provide immediate benefits independently of
    the rest of the framework
  • Planning development around loosely coupled,
    free-standing components greatly reduces overall
    project development risk as, in most cases,
    failure to effectively realize a single component
    does not frustrate the function or utility of
    other components.
  • Many components have well understood designs.
    For most of the more speculative elements,
    restricted initial designs are envisaged.
  • By identifying free standing tools that can be
    linked together in a web environment, we make it
    easier to exploit existing, independently-develope
    d tools from diverse research and development
    communities.

55
Publishing research group data
56
Using repository software to publish research
group data
  • Domain-specific metadata packaging (e.g. metadata
    terms for metadata streams, or MPEG-21 style
    encapsulation).
  • Domain-specific metadata access, assuming that
    this will be packaged as a distinguished data
    stream.
  • Image and metadata presentation tools. XSLT
    style sheets used for the SERPENT repository are
    a possible starting point for this.
  • We are currently undertaking a proof-of-concept
    deployment for Drosophila in situ gene expression
    images with associated annotations.
  • This work builds upon existing repository
    software, but we are not aware of any available
    tools to capture specific domain-specific data
    and metadata into a repository.
  • Encouraging a move away from ad hoc publishing
    tools towardss a common framework for meeting
    common requirements.

57
Repository ingress tools
  • Generic and domain-specific tools for collecting
    primary observations into a research group
    repository.
  • Our survey and other observers have noted that
    one of the greatest hurdles to data sharing is
    acquisition of adequate source metadata.
  • Web forms are an obvious mode of data submission
    that should be provided, but we have observed
    (cf. FlyTed, NERC workshop) that the limitations
    of browser style interfaces mean that web forms
    are awkward for bulk data entry.
  • Many researchers use spreadsheets to record and
    organize primary observations, and are very
    comfortable with the user interface of popular
    spreadsheet software. Tools for generating
    spreadsheet templates from a repository, and
    subsequently importing spreadsheet data via web
    upload are an option for gathering primary
    observations.

58
Migration from research group to institutional
repository
  • Tools for migrating images and metadata from a
    research group repository to an institutional
    repository.
  • If the same respository software system is used
    for each, the migration should be fairly
    trivial, the main requirement being to provide
    for addition of metadata required to satisfy
    institutional repository policies.
  • If different repository systems are used, some
    repackaging of data and metadata may be required.
    Future adoption of emerging ORE standards across
    different repository systems will hopefully ease
    this migration.

59
Repository and database adapters
60
Repository and database adapters
  • Our goal is SPARQL access to research data
    repositories. This is something of an open-ended
    development, as any new data source may need a
    new adapter, but initial targets would include
    adapters for
  • OAI-PMH repository adapter
  • ePrints software repository adapter
  • public database adapters for FlyBase, FlyMine,
    Gene Ontology and proteomics data
  • For relational database data sources, D2R Server
    may be used.
  • For XML data sources, a SPARQL-to-XQuery rewriter
    could be an option. We are not currently aware
    of any such tool, but it seems possible that one
    may be in development somewhere.
  • Generation of RSS/ATOM data feeds is another
    goal.
  • SPARQL-to-RSS/ATOM and RSS/ATOM-to-SPARQL
    adapters?

61
OAI-PMH repository adapter
  • A generic local OAI-PMH metadata harvester and
    indexer coupled to a SPARQL endpoint allowing
    queries over metadata values.
  • Additional logic to access domain-specific
    metadata.
  • Possibly based on the Joseki system with custom
    software to harvest metadata into a local Jena
    model.
  • This could be used with local and institutional
    repositories.

62
ePrints software repository adapter
  • Southampton ECS have expressed interest in
    cooperating in development of a SPARQL adapter
    for their ePrints software.
  • Direct API to access the underlying data and
    metadata store may avoid the need for OAI-PMH
    harvesting.
  • Further exploration needed.

63
Browse and search services
64
Browse service
  • A generic facility to browse SPARQL endpoint data
  • Based on Southampton mSpace software, with
    possible additional tools to generate mSpace
    browse configuration.
  • Collaborate with Southampton team to refine
    browse facilities for bioinformatics data.

65
Search service
  • Metadata harvesting and indexing from selected
    sources to provide search facilities
  • Simple keyword search (Google-like)
  • Semi-structured keyword search find keyword
    within indicated ontological structure
  • Ontologically guided search expansion and/or
    contraction
  • We are not aware of tools for this - further
    survey work may be needed

66
User and annotation services
67
User registration, personalization and provenance
services
  • Initially, a simple user registration service
    returning a token that anonymously identifies a
    session with a user.
  • Add queryable attributes that services may use
    for personalization, with a user interface to
    view/edit attribute values.
  • Add service to create provenance token that binds
    a user identity with other queryable contextual
    information.

68
Annotation service
  • Provision for third-party annotations to be
    attached to any existing data.
  • Annotations may be simple tagging,
    semi-structured or fully structured. Existing
    work includes
  • Connotea - simple tagging
  • Rich Tags project (Southampton) - semi-structured
    tagging, with possibilities to identify emergent
    ontologies.
  • Annotea - fully structured (?)
  • Annotations are associated with provenance data
    so that users can decide which annotations to
    include in their combined data.

69
Trust services
  • Extension of provenance and personalization
    framework allowing trust decisions (primarily
    relating to annotations, but also for other data
    sources) to be based on a combination of personal
    preferences and community recommendation and
    reputation values.
  • Possibly using elements of SECURE project trust
    evaluation core services?

70
Data web services
71
Data web
  • Schema alignment registry service
  • Coreference service
  • Distributed query handler
  • Query segmentation and distribution
  • Query rewriting

72
Data web schema alignment registry service
  • One registry service per data web, seeded with
    core ontology for a data web.
  • Data sources are subscribed, lodging
    information about supported and rules for mapping
    these to a common core.
  • The form of mapping rules is to be determined,
    but will draw upon a body of existing research
    into database schema alignment (Doerr, Halevy,
    etc.)
  • Mapping may be performed by the schema alignment
    service, or by the query handler service, or
    both. There are engineering trade-offs to be
    explored here.

73
Data web coreference service
  • Maybe one coreference service per web, or maybe a
    one service can support multiple data webs - to
    be determined.
  • Instance identifier schemes are provided when a
    data source is subscribed to a data web, and
    rules for transformation to other identifier
    schemes for the same class of entities.
  • Mapping may involve manipulation of URI strings,
    translation table lookup or other mechanisms to
    be determined.
  • Early thoughts are that identifier mapping will
    be handled internally by the coreference service
    in response to specific identifier queries.

74
Data web distributed query
  • Part of the data web core aggregator service.
  • Separates incoming queries into queries of
    different data sources, based on schema usage
    information lodged with the schema alignment
    registry service. Combines results from these
    multiple queries into a single result set.
  • May be implemented separately from addressing
    schema alignment.
  • We plan to use existing work on DARQ, which is
    itself based on the ARQ query engine in the
    Jena/Joseki framework. The ARQ framework is
    well-suited to query analysis and reformulation,
    being designed in part to support query
    optimization.

75
Data web query rewriting
  • Part of the data web core aggregator service,
    building upon the distributed query handler.
  • Maps incoming queries framed using core schema
    elements to use elements from individual data
    sources, based on mapping information lodged with
    the schema alignment registry service. Maps
    query results back to the core schema.
  • The design of this element will be highly
    experimental, and will drive aspects of the
    schema alignment service. There is a body of
    schema alignment research upon which we can draw.

76
Conclusions
77
Conclusions
  • Publishing raw observations alongside paper-based
    argumentation adds a new dimension to the
    information publication landscape, increasing
    scope for independent interpretation and
    secondary research.
  • Better computational tools are needed to explore,
    summarize and search large volumes of published
    data.
  • Semantic web standards provide a common
    structural framework for data publication for
    which generic tools can be developed and
    deployed.
  • Diverse research groups are not expected to use
    common terms and schemas for their data, so to
    usefully combine such data will require some
    element of alignment.

78
Conclusions
  • The proposed technical architecture has been
    developed through consideration of the preceding
    points.
  • The proposed technical architecture is not itself
    offered as an appropriate development plan, as
    the design of certain elements is currently the
    subject of ongoing research, and draws upon
    skills difficult to find in a single group.
  • Using the web as a platform for composing
    services, we propose a number of free-standing
    components that may be separately developed and
    tested, and separately evolved and improved even
    as they are being used in concert.

79
End
Write a Comment
User Comments (0)
About PowerShow.com