Towards a technical infrastructure for research image data publication

About This Presentation

Title:

Towards a technical infrastructure for research image data publication

Description:

... for images of 'rodents' can return images described as being of 'mice' ... Joseki system with custom software to harvest metadata into a local Jena model. ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 80

Provided by: graham124

Category:

more less

Transcript and Presenter's Notes

Title: Towards a technical infrastructure for research image data publication

1
Towards a technical infrastructure for research
image data publication

Graham Klyne
Image Bioinformatics Research Group
Zoology Department, University of Oxford

2
Overview

Part 1 analysis - data in the scholarly life
cycle
Key concepts
Repository survey summary
Dealing with data
Part 2 synthesis - an architectural framework
Publication infrastructure for documents
Infrastructure enhancements for data
Part 3 development proposals - components
Supporting pieces
Data web core elements

3
Part 1 Analysis

Incorporating images and other data into the life
cycle of published research

4
Documents, images and data

Documents
interpreted by humans
structure visible but content generally opaque to
computer software
Data
processed by computer software
not generally for human consumption
Images
interpretable by humans, sometimes only partially
inaccessible to most computer software
typically handled as data
additional information needed for discovery and
interpretation

5
Documents, data and metadata

Metadata data about data
has machine-processable structure
Metadata is clearly distinct from documents and
images
Metadata is not always clearly distinct from
other data
For images, metadata may be particularly
important to aid discovery, interpretation and
effective display

6
Types of metadata

Generic metadata (sometimes called descriptive
metadata)
common to all kinds of publication, typically as
Dublin Core
Structural metadata
file format, image format, etc
standards exist
Preservation metadata
needed for access to content
Domain-specific metadata
used for discovery or interpretation of content
standards not universally available

7
Summary of repository survey

Not many image collections in institutional
repositories
One exemplary image collection (SERPENT) was in a
specialized (non-institutional) repository ref
serpent page
http//imageweb.zoo.ox.ac.uk/wiki/index.php/Defini
ngImageAccess/Repository/SERPENT
There is very little domain specific metadata
associated with image collections.
SERPENT again being an honourable exception
Where available, metadata quality is variable
May need external annotation to make collections
useful
Even generic metadata is not always well served
Deployed institutional repositories seem to be
primarily set up to deal with papers and theses

8
OAI-PMH and domain-specific metadata

We believe domain specific metadata is important
for discovery and interpretation of research
image data
OAI-PMH is not a panacea for accessing domain
metadata
administratively difficult to deploy, if possible
cannot perform discovery based on metadata values
(the repository community model seems to be to
use a separate service like OAIster for such
operations)
Domain-specific metadata can reasonably be
handled as part of the data with modest
repository support
e.g. ePrints add a couple of structural metadata
fields, Fedora define an appropriate content
model
external tool support based on OAI-PMH for basic
access

9
Looking forwardpublishing research data

Building on repositories for papers and theses,
we outline a technical vision for data
publication
Examine the framework within which papers are
published
Suggest implications of additionally publishing
data
Consider how the publication framework might be
adjusted to accommodate these implications
Propose a technical architecture for publishing
papers and data that accommodates these
adjustments
Identify specific free-standing evolutionary
developments that work toward supporting the
proposed technical architecture
A generic approach to data publication can also
handle image data, but with greater emphasis on
metadata.

10
Elements of published-paperbased research

Primary observations new data derived from
laboratory or field studies.
Selection, analysis and argumentation manual
interpretation and organization for rhetorical
effect, which accordingly limits the volume of
material.
Published papers written materials published via
journals and the web, all the result of some
manual interpretation.
Preservation maintain copies of material that
can be recovered for subsequent analysis or
publication.
Optional for all forms of material, though
generally an automatic side effect for anything
published via an institutional repository.

11
Implications of publishing data

Greater volume
there are no constraints, or radically fewer
constraints, on the amount of material that can
be published. Hence less need to select primary
observations, in contrast to selection for
inclusion in papers
Requirements for automatic processing
Computer handling requires that software is
tailored to the particular data sources, or the
sources have been aligned with respect to data
structures and labelling
Tool support
better computational tools are needed to explore,
summarize and search volumes of published data

12
Elements of published-databased research

Data publication expose primary observation data
to public access
Data alignment adapt published data to some
common form so that data from different sources
can be combined
Explore, summarize, search additional tools to
aid researchers and software applications in
finding relevant published data

13
Further implications of publishing data

Contextual data for interpretation
Domain-specific (meta)data may be needed for
interpretation, especially for images and
similar e.g. bare images are not enough
Data and metadata
The distinction between data and metadata is
blurred some metadata need to be available as
first class data objects
Primary observations and interpretations
A distinction may be seen in published material
between primary observations and contestable
interpretations
Secondary research from existing primary
observations
Possible multiple interpretations of the same
primary data suggest a role for provenance
tracking metadata

14
Part 2 Synthesis

Creating a technical architecture for data
publication

15
Technical requirements for data publication

Data publication tools
exposing both generic and arbitrary
domain-specific metadata
common access mechanisms for metadata
resource location and extraction based on
metadata
Data alignment tools
query, combine and align metadata to deliver a
coordinated view across heterogeneous
repositories.
Data access tools
to explore, summarize and search published data
and metadata
External annotation tools
allowing multiple interpretations of primary data
to be captured
further tools to support provenance,
personalization and trust

16
Infrastructure to support data publication
technical requirements

The following slides illustrate how the
technical elements may be built upon each other,
followed by more detailed descriptions of those
elements
1. Present framework elements for online paper
publication
2. Present framework elements for online data
publication
3. Proposed framework for publishing data via
separate repositories
4. Proposed framework for publishing data via
combined repositories
5. Data web details

17
Present framework for online publication and
access to papers
18
Present framework for online publication and
access to data
19
Proposed framework for publishing data via
separate repositories
20
Proposed framework for publishing data via
combined repositories
21
Data web more details
22
Framework componentdescriptions ...
23
Research group data
24
Research group data

Research data may be collected and stored locally
on computers, but not generally accessible
outside the group
Corresponds to use of computers for data
collection and analysis
Stored locally, not publicly accessible

25
Institutional repository and publisher
repositories
26
Institutional repository

Preservation and/or web publication of papers,
theses, images or and data, facilities typically
provided by research institutions
Current deployments focus on publishing papers
and theses
Some institutions are also looking to publish
images and other data sets.

27
Publisher repositories

Publishers may publish papers to the web.
Publishers may also publish auxiliary data and/or
other commentary to the web
e.g. Nature Publishing Group provides Connotea,
which directly supports limited third-party
annotation (tagging) of published articles, and
can be used by other web software to hold richer
annotations.
Publishers we have spoken to are particularly
interested in linking journal papers to
supporting primary observations.

28
Public discovery services
29
Public discovery services

An obvious public discovery service is Google
Also include services such as Citeseer, ROAR,
professional society indexing and abstracting
services, etc.
http//imageweb.zoo.ox.ac.uk/wiki/index.php/Defini
ngImageAccess/Resource/ROAR
Also includes JISC services such as OpenDOAR,
Intute repository search
http//imageweb.zoo.ox.ac.uk/wiki/index.php/Defini
ngImageAccess/Resource/OpenDOAR
http//imageweb.zoo.ox.ac.uk/wiki/index.php/Defini
ngImageAccess/Project/Intute
And any third party service that can be used to
search for online publications or data sets

30
Research group repository and specialized
applications
31
Research group repository

A research group repository, or database, is a
system for publishing research group data (and
possibly papers, presentations, etc.) to the web.
It is provisioned by a research group rather than
institutionally, and provides timely rather than
long-lived access to research data.
These repositories may be bespoke databases
and/or web sites.
Alternatively, common repository software could
be used, which might be expected to ease the
migration of content to an institutional
repository for preservation and longer term
publication.

32
Specialized applications

For automated access to research group data and
possibly to public databases, specialized
applications may be used
These are often developed to handle particular
data for a particular research project, and may
not be readily repurposed to work with other data.

33
Public databases
34
Public databases

A number of public databases are being created.
For example, in the life science area, there are
public databases for gene sequences, proteomics,
model species data, gene ontology, etc. These
provide for online access to pooled results from
many research groups in the domains covered,
which is increasingly important in life science
research.
Similar public databases exist for areas of
humanities and/or classics research (e.g. Beazley
archive, LIMC), though these are not always tuned
for programmatic access to data.
These systems typically employ expert curation
and incur ongoing maintenance.

35
Repository and database adapters
36
Repository and database adapters

Access to repository data and metadata is
currently provided by web browser interfaces or
harvesting systems with web front-ends (OAIster,
Intute repository search), which lack a common
mechanism for programmatic access.
We propose that a uniform interface to various
data sources may be provided by implementing
adapters that present data mapped to RDF and
RSS/ATOM data models, and provide query access
via a SPARQL (semantic web query) endpoint
service.
A secondary function of the repository and
database adapters may be to serve metadata about
the data source itself for example, indicating
the URI of a resource containing configuration
data for a browse or search service.

37
Browse and search services
38
Browse service

With uniform programmatic access to multiple data
sources, a common browser service can provide for
exploration of diverse data sources.
Configuration data, at the level of an RDF schema
with additional annotations, might be used to
provide a browsing experience tailored to the
content of each data source (e.g. mSpace,
Fresnel).
Note en passant Karger and Schraefels pathetic
fallacy
This will not replace the need for
domain-specific user interfaces for viewing and
interacting with data sources, but may provide a
starting point for learning about new data
sources.
It may be that generic user interface tools based
on Semantic Lens ideas can provide
application-specific user interfaces, but a
generic browser will likely be needed to support
configuration of such tools for new data sources.

39
Search service

Complementing the generic browse service, this
service allows unstructured, semi-structured or
fully-structured search for data in an arbitrary
data source.
A simple search may simply locate keywords in
designated parts of some structured data,
applying purely syntactic criteria.
A more advanced search service may use taxonomic
and other semantic information to expand or
restrict a search (e.g. a search for images of
rodents can return images described as being of
mice).

40
Annotation and user registration services
41
Annotation services

With publication of primary observations comes
the possibility for different researchers to
analyze them and draw different conclusions.
Annotation services provide mechanisms for such
alternative analyses to be published and made
available in a common framework.
Annotations may consist of simple tagging (e.g.
Flickr, del.icio.us, Connotea), or more complex
semantic descriptions (e.g. Annotea).
If the published information is viewed as an
RDF-style graph (nodes denoting entities linked
by edges denoting relationships between them),
then annotations provide a means to describe or
deduce new links between the nodes in a graph, or
between nodes in separately published graphs.

42
User registration, personalization, provenance
and trust services

Roughly metadata about users
The notion of user services is motivated
particularly by third party annotations
The provenance of an annotation will
substantially effect its credibility for other
researchers. The system must be able to record
disagreement, and provide principled ways to
resolve conflicts in a way that is appropriate to
each users needs and preferences.
Provenance may extend to personalized trust
analysis, avoiding the need for determination of
a universal truth in the face of differing
opinions and conflicting data.
Personalization services may also be accessed by
browse, search and other services, allowing
different users to apply different options and
views when accessing data.
Should link to existing authentication and
authorization systems.

43
Secondary and domain specific services
44
Secondary and domain specific services

A goal of this technical architecture is to
provide a uniform platform for deploying
secondary, application- and domain-specific
services, with common programmatic access and
data formats.
Domain specific services fill the role previously
served by specialized services for research data.
Secondary and domain specific services work with
standardized data formats, and hence are more
readily applied to new data sources with similar
processing requirements.
Application-specific data analysis, data
processing workflows, visualization services and
mashups are possible secondary services.

45
Data webalignment and query handler
46
Data web alignment and query handler

Although the various data sources are presented
in a common format using adapters, no attempt is
made to enforce common schema or labels for data
on the various sources.
Different sources may be describing the same
objects or relationships using different labels.
If common schema and naming standards happen to
be used, then linkages between different sources
may be beneficially exploited without a data web.
But we do not assume general use of such common
standards.
The data handling services described previously
would typically be used separately with each
source, with the user left to determine linkages
between information.
A data web is a domain- or community-specific
aggregation of data sources, providing a
coordinated view with common core schema and
instance identifiers applied across all
participating data sources.

47
Data web more details
48
Data web overview

Data sources are subscribed to a data web by
submission of schema mapping and identifier
mapping details to the schema alignment service
and coreference service respectively.
Data alignment builds upon schema and ontology
alignment research (e.g. Halevy, Doerr), but is
applied to an open-world web environment.
The query handler component may be stateless,
allowing replication for arbitrary scaling if
complex alignment computations are needed.

49
Schema alignment registry service

A registry of information about various data
sources, describing how schema elements from each
are mapped to a data web core schema.
Adopts a modular, rule-based approach to schema
mapping (e.g. drawing upon schema alignment
language work by Martin Doerr).
Schema mapping may be performed as a service, or
by a calling application using information
provided by this service - this is a design area
needing further investigation, particularly with
respect to its impact on scalability.
The schema alignment service also serves as a
registry of metadata schemas supported by the
various data sources, information that may be
used in planning distributed queries.
Schema use and alignment information is provided
by subscribing a data source to a data web. In
principle, any party can perform such a
subscription, allowing for open-ended data webs.

50
Coreference service

Separates instance-level alignment from
schema-level data alignment services.
Holds a registry of information about instance
identifiers used by various subscribed data
sources, describing how identifiers from each are
mapped to and from a common identifier form.
Initial implementations will probably use a
simple identifier-translation approach, with
possibilities to adopt richer inference-based
approaches. Details of the identifier mapping
are abstracted by the service interface.
Identifier mapping may be performed as a service,
or by a calling application using information
provided by this service - detailed designs are
yet to be researched.

51
Distributed query handler

Accepts queries expressed using a data web core
schema and instance identifiers
Translates submitted queries into subsidiary
queries across known data sources, using
information provided by the schema alignment and
coreference services.
Results are translated back to data web core
schema and instance identifiers.
In principle, query distribution and schema
mapping functions can be implemented and tested
separately interactions will need to be
analyzed when they are deployed together.

52
Role of technical architecturedesign proposals

The foregoing architecture description is not of
itself a proposed development, but describes an
idealized framework upon which developments may
be based
In the next part, some possible elements of a
development plan are suggested, based on the
idealized framework described here

53
Part 3 Development plan

Identifying independently deployable components
that make up the proposed technical architecture
- a plan for incremental deployment

54
Components for publishing and accessing data

From the idealized technical infrastructure we
can identify a number of potential free-standing
developments that are separately deployable and
can provide immediate benefits independently of
the rest of the framework
Planning development around loosely coupled,
free-standing components greatly reduces overall
project development risk as, in most cases,
failure to effectively realize a single component
does not frustrate the function or utility of
other components.
Many components have well understood designs.
For most of the more speculative elements,
restricted initial designs are envisaged.
By identifying free standing tools that can be
linked together in a web environment, we make it
easier to exploit existing, independently-develope
d tools from diverse research and development
communities.

55
Publishing research group data
56
Using repository software to publish research
group data

Domain-specific metadata packaging (e.g. metadata
terms for metadata streams, or MPEG-21 style
encapsulation).
Domain-specific metadata access, assuming that
this will be packaged as a distinguished data
stream.
Image and metadata presentation tools. XSLT
style sheets used for the SERPENT repository are
a possible starting point for this.
We are currently undertaking a proof-of-concept
deployment for Drosophila in situ gene expression
images with associated annotations.
This work builds upon existing repository
software, but we are not aware of any available
tools to capture specific domain-specific data
and metadata into a repository.
Encouraging a move away from ad hoc publishing
tools towardss a common framework for meeting
common requirements.

57
Repository ingress tools

Generic and domain-specific tools for collecting
primary observations into a research group
repository.
Our survey and other observers have noted that
one of the greatest hurdles to data sharing is
acquisition of adequate source metadata.
Web forms are an obvious mode of data submission
that should be provided, but we have observed
(cf. FlyTed, NERC workshop) that the limitations
of browser style interfaces mean that web forms
are awkward for bulk data entry.
Many researchers use spreadsheets to record and
organize primary observations, and are very
comfortable with the user interface of popular
spreadsheet software. Tools for generating
spreadsheet templates from a repository, and
subsequently importing spreadsheet data via web
upload are an option for gathering primary
observations.

58
Migration from research group to institutional
repository

Tools for migrating images and metadata from a
research group repository to an institutional
repository.
If the same respository software system is used
for each, the migration should be fairly
trivial, the main requirement being to provide
for addition of metadata required to satisfy
institutional repository policies.
If different repository systems are used, some
repackaging of data and metadata may be required.
Future adoption of emerging ORE standards across
different repository systems will hopefully ease
this migration.

59
Repository and database adapters
60
Repository and database adapters

Our goal is SPARQL access to research data
repositories. This is something of an open-ended
development, as any new data source may need a
new adapter, but initial targets would include
adapters for
OAI-PMH repository adapter
ePrints software repository adapter
public database adapters for FlyBase, FlyMine,
Gene Ontology and proteomics data
For relational database data sources, D2R Server
may be used.
For XML data sources, a SPARQL-to-XQuery rewriter
could be an option. We are not currently aware
of any such tool, but it seems possible that one
may be in development somewhere.
Generation of RSS/ATOM data feeds is another
goal.
SPARQL-to-RSS/ATOM and RSS/ATOM-to-SPARQL
adapters?

61
OAI-PMH repository adapter

A generic local OAI-PMH metadata harvester and
indexer coupled to a SPARQL endpoint allowing
queries over metadata values.
Additional logic to access domain-specific
metadata.
Possibly based on the Joseki system with custom
software to harvest metadata into a local Jena
model.
This could be used with local and institutional
repositories.

62
ePrints software repository adapter

Southampton ECS have expressed interest in
cooperating in development of a SPARQL adapter
for their ePrints software.
Direct API to access the underlying data and
metadata store may avoid the need for OAI-PMH
harvesting.
Further exploration needed.

63
Browse and search services
64
Browse service

A generic facility to browse SPARQL endpoint data
Based on Southampton mSpace software, with
possible additional tools to generate mSpace
browse configuration.
Collaborate with Southampton team to refine
browse facilities for bioinformatics data.

65
Search service

Metadata harvesting and indexing from selected
sources to provide search facilities
Simple keyword search (Google-like)
Semi-structured keyword search find keyword
within indicated ontological structure
Ontologically guided search expansion and/or
contraction
We are not aware of tools for this - further
survey work may be needed

66
User and annotation services
67
User registration, personalization and provenance
services

Initially, a simple user registration service
returning a token that anonymously identifies a
session with a user.
Add queryable attributes that services may use
for personalization, with a user interface to
view/edit attribute values.
Add service to create provenance token that binds
a user identity with other queryable contextual
information.

68
Annotation service

Provision for third-party annotations to be
attached to any existing data.
Annotations may be simple tagging,
semi-structured or fully structured. Existing
work includes
Connotea - simple tagging
Rich Tags project (Southampton) - semi-structured
tagging, with possibilities to identify emergent
ontologies.
Annotea - fully structured (?)
Annotations are associated with provenance data
so that users can decide which annotations to
include in their combined data.

69
Trust services

Extension of provenance and personalization
framework allowing trust decisions (primarily
relating to annotations, but also for other data
sources) to be based on a combination of personal
preferences and community recommendation and
reputation values.
Possibly using elements of SECURE project trust
evaluation core services?

70
Data web services
71
Data web

Schema alignment registry service
Coreference service
Distributed query handler
Query segmentation and distribution
Query rewriting

72
Data web schema alignment registry service

One registry service per data web, seeded with
core ontology for a data web.
Data sources are subscribed, lodging
information about supported and rules for mapping
these to a common core.
The form of mapping rules is to be determined,
but will draw upon a body of existing research
into database schema alignment (Doerr, Halevy,
etc.)
Mapping may be performed by the schema alignment
service, or by the query handler service, or
both. There are engineering trade-offs to be
explored here.

73
Data web coreference service

Maybe one coreference service per web, or maybe a
one service can support multiple data webs - to
be determined.
Instance identifier schemes are provided when a
data source is subscribed to a data web, and
rules for transformation to other identifier
schemes for the same class of entities.
Mapping may involve manipulation of URI strings,
translation table lookup or other mechanisms to
be determined.
Early thoughts are that identifier mapping will
be handled internally by the coreference service
in response to specific identifier queries.

74
Data web distributed query

Part of the data web core aggregator service.
Separates incoming queries into queries of
different data sources, based on schema usage
information lodged with the schema alignment
registry service. Combines results from these
multiple queries into a single result set.
May be implemented separately from addressing
schema alignment.
We plan to use existing work on DARQ, which is
itself based on the ARQ query engine in the
Jena/Joseki framework. The ARQ framework is
well-suited to query analysis and reformulation,
being designed in part to support query
optimization.

75
Data web query rewriting

Part of the data web core aggregator service,
building upon the distributed query handler.
Maps incoming queries framed using core schema
elements to use elements from individual data
sources, based on mapping information lodged with
the schema alignment registry service. Maps
query results back to the core schema.
The design of this element will be highly
experimental, and will drive aspects of the
schema alignment service. There is a body of
schema alignment research upon which we can draw.

76
Conclusions
77
Conclusions

Publishing raw observations alongside paper-based
argumentation adds a new dimension to the
information publication landscape, increasing
scope for independent interpretation and
secondary research.
Better computational tools are needed to explore,
summarize and search large volumes of published
data.
Semantic web standards provide a common
structural framework for data publication for
which generic tools can be developed and
deployed.
Diverse research groups are not expected to use
common terms and schemas for their data, so to
usefully combine such data will require some
element of alignment.

78
Conclusions

The proposed technical architecture has been
developed through consideration of the preceding
points.
The proposed technical architecture is not itself
offered as an appropriate development plan, as
the design of certain elements is currently the
subject of ongoing research, and draws upon
skills difficult to find in a single group.
Using the web as a platform for composing
services, we propose a number of free-standing
components that may be separately developed and
tested, and separately evolved and improved even
as they are being used in concert.

79
End

Write a Comment

User Comments (0)