Title: Towards a technical infrastructure for research image data publication
1Towards a technical infrastructure for research
image data publication
- Graham Klyne
- Image Bioinformatics Research Group
- Zoology Department, University of Oxford
2Overview
- Part 1 analysis - data in the scholarly life
cycle - Key concepts
- Repository survey summary
- Dealing with data
- Part 2 synthesis - an architectural framework
- Publication infrastructure for documents
- Infrastructure enhancements for data
- Part 3 development proposals - components
- Supporting pieces
- Data web core elements
3Part 1 Analysis
- Incorporating images and other data into the life
cycle of published research
4Documents, images and data
- Documents
- interpreted by humans
- structure visible but content generally opaque to
computer software - Data
- processed by computer software
- not generally for human consumption
- Images
- interpretable by humans, sometimes only partially
- inaccessible to most computer software
- typically handled as data
- additional information needed for discovery and
interpretation
5Documents, data and metadata
- Metadata data about data
- has machine-processable structure
- Metadata is clearly distinct from documents and
images - Metadata is not always clearly distinct from
other data - For images, metadata may be particularly
important to aid discovery, interpretation and
effective display
6Types of metadata
- Generic metadata (sometimes called descriptive
metadata) - common to all kinds of publication, typically as
Dublin Core - Structural metadata
- file format, image format, etc
- standards exist
- Preservation metadata
- needed for access to content
- Domain-specific metadata
- used for discovery or interpretation of content
- standards not universally available
7Summary of repository survey
- Not many image collections in institutional
repositories - One exemplary image collection (SERPENT) was in a
specialized (non-institutional) repository ref
serpent page - http//imageweb.zoo.ox.ac.uk/wiki/index.php/Defini
ngImageAccess/Repository/SERPENT - There is very little domain specific metadata
associated with image collections. - SERPENT again being an honourable exception
- Where available, metadata quality is variable
- May need external annotation to make collections
useful - Even generic metadata is not always well served
- Deployed institutional repositories seem to be
primarily set up to deal with papers and theses
8OAI-PMH and domain-specific metadata
- We believe domain specific metadata is important
for discovery and interpretation of research
image data - OAI-PMH is not a panacea for accessing domain
metadata - administratively difficult to deploy, if possible
- cannot perform discovery based on metadata values
(the repository community model seems to be to
use a separate service like OAIster for such
operations) - Domain-specific metadata can reasonably be
handled as part of the data with modest
repository support - e.g. ePrints add a couple of structural metadata
fields, Fedora define an appropriate content
model - external tool support based on OAI-PMH for basic
access
9Looking forwardpublishing research data
- Building on repositories for papers and theses,
we outline a technical vision for data
publication - Examine the framework within which papers are
published - Suggest implications of additionally publishing
data - Consider how the publication framework might be
adjusted to accommodate these implications - Propose a technical architecture for publishing
papers and data that accommodates these
adjustments - Identify specific free-standing evolutionary
developments that work toward supporting the
proposed technical architecture - A generic approach to data publication can also
handle image data, but with greater emphasis on
metadata.
10Elements of published-paperbased research
- Primary observations new data derived from
laboratory or field studies. - Selection, analysis and argumentation manual
interpretation and organization for rhetorical
effect, which accordingly limits the volume of
material. - Published papers written materials published via
journals and the web, all the result of some
manual interpretation. - Preservation maintain copies of material that
can be recovered for subsequent analysis or
publication. - Optional for all forms of material, though
generally an automatic side effect for anything
published via an institutional repository.
11Implications of publishing data
- Greater volume
- there are no constraints, or radically fewer
constraints, on the amount of material that can
be published. Hence less need to select primary
observations, in contrast to selection for
inclusion in papers - Requirements for automatic processing
- Computer handling requires that software is
tailored to the particular data sources, or the
sources have been aligned with respect to data
structures and labelling - Tool support
- better computational tools are needed to explore,
summarize and search volumes of published data
12Elements of published-databased research
- Data publication expose primary observation data
to public access - Data alignment adapt published data to some
common form so that data from different sources
can be combined - Explore, summarize, search additional tools to
aid researchers and software applications in
finding relevant published data
13Further implications of publishing data
- Contextual data for interpretation
- Domain-specific (meta)data may be needed for
interpretation, especially for images and
similar e.g. bare images are not enough - Data and metadata
- The distinction between data and metadata is
blurred some metadata need to be available as
first class data objects - Primary observations and interpretations
- A distinction may be seen in published material
between primary observations and contestable
interpretations - Secondary research from existing primary
observations - Possible multiple interpretations of the same
primary data suggest a role for provenance
tracking metadata
14Part 2 Synthesis
- Creating a technical architecture for data
publication
15Technical requirements for data publication
- Data publication tools
- exposing both generic and arbitrary
domain-specific metadata - common access mechanisms for metadata
- resource location and extraction based on
metadata - Data alignment tools
- query, combine and align metadata to deliver a
coordinated view across heterogeneous
repositories. - Data access tools
- to explore, summarize and search published data
and metadata - External annotation tools
- allowing multiple interpretations of primary data
to be captured - further tools to support provenance,
personalization and trust
16Infrastructure to support data publication
technical requirements
- The following slides illustrate how the
technical elements may be built upon each other,
followed by more detailed descriptions of those
elements - 1. Present framework elements for online paper
publication - 2. Present framework elements for online data
publication - 3. Proposed framework for publishing data via
separate repositories - 4. Proposed framework for publishing data via
combined repositories - 5. Data web details
17Present framework for online publication and
access to papers
18Present framework for online publication and
access to data
19Proposed framework for publishing data via
separate repositories
20Proposed framework for publishing data via
combined repositories
21Data web more details
22Framework componentdescriptions ...
23Research group data
24Research group data
- Research data may be collected and stored locally
on computers, but not generally accessible
outside the group - Corresponds to use of computers for data
collection and analysis - Stored locally, not publicly accessible
25Institutional repository and publisher
repositories
26Institutional repository
- Preservation and/or web publication of papers,
theses, images or and data, facilities typically
provided by research institutions - Current deployments focus on publishing papers
and theses - Some institutions are also looking to publish
images and other data sets.
27Publisher repositories
- Publishers may publish papers to the web.
- Publishers may also publish auxiliary data and/or
other commentary to the web - e.g. Nature Publishing Group provides Connotea,
which directly supports limited third-party
annotation (tagging) of published articles, and
can be used by other web software to hold richer
annotations. - Publishers we have spoken to are particularly
interested in linking journal papers to
supporting primary observations.
28Public discovery services
29Public discovery services
- An obvious public discovery service is Google
- Also include services such as Citeseer, ROAR,
professional society indexing and abstracting
services, etc. - http//imageweb.zoo.ox.ac.uk/wiki/index.php/Defini
ngImageAccess/Resource/ROAR - Also includes JISC services such as OpenDOAR,
Intute repository search - http//imageweb.zoo.ox.ac.uk/wiki/index.php/Defini
ngImageAccess/Resource/OpenDOAR - http//imageweb.zoo.ox.ac.uk/wiki/index.php/Defini
ngImageAccess/Project/Intute - And any third party service that can be used to
search for online publications or data sets
30Research group repository and specialized
applications
31Research group repository
- A research group repository, or database, is a
system for publishing research group data (and
possibly papers, presentations, etc.) to the web. - It is provisioned by a research group rather than
institutionally, and provides timely rather than
long-lived access to research data. - These repositories may be bespoke databases
and/or web sites. - Alternatively, common repository software could
be used, which might be expected to ease the
migration of content to an institutional
repository for preservation and longer term
publication.
32Specialized applications
- For automated access to research group data and
possibly to public databases, specialized
applications may be used - These are often developed to handle particular
data for a particular research project, and may
not be readily repurposed to work with other data.
33Public databases
34Public databases
- A number of public databases are being created.
For example, in the life science area, there are
public databases for gene sequences, proteomics,
model species data, gene ontology, etc. These
provide for online access to pooled results from
many research groups in the domains covered,
which is increasingly important in life science
research. - Similar public databases exist for areas of
humanities and/or classics research (e.g. Beazley
archive, LIMC), though these are not always tuned
for programmatic access to data. - These systems typically employ expert curation
and incur ongoing maintenance.
35Repository and database adapters
36Repository and database adapters
- Access to repository data and metadata is
currently provided by web browser interfaces or
harvesting systems with web front-ends (OAIster,
Intute repository search), which lack a common
mechanism for programmatic access. - We propose that a uniform interface to various
data sources may be provided by implementing
adapters that present data mapped to RDF and
RSS/ATOM data models, and provide query access
via a SPARQL (semantic web query) endpoint
service. - A secondary function of the repository and
database adapters may be to serve metadata about
the data source itself for example, indicating
the URI of a resource containing configuration
data for a browse or search service.
37Browse and search services
38Browse service
- With uniform programmatic access to multiple data
sources, a common browser service can provide for
exploration of diverse data sources. - Configuration data, at the level of an RDF schema
with additional annotations, might be used to
provide a browsing experience tailored to the
content of each data source (e.g. mSpace,
Fresnel). - Note en passant Karger and Schraefels pathetic
fallacy - This will not replace the need for
domain-specific user interfaces for viewing and
interacting with data sources, but may provide a
starting point for learning about new data
sources. - It may be that generic user interface tools based
on Semantic Lens ideas can provide
application-specific user interfaces, but a
generic browser will likely be needed to support
configuration of such tools for new data sources.
39Search service
- Complementing the generic browse service, this
service allows unstructured, semi-structured or
fully-structured search for data in an arbitrary
data source. - A simple search may simply locate keywords in
designated parts of some structured data,
applying purely syntactic criteria. - A more advanced search service may use taxonomic
and other semantic information to expand or
restrict a search (e.g. a search for images of
rodents can return images described as being of
mice).
40Annotation and user registration services
41Annotation services
- With publication of primary observations comes
the possibility for different researchers to
analyze them and draw different conclusions.
Annotation services provide mechanisms for such
alternative analyses to be published and made
available in a common framework. - Annotations may consist of simple tagging (e.g.
Flickr, del.icio.us, Connotea), or more complex
semantic descriptions (e.g. Annotea). - If the published information is viewed as an
RDF-style graph (nodes denoting entities linked
by edges denoting relationships between them),
then annotations provide a means to describe or
deduce new links between the nodes in a graph, or
between nodes in separately published graphs.
42User registration, personalization, provenance
and trust services
- Roughly metadata about users
- The notion of user services is motivated
particularly by third party annotations - The provenance of an annotation will
substantially effect its credibility for other
researchers. The system must be able to record
disagreement, and provide principled ways to
resolve conflicts in a way that is appropriate to
each users needs and preferences. - Provenance may extend to personalized trust
analysis, avoiding the need for determination of
a universal truth in the face of differing
opinions and conflicting data. - Personalization services may also be accessed by
browse, search and other services, allowing
different users to apply different options and
views when accessing data. - Should link to existing authentication and
authorization systems.
43Secondary and domain specific services
44Secondary and domain specific services
- A goal of this technical architecture is to
provide a uniform platform for deploying
secondary, application- and domain-specific
services, with common programmatic access and
data formats. - Domain specific services fill the role previously
served by specialized services for research data. - Secondary and domain specific services work with
standardized data formats, and hence are more
readily applied to new data sources with similar
processing requirements. - Application-specific data analysis, data
processing workflows, visualization services and
mashups are possible secondary services.
45Data webalignment and query handler
46Data web alignment and query handler
- Although the various data sources are presented
in a common format using adapters, no attempt is
made to enforce common schema or labels for data
on the various sources. - Different sources may be describing the same
objects or relationships using different labels. - If common schema and naming standards happen to
be used, then linkages between different sources
may be beneficially exploited without a data web.
But we do not assume general use of such common
standards. - The data handling services described previously
would typically be used separately with each
source, with the user left to determine linkages
between information. - A data web is a domain- or community-specific
aggregation of data sources, providing a
coordinated view with common core schema and
instance identifiers applied across all
participating data sources.
47Data web more details
48Data web overview
- Data sources are subscribed to a data web by
submission of schema mapping and identifier
mapping details to the schema alignment service
and coreference service respectively. - Data alignment builds upon schema and ontology
alignment research (e.g. Halevy, Doerr), but is
applied to an open-world web environment. - The query handler component may be stateless,
allowing replication for arbitrary scaling if
complex alignment computations are needed.
49Schema alignment registry service
- A registry of information about various data
sources, describing how schema elements from each
are mapped to a data web core schema. - Adopts a modular, rule-based approach to schema
mapping (e.g. drawing upon schema alignment
language work by Martin Doerr). - Schema mapping may be performed as a service, or
by a calling application using information
provided by this service - this is a design area
needing further investigation, particularly with
respect to its impact on scalability. - The schema alignment service also serves as a
registry of metadata schemas supported by the
various data sources, information that may be
used in planning distributed queries. - Schema use and alignment information is provided
by subscribing a data source to a data web. In
principle, any party can perform such a
subscription, allowing for open-ended data webs.
50Coreference service
- Separates instance-level alignment from
schema-level data alignment services. - Holds a registry of information about instance
identifiers used by various subscribed data
sources, describing how identifiers from each are
mapped to and from a common identifier form. - Initial implementations will probably use a
simple identifier-translation approach, with
possibilities to adopt richer inference-based
approaches. Details of the identifier mapping
are abstracted by the service interface. - Identifier mapping may be performed as a service,
or by a calling application using information
provided by this service - detailed designs are
yet to be researched.
51Distributed query handler
- Accepts queries expressed using a data web core
schema and instance identifiers - Translates submitted queries into subsidiary
queries across known data sources, using
information provided by the schema alignment and
coreference services. - Results are translated back to data web core
schema and instance identifiers. - In principle, query distribution and schema
mapping functions can be implemented and tested
separately interactions will need to be
analyzed when they are deployed together.
52Role of technical architecturedesign proposals
- The foregoing architecture description is not of
itself a proposed development, but describes an
idealized framework upon which developments may
be based - In the next part, some possible elements of a
development plan are suggested, based on the
idealized framework described here
53Part 3 Development plan
- Identifying independently deployable components
that make up the proposed technical architecture
- a plan for incremental deployment
54Components for publishing and accessing data
- From the idealized technical infrastructure we
can identify a number of potential free-standing
developments that are separately deployable and
can provide immediate benefits independently of
the rest of the framework - Planning development around loosely coupled,
free-standing components greatly reduces overall
project development risk as, in most cases,
failure to effectively realize a single component
does not frustrate the function or utility of
other components. - Many components have well understood designs.
For most of the more speculative elements,
restricted initial designs are envisaged. - By identifying free standing tools that can be
linked together in a web environment, we make it
easier to exploit existing, independently-develope
d tools from diverse research and development
communities.
55Publishing research group data
56Using repository software to publish research
group data
- Domain-specific metadata packaging (e.g. metadata
terms for metadata streams, or MPEG-21 style
encapsulation). - Domain-specific metadata access, assuming that
this will be packaged as a distinguished data
stream. - Image and metadata presentation tools. XSLT
style sheets used for the SERPENT repository are
a possible starting point for this. - We are currently undertaking a proof-of-concept
deployment for Drosophila in situ gene expression
images with associated annotations. - This work builds upon existing repository
software, but we are not aware of any available
tools to capture specific domain-specific data
and metadata into a repository. - Encouraging a move away from ad hoc publishing
tools towardss a common framework for meeting
common requirements.
57Repository ingress tools
- Generic and domain-specific tools for collecting
primary observations into a research group
repository. - Our survey and other observers have noted that
one of the greatest hurdles to data sharing is
acquisition of adequate source metadata. - Web forms are an obvious mode of data submission
that should be provided, but we have observed
(cf. FlyTed, NERC workshop) that the limitations
of browser style interfaces mean that web forms
are awkward for bulk data entry. - Many researchers use spreadsheets to record and
organize primary observations, and are very
comfortable with the user interface of popular
spreadsheet software. Tools for generating
spreadsheet templates from a repository, and
subsequently importing spreadsheet data via web
upload are an option for gathering primary
observations.
58Migration from research group to institutional
repository
- Tools for migrating images and metadata from a
research group repository to an institutional
repository. - If the same respository software system is used
for each, the migration should be fairly
trivial, the main requirement being to provide
for addition of metadata required to satisfy
institutional repository policies. - If different repository systems are used, some
repackaging of data and metadata may be required.
Future adoption of emerging ORE standards across
different repository systems will hopefully ease
this migration.
59Repository and database adapters
60Repository and database adapters
- Our goal is SPARQL access to research data
repositories. This is something of an open-ended
development, as any new data source may need a
new adapter, but initial targets would include
adapters for - OAI-PMH repository adapter
- ePrints software repository adapter
- public database adapters for FlyBase, FlyMine,
Gene Ontology and proteomics data - For relational database data sources, D2R Server
may be used. - For XML data sources, a SPARQL-to-XQuery rewriter
could be an option. We are not currently aware
of any such tool, but it seems possible that one
may be in development somewhere. - Generation of RSS/ATOM data feeds is another
goal. - SPARQL-to-RSS/ATOM and RSS/ATOM-to-SPARQL
adapters?
61OAI-PMH repository adapter
- A generic local OAI-PMH metadata harvester and
indexer coupled to a SPARQL endpoint allowing
queries over metadata values. - Additional logic to access domain-specific
metadata. - Possibly based on the Joseki system with custom
software to harvest metadata into a local Jena
model. - This could be used with local and institutional
repositories.
62ePrints software repository adapter
- Southampton ECS have expressed interest in
cooperating in development of a SPARQL adapter
for their ePrints software. - Direct API to access the underlying data and
metadata store may avoid the need for OAI-PMH
harvesting. - Further exploration needed.
63Browse and search services
64Browse service
- A generic facility to browse SPARQL endpoint data
- Based on Southampton mSpace software, with
possible additional tools to generate mSpace
browse configuration. - Collaborate with Southampton team to refine
browse facilities for bioinformatics data.
65Search service
- Metadata harvesting and indexing from selected
sources to provide search facilities - Simple keyword search (Google-like)
- Semi-structured keyword search find keyword
within indicated ontological structure - Ontologically guided search expansion and/or
contraction - We are not aware of tools for this - further
survey work may be needed
66User and annotation services
67User registration, personalization and provenance
services
- Initially, a simple user registration service
returning a token that anonymously identifies a
session with a user. - Add queryable attributes that services may use
for personalization, with a user interface to
view/edit attribute values. - Add service to create provenance token that binds
a user identity with other queryable contextual
information.
68Annotation service
- Provision for third-party annotations to be
attached to any existing data. - Annotations may be simple tagging,
semi-structured or fully structured. Existing
work includes - Connotea - simple tagging
- Rich Tags project (Southampton) - semi-structured
tagging, with possibilities to identify emergent
ontologies. - Annotea - fully structured (?)
- Annotations are associated with provenance data
so that users can decide which annotations to
include in their combined data.
69Trust services
- Extension of provenance and personalization
framework allowing trust decisions (primarily
relating to annotations, but also for other data
sources) to be based on a combination of personal
preferences and community recommendation and
reputation values. - Possibly using elements of SECURE project trust
evaluation core services?
70Data web services
71Data web
- Schema alignment registry service
- Coreference service
- Distributed query handler
- Query segmentation and distribution
- Query rewriting
72Data web schema alignment registry service
- One registry service per data web, seeded with
core ontology for a data web. - Data sources are subscribed, lodging
information about supported and rules for mapping
these to a common core. - The form of mapping rules is to be determined,
but will draw upon a body of existing research
into database schema alignment (Doerr, Halevy,
etc.) - Mapping may be performed by the schema alignment
service, or by the query handler service, or
both. There are engineering trade-offs to be
explored here.
73Data web coreference service
- Maybe one coreference service per web, or maybe a
one service can support multiple data webs - to
be determined. - Instance identifier schemes are provided when a
data source is subscribed to a data web, and
rules for transformation to other identifier
schemes for the same class of entities. - Mapping may involve manipulation of URI strings,
translation table lookup or other mechanisms to
be determined. - Early thoughts are that identifier mapping will
be handled internally by the coreference service
in response to specific identifier queries.
74Data web distributed query
- Part of the data web core aggregator service.
- Separates incoming queries into queries of
different data sources, based on schema usage
information lodged with the schema alignment
registry service. Combines results from these
multiple queries into a single result set. - May be implemented separately from addressing
schema alignment. - We plan to use existing work on DARQ, which is
itself based on the ARQ query engine in the
Jena/Joseki framework. The ARQ framework is
well-suited to query analysis and reformulation,
being designed in part to support query
optimization.
75Data web query rewriting
- Part of the data web core aggregator service,
building upon the distributed query handler. - Maps incoming queries framed using core schema
elements to use elements from individual data
sources, based on mapping information lodged with
the schema alignment registry service. Maps
query results back to the core schema. - The design of this element will be highly
experimental, and will drive aspects of the
schema alignment service. There is a body of
schema alignment research upon which we can draw.
76Conclusions
77Conclusions
- Publishing raw observations alongside paper-based
argumentation adds a new dimension to the
information publication landscape, increasing
scope for independent interpretation and
secondary research. - Better computational tools are needed to explore,
summarize and search large volumes of published
data. - Semantic web standards provide a common
structural framework for data publication for
which generic tools can be developed and
deployed. - Diverse research groups are not expected to use
common terms and schemas for their data, so to
usefully combine such data will require some
element of alignment.
78Conclusions
- The proposed technical architecture has been
developed through consideration of the preceding
points. - The proposed technical architecture is not itself
offered as an appropriate development plan, as
the design of certain elements is currently the
subject of ongoing research, and draws upon
skills difficult to find in a single group. - Using the web as a platform for composing
services, we propose a number of free-standing
components that may be separately developed and
tested, and separately evolved and improved even
as they are being used in concert.
79End