Title: A vision involving raw data archiving via local archives as a supplement to the existing processed data archives (PDB, CSD, ICDD etc)
1A vision involving raw data archiving via local
archives as a supplement to the existing
processed data archives (PDB, CSD, ICDD etc)
John R. Helliwell, Brian McMahon, Tom Terwilliger
john.helliwell_at_manchester.ac.uk bm_at_iucr.org terwil
liger_at_lanl.gov
2Options
- Do nothing for ensuring raw data archiving
- Do what we can eg via centralised facilities raw
data archiving along with Universities own data
archives both as supplements to the processed
data archiving at the CSD and PDB etc or at the
very least by personal web page links - Seek a blue skies solution where all raw data are
compulsorily archived at centralised repositories
3During the last year detailed options were
sketched out Firstly
- At the Launch Meeting of the DDD WG in Madrid it
was suggested that a pilot project involving
digital object identifier (DOI) registrations of
a test group of data sets could be established
this would be led by an SR Facility that is
keeping a raw data archive in any case. - This was enthusiastically supported and DLS
agreed to take this forward with 100 MX data
sets. - JRH in parallel continued to investigate the
local University reprint repository archive
option, which accepts data (in U. Manchester) for
small data sets this led to finding out that
U. Manchester in any case was setting up a data
archive for its researchers so as to satisfy
funding bodies requirements of its grant holders
(launch expected September 2012). - The local University Data Archive would be the
vehicle for locally measured diffractometer data
sets and also perhaps those from SR and neutron
Facilities that made it into publications by
academics at that University.
4During the last year detailed options were
sketched out Secondly
- A draft proposal was also written by JRH
exploring the possibility of Acta
Crystallographica Section E Structure Reports
Online hosting raw data (the set of diffraction
data images) for each structure - Preliminary analysis, in discussions with IUCr
Journals Chester, identified the major bottleneck
as network bandwidth (Chester has 2 x 2Mbps but
there were also concerns about bandwidth limits
on international pipes, especially to individual
laboratories) - Also building costs would be involved to upgrade
a server room for higher-capacity storage
although preliminary estimates suggested
per-article storage overhead could be sustainable
within the journal's open-access charging model
5JRH with L K-B write article with links to raw
data sets
- Tanley, S. W. M., Schreurs, A. M. M., Helliwell,
J. R. and Kroon-Batenburg, L. M. J.
(2012).Experience with exchange and archiving of
raw data comparison of data from two
diffractometers and four software packages on a
series of lysozyme crystals (2012). J. Appl.
Cryst. Submitted. - Explores comparative metadata associated with
different instruments, emphasising benefit of
standard ontologies (e.g. imgCIF) - Demonstrates scientific usefulness of detailed
data reanalysis
6New reports appear from learned bodies
- In addition to ICSUs Strategic Committee on Data
Report - The Royal Society (June 2012) enthusiastically
endorses the importance of access to data their
Committee defines data in its view as - and states For example, the annual cost of
managing the worlds data on protein structures
in the world wide Protein Data Bank is less than
1 of the cost of generating that data. - Their data definitions unfortunately seem to miss
the distinction between processed data and raw
data.
7Is a Blue Skies option still out of the question?
- One or more centralised global repositories might
take on the raw data archiving? - The PDB has given a careful and detailed analysis
at this Workshop.
8Is the option of localised repositories (near to
where data are measured) secure yet?
- CSynR has started a survey of SR Facilities (8
reported so far) suggesting that this is a
promising as an option but each SR facility
emphasised that they are not to be regarded as an
archive. Neither - instantaneous delivery of data
- provision of data sets certified to be 100 free
of data corruption - could be guaranteed.
- The Universities Data Archive experience, even at
the most advanced in their planning (e.g.
University of Manchester), is yet to be seen in
practice, e.g. with respect to the two issues
mentioned in point 1 above.
9Possibilities for SR facility temporary
repositories
- Most synchrotron facilities already maintain
simple archives of users data - Perhaps 99 access and availability is plenty
(and better than nothing) - A simple approach
- Save raw data at SR, tagged with identifier(s).
Optimally tag meta-data also. (Perhaps one DOI
per dataset generated at this time and provided
to user and stored in image headers) - Processing programs keep track of identifiers so
that processed data is linked to raw data - On PDB deposition, the DOI is deposited. On
publication it is listed. - PDB notifies SR, the flagged data are copied to a
long-term storage location - Perhaps some day the PDB pulls this data in
10Might we still need additional fallback
positions?
- Corresponding authors set up web links to their
data sets that underpin their publications. - These may be or may be not DOI linked such a
requirement would be difficult to enforce
although journals could strongly recommend. - How would such a method for data archiving and
access by readers be kept up to date,e. g. in the
event of an author retiring (or what to do after
their death?).
11Conclusions
- There is an enthusiasm and encouragement to
archive more than derived or processed data in
many areas of science besides our own. - The crystallographic community prides itself in
making its processed data accompany its
publications indeed it has been obligatory these
last 10 years or so. - We have three practical options in the near
future to extend these principles to our raw
data - via the local Data Archive
- via synchrotron data storage
- Or via the corresponding author setting up a
personal link to datasets underpinning
publications on their personal websites.
12So, we suggest a proposal
- We suggest that we adopt the above three
practical options to make feasible a
recommendation to the IUCr Executive Committee
that - Authors should provide a permanent and prominent
link from an article to the raw data sets
underpinning a journal publication - with a view to making this a formal requirement
on authors at such time as the community has
adopted raw data deposition as a routine
procedure.