Issues in Access to Reprocessed Datasets - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Issues in Access to Reprocessed Datasets

Description:

... physical costs of storage have become relatively insignificant over the past decade. ... archives have chosen very different models of how access to the original ... – PowerPoint PPT presentation

Number of Views:11
Avg rating:3.0/5.0
Slides: 2
Provided by: ThomasM54
Category:

less

Transcript and Presenter's Notes

Title: Issues in Access to Reprocessed Datasets


1
Data Preservation and Data Reuse in Archive
Policy and Implementation Thomas
McGlynn NASA/Goddard Space Flight Center High
Energy Astrophysics Science Archive Research
Center
Archive Policy and experience Access to static low-level data? Can get old data? Multiple versions visible to public?
Hubble Space Telescope Archive Telemetry data is preserved, but data products use on-demand processing to create science data sets. Higher level data products are not stored statically while an instrument is active. No facilities exist to enable users asking for a given observation at different times to get the same results. Users may be able to process low level data using custom processing to reproduce earlier data products. Yes. No. No.
Chandra Science Center Chandra Data Archive All releases of all data products are preserved and users can ask for an earlier version of a released data product. Most data has been reprocessed at least once. No science-based request for an earlier version of data has ever been received, though occasional requests have been made by software developers. Yes. Yes but not by default. No.
High Energy Astrophysics Science Archive Research Center The details vary slightly from mission to mission, but generally revised data products completely replace earlier data products. In principle earlier versions of datasets could be restored by restoring earlier versions of the archive, but this would be a massive undertaking comparable to recovering from a fire. Few, if any, science based requests for earlier version data products have been received. Yes. No. No.
Infrared Science Archive Details vary on a mission by mission basis. For most missions later data version replace earlier versions. For 2MASS each of the sequential releases is supported separately. Yes. Sometimes. Sometimes.
Sloan Digital Sky Survey The SDSS has made a series of releases of increasingly large sets of data. All releases are supported independently. Yes. Yes. Yes.
SkyView Virtual Observatory SkyView resamples survey datasets to any specified geometry. Source data is not generally accessible to users, only resampled imagery. SkyView is a secondary archive for most of the large datasets it serves. No. No. No.
Why do we have archives? The traditional role of
archives has been preserve artifacts and
information, but this role has significantly
expanded with development of on-line digital
archives. For many of our science archives the
primary goal is now to promote the use and re-use
of the data in the archive. There can be some
tension between these two goals.   This poster
explores how this tension can be seen in in the
design and operation of modern archives looking
at some of the major NASA-supported astronomy
archives. Since software and calibrations of
science data normally improve with time most
higher level data products go through several
iterations. Different astronomy archives have
chosen very different models of how access to the
original data will be maintained in light of
revisions to their processing pipelines.
Currently most archives preserve some low level
data products which are largely unaffected by the
processing pipeline, but some systems no longer
provide users with easy access to such data. In
addition to the data retention policies, I
discuss the response of the community to the
choices that have been made
Issues in Access to Reprocessed Datasets Two
potential reasons are usually given for enabling
scientists access to older versions of science
datasets comparison with earlier results
obtained from the obsolete data, and the ability
to recover from errors introduced into later
versions of the pipeline. While it is certainly
true that errors can be introduced into
pipelines, in practice the archive center is
likely to correct the data relatively quickly so
that action by individual users in not necessary.
The experience of the Chandra and HEASARC
archives, where access to earlier versions of
data is either impossible or is noted, suggests
that users rarely need earlier versions of
datasets. After many years of operations there
have been essentially no requests for these
data. Three major concerns to keeping old
datasets are the costs of storage and the
additional complexity that managing multiple
versions requires in the archive system or
presents to the user, and the possibility that
users may overlook more recent data sets. The
importance of the first issue has dropped
significantly as the physical costs of storage
have become relatively insignificant over the
past decade. With a terabyte of storage costing
less than 1,000, the actual cost of storage is
now an issue only for the very largest systems,
e.g., LSST. The
complexity multiple versions engender is real.
E.g., I have seen users confused by the large
numbers of SDSS resources available in DataScope
(a Virtual Observatory data location service).
Services which maintain multiple versions of data
should probably provide only the latest version
by default and return earlier versions only on
specific request similar to the policy of the
Chandra Science Center. On the server side
tracking versions does add to the complexity to
the archive system, but this should be
manageable. When updated data products are given
new names and addresses, there is a significant
fraction of the community will not use the
updated data. E.g., the various SDSS data
releases are served from distinct URLs. Each end
user must change the addresses they use to get
the most up-to-date information. Users are not
informed that more up-to-date information exists.
Providing the best data at a fixed address may
be a better strategy. Release specific addresses
could still be provided for users who wished to
ensure they were getting a specified dataset.
Catalogs have traditionally included a version
number in their name, e.g., the RC3 and RC4, and
this seems to be the basis the choices of the
SDSS and 2MASS teams. The projects are
simultaneously massive data archives and very
large catalogs. Large, long-lived archive/catalog
projects may wish to consider including
versioning at the row level so that users can
query only a single catalog but still find which
data was available in a given catalog generation.
Data retention policies of selected astronomy
archives This table illustrates the varied
archive policies of prominent astronomy archives
with respect to access to earlier versions of
data.
Secondary Archives and Access to Low Level
Information The SkyView virtual observatory
represents a further loosening of the ties
between traditional and modern archives. In
SkyView users can only retrieve data after it has
been resampled so that the underlying datasets
are not directly available. Despite, or perhaps
because of, this the system is very popular and
is able to provide a simple and consistent
interface to a very broad range of data. As the
amount of processing needed between observation
and scientifically usable data increases,
end-users may be more and more drawn to
interfaces that concentrate on more science-ready
data products. Generally this is appropriate
when users are not pushing an observatory to its
limits. Secondary archives, like SkyView, or
secondary interfaces to existing archives, like
the interfaces being developed by the Virtual
Observatory, can concentrate on higher level
products without compromising the ability of the
community to get to lower level data products
through the primary archives when
needed. Discussion. Current astronomy data
archives vary considerably in the degree to which
they provide access to earlier versions of data.
SDSS and 2MASS make it very easy to access
earlier versions, while IRSA and the HEASARC can
make it virtually impossible to get earlier
versions of reprocessed datasets. In practice
scientists do not seem to have a strong interest
in earlier data. Regardless of the extent to
which archives provide access to earlier
versions, the facilitation of current use of the
archive data is the primary mission of all of the
astronomy archives. Although science archives
see the science community as their clients, there
may be some dangers to this viewpoint. We have
only had on-line archives for a decade or two and
in the longer term we find additional roles for
them in the historical and social analysis of
science. When science historians wish to find
the data that was used in some discovery it may
be difficult to reproduce given the policies like
those in place at the HEASARC or IRSA. E.g.,
the first observational evidence of General
Relativistic frame dragging used data from the
ASCA observatory archived at the HEASARC.
Historians wishing to reconstruct the discovery
may no longer find the same data available that
was used at the time of discovery.
A detail in the Virtual Observatory DataScope
illustrating the many SDSS services that are
visible to the public.
Write a Comment
User Comments (0)
About PowerShow.com