Data provenance in astronomy - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Data provenance in astronomy

Description:

Wide-Field Astronomy Unit. University of Edinburgh (rgm_at_roe.ac.uk) 2 /24. Outline ... Astronomical data: original form ... Astronomical data: final form. Most ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 25
Provided by: nes6
Category:

less

Transcript and Presenter's Notes

Title: Data provenance in astronomy


1
Data provenance in astronomy
  • Bob Mann
  • Wide-Field Astronomy UnitUniversity of
    Edinburgh(rgm_at_roe.ac.uk)

2
Outline
  • Data and databases in astronomy
  • Case Study UK Infrared Deep Sky Survey
  • Conclusions

3
Outline
  • Data and databases in astronomy
  • Case Study UK Infrared Deep Sky Survey
  • Conclusions

4
Astronomers observe across the whole
electromagnetic spectrum
  • Galaxy images look different across spectrum, due
    to
  • Inherent angular resolution of the telescope
  • Different emission processes

5
Astronomical data original form
  • Different detector technologies used across the
    spectrum, yielding different types of data e.g.
  • Ultraviolet/optical/infrared
  • Image array of pixel values
  • X-ray
  • Event list positions, arrival times, energies of
    all detected photons
  • Radio
  • Interferometric visibilities sparse Fourier
    transform of a region of the sky

6
Astronomical data final form
  • Most research done using catalogue data
  • i.e. tables of attributes of detected sources
    mainly discrete sources (stars, galaxies, etc)
  • Data compression
  • Catalogue few of image data volume
  • Amenable to representation in relational DB
  • Natural indexing by location in sky
  • but original data products (images, spectra,
    event lists) sometimes needed

7
Astronomical databases
  • Telescope archives
  • Heterogeneous collections of raw data files from
    all observations taken
  • Download data for reduction and analysis
  • Sky survey archives
  • Homogeneous data and pipeline reduction
  • Science Archive do science on DB
  • Bibliographic archives scans of journals

8
Astronomical data processing
  • Data reduction
  • Remove instrumental signatures from raw data and
    produce science-ready data
  • Software packages written for specific
    instruments
  • Data analysis
  • Derive scientific results from science-ready data
    products e.g. statistical analyses
  • Some astro-specific packages/environments e.g.
    IRAF
  • Some use of programming languages
  • Fortran, C/C, Python, Java
  • Some use of commercial packages
  • e.g. Interactive Data Language (IDL)

9
Outline
  • Data and databases in astronomy
  • Case Study UKIDSS
  • Introduction to UKIDSS
  • Data life-cycle in UKIDSS
  • Provenance in UKIDSS
  • Conclusions

10
UK Infrared Deep Sky Survey
  • Set of five infrared sky surveys
  • Covering 1/6 of the sky
  • From large/shallow to very small/very deep
  • See www.ukidss.org
  • Observations 2005-2012 using Wide Field Camera
    (WFCAM) on UK Infrared Telescope (UKIRT) in
    Hawaii

11
UKIDSS data life-cycle (1)
  • Summit of Mauna Kea
  • Data acquired from 4 WFCAM detectors
  • Summit pipeline instrument health
  • Data written to LTO tape in NDF format
  • Tapes couriered to Cambridge weekly
  • Cambridge
  • Raw data converted from NDF to FITS
  • Data reduction pipeline run on nightly basis
    100Gb/night
  • Remove instrumental signatures, combine images,
    detect and classify objects, calibrate positions
    fluxes

12
UKIDSS data life-cycle (2)
  • Edinburgh
  • Ingest data from Cambridgecatalogues into
    RDBMS image metadata into RDBMS images on disk
  • Combine data from multiple nights generate new
    catalogues from stacked images
  • Prepare release databases for WFCAM Science
    Archive (WSA) see http//surveys.roe.ac.uk/wsa
  • Users worldwide
  • Extract raw images from Cambridge
  • Extract image and catalogues in FITS files from
    Edinburgh
  • Run queries on catalogues image metadata in WSA

13
Provenance in UKIDSS
  • Why is provenance important in UKIDSS?
  • What provenance information is recorded?
  • How will this be used?...and by whom?
  • and is this adequate?

14
Importance of provenance
  • Much UKIDSS science is rare object search

Objects with these colours would be very unusual
and possibly very interesting. Are they
real? Need ability to trace back to reduced image
within which object was detected maybe back to
raw image.
Ratio of fluxes in H K bands
Ratio of fluxes in J H bands
15
Structure of a FITS file

Header composedof 80-characterASCII records
Data units can be images or tables
Extensions
16
FITS header records
  • Almost all records of the formKEYWORD value
    / COMMENT
  • Some standard keywords defined, butconsiderable
    freedom to define new ones
  • Relevant metadata for particular instruments
  • Amongst standard set is HISTORY
  • Format HISTORY free text
  • Provenance information can be stored in a series
    of HISTORY records

17
UKIDSS FITS files (1)
  • Raw image files
  • Primary header telescope/instrumentset-up,
    observing conditions, target,observational
    parameters
  • Primary data array empty
  • Extensions (header,data) pairs for each of four
    detectors header has detector-specific metadata
    data is compressed image
  • Header keywords defined in Interface Control
    Document between Hawaii Cambridge

18
UKIDSS FITS files (2)
  • Reduced image files
  • Primary header data array metadatapropagated
    from raw data file
  • Headers of extensions include HISTORY records
    for data reduction steps run at Cambridge, e.g
  • HISTORY 20060615 173002
  • HISTORY Id cir_stage1.c,v 1.11 2005/12/15
    144404 jim Exp
  • HISTORY 20060615 173104
  • HISTORY Id cir_qblkmed.c,v 1.9 2005/08/12
    143519 jim Exp
  • HISTORY 20060615 173236
  • HISTORY Id cir_xtalk.c,v 1.5 2005/10/17
    145850 jim Exp
  • HISTORY 20060615 200158
  • HISTORY Id cir_arith.c,v 1.8 2005/02/25
    101455 jim Exp

What
When
Who
19
UKIDSS FITS files (3)
  • Catalogue files
  • Primary header metadata propagatedfrom raw
    image
  • Primary data array empty
  • Headers of extensions include metadata for
    catalogue generation process invocations of
    software modules in HISTORY records, with
    parameter values in separate records
  • Header keywords for both reduced images and
    catalogues are defined in an Interface Control
    Document between Cambridge Edinburgh

20
User access to provenance info
  • All header records from all FITS files ingested
    into WSA except HISTORY records
  • So, users can track provenance through queries
    against WSA, and can get HISTORY records by
    downloading files
  • Hopefully enough to determined whether unusual
    object is real,but this is this good enough?

21
RecapAstronomical data processing
  • Data reduction
  • Remove instrumental signatures from raw data and
    produce science-ready data
  • Software packages written for specific
    instruments
  • Data analysis
  • Derive scientific results from science-ready data
    products e.g. statistical analyses
  • Some astro-specific packages/environments e.g.
    IRAF
  • Some use of programming languages
  • Fortran, C/C, Python, Java
  • Some use of commercial packages
  • e.g. Interactive Data Language (IDL)

?
22
Provenance in data analysisTwo main problems
  • Less controlled software environment
  • Little bits of code written for a specific
    analysis, not tried and tested pipeline modules
  • Use of data from many sources
  • UKIDSS/WSA is state-of-the-art for provenance
  • Many (esp. older) data resources not so good
  • Provenance of combined dataset only as good as
    provenance of worst constituent dataset?

23
Does this matter?
  • Provenance information for data analysis is
    recorded in the journal paper (sort of)
  • Improving links between online literature and
    data sources
  • Increasing importance of large sky surveys with
    well controlled environments
  • Moving more of the data analysis from the users
    desktop to the data centre

24
Conclusions
  • Modern sky survey systems record publish
    extensive provenance for data reduction
  • Very little provenance recorded from data
    analysis except description in journal paper
  • More could surely be done but would researchers
    support overhead of doing so?
  • Improvements as more analysis in data centre
  • Could/should we be doing more?
Write a Comment
User Comments (0)
About PowerShow.com