Where The Rubber Meets the Sky Giving Access to Science Data - PowerPoint PPT Presentation

About This Presentation
Title:

Where The Rubber Meets the Sky Giving Access to Science Data

Description:

I have been working with some astronomers. for the last 6 years ... Szalay; Jim Gray; Jan vandenBerg, SIPE Astronomy Telescopes and Instruments, 22 ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 17
Provided by: gray48
Category:
Tags: access | data | giving | meets | rubber | science | sipe | sky

less

Transcript and Presenter's Notes

Title: Where The Rubber Meets the Sky Giving Access to Science Data


1
Where The Rubber Meets the SkyGiving Access to
Science Data
  • Talk at
  • National Institute of Informatics, Tokyo, Japan
  • October 2005
  • Jim Gray
  • Microsoft Research
  • Gray_at_Microsoft.com
  • Http//research.Microsoft.com/Gray
  • Alex SzalayJohns Hopkins University
  • Szalay_at_JHU.edu

2
  • Abstract
  • I have been working with some astronomers
  • for the last 6 years trying to apply DB
    technology to science problems.
  • These are some lessons I learned
  • Paper at
  • Where the Rubber Meets the Sky Bridging the Gap
    between Databases and Science, Jim Gray
    Alexander S. Szalay MSR-TR-2004-110, October
    2004

3
New Science Paradigms
  • Thousand years ago science was empirical
  • describing natural phenomena
  • Last few hundred years theoretical branch
  • using models, generalizations
  • Last few decades a computational branch
  • simulating complex phenomena
  • Today data exploration (eScience)
  • unify theory, experiment, and simulation
  • using data management and statistics
  • Data captured by instrumentsOr generated by
    simulator
  • Processed by software
  • Scientist analyzes database / files

4
The Big Picture
The Big Problems
  • Data ingest
  • Managing a petabyte
  • Common schema
  • How to organize it?
  • How to reorganize it?
  • How to coexist with others?
  • Data Query and Visualization tools
  • Support/training
  • Performance
  • Execute queries in a minute
  • Batch (big) query scheduling

5
Experiment Budgets ¼½ Software
  • Millions of lines of code
  • Repeated for experiment after experiment
  • Not much sharing or learning
  • Lets work to change this
  • Identify generic tools
  • Workflow schedulers
  • Databases and libraries
  • Analysis packages
  • Visualizers
  • Software for
  • Instrument scheduling
  • Instrument control
  • Data gathering
  • Data reduction
  • Database
  • Analysis
  • Visualization

6
Data Lifecycle
  • Raw data ? primary data ? derived data
  • Data has bugs
  • Instrument bugs
  • Pipeline bugs
  • Data comes in versions
  • later versions fix known bugs
  • Just like software (indeed data is software)
  • Cant un-publish bad data.

7
Data Inflation Data Pyramid
  • Level 2Derived data products 10x smaller But
    there are many. L2L1
  • Publish new edition each year
  • Fixes bugs in data.
  • Must preserve old editions
  • Creates data pyramid
  • Store each edition
  • 1, 2, 3, 4 N N2 bytes
  • Net Data Inflation L2 L1
  • Level 1AGrows X TB/year .4X TB/y
    compressed (level 1A in NASA terms)

8
The Year 5 Problem
  • Data arrives at R bytes/year
  • New Storage Processing
  • Need to buy R units in year N
  • Data inflation means N2R
  • Need to buy NR units
  • Depreciate over 3 years
  • After year 3 need to buy N2R (N-3)2R
  • Moores law 60/year price decline
  • Capital expense peaks at year 5
  • See 6x Over-Power slide next

9
6x Over-Power Ratio
  • If you think you need X raw capacity, then you
    probably need 6X
  • Reprocessing
  • Backup copies
  • Versions
  • Hardware is cheap, Your time is precious.

10
Data Loading
  • Data from outside
  • Is full of bugs
  • Is not in your format
  • Advice
  • Get it in a Universal Format (e.g. Unicode
    CSV)
  • Create Blood-Brain barrier Quarantine in a
    load database
  • Scrub the data
  • Cross check everything you can
  • Check data statistics for sanity
  • Reject or repair bad data
  • Generate detailed bug reports(needed to send
    rejection upstream)
  • Expect to reload many times Automate everything!

11
Performance Prediction Regression
  • Database grows exponentially
  • Set up response-time requirements
  • For load
  • For access
  • Define a workload to measure each
  • Run it regularly to detect anomalies
  • SDSS uses
  • one-week to reload
  • 20 queries with response of 10 sec to 10 min.

12
Data Subsets For Science and Development
  • Offer 1GB, 10GB, , Full subsets
  • Wonderful tool for you
  • Design Debug
  • Good tool for scientists
  • Experiment on subset
  • Not for needle in haystack, but good for global
    stats
  • Challenge How make statistically valid subsets?
  • Seems domain specific
  • Seems problem specific
  • But, must be some general concepts.

13
Data Curation Problem Statement
  • Once published, scientific data needs to be
    available forever,so that the science can be
    reproduced/extended.
  • What does that mean?
  • Data can be characterized as
  • Primary Data could not be reproduced
  • Derived data could be derived from primary data.
  • Meta-data how the data was collected/derivedis
    primary
  • Must be preserved
  • Includes design docs, software, email, pubs,
    personal notes, teleconferences,

NASA level 0
14
Schema (aka metadata)
  • Everyone starts with the same schema
    ltstuff/gtThen the start arguing about semantics.
  • Virtual Observatory http//www.ivoa.net/
  • Metadata based on Dublin Corehttp//www.ivoa.net
    /Documents/latest/RM.html
  • Universal Content Descriptors (UCD)
    http//vizier.u-strasbg.fr/doc/UCD.htxCaptures
    quantitative concepts and their unitsReduced
    from 100,000 tables in literature to 1,000
    terms
  • VOtable a schema for answers to
    questionshttp//www.us-vo.org/VOTable/
  • Common QueriesCone Search and Simple Image
    Access Protocol, SQL
  • Registry http//www.ivoa.net/Documents/latest/RME
    xp.htmlstill a work in progress.

15
Archive Challenges
  • Cost of administering storage
  • Presently 10x to 100x the hardware cost.
  • Resist attack geographic diversity
  • At 1GBps it takes 12 days to move a PB
  • Store it in two (or more) places online (on
    disk). A geo-plex
  • Scrub it continuously (look for errors)
  • On failure,
  • use other copy until failure repaired,
  • refresh lost copy from safe copy.
  • Can organize the copies differently (e.g.
    one by time, one by space)

16
References http//SkyServer.SDSS.org/http//rese
arch.microsoft.com/pubs/ http//research.microsof
t.com/Gray/SDSS/ (download personal SkyServer)
  • Extending the SDSS Batch Query System to the
    National Virtual Observatory Grid, M. A.
    Nieto-Santisteban, W. O'Mullane, J. Gray, N. Li,
    T. Budavari, A. S. Szalay, A. R. Thakar,
    MSR-TR-2004-12, Feb. 2004
  • Scientific Data Federation, J. Gray, A. S.
    Szalay, The Grid 2 Blueprint for a New Computing
    Infrastructure, I. Foster, C. Kesselman, eds,
    Morgan Kauffman, 2003, pp 95-108.
  • Data Mining the SDSS SkyServer Database, J.
    Gray, A.S. Szalay, A. Thakar, P. Kunszt, C.
    Stoughton, D. Slutz, J. vandenBerg, Distributed
    Data Structures 4 Records of the 4th
    International Meeting, pp 189-210, W. Litwin, G.
    Levy (eds),, Carleton Scientific 2003, ISBN
    1-894145-13-5, also MSR-TR-2002-01, Jan. 2002
  • Petabyte Scale Data Mining Dream or Reality?,
    Alexander S. Szalay Jim Gray Jan vandenBerg,
    SIPE Astronomy Telescopes and Instruments, 22-28
    August 2002, Waikoloa, Hawaii, MSR-TR-2002-84
  • Online Scientific Data Curation, Publication,
    and Archiving, J. Gray A. S. Szalay A.R.
    Thakar C. Stoughton J. vandenBerg, SPIE
    Astronomy Telescopes and Instruments, 22-28
    August 2002, Waikoloa, Hawaii, MSR-TR-2002-74
  • The World Wide Telescope An Archetype for Online
    Science, J. Gray A. Szalay,, CACM, Vol. 45, No.
    11, pp 50-54, Nov. 2002, MSR TR 2002-75,
  • The SDSS SkyServer Public Access To The Sloan
    Digital Sky Server Data, A. S. Szalay, J. Gray,
    A. Thakar, P. Z. Kunszt, T. Malik, J. Raddick, C.
    Stoughton, J. vandenBerg, ACM SIGMOD 2002
    570-581 MSR TR 2001 104.
  • The World Wide Telescope, A.S., Szalay, J.,
    Gray, Science, V.293 pp. 2037-2038. 14 Sept 2001.
    MS-TR-2001-77
  • Designing Mining Multi-Terabyte Astronomy
    Archives Sloan Digital Sky Survey, A. Szalay,
    P. Kunszt, A. Thakar, J. Gray, D. Slutz, P.
    Kuntz, June 1999, ACM SIGMOD 2000, MS-TR-99-30,
Write a Comment
User Comments (0)
About PowerShow.com