Where The Rubber Meets the Sky Giving Access to Science Data - PowerPoint PPT Presentation

About This Presentation
Title:

Where The Rubber Meets the Sky Giving Access to Science Data

Description:

It has data on every part of the sky. In every measured spectral band: optical, x-ray, radio. ... The Big Picture. Data ingest. Managing a petabyte. Common ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 13
Provided by: gray48
Category:
Tags: access | data | giving | meets | rubber | science | sky

less

Transcript and Presenter's Notes

Title: Where The Rubber Meets the Sky Giving Access to Science Data


1
Where The Rubber Meets the SkyGiving Access to
Science Data
  • Jim Gray
  • Microsoft Research
  • Gray_at_Microsoft.com
  • Http//research.Microsoft.com/Gray
  • Alex SzalayJohns Hopkins University
  • Szalay_at_JHU.edu

2
New Science Paradigms
  • Thousand years ago science was empirical
  • describing natural phenomena
  • Last few hundred years theoretical branch
  • using models, generalizations
  • Last few decades a computational branch
  • simulating complex phenomena
  • Today data exploration (eScience)
  • unify theory, experiment, and simulation
  • using data management and statistics
  • Data captured by instrumentsOr generated by
    simulator
  • Processed by software
  • Scientist analyzes database / files

3
The Virtual Observatory
  • Premise most data is (or could be online)
  • The Internet is the worlds best telescope
  • It has data on every part of the sky
  • In every measured spectral band optical, x-ray,
    radio..
  • As deep as the best instruments (2 years ago).
  • It is up when you are up
  • The seeing is always great
  • Its a smart telescope links objects and
    data to literature
  • Software is the capital expense
  • Share, standardize, reuse..

4
The Big Picture
The Big Problems
  • Data ingest
  • Managing a petabyte
  • Common schema
  • How to organize it?
  • How to reorganize it?
  • How to coexist with others?
  • Data Query and Visualization tools
  • Support/training
  • Performance
  • Execute queries in a minute
  • Batch (big) query scheduling

5
What X-info Needs from us (cs)(not drawn to
scale)
6
Data Access Hitting a Wall
  • Current science practice based on data download
    (FTP/GREP)Will not scale to the datasets of
    tomorrow
  • You can GREP 1 MB in a second
  • You can GREP 1 GB in a minute
  • You can GREP 1 TB in 2 days
  • You can GREP 1 PB in 3 years.
  • Oh!, and 1PB 5,000 disks
  • At some point you need indices to limit
    search parallel data search and analysis
  • This is where databases can help
  • You can FTP 1 MB in 1 sec
  • You can FTP 1 GB / min (1)
  • 2 days and 1K
  • 3 years and 1M

7
Next-Generation Data Analysis
  • Looking for
  • Needles in haystacks the Higgs particle
  • Haystacks dark matter, dark energy, turbulence,
    ecosystem dynamics
  • Needles are easier than haystacks
  • Global statistics have poor scaling
  • Correlation functions are N2, likelihood
    techniques N3
  • As data and computers grow at Moores Law, we
    can only keep up with N logN
  • A way out?
  • Relax optimal notion (data is fuzzy, answers are
    approximate)
  • Dont assume infinite computational resources or
    memory
  • Requires combination of statistics computer
    science

8
Smart Data Unifying DB and Analysis
  • There is too much data to move aroundDo data
    manipulations at database
  • Build custom procedures and functions into DB
  • Unify data Access Analysis
  • Examples
  • Statistical sampling and analysis
  • Temporal and spatial indexing
  • Pixel processing
  • Automatic parallelism
  • Auto (re)organize
  • Scalable to Petabyte datasets

Move Mohamed to the mountain, not the mountain to
Mohamed.
9
Experiment Budgets ¼½ Software
  • Millions of lines of code
  • Repeated for experiment after experiment
  • Not much sharing or learning
  • Lets work to change this
  • Identify generic tools
  • Workflow schedulers
  • Databases and libraries
  • Analysis packages
  • Visualizers
  • Software for
  • Instrument scheduling
  • Instrument control
  • Data gathering
  • Data reduction
  • Database
  • Analysis
  • Visualization

Simulation (computational science) are gt ½
software
10
How to Help?
  • Cant learn the discipline before you
    start(takes 4 years.)
  • Cant go native you are a CS person not a
    bio, person
  • Have to learn how to communicateHave to learn
    the language
  • Have to form a working relationship with domain
    expert(s)
  • Have to find problems that leverage your skills

11
Working Cross-Culture A Way to Engage With
Domain Scientists
  • Find someone who is desperate for help
  • Communicate in terms of scenarios
  • Work on a problem that gives 100x benefit
  • Weeks/task vs hours/task
  • Solve 20 of the problem
  • The other 80 will take decades
  • Prototype
  • Go from working-to-working, Always have
  • Something to show
  • Clear next steps
  • Clear goal
  • Avoid death-by-collaboration-meetings.

12
Working Cross-Culture -- 20 Questions A Way to
Engage With Domain Scientists
  • Astronomers proposed 20 questions
  • Typical of things they want to do
  • Each would require a week or more in old way
    (programming in tcl / C/ FTP)
  • Goal, make it easy to answer questions
  • This goal motivates DB and tools design

13
The 20 Queries
  • Q11 Find all elliptical galaxies with spectra
    that have an anomalous emission line.
  • Q12 Create a grided count of galaxies with u-ggt1
    and rlt21.5 over 60ltdeclinationlt70, and 200ltright
    ascensionlt210, on a grid of 2, and create a map
    of masks over the same grid.
  • Q13 Create a count of galaxies for each of the
    HTM triangles which satisfy a certain color cut,
    like 0.7u-0.5g-0.2ilt1.25 rlt21.75, output it in
    a form adequate for visualization.
  • Q14 Find stars with multiple measurements and
    have magnitude variations gt0.1. Scan for stars
    that have a secondary object (observed at a
    different time) and compare their magnitudes.
  • Q15 Provide a list of moving objects consistent
    with an asteroid.
  • Q16 Find all objects similar to the colors of a
    quasar at 5.5ltredshiftlt6.5.
  • Q17 Find binary stars where at least one of them
    has the colors of a white dwarf.
  • Q18 Find all objects within 30 arcseconds of one
    another that have very similar colors that is
    where the color ratios u-g, g-r, r-I are less
    than 0.05m.
  • Q19 Find quasars with a broad absorption line in
    their spectra and at least one galaxy within 10
    arcseconds. Return both the quasars and the
    galaxies.
  • Q20 For each galaxy in the BCG data set
    (brightest color galaxy), in 160ltright
    ascensionlt170, -25ltdeclinationlt35 count of
    galaxies within 30"of it that have a photoz
    within 0.05 of that galaxy.
  • Q1 Find all galaxies without unsaturated pixels
    within 1' of a given point of ra75.327,
    dec21.023
  • Q2 Find all galaxies with blue surface
    brightness between and 23 and 25 mag per square
    arcseconds, and -10ltsuper galactic latitude (sgb)
    lt10, and declination less than zero.
  • Q3 Find all galaxies brighter than magnitude 22,
    where the local extinction is gt0.75.
  • Q4 Find galaxies with an isophotal surface
    brightness (SB) larger than 24 in the red band,
    with an ellipticitygt0.5, and with the major axis
    of the ellipse having a declination of between
    30 and 60arc seconds.
  • Q5 Find all galaxies with a deVaucouleours
    profile (r¼ falloff of intensity on disk) and the
    photometric colors consistent with an elliptical
    galaxy. The deVaucouleours profile
  • Q6 Find galaxies that are blended with a star,
    output the deblended galaxy magnitudes.
  • Q7 Provide a list of star-like objects that are
    1 rare.
  • Q8 Find all objects with unclassified spectra.
  • Q9 Find quasars with a line width gt2000 km/s and
    2.5ltredshiftlt2.7.
  • Q10 Find galaxies with spectra that have an
    equivalent width in Ha gt40Å (Ha is the main
    hydrogen spectral line.)

Also some good queries at http//www.sdss.jhu.edu
/ScienceArchive/sxqt/sxQT/Example_Queries.html
14
SkyQuery (http//skyquery.net/)
  • Distributed Query tool using a set of web
    services
  • Many astronomy archives from Pasadena, Chicago,
    Baltimore, Cambridge (England)
  • Has grown from 4 to 15 archives,now becoming
    international standard
  • Allows queries like

SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND AREA(181.3,-0.76,6.5)
AND o.type3 and (o.I - t.m_j)gt2
15
SkyQuery Structure
  • Portal is
  • Plans Query (2 phase)
  • Integrates answers
  • Is itself a web service
  • Each SkyNode publishes
  • Schema Web Service
  • Database Web Service

16
MyDB eScience Workbench
  • Prototype of bringing analysis to the data
  • Everybody gets a workspace (database)
  • Executes analysis at the data
  • Store intermediate results there
  • Long queries run in batch
  • Results shared within groups
  • Only fetch the final results
  • Extremely successful matches work patterns
Write a Comment
User Comments (0)
About PowerShow.com