Title: Where The Rubber Meets the Sky Giving Access to Science Data
1Where The Rubber Meets the SkyGiving Access to
Science Data
- Jim Gray
- Microsoft Research
- Gray_at_Microsoft.com
- Http//research.Microsoft.com/Gray
- Alex SzalayJohns Hopkins University
- Szalay_at_JHU.edu
-
2New Science Paradigms
- Thousand years ago science was empirical
- describing natural phenomena
- Last few hundred years theoretical branch
- using models, generalizations
- Last few decades a computational branch
- simulating complex phenomena
- Today data exploration (eScience)
- unify theory, experiment, and simulation
- using data management and statistics
- Data captured by instrumentsOr generated by
simulator - Processed by software
- Scientist analyzes database / files
3The Virtual Observatory
- Premise most data is (or could be online)
- The Internet is the worlds best telescope
- It has data on every part of the sky
- In every measured spectral band optical, x-ray,
radio.. - As deep as the best instruments (2 years ago).
- It is up when you are up
- The seeing is always great
- Its a smart telescope links objects and
data to literature - Software is the capital expense
- Share, standardize, reuse..
4The Big Picture
The Big Problems
- Data ingest
- Managing a petabyte
- Common schema
- How to organize it?
- How to reorganize it?
- How to coexist with others?
- Data Query and Visualization tools
- Support/training
- Performance
- Execute queries in a minute
- Batch (big) query scheduling
5What X-info Needs from us (cs)(not drawn to
scale)
6Data Access Hitting a Wall
- Current science practice based on data download
(FTP/GREP)Will not scale to the datasets of
tomorrow - You can GREP 1 MB in a second
- You can GREP 1 GB in a minute
- You can GREP 1 TB in 2 days
- You can GREP 1 PB in 3 years.
- Oh!, and 1PB 5,000 disks
- At some point you need indices to limit
search parallel data search and analysis - This is where databases can help
- You can FTP 1 MB in 1 sec
- You can FTP 1 GB / min (1)
- 2 days and 1K
- 3 years and 1M
7Next-Generation Data Analysis
- Looking for
- Needles in haystacks the Higgs particle
- Haystacks dark matter, dark energy, turbulence,
ecosystem dynamics - Needles are easier than haystacks
- Global statistics have poor scaling
- Correlation functions are N2, likelihood
techniques N3 - As data and computers grow at Moores Law, we
can only keep up with N logN - A way out?
- Relax optimal notion (data is fuzzy, answers are
approximate) - Dont assume infinite computational resources or
memory - Requires combination of statistics computer
science
8Smart Data Unifying DB and Analysis
- There is too much data to move aroundDo data
manipulations at database - Build custom procedures and functions into DB
- Unify data Access Analysis
- Examples
- Statistical sampling and analysis
- Temporal and spatial indexing
- Pixel processing
- Automatic parallelism
- Auto (re)organize
- Scalable to Petabyte datasets
-
Move Mohamed to the mountain, not the mountain to
Mohamed.
9Experiment Budgets ¼½ Software
- Millions of lines of code
- Repeated for experiment after experiment
- Not much sharing or learning
- Lets work to change this
- Identify generic tools
- Workflow schedulers
- Databases and libraries
- Analysis packages
- Visualizers
- Software for
- Instrument scheduling
- Instrument control
- Data gathering
- Data reduction
- Database
- Analysis
- Visualization
Simulation (computational science) are gt ½
software
10How to Help?
- Cant learn the discipline before you
start(takes 4 years.) - Cant go native you are a CS person not a
bio, person - Have to learn how to communicateHave to learn
the language - Have to form a working relationship with domain
expert(s) - Have to find problems that leverage your skills
11Working Cross-Culture A Way to Engage With
Domain Scientists
- Find someone who is desperate for help
- Communicate in terms of scenarios
- Work on a problem that gives 100x benefit
- Weeks/task vs hours/task
- Solve 20 of the problem
- The other 80 will take decades
- Prototype
- Go from working-to-working, Always have
- Something to show
- Clear next steps
- Clear goal
- Avoid death-by-collaboration-meetings.
12Working Cross-Culture -- 20 Questions A Way to
Engage With Domain Scientists
- Astronomers proposed 20 questions
- Typical of things they want to do
- Each would require a week or more in old way
(programming in tcl / C/ FTP) - Goal, make it easy to answer questions
- This goal motivates DB and tools design
13The 20 Queries
- Q11 Find all elliptical galaxies with spectra
that have an anomalous emission line. - Q12 Create a grided count of galaxies with u-ggt1
and rlt21.5 over 60ltdeclinationlt70, and 200ltright
ascensionlt210, on a grid of 2, and create a map
of masks over the same grid. - Q13 Create a count of galaxies for each of the
HTM triangles which satisfy a certain color cut,
like 0.7u-0.5g-0.2ilt1.25 rlt21.75, output it in
a form adequate for visualization. - Q14 Find stars with multiple measurements and
have magnitude variations gt0.1. Scan for stars
that have a secondary object (observed at a
different time) and compare their magnitudes. - Q15 Provide a list of moving objects consistent
with an asteroid. - Q16 Find all objects similar to the colors of a
quasar at 5.5ltredshiftlt6.5. - Q17 Find binary stars where at least one of them
has the colors of a white dwarf. - Q18 Find all objects within 30 arcseconds of one
another that have very similar colors that is
where the color ratios u-g, g-r, r-I are less
than 0.05m. - Q19 Find quasars with a broad absorption line in
their spectra and at least one galaxy within 10
arcseconds. Return both the quasars and the
galaxies. - Q20 For each galaxy in the BCG data set
(brightest color galaxy), in 160ltright
ascensionlt170, -25ltdeclinationlt35 count of
galaxies within 30"of it that have a photoz
within 0.05 of that galaxy.
- Q1 Find all galaxies without unsaturated pixels
within 1' of a given point of ra75.327,
dec21.023 - Q2 Find all galaxies with blue surface
brightness between and 23 and 25 mag per square
arcseconds, and -10ltsuper galactic latitude (sgb)
lt10, and declination less than zero. - Q3 Find all galaxies brighter than magnitude 22,
where the local extinction is gt0.75. - Q4 Find galaxies with an isophotal surface
brightness (SB) larger than 24 in the red band,
with an ellipticitygt0.5, and with the major axis
of the ellipse having a declination of between
30 and 60arc seconds. - Q5 Find all galaxies with a deVaucouleours
profile (r¼ falloff of intensity on disk) and the
photometric colors consistent with an elliptical
galaxy. The deVaucouleours profile - Q6 Find galaxies that are blended with a star,
output the deblended galaxy magnitudes. - Q7 Provide a list of star-like objects that are
1 rare. - Q8 Find all objects with unclassified spectra.
- Q9 Find quasars with a line width gt2000 km/s and
2.5ltredshiftlt2.7. - Q10 Find galaxies with spectra that have an
equivalent width in Ha gt40Å (Ha is the main
hydrogen spectral line.)
Also some good queries at http//www.sdss.jhu.edu
/ScienceArchive/sxqt/sxQT/Example_Queries.html
14SkyQuery (http//skyquery.net/)
- Distributed Query tool using a set of web
services - Many astronomy archives from Pasadena, Chicago,
Baltimore, Cambridge (England) - Has grown from 4 to 15 archives,now becoming
international standard - Allows queries like
SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND AREA(181.3,-0.76,6.5)
AND o.type3 and (o.I - t.m_j)gt2
15SkyQuery Structure
- Portal is
- Plans Query (2 phase)
- Integrates answers
- Is itself a web service
- Each SkyNode publishes
- Schema Web Service
- Database Web Service
16MyDB eScience Workbench
- Prototype of bringing analysis to the data
- Everybody gets a workspace (database)
- Executes analysis at the data
- Store intermediate results there
- Long queries run in batch
- Results shared within groups
- Only fetch the final results
- Extremely successful matches work patterns