Mining the Sky The World-Wide Telescope - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Mining the Sky The World-Wide Telescope

Description:

The revolution in Computational Science. The Virtual Observatory Concept == World-Wide Telescope ... Exploring Parameter Space. Manual or Automatic Data Mining ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 46
Provided by: jimg178
Category:
Tags: mining | sky | telescope | wide | world

less

Transcript and Presenter's Notes

Title: Mining the Sky The World-Wide Telescope


1
Mining the SkyThe World-Wide Telescope
  • Jim Gray
  • Microsoft Research
  • Collaborating with
  • Alex Szalay, Peter Kunszt, Ani Thakar _at_ JHU
  • Robert Brunner, Roy Williams _at_ Caltech
  • George Djorgovski, Julian Bunn _at_ Caltech

2
Outline
  • The revolution in Computational Science
  • The Virtual Observatory Concept
  • World-Wide Telescope
  • The Sloan Digital Sky Survey DB technology

3
Computational Science The Third Science Branch
is Evolving
  • In the beginning science was empirical.
  • Then theoretical branches evolved.
  • Now, we have computational branches.
  • Has primarily been simulation
  • Growth area data analysis/visualizationof
    peta-scale instrument data.
  • Analysis Visualization tools
  • Help both simulation and instruments.
  • Are primitive today.

4
Computational Science
  • Traditional Empirical Science
  • Scientist gathers data by direct observation
  • Scientist analyzes data
  • Computational Science
  • Data captured by instrumentsOr data generated by
    simulator
  • Processed by software
  • Placed in a database / files
  • Scientist analyzes database / files

5
Exploring Parameter SpaceManual or Automatic
Data Mining
  • There is LOTS of data
  • people cannot examine most of it.
  • Need computers to do analysis.
  • Manual or Automatic Exploration
  • Manual person suggests hypothesis, computer
    checks hypothesis
  • Automatic Computer suggests hypothesis person
    evaluates significance
  • Given an arbitrary parameter space
  • Data Clusters
  • Points between Data Clusters
  • Isolated Data Clusters
  • Isolated Data Groups
  • Holes in Data Clusters
  • Isolated Points

Nichol et al. 2001 Slide courtesy of and adapted
fromRobert Brunner _at_ CalTech.
6
Challenge to Data Miners Rediscover Astronomy
  • Astronomy needs deep understanding of physics.
  • But, some was discovered as variable correlation
    then explained with physics.
  • Famous example Hertzsprung-Russell Diagramstar
    luminosity vs color (temperature)
  • Challenge 1 (the student test) How much of
    astronomy can data mining discover?
  • Challenge 2 (the Turing test)Can data mining
    discover NEW correlations?

7
Whats needed?(not drawn to scale)
8
Data MiningScience vs Commerce
  • Data in files FTP a local copy /subset.ASCII or
    Binary.
  • Each scientist builds own analysis toolkit
  • Analysis is tcl script of toolkit on local data.
  • Some simple visualization tools x vs y
  • Data in a database
  • Standard reports for standard things.
  • Report writers for non-standard things
  • GUI tools to explore data.
  • Decision trees
  • Clustering
  • Anomaly finders

9
Butsome science is hitting a wallFTP and GREP
are not adequate
  • You can GREP 1 MB in a second
  • You can GREP 1 GB in a minute
  • You can GREP 1 TB in 2 days
  • You can GREP 1 PB in 3 years.
  • Oh!, and 1PB 10,000 disks
  • At some point you need indices to limit
    search parallel data search and analysis
  • This is where databases can help
  • You can FTP 1 MB in 1 sec
  • You can FTP 1 GB / min ( 1 /GB)
  • 2 days and 1K
  • 3 years and 1M

10
Why is Science Behind?
  • Inertia
  • Science started earlier (Fortran,)
  • Science culture works (no big incentive to
    change)
  • Energy
  • Commerce is about profit better answers
    translate to better profits
  • So companies to build tools.
  • Impedance Mismatch
  • Databases dont accommodate analysis packages
  • Scientists analysis needs to be inside the dbms.

11
Goal Easy Data Publication Access
  • Augment FTP with data query Return
    intelligent data subsets
  • Make it easy to
  • Publish Record structured data
  • Find
  • Find data anywhere in the network
  • Get the subset you need
  • Explore datasets interactively
  • Realistic goal
  • Make it as easy as publishing/reading web sites
    today.

12
Web Services The Key?
Your program
Web Server
  • Web SERVER
  • Given a url parameters
  • Returns a web page (often dynamic)
  • Web SERVICE
  • Given a XML document (soap msg)
  • Returns an XML document
  • Tools make this look like an RPC.
  • F(x,y,z) returns (u, v, w)
  • Distributed objects for the web.
  • naming, discovery, security,..
  • Internet-scale distributed computing

http
Web page
Your program
Web Service
soap
Data In your address space
objectin xml
13
Data Federations of Web Services
  • Massive datasets live near their owners
  • Near the instruments software pipeline
  • Near the applications
  • Near data knowledge and curation
  • Super Computer centers become Super Data Centers
  • Each Archive publishes a web service
  • Schema documents the data
  • Methods on objects (queries)
  • Scientists get personalized extracts
  • Uniform access to multiple Archives
  • A common global schema

Federation
14
Grid and Web Services Synergy
  • I believe the Grid will have many web services
  • IETF standards Provide
  • Naming
  • Authorization / Security / Privacy
  • Distributed Objects
  • Discovery, Definition, Invocation, Object Model
  • Higher level services workflow, transactions,
    DB,..
  • Synergy commercial Internet Grid tools

15
Outline
  • The revolution in Computational Science
  • The Virtual Observatory Concept
  • World-Wide Telescope
  • The Sloan Digital Sky Survey DB technology

16
Why Astronomy Data?
  • It has no commercial value
  • No privacy concerns
  • Can freely share results with others
  • Great for experimenting with algorithms
  • It is real and well documented
  • High-dimensional data (with confidence intervals)
  • Spatial data
  • Temporal data
  • Many different instruments from many different
    places and many different times
  • Federation is a goal
  • The questions are interesting
  • How did the universe form?
  • There is a lot of it (petabytes)

17
Time and Spectral DimensionsThe Multiwavelength
Crab Nebulae
Crab star 1053 AD
X-ray, optical, infrared, and radio views of
the nearby Crab Nebula, which is now in a state
of chaotic expansion after a supernova explosion
first sighted in 1054 A.D. by Chinese Astronomers.
Slide courtesy of Robert Brunner _at_ CalTech.
18
Even in optical images are very different
Optical Near-Infrared Galaxy Image Mosaics
BJ RF IN J H K
One object in 6 different color bands
Slide courtesy of Robert Brunner _at_ CalTech.
19
Astronomy Data Growth
  • In the old days astronomers took photos.
  • Starting in the 1960s they began to digitize.
  • New instruments are digital (100s of GB/nite)
  • Detectors are following Moores law.
  • Data avalanche double every 2 years

Total area of 3m telescopes in the world in m2,
total number of CCD pixels in megapixel, as a
function of time. Growth over 25 years is a
factor of 30 in glass, 3000 in pixels.
3 M telescopes area m2
Courtesy of Alex Szalay
CCD area mpixels
20
Universal Access to Astronomy Data
  • Astronomers have a few Petabytes now.
  • 1 pixel (byte) / sq arc second 4TB
  • Multi-spectral, temporal, ? 1PB
  • They mine it looking for new (kinds of) objects
    or more of interesting ones (quasars),
    density variations in 400-D space correlations
    in 400-D space
  • Data doubles every 2 years.
  • Data is public after 2 years.
  • So, 50 of the data is public.
  • Some have private access to 5 more data.
  • So 50 vs 55 access for everyone

21
The Age of Mega-Surveys
  • Large number of new surveys
  • multi-TB in size, 100 million objects or more
  • Data publication an integral part of the survey
  • Software bill a major cost in the survey
  • The next generation mega-surveys are different
  • top-down design
  • large sky coverage
  • sound statistical plans
  • well controlled/documented data processing
  • Each survey has a publication plan
  • Federating these archives
  • ? Virtual Observatory

MACHO 2MASS DENIS SDSS PRIME DPOSS GSC-II COBE
MAP NVSS FIRST GALEX ROSAT OGLE ...
Slide courtesy of Alex Szalay, modified by Jim
22
Data Publishing and Access
  • But..
  • How do I get at that 50 of the data?
  • Astronomers have culture of publishing.
  • FITS files and many tools.http//fits.gsfc.nasa.g
    ov/fits_home.html
  • Encouraged by NASA.
  • FTP what you need.
  • But, data details are hard to document.
    Astronomers want to do it but it is VERY
    hard.(What programs where used? What were the
    processing steps? How were errors treated?)
  • And by the way, few astronomers have a spare
    petabyte of storage in their pocket.
  • THESIS Challenging problems are publishing
    data providing good query visualization tools

23
Virtual Observatoryhttp//www.astro.caltech.edu/n
voconf/http//www.voforum.org/
  • Premise Most data is (or could be online)
  • So, the Internet is the worlds best telescope
  • It has data on every part of the sky
  • In every measured spectral band optical, x-ray,
    radio..
  • As deep as the best instruments (2 years ago).
  • It is up when you are up.The seeing is always
    great (no working at night, no clouds no moons
    no..).
  • Its a smart telescope links objects and
    data to literature on them.

24
Demo of VirtualSky
  • Roy Williams _at_ CaltechPalomar Data with links to
    NED.
  • Shows multiple themes, shows link to other sites
    (NED, VizeR, Sinbad, )
  • http//virtualsky.org/servlet/Page?T3S21P1X
    0Y0W4F1
  • And
  • NED _at_ http//nedwww.ipac.caltech.edu/index.html

25
Demo of Sky Server
  • Alex Szalay of Johns Hopkins built SkyServer
    (based on TerraServer design).
  • http//skyserver.sdss.org/

26
Virtual Observatory Challenges
  • Size multi-Petabyte
  • 40,000 square degrees is 2 Trillion pixels
  • One band (at 1 sq arcsec) 4 Terabytes
  • Multi-wavelength 10-100
    Terabytes
  • Time dimension gtgt 10 Petabytes
  • Need auto parallelism tools
  • Unsolved MetaData problem
  • Hard to publish data programs
  • How to federate Archives
  • Hard to find/understand data programs
  • Current tools inadequate
  • new analysis visualization tools
  • Data Federation is problematic
  • Transition to the new astronomy
  • Sociological issues

27
Steps to Virtual Observatory Prototype
  • Get SDSS and Palomar data online
  • Alex Szalay, Jan Vandenberg, Ani Thacker.
  • Roy Williams, Robert Brunner, Julian Bunn,
  • Do local queries and crossID matches to expose
  • Schema, Units,
  • Dataset problems
  • Typical use scenarios.
  • Define a set of Astronomy Objects and methods.
  • Based on UDDI, WSDL, SOAP.
  • Started this with TerraService http//TerraService
    .net/ ideas.
  • Working with Caltech (Brunner, Williams,
    Djorgovski, Bunn) and JHU (Szalay et al) on this
  • Each archive is a web service
  • Move crossID app to web-service base

28
Virtual Observatory and Education
  • The Virtual Observatory can be used to
  • Teach astronomy make it interactive,
    demonstrate ideas and phenomena
  • Teach computational science skills

29
Outline
  • The revolution in Computational Science
  • The Virtual Observatory Concept
  • World-Wide Telescope
  • The Sloan Digital Sky Survey DB technology

30
Sloan Digital Sky Survey http//www.sdss.org/
  • For the last 12 years a group of astronomers has
    been building a telescope (with funding from
    Sloan Foundation, NSF, and a dozen
    universities). 90M.
  • Y2000 engineer, calibrate, commission now
    public data.
  • 5 of the survey, 600 sq degrees, 15 M objects
    60GB, ½ TB raw.
  • This data includes most of the known high z
    quasars.
  • It has a lot of science left in it but.
  • New the data is arriving
  • 250GB/nite (20 nights per year) 5TB/y.
  • 100 M stars, 100 M galaxies, 1 M spectra.
  • http//www.sdss.org/ and http//www.sdss.jhu.edu/

31
Two kinds of SDSS data in an SQL DB(objects and
images all in DB)
  • 15M Photo Objects 400 attributes

50K Spectra with 30 lines/ spectrum
32
Spatial Data Access SQL extension(Szalay,
Kunszt, Brunner) http//www.sdss.jhu.edu/htm
  • Added Hierarchical Triangular Mesh (HTM)
    table-valued function for spatial joins.
  • Every object has a 20-deep Mesh ID.
  • Given a spatial definitionRoutine returns up to
    10 covering triangles.
  • Spatial query is then up to 10 range queries.
  • Very fast 10,000 triangles / second / cpu.

33
Data Loading
  • JavaScript of DB loader (DTS)
  • Web ops interface workflow system
  • Data ingest and scrubbing is major effort
  • Test data quality
  • Chase down bugs / inconsistencies
  • Other major task is data documentation
  • Explain the data
  • Explain the schema and functions.
  • If we supported users,

34
Scenario Design
  • Astronomers proposed 20 questions
  • Typical of things they want to do
  • Each would require a week of programming in tcl /
    C/ FTP
  • Goal, make it easy to answer questions
  • DB and tools design motivated by this goal
  • Implementd utility prodecures
  • JHU Built GUI for Linux clients

35
The 20 Queries
  • Q11 Find all elliptical galaxies with spectra
    that have an anomalous emission line.
  • Q12 Create a grided count of galaxies with u-ggt1
    and rlt21.5 over 60ltdeclinationlt70, and 200ltright
    ascensionlt210, on a grid of 2, and create a map
    of masks over the same grid.
  • Q13 Create a count of galaxies for each of the
    HTM triangles which satisfy a certain color cut,
    like 0.7u-0.5g-0.2ilt1.25 rlt21.75, output it in
    a form adequate for visualization.
  • Q14 Find stars with multiple measurements and
    have magnitude variations gt0.1. Scan for stars
    that have a secondary object (observed at a
    different time) and compare their magnitudes.
  • Q15 Provide a list of moving objects consistent
    with an asteroid.
  • Q16 Find all objects similar to the colors of a
    quasar at 5.5ltredshiftlt6.5.
  • Q17 Find binary stars where at least one of them
    has the colors of a white dwarf.
  • Q18 Find all objects within 30 arcseconds of one
    another that have very similar colors that is
    where the color ratios u-g, g-r, r-I are less
    than 0.05m.
  • Q19 Find quasars with a broad absorption line in
    their spectra and at least one galaxy within 10
    arcseconds. Return both the quasars and the
    galaxies.
  • Q20 For each galaxy in the BCG data set
    (brightest color galaxy), in 160ltright
    ascensionlt170, -25ltdeclinationlt35 count of
    galaxies within 30"of it that have a photoz
    within 0.05 of that galaxy.
  • Q1 Find all galaxies without unsaturated pixels
    within 1' of a given point of ra75.327,
    dec21.023
  • Q2 Find all galaxies with blue surface
    brightness between and 23 and 25 mag per square
    arcseconds, and -10ltsuper galactic latitude (sgb)
    lt10, and declination less than zero.
  • Q3 Find all galaxies brighter than magnitude 22,
    where the local extinction is gt0.75.
  • Q4 Find galaxies with an isophotal surface
    brightness (SB) larger than 24 in the red band,
    with an ellipticitygt0.5, and with the major axis
    of the ellipse having a declination of between
    30 and 60arc seconds.
  • Q5 Find all galaxies with a deVaucouleours
    profile (r¼ falloff of intensity on disk) and the
    photometric colors consistent with an elliptical
    galaxy. The deVaucouleours profile
  • Q6 Find galaxies that are blended with a star,
    output the deblended galaxy magnitudes.
  • Q7 Provide a list of star-like objects that are
    1 rare.
  • Q8 Find all objects with unclassified spectra.
  • Q9 Find quasars with a line width gt2000 km/s and
    2.5ltredshiftlt2.7.
  • Q10 Find galaxies with spectra that have an
    equivalent width in Ha gt40Å (Ha is the main
    hydrogen spectral line.)

Also some good queries at http//www.sdss.jhu.edu
/ScienceArchive/sxqt/sxQT/Example_Queries.html
36
An easy oneQ7 Provide a list of star-like
objects that are 1 rare.
  • Found 14,681 buckets, first 140 buckets have
    99 time 62 seconds
  • CPU bound 226 k records/second (2 cpu)
    250 KB/s.

Select cast((u-g) as int) as ug, cast((g-r) as
int) as gr, cast((r-i) as int) as ri,
cast((i-z) as int) as iz, count()
as Population from stars group by cast((u-g) as
int), cast((g-r) as int), cast((r-i) as int),
cast((i-z) as int) order by count()
37
An Easy OneQ15 Provide a list of moving objects
consistent with an asteroid.
  • Sounds hard but there are 5 pictures of the
    object at 5 different times (color filters) and
    so can see velocity.
  • Image pipeline computes velocity.
  • Computing it from the 5 color x,y would also be
    fast
  • Finds 1,303 objects in 3 minutes,
    140MBps. (could go 2x faster with more disks)

select objId, dbo.fGetUrlEq(ra,dec) as url
--return object ID url sqrt(power(rowv,2)powe
r(colv,2)) as velocity from photoObj --
check each object. where (power(rowv,2)
power(colv, 2)) -- square of velocity
between 50 and 1000 -- huge values error
38
Q15 Fast Moving Objects
  • Find near earth asteroids

SELECT r.objID as rId, g.objId as gId,
dbo.fGetUrlEq(g.ra, g.dec) as url FROM PhotoObj
r, PhotoObj g WHERE r.run g.run and
r.camcolg.camcol and abs(g.field-r.field)lt2
-- nearby -- the red selection criteria and
((power(r.q_r,2) power(r.u_r,2)) gt 0.111111
) and r.fiberMag_r between 6 and 22 and
r.fiberMag_r lt r.fiberMag_g and r.fiberMag_r lt
r.fiberMag_i and r.parentID0 and r.fiberMag_r lt
r.fiberMag_u and r.fiberMag_r lt
r.fiberMag_z and r.isoA_r/r.isoB_r gt 1.5 and
r.isoA_rgt2.0 -- the green selection
criteria and ((power(g.q_g,2) power(g.u_g,2))
gt 0.111111 ) and g.fiberMag_g between 6 and 22
and g.fiberMag_g lt g.fiberMag_r and
g.fiberMag_g lt g.fiberMag_i and g.fiberMag_g lt
g.fiberMag_u and g.fiberMag_g lt g.fiberMag_z and
g.parentID0 and g.isoA_g/g.isoB_g gt 1.5 and
g.isoA_g gt 2.0 -- the matchup of the pair and
sqrt(power(r.cx -g.cx,2) power(r.cy-g.cy,2)power
(r.cz-g.cz,2))(10800/PI())lt 4.0 and
abs(r.fiberMag_r-g.fiberMag_g)lt 2.0
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
Performance (on current SDSS data)
  • Run times on 15k COMPAQ Server (2 cpu, 1 GB ,
    8 disk)
  • Some take 10 minutes
  • Some take 1 minute
  • Median 22 sec.
  • Ghz processors are fast!
  • (10 mips/IO, 200 ins/byte)
  • 2.5 m rec/s/cpu

1000 IO/cpu sec 70 MB IO/cpu sec
43
Summary of Queries
  • All have fairly short SQL programs -- a
    substantial advance over (tcl, C)
  • Many are sequential one-pass and two-pass over
    data
  • Covering indices make scans run fast
  • Table valued functions are wonderful but
    limitations are painful.
  • Counting, Binning, Histograms VERY common
  • Spatial indices helpful,
  • Materialized view (Neighbors) helpful.

44
Call to Action
  • If you do data visualization we need you(and we
    know it).
  • If you do databaseshere is some data you can
    practice on.
  • If you do distributed systemshere is a
    federation you can practice on.
  • If you do data mininghere are datasets to test
    your algorithms.
  • If you do astronomy educational outreachhere is
    a tool for you.
  • The astronomers are very good, and very smart,
    and a pleasure to work with, and the questions
    are cosmic, so

45

46
HTM and SQL
  • Spatial spec in http//www.sdss.jhu.edu/htm/
  • List of triangles out (about 10-20 range queries)
  • Table valued function, then geometry rejects
    false positives

Use SkyServerV3 GO -- show an HTM
ID select dbo.fHTM_To_String(dbo.fHTM_Lookup('J200
0 20 185 0')) Go -- show triangles covering a
circle select dbo.fHTM_To_String(HTMIDstart) as
start, dbo.fHTM_To_String(HTMIDend) as stop from
dbo.fHTM_Cover('CIRCLE J2000 12 185 0 5 ')
GO -- Show the spatial join declare _at_shift
real set _at_shift CONVERT(int,POWER(4.,20-12))
-- 4 22 and 2 bits per htm level select
ObjID from PhotoObj as P, dbo.fHTM_Cover('CIR
CLE J2000 12 185 0 1 ') as C where P.htmID
between C.HTMIDstart_at_shift and
C.HTMIDend_at_shift GO -- show a user-level
function. select ObjID from dbo.fGetNearbyObjEq(18
5,0,1)
47
A Hard One Q14 Find stars with multiple
measurements that have magnitude variations
gt0.1.
  • This should work, but SQL Server does not allow
    table values to be piped to table-valued
    functions.
  • This should work, but SQL Server does not allow
    table values to be piped to table-valued
    functions.

48
A Hard one Second TryQ14 Find stars with
multiple measurements that have magnitude
variations gt0.1.
  • Write a program with a cursor, ran for 2 days

--------------------------------------------------
----------------------------- -- Table-valued
function that returns the binary stars within a
certain radius -- of another (in arc-minutes)
(typically 5 arc seconds). -- Returns the ID
pairs and the distance between them (in
arcseconds). create function BinaryStars(_at_MaxDista
nceArcMins float) returns _at_BinaryCandidatesTable
table( S1_object_ID bigint not null, -- Star
1 S2_object_ID bigint not null, -- Star
2 distance_arcSec float) -- distance between
them as begin declare _at_star_ID bigint,
_at_binary_ID bigint-- Star's ID and binary ID
declare _at_ra float, _at_dec float -- Star's
position declare _at_u float, _at_g float, _at_r float,
_at_i float,_at_z float -- Star's colors  
----------------Open a cursor over stars and get
position and colors declare star_cursor cursor
for select object_ID, ra, dec, u, g, r, i,
z from Stars open star_cursor   while
(11) -- for each star begin -- get its
attribues fetch next from star_cursor into
_at_star_ID, _at_ra, _at_dec, _at_u, _at_g, _at_r, _at_i, _at_z if
(_at__at_fetch_status -1) break -- end if no more
stars insert into _at_BinaryCandidatesTable --
insert its binaries select _at_star_ID,
S1.object_ID, -- return stars pairs
sqrt(N.DotProd)/PI()10800 -- and distance in
arc-seconds from getNearbyObjEq(_at_ra, _at_dec,
-- Find objects nearby S. _at_MaxDistanceArcMins)
as N, -- call them N. Stars as S1 --
S1 gets N's color values where _at_star_ID lt
N.Object_ID -- S1 different from S and
N.objType dbo.PhotoType('Star') -- S1 is a
star and N.object_ID S1.object_ID -- join
stars to get colors of S1N and
(abs(_at_u-S1.u) gt 0.1 -- one of the colors is
different. or abs(_at_g-S1.g) gt 0.1 or
abs(_at_r-S1.r) gt 0.1 or abs(_at_i-S1.i) gt 0.1
or abs(_at_z-S1.z) gt 0.1 ) end -- end
of loop over all stars -------------- Looped
over all stars, close cursor and exit. close
star_cursor -- deallocate star_cursor
return -- return table end -- end of
BinaryStars GO select from dbo.BinaryStars(.05)
49
A Hard one Third TryQ14 Find stars with
multiple measurements that have magnitude
variations gt0.1.
  • Use pre-computed neighbors table.
  • Ran in 2 minutes, found 48k pairs.


-- Plan 2 Use
the precomputed neighbors table select top 100
S.object_ID, S1.object_ID, -- return star pairs
and distance str(N.Distance_mins 60,6,1) as
DistArcSec from Star S, -- S is a
star Neighbors N, -- N within 3 arcsec (10
pixels) of S. Star S1 -- S1 N has the
color attibutes where S.Object_ID
N.Object_ID -- connect S and N. and
S.Object_ID lt N.Neighbor_Object_ID -- S1
different from S and N.Neighbor_objType
dbo.fPhotoType('Star')-- S1 is a star (an
optimization) and N.Distance_mins lt .05 --
the 3 arcsecond test and N.Neighbor_object_ID
S1.Object_ID -- N S1 and (
abs(S.u-S1.u) gt 0.1 -- one of the colors is
different. or abs(S.g-S1.g) gt 0.1 or
abs(S.r-S1.r) gt 0.1 or abs(S.i-S1.i) gt 0.1 or
abs(S.z-S1.z) gt 0.1 ) -- Found 48,425 pairs
(out of 4.4 m stars) in 121 sec.
50
The Pain of Going Outside SQL(its fortunate that
all the queries are single statements)
  • Use a cursor
  • No cpu parallelism
  • CPU bound
  • 6 MBps, 2.7 k rps
  • 5,450 seconds (10x slower)
  • Count parent objects
  • 503 seconds for 14.7 M objects in 33.3 GB
  • 66 MBps
  • IO bound (30 of one cpu)
  • 100 k records/cpu sec

declare _at_count int declare _at_sum int set _at_sum
0 declare PhotoCursor cursor for select nChild
from sxPhotoObj open PhotoCursor while (11)
begin fetch next from PhotoCursor into
_at_count if (_at__at_fetch_status -1) break set
_at_sum _at_sum _at_count end close
PhotoCursor deallocate PhotoCursor print 'Sum
is 'cast(_at_sum as varchar(12))
select count() from sxPhotoObj where nChild
gt 0
51
Reflections on the 20 Queries
  • Data loading/scrubbing is labor intensive
    tedious
  • AUTOMATE!!!
  • This is 5 of the data, and some queries take 10
    minutes.
  • But this is not tuned (disk bound).
  • All queries benefit from parallelism (both disk
    and cpu)(if you can state the query inside SQL).
  • Parallel database machines will do well on this
  • Hash machines
  • Data pumps
  • See paper in word or pdf on my web site.
  • Conclusion SQL answered the questions.Once you
    get the answers, you need visualization

52
Astronomy Data Characteristics
  • Lots of it (petabytes)
  • Hundreds of dimensions per object
  • Cross-correlation is challenging because
  • Multi-resolution
  • Time varying
  • Data is dirty (cosmic rays, airplanes)

53
SkyServer as a WebServerWSDLSOAPjust add
details ?
  • Archive ss new VOService(SkyServer)
  • Attributes A ss.GetObjects(ra,dec,radius)
  • ?? What are the objects (attributes)?
  • ?? What are the methods (GetObjects()...)?
  • ?? Is the query language SQL or Xquery or what?

54
SDSS what I have been doing
  • Work with Alex Szalay, Don Slutz, and others to
    define 20 canonical queries and 10 visualization
    tasks.
  • Working with Alex Szalay on building Sky Server
    and making data it public (send out 80GB
    SQL DBs)

55
What Next?(after the data online, after the web
servers)
  • How to federate the Archives to make a VO?
  • Send XML a non-answer equivalent to send
    Unicode
  • Bytes is the wrong abstractionPublish Methods
    on Objects.

56
Survey Cross-Identification
  • Billions of Sources
  • High Source Densities
  • Multi-Wavelength Radio to g-Ray
  • All Sky - Thousands of Sq. Degrees
  • Computational Challenge
  • Probabilistic Associations
  • Optimized Likelihood Ratios
  • A Priori Astrophysical Knowledge Important
  • Secondary Parameters
  • Temporal Variability
  • Dynamic Static Associations
  • User-Defined Cross-Identification Algorithms

Optical-Infrared-Radio Quasar-Environment Survey
Radio Survey Cross-Identification Steep Spectrum
Sources
Optical-Infrared-X-Ray Serendipitous Chandra
Identification
Slide courtesy of Robert Brunner _at_ CalTech.
57
Data Federation A Computational Challenge
  • 2MASS vs. DPOSS Cross-identification
  • 2MASS J lt 15
  • DPOSS IN lt 18
Write a Comment
User Comments (0)
About PowerShow.com