Store Everything Online In A Database - PowerPoint PPT Presentation

1 / 71
About This Presentation

Store Everything Online In A Database

Description: 3. How Much ... Most bytes will never be ... impedance mismatch: rowsets == ragged arrays of objects ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 72
Provided by: jimg178


Transcript and Presenter's Notes

Title: Store Everything Online In A Database

Store EverythingOnlineIn A Database
  • Jim Gray
  • Microsoft Research
  • http//

  • Store Everything
  • Online (Disk not Tape)
  • In a Database
  • A Federated DB
  • Two Examples

How Much is Everything?
Yotta Zetta Exa Peta Tera Giga Mega Kilo
Everything! Recorded
  • Soon everything can be recorded and indexed
  • Most bytes will never be seen by humans.
  • Data summarization, trend detection anomaly
    detection are key technologies
  • See Mike Lesk How much information is there
  • See Lyman Varian
  • How much information
  • http//

All Books MultiMedia
All LoC books (words)
A Photo
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9
nano, 6 micro, 3 milli
Storage capacity beating Moores law
  • 2.3 k/TB today (raw disk)
  • 1k/TB by end of 2002

  • Store Everything
  • Online (Disk not Tape)
  • In a Database
  • A Federated DB
  • Two Examples

Online Data
  • Can build 1PB of NAS disk for 5M today
  • Can SCAN (read or write) entire PB in 3 hours.
  • Operate it as a data pump continuous sequential
  • Can deliver 1PB for 1M over Internet
  • Access charge is 300/Mbps bulk rate
  • Need to Geoplex data (store it in two places).
  • Need to filter/process data near the source,
  • To minimize network costs.

The Absurd Disk
  • 2.5 hr scan time (poor sequential access)
  • 1 access per second / 5 GB (VERY cold data)
  • Its a tape!

1 TB
100 MB/s
200 Kaps
Disk vs Tape
  • Tape
  • 40 GB
  • 10 MBps
  • 10 sec pick time
  • 30-120 second seek time
  • 2/GB for media8/GB for drivelibrary
  • 10 TB/rack
  • 1 week scan
  • Disk
  • 80 GB
  • 35 MBps
  • 5 ms seek time
  • 3 ms rotate latency
  • 2/GB for drive 2/GB for ctlrs/cabinet
  • 15 TB/rack
  • 1 hour scan

Guestimates Cern 200 TB 3480 tapes 2 col
50GB Rack 1 TB 12 drives
The price advantage of disk is growing the
performance advantage of disk is huge! At
10K/TB, disk is competitive with nearline tape.
Building a Petabyte Disk Store
  • Cadillac 500k/TB 500M/PB plus FC
    switches plus 800M/PB
  • TPC-C SANs (Brand PC 18GB/) 60M/PB
  • Brand PC local SCSI 15M/PB
  • Do it yourself ATA

Cheap Storage and/or Balanced System
  • Low cost storage (2 x 3k servers) 5K TB2x (
    800 Mhz, 256Mb 8x80GB disks 100MbE)raid5
    costs 6K/TB
  • Balanced server (5k/.64 TB)
  • 2x 1Ghz (2k)
  • 512 MB
  • 8 x 80 GB drives (2K)
  • Gbps Ethernet switch (300/port)
  • 9k/TB 18K/mirrored TB

Next step in the Evolution
  • Disks become supercomputers
  • Controller will have 1bips, 1 GB ram, 1 GBps net
  • And a disk arm.
  • Disks will run full-blown app/web/db/os stack
  • Distributed computing
  • Processors migrate to transducers.

Its Hard to Archive a PetabyteIt takes a LONG
time to restore it.
  • At 1GBps it takes 12 days!
  • Store it in two (or more) places online (on
    disk?). A geo-plex
  • Scrub it continuously (look for errors)
  • On failure,
  • use other copy until failure repaired,
  • refresh lost copy from safe copy.
  • Can organize the two copies differently
    (e.g. one by time, one by space)

  • Store Everything
  • Online (Disk not Tape)
  • In a Database
  • A Federated DB
  • Two Examples

Why Not file object GREP?
  • It works if you have thousands of objects (and
    you know them all)
  • But hard to search millions/billions/trillions
    with GREP
  • Hard to put all attributes in file name.
  • Minimal metadata
  • Hard to do chunking right.
  • Hard to pivot on space/time/version/attributes.

The Reality its build vs buy
  • If you use a file system you will eventually
    build a database system
  • metadata,
  • Query,
  • parallel ops,
  • security,.
  • reorganize,
  • recovery,
  • distributed,
  • replication,

OK so Ill put lots of objects in a fileDo It
Yourself Database
  • Good news
  • Your implementation will be 10x faster (at
  • easier to understand and use
  • Bad news
  • It will cost 10x more to build and maintain
  • Someday you will get bored maintaining/evolving
  • It will lack some killer features
  • Parallel search
  • Self-describing via metadata
  • SQL, XML,
  • Replication
  • Online update reorganization
  • Chunking is problematic (what granularity, how to

Top 10 reasons to put Everything in a DB
  1. Someone else writes the million lines of code
  2. Captures data and Metadata,
  3. Standard interfaces give tools and quick learning
  4. Allows Schema Evolution without breaking old apps
  5. Index and Pivot on multiple attributes
  6. Parallel terabyte searches in seconds or minutes
  7. Moves processing search close to the disk
    arm (moves fewer bytes (qestons return datons).
  8. Chunking is easier (can aggregate chunks at
  9. Automatic geo-replication
  10. Online update and reorganization.
  11. Security
  12. If you pick the right vendor, ten years from now,
    there will be software that can read the data.

DB Centric Examples
  • TerraServer
  • All images and all data in the database (chunked
    as small tiles)
  • http//
  • SkyServer Virtual Sky
  • Both image and semantic data in a relational
  • Parallel search NonProcedural access are
  • http//
  • http//
  • http//

OK Why dont they use our stuff?
  • Wrong metaphor HDF with hyper-slab is better
  • Impedance match Getting stuff in/out of DB
    is too hard
  • We sold them OODBs and they did not work
    (unreliable, poor performance, no tools).

So, why will the future be different?
  • They have MUCH more data (109 files?)
  • Java / C eases impedance mismatch rowsets
    ragged arrays of objectsiterators, exceptions,..
    built in language
  • Tools are better
  • Optimizers are better
  • CPU and disk parallelism actually works now
  • Statistical packages are better.

  • Store Everything
  • Online (Disk not Tape)
  • In a Database
  • A Federated DB
  • Two Examples


Distributed shared databases have failed even
on their home turf. blocks, files, tables are
wrong abstraction for networks. (too low
level) Objects are the right abstraction So,
UDDI / WSDL / SOAP is the solution (not
SQL) Replays the NAS is better than SAN
argument methods gt sql gt file gt disk XML is the
wire format, XLANG is the workflow protocol,
Query will be in there somewhere.
DDB technology GREAT in a Cluster Beowulf
  • Uniform architecture
  • Trust among nodes
  • High bandwidth-low latency communication
  • Programs have single system image
  • Queries run in parallel
  • Global optimizer does query decomposition

But in a Distributed System(a Grid vs Beowulf)
  • Higher level abstraction give modularity
    minimize round trips
  • Change is constant need modularity.
  • Heterogeneous architecture makes query planning
    much harder
  • No trust
  • Communication is slow and expensive (minimize

Federate Databases
  • Each Database exposes services
  • Self describing
  • Discoverable
  • Easy for programs to use/understand
  • Built on standards (W3C, IETF,..)
  • Client-side or server-side apps
  • Integrate these services
  • Combine information to produce answers

DDB the Trust Issue
  • Customers serve themselves
  • Follow the rules posted on the door
  • No Overhead, no staff!
  • Clerks serve Customers
  • Take order, fill order, fill out invoice, collect
  • Overhead staff, training, rules,

Client/Server Groceries
DDB Grocery
  • Store Everything
  • Online (Disk not Tape)
  • In a Database
  • A Federated DB
  • Two Examples

How do we find information today?
  • Human searches web (with an index)
  • Human browses pages

How do we find information tomorrow?
Digital Dashboard
  • Agents gather and digest it for us.
  • Q How?
  • AW3C
  • Discovery UDDI, DISCO, WSDL
  • Use
  • SOAP

My Agents
Web Services
How do we publish information?
  • Get the data.
  • Conceptualize the data schema
  • Provide methods that return data subsets.
  • Challenge how much processing on your server?
  • Publish the schema and methods.
  • We (you and I) are exploring these issues.

f, g, x, y
TerraServer Example
  • TerraServer
  • 3TB Internet Map DB available since June 1998
  • USGS photo and topo maps of the US
  • Integrated with Home Advisor
  • Shows off SQL Server availability scalability
  • Designed for thin clients and voice network
  • TerraService
  • A .NET web service
  • Makes TerraServer data available to other apps

Demo http//
Show photo topo gazetteer demographics
TerraServer Experience
  • Successful Web Site
  • Top 1000 Web Site continues to be popular
  • Met goals interesting, big, real, public, fast,
    easy, accessible, and free
  • High Availability Windows Data Center Compaq
    SAN Technology
  • New Feature Requests
  • Programmable access to meta-data
  • User selectable image sizes, i.e. a map server
  • Permission to use TerraServer data within server

What is a Web Service?
A programmable application component accessible
via standard Web protocols
TerraService Architecture
Standard Browsers
Smart Clients
Terra Services
  • Geo-coded places, e.g. Schools, Golf Courses,
    Hospitals, etc.
  • Place Polygons e.g. Zip Codes, Cities, etc.
  • Query Gazetteer
  • Retrieve imagery meta-data
  • Retrieve imagery
  • Simple Projection conversions

allows overlay information for
Terra-Tile-Service applications
Clients can present TerraServer imagery in new
Terra Services
  • Tile
  • GetAreaFromPt
  • GetAreaFromRect
  • GetAreaFromTileId
  • GetTileMetaFromLonLat
  • GetTileMetaFromTileId
  • GetTile (Image)
  • Landmark
  • GetLandmarkTypes
  • CountOfLandmarkPointsByRect
  • GetLandmarkPointsByRect
  • CountOfLandmarkShapesByRect
  • GetLandmarkShapesByRect
  • Place Search
  • GetPlaceFacts
  • GetPlaceList
  • GetPlaceListInRect
  • CountPlacesInRect
  • Projection
  • ConvertLonLatToUtm
  • ConvertUtmToLonLat
  • ConvertLonLatTo NearestPlace
  • GetTheme
  • GetLatLonMetrics

Soil Viewer Uses TerraService
Custom End Product
Web Soil Data Viewer
XML Soil Report
Soil Interpretation Map
What You Saw
  • Converted a Web Server
  • HTML get post
  • Server returns pictures to people
  • to a Web Service
  • SOAP service
  • returns XML self-describing data
  • Application integrates data (Agriculture and Geo

Rosetta Stone
  • Dot Net
  • UDDI Universal description, discovery, and
  • Schema, XLANG
  • SOAP simple object access protocol
  • WSDL web services definition language
  • XML- eXtended Markup Language
  • Distributed computing basic services
  • Yellow Pages
  • ?
  • RPC remote procedure call, CORBA, DCOM, RMI
  • IDL interface definition language
  • XDR - eXternal Data Representation

  • Collaborating with
  • Alex Szalay, Peter Kunszt, Ani Thakar _at_ JHU
  • Robert Brunner, Roy Williams _at_ Caltech
  • George Djorgovski, Julian Bunn _at_ Caltech
  • FermiLab operates Sky Server
  • Compaq donated hardware
  • Microsoft donated software and money

Sky Server
  • Like TerraServer pictures of the sky.
  • But also LOTS of data on each object So a
    data mining web service
  • Luminosity (multi-spectra), morphology, spectrum
  • So, it is a data mining application
  • Cross-correlation is challenging because
  • Multi-resolution
  • Data is dirty/fuzzy (error bars, cosmic rays,
  • Time varying

Astronomy Data
  • In the old days astronomers took photos.
  • Starting in the 1960s they began to digitize.
  • New instruments are digital (100s of GB/nite)
  • Detectors are following Moores law.
  • Data avalanche double every year

Total area of 3m telescopes in the world in m2,
total number of CCD pixels in megapixel, as a
function of time. Growth over 25 years is a
factor of 30 in glass, 3000 in pixels.
3 M telescopes area m2
Courtesy of Alex Szalay
CCD area mpixels
Astronomy Data
  • Astronomers have a few Petabytes now.
  • 1 pixel (byte) / sq arc second 4TB
  • Multi-spectral, temporal, ? 1PB
  • They mine it looking for new (kinds of) objects
    or more of interesting ones(quasars),
    density variations in 400-D space correlations
    in 400D space
  • Data doubles every year.
  • Data is public after a year.
  • So, 50 of the data is public.
  • Some have private access to 5 more data.
  • So 50 vs 55 access for everyone

Astronomy Data
  • But..
  • How do I get at that 50 of the data?
  • Astronomers have culture of publishing.
  • FITS files and many tools.http//fits.gsfc.nasa.g
  • Encouraged by NASA.
  • Publishing data details is difficult.
    Astronomers want to do it but it is VERY
    hard.(What programs where used? what were the
    processing steps? How were errors treated?)
  • File is wrong abstraction.

Virtual Observatoryhttp//
  • Premise Most data is (or could be online)
  • So, the Internet is the worlds best telescope
  • It has data on every part of the sky
  • In every measured spectral band optical, x-ray,
  • As deep as the best instruments (1 year ago).
  • It is up when you are up.The seeing is always
    great (no working at night, no clouds no moons
  • Its a smart telescope links objects and
    data to literature on them.

Virtual Observatory Golden Age of Mega-Surveys
  • Many new surveys
  • multi-TB in size, 100 million objects or more
  • individual archives planned, or under way
  • Data publication an integral part of the survey
  • Software bill a major cost in the survey
  • Multi-wavelength view of the sky
  • more than 13 wavelength coverage in 5 years
  • Impressive early discoveries
  • finding exotic objects by unusual colors
  • L,T dwarfs, high-z quasars
  • finding objects by time variability
  • gravitational micro-lensing

Slide courtesy of Alex Szalay, modified by jim
Virtual Observatory Federating the Archives
  • The next generation mega-surveys are different
  • top-down design
  • large sky coverage
  • sound statistical plans
  • well controlled/documented data processing
  • Each survey has a publication plan
  • Data mining will lead to stunning new discoveries
  • Federating these archives
  • ? Virtual Observatory

Slide courtesy of Alex Szalay
The Multiwavelength Crab Nebula
Crab star 1053 AD
Nova first sighted 1054 A.D. by Chinese
Astronomers Now Crab Nebula X-ray, optical,
infrared, and radio
Slide courtesy of Robert Brunner _at_ CalTech.
Exploring Parameter Space
  • Given an arbitrary parameter space
  • Data Clusters
  • Points between Data Clusters
  • Isolated Data Clusters
  • Isolated Data Groups
  • Holes in Data Clusters
  • Isolated Points

Nichol et al. 2001
Slide courtesy of Robert Brunner _at_ CalTech.
Virtual Observatory and Education
  • In the beginning science was empirical.
  • Then theoretical branches evolved.
  • Now, we have a computational branches.
  • The computational branch has been simulation
  • It is becoming data analysis/visualization
  • The Virtual Observatory can be used to
  • Teach astronomy make it interactive,
    demonstrate ideas and phenomena
  • Teach computational science skillsand the
    process of scientific discovery

Sloan Digital Sky Survey http//
  • A group of astronomers has been building a
    telescope (with 90M from Sloan Foundation, NSF,
    and a dozen universities). for the last 12
  • Now data is arriving
  • 250GB/nite (20 nights per year).
  • 100 M stars, 100 M galaxies, 1 M spectra.
  • Public data at http//
  • 5 of the survey, 600 sq degrees, 15 M objects
  • This data includes most of the known high z
  • It has a lot of science left in it but that is
    just the start.

Demo of Sky Server
  • Alex built SkyServer (based on TerraServer
  • http//

Demo famous places navigator data
shopping cart spectrum SQL? ?
Virtual Observatory Challenges
  • Size multi-Petabyte
  • 40,000 square degrees is 2 Trillion pixels
  • One band (at 1 sq arcsec) 4 Terabytes
  • Multi-wavelength 10-100 Terabytes
  • Time dimension gtgt 10 Petabytes
  • Need auto parallelism tools
  • Unsolved Meta-Data problem
  • Hard to publish data programs
  • Hard to find/understand data programs
  • Current tools inadequate
  • new analysis visualization tools
  • Transition to the new astronomy
  • Sociological issues

The Challenges
  • How to federate the Archives to make a VO?
  • The hope XML is the answer.
  • The reality XML is syntax and tools FITS
    on XML will be good but.. Explaining the data
    will still be very difficult.
  • Define Astronomy Objects and Methods.
  • Based on UDDI, WSDL, SOAP.
  • Each archive is a service
  • http// shows the idea.
  • Working with Caltech (Brunner, Williams,
    Djorgovski, Bunn)
  • But, how does data mining work?

Three Steps to a VO 0.01
  • Get SDSS and Palomar online
  • Alex Szalay, Jan Vandenberg, Ani Thakar.
  • Roy Williams, Robert Brunner, Julian Bunn
  • Do queries and crossID matches with CalTech and
    SDSS to expose
  • Schema, Units,
  • Dataset problems
  • the typical use scenarios.
  • Implement WebServices at CalTech and SDSS

  • All information at your fingertips.
  • How do we publish information so that our
    agents can digest it?
  • Example TerraServer -gt TerraService
  • The Virtual Observatory Concept
  • The Internet is worlds best telescope
  • For astronomy
  • For teaching astronomy and
  • For teaching computational science

  • Store Everything
  • Online (Disk not Tape)
  • In a Database
  • A Federated DB
  • Two Examples

No time for what follows
SDSS what I have been doing
  • Work with Alex Szalay, Don Slutz, and others to
    define 20 canonical queries and 10 visualization
  • Don Slutz did a first cut of the queries, Im
    continuing that work.
  • Working with Alex Szalay on building Sky Server
    and making data it public (send out 80GB
    SQL DBs)

Two kinds of SDSS data
  • 15M Photo Objects 400 attributes

20K Spectra with 10 lines/ spectrum
Spatial Data Access(Szalay, Kunszt,
Brunner)http// look at the HTM
  • Implemented Hierarchical Triangular Mesh (HTM) as
    table-valued function for spatial joins.
  • Every object has a 20-deep Mesh ID.
  • Given a spatial definitionRoutine returns up to
    500 covering triangles.
  • Spatial query is then up to 500 range queries.
  • Very fast 1,000s of triangles per second.

The 20 Queries
  • Q11 Find all elliptical galaxies with spectra
    that have an anomalous emission line.
  • Q12 Create a grided count of galaxies with u-ggt1
    and rlt21.5 over 60ltdeclinationlt70, and 200ltright
    ascensionlt210, on a grid of 2, and create a map
    of masks over the same grid.
  • Q13 Create a count of galaxies for each of the
    HTM triangles which satisfy a certain color cut,
    like 0.7u-0.5g-0.2ilt1.25 rlt21.75, output it in
    a form adequate for visualization.
  • Q14 Find stars with multiple measurements and
    have magnitude variations gt0.1. Scan for stars
    that have a secondary object (observed at a
    different time) and compare their magnitudes.
  • Q15 Provide a list of moving objects consistent
    with an asteroid.
  • Q16 Find all objects similar to the colors of a
    quasar at 5.5ltredshiftlt6.5.
  • Q17 Find binary stars where at least one of them
    has the colors of a white dwarf.
  • Q18 Find all objects within 30 arcseconds of one
    another that have very similar colors that is
    where the color ratios u-g, g-r, r-I are less
    than 0.05m.
  • Q19 Find quasars with a broad absorption line in
    their spectra and at least one galaxy within 10
    arcseconds. Return both the quasars and the
  • Q20 For each galaxy in the BCG data set
    (brightest color galaxy), in 160ltright
    ascensionlt170, -25ltdeclinationlt35 count of
    galaxies within 30"of it that have a photoz
    within 0.05 of that galaxy.
  • Q1 Find all galaxies without unsaturated pixels
    within 1' of a given point of ra75.327,
  • Q2 Find all galaxies with blue surface
    brightness between and 23 and 25 mag per square
    arcseconds, and -10ltsuper galactic latitude (sgb)
    lt10, and declination less than zero.
  • Q3 Find all galaxies brighter than magnitude 22,
    where the local extinction is gt0.75.
  • Q4 Find galaxies with an isophotal surface
    brightness (SB) larger than 24 in the red band,
    with an ellipticitygt0.5, and with the major axis
    of the ellipse having a declination of between
    30 and 60arc seconds.
  • Q5 Find all galaxies with a deVaucouleours
    profile (r¼ falloff of intensity on disk) and the
    photometric colors consistent with an elliptical
    galaxy. The deVaucouleours profile
  • Q6 Find galaxies that are blended with a star,
    output the deblended galaxy magnitudes.
  • Q7 Provide a list of star-like objects that are
    1 rare.
  • Q8 Find all objects with unclassified spectra.
  • Q9 Find quasars with a line width gt2000 km/s and
  • Q10 Find galaxies with spectra that have an
    equivalent width in Ha gt40Å (Ha is the main
    hydrogen spectral line.)

Also some good queries at http//
An easy oneQ7 Provide a list of star-like
objects that are 1 rare.
  • Found 14,681 buckets, first 140 buckets have
    99 time 104 seconds
  • Disk bound, reads 3 disks at 68 MBps.

Select cast((u-g) as int) as ug, cast((g-r) as
int) as gr, cast((r-i) as int) as ri,
cast((i-z) as int) as iz, count()
as Population from stars where (ugriz) lt
150 group by cast((u-g) as int), cast((g-r) as
int), cast((r-i) as int), cast((i-z) as int)
order by count()
Another easy oneQ15 Provide a list of moving
objects consistent with an asteroid.
  • Looks hard but there are 5 pictures of the object
    at 5 different times (colors) and so can compute
  • Image pipeline computes velocity.
  • Computing it from the 5 color x,y would also be
  • Finds 2167 objects in 7 minutes, 70MBps.

select object_id, -- return object ID
sqrt(power(rowv,2)power(colv,2)) as velocity
from sxPhotObj -- check each
object. where (power(rowv,2) power(colv, 2)) gt
50 -- square of velocity and rowv gt 0
and colv gt0 -- negative values indicate error
A Hard One Q14 Find stars with multiple
measurements that have magnitude variations
  • This should work, but SQL Server does not allow
    table values to be piped to table-valued
  • This should work, but SQL Server does not allow
    table values to be piped to table-valued

A Hard one Second TryQ14 Find stars with
multiple measurements that have magnitude
variations gt0.1.
  • Write a program with a cursor, ran for 2 days

----------------------------- -- Table-valued
function that returns the binary stars within a
certain radius -- of another (in arc-minutes)
(typically 5 arc seconds). -- Returns the ID
pairs and the distance between them (in
arcseconds). create function BinaryStars(_at_MaxDista
nceArcMins float) returns _at_BinaryCandidatesTable
table( S1_object_ID bigint not null, -- Star
1 S2_object_ID bigint not null, -- Star
2 distance_arcSec float) -- distance between
them as begin declare _at_star_ID bigint,
_at_binary_ID bigint-- Star's ID and binary ID
declare _at_ra float, _at_dec float -- Star's
position declare _at_u float, _at_g float, _at_r float,
_at_i float,_at_z float -- Star's colors  
----------------Open a cursor over stars and get
position and colors declare star_cursor cursor
for select object_ID, ra, dec, u, g, r, i,
z from Stars open star_cursor   while
(11) -- for each star begin -- get its
attribues fetch next from star_cursor into
_at_star_ID, _at_ra, _at_dec, _at_u, _at_g, _at_r, _at_i, _at_z if
(_at__at_fetch_status -1) break -- end if no more
stars insert into _at_BinaryCandidatesTable --
insert its binaries select _at_star_ID,
S1.object_ID, -- return stars pairs
sqrt(N.DotProd)/PI()10800 -- and distance in
arc-seconds from getNearbyObjEq(_at_ra, _at_dec,
-- Find objects nearby S. _at_MaxDistanceArcMins)
as N, -- call them N. Stars as S1 --
S1 gets N's color values where _at_star_ID lt
N.Object_ID -- S1 different from S and
N.objType dbo.PhotoType('Star') -- S1 is a
star and N.object_ID S1.object_ID -- join
stars to get colors of S1N and
(abs(_at_u-S1.u) gt 0.1 -- one of the colors is
different. or abs(_at_g-S1.g) gt 0.1 or
abs(_at_r-S1.r) gt 0.1 or abs(_at_i-S1.i) gt 0.1
or abs(_at_z-S1.z) gt 0.1 ) end -- end
of loop over all stars -------------- Looped
over all stars, close cursor and exit. close
star_cursor -- deallocate star_cursor
return -- return table end -- end of
BinaryStars GO select from dbo.BinaryStars(.05)
A Hard one Third TryQ14 Find stars with
multiple measurements that have magnitude
variations gt0.1.
  • Use pre-computed neighbors table.
  • Ran in 17 minutes, found 31k pairs.

-- Plan 2 Use
the precomputed neighbors table select top 100
S.object_ID, S1.object_ID, -- return star pairs
and distance str(N.Distance_mins 60,6,1) as
DistArcSec from Stars S, -- S is a
star Neighbors N, -- N within 3 arcsec (10
pixels) of S. Stars S1 -- S1 N has the
color attibutes where S.Object_ID
N.Object_ID -- connect S and N. and
S.Object_ID lt N.Neighbor_Object_ID -- S1
different from S and N.Neighbor_objType
dbo.PhotoType('Star')-- S1 is a star (an
optimization) and N.Distance_mins lt .05 --
the 3 arcsecond test and N.Neighbor_object_ID
S1.Object_ID -- N S1 and (
abs(S.u-S1.u) gt 0.1 -- one of the colors is
different. or abs(S.g-S1.g) gt 0.1 or
abs(S.r-S1.r) gt 0.1 or abs(S.i-S1.i) gt 0.1 or
abs(S.z-S1.z) gt 0.1 ) -- Found 31,355 pairs
(out of 4.4 m stars) in 17 min 14 sec.
The Pain of Going Outside SQL(its fortunate that
all the queries are single statements)
  • Count parent objects
  • 503 seconds for 14.7 M objects in 33.3 GB
  • 66 MBps
  • IO bound (30 of one cpu)
  • 100 k records/cpu sec
  • Use a cursor
  • No cpu parallelism
  • CPU bound
  • 6 MBps, 2.7 k rps
  • 5,450 seconds (10x slower)

declare _at_count int declare _at_sum int set _at_sum
0 declare PhotoCursor cursor for select nChild
from sxPhotoObj open PhotoCursor while (11)
begin fetch next from PhotoCursor into
_at_count if (_at__at_fetch_status -1) break set
_at_sum _at_sum _at_count end close
PhotoCursor deallocate PhotoCursor print 'Sum
is 'cast(_at_sum as varchar(12))
select count() from sxPhotoObj where nChild
gt 0
Summary of Current Status
  • All 20 queries run(still checking science)
  • Also 15 user queries
  • Run times on 3k PC (2 cpu, 4 disk, 256MB)

100 IO/cpu sec 5MB/cpu sec
Summary of Current Status
  • 16 of the queries are simple
  • 2 are iterative, 2 complex
  • Many are sequential one-pass and two-pass over
  • Covering indices make scans run fast
  • Table valued functions are wonderful but
    limitations on parameters are a pain.
  • Counting is VERY common.
  • Binning (grouping by some set of attributes) is
  • Did not request cube, but that may be cultural.

Reflections on the 20 Queries
  • Data loading/scrubbing is labor intensive
  • This is 5 of the data, and some queries take an
  • But this is not tuned (disk bound).
  • All queries benefit from parallelism (both disk
    and cpu)(if you can state the query right, e.g.
    inside SQL).
  • Parallel database machines will do great on this
  • Hash machines
  • Data pumps
  • See paper in word or pdf on my web site.
  • SQL looks good. The answers, need

Call to Action
  • If you do data visualization we need you(and we
    know it).
  • If you do databaseshere is some data you can
    practice on.
  • If you do distributed systemshere is a
    federation you can practice on.
  • If you do astronomy educational outreachhere is
    a tool for you.
  • The astronomy folks are very good, and very
    smart, and a pleasure to work with, and the
    questions are cosmic, so
Write a Comment
User Comments (0)