Mara NietoSantisteban - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Mara NietoSantisteban

Description:

Implement an access and cross-matching engine that facilitates access to large ... Edited by Carlos Gabriel, Christophe Arviset, Daniel Ponz, and Enrique Solano. ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 23
Provided by: mariaa3
Category:

less

Transcript and Presenter's Notes

Title: Mara NietoSantisteban


1
Large-Scale Cross-Matching with Open SkyQuery
  • María Nieto-Santisteban
  • Ani Thakar
  • Alex Szalay, et al.
  • The Johns Hopkins University

AISRP 2008 _at_ College Park, University of Maryland
2
Goals
  • Implement an access and cross-matching engine
    that facilitates access to large digital archives
    and enables new scientific discoveries by cross
    correlating multi-wavelength datasets

3
Handling the Large-Scale
  • The 20 Spatial Queries
  • Partitioning Parallelization
  • Asynchronous Data Access
  • Efficient Cross-Match
  • Workflow Management
  • Cluster Management
  • Data transport

4
The 20 Spatial Queries
  • Single/Multi Catalog Regions
  • Cone search Find objects within a circle
  • Find objects within a circle satisfying a high
    multi-dimensional constraints
  • Find the closest neighbor
  • Find objects within a region
  • Find objects in/outside masked regions
  • Find objects near the edges of a region
  • Compute the area of a region
  • Find surveys covering a given region
  • Find the intersection between several surveys
  • Count objects from a list of regions

5
The 20 Spatial Queries
  • Find these 1k - 100k objects in these catalogs
  • For all catalogs, extract a random sample of
    existing objects within a given region
  • Cross-match 2 catalogs within a given region
  • Cross-match n catalogs, n gt 2, within a given
    region
  • Find objects which are in A, B, and C but not in
    D
  • Given a sparse grid, find the closest grid point
    for all objects in the catalog
  • Find multiple detections of the same object with
    given magnitudes variations
  • Find all quasars within a region and compute
    their distance to surroundings galaxies
  • more . . . open to discussion

6
Partitioning Parallelization
  • Zones (spatial partitioning and indexing
    algorithm)
  • Partition and bin the data in declination zones
  • ZoneID floor ((dec 90.0) / zoneHeight)
  • Some tricks required to handle spherical geometry
  • Place the data close on disk
  • Clustered Index on ZoneID and RA
  • Fully implemented in SQL
  • Efficient
  • Cone searches
  • Cross-Match (especially)
  • Enables Parallelization
  • Execute the query on a data partition
  • Partition the query and execute it on the full
    dataset

7
CasJobs, Asynchronous Data Access
  • Solution to the SDSS increasing size and demand
  • Astronomers workbench
  • Unlimited queries against the large SDSS
    databases
  • Minimize data movement
  • Personal database, MyDB, under users full
    control
  • Full power to create tables, stored procedures,
    functions, load personal data, etc.
  • Collaborative environment
  • Easy access to prior data releases
  • Job tracking system
  • Accessed through a Guide User Interface
  • Accessed though a WS interface
  • Not exclusive of SDSS nor Astronomy!

8
Efficient Cross-Matching
Matching stars between 2MASS and SDSS DR5 74 M
x 54 M rows, 4.5 h instead of 2 days 1 degree
match between SDSSDR5 and a sparse grid 350 M x
50 k rows, 7 h instead of 1 year LSST
simulations for alert detection 6 M x 125 k
rows, 40 s Pan-STARRS on-the-fly
association 1.3 billion objects x 120 million
detections, 1.5 h
9
Graywulf
  • Date Tue, 2 Nov 2004 142637 -0800
  • From Jim Gray ltgray_at_microsoft.comgt
  • To Maria A. Nieto-Santisteban ltnieto_at_skysrv.pha.j
    hu.edugt
  • Subject RE Scaleout
  • I think your cluster finding work, the loader,
    the sector stuff, the
  • match stuff, ... all are examples of map-reduce.
  • I would like to build a system to describe these
    parallel workflows and run them on a replicated
    database, then take the outputs and glue them
    together (map-reduce).
  • Make sense?

10
Graywulf
User
SkyQuery
Graywulf
HP cluster
DB cluster
11
Architecture
Cluster Manager (CLM)
Workflow Manager (WFM)
Perf. Monitor
Application
Query Manager (QM)
Web Based Interface (WBI)
12
Open SkyQuery Next Generation
Cluster Manager (CLM)
Workflow Manager (WFM)
Perf. Monitor
Linked Servers
2MASS
SDSS
SDSS
2MASS
myDB
VoSpace
myDB
myDB
MatchDB1
MatchDBn
OSQ
Query Manager (QM)
DRL
Web Based Interface (WBI)
13
Pan-STARRS
Cluster Manager (CLM)
Workflow Manager (WFM)
Performance Monitor
Linked Servers
Objects_pm Detections_pm Meta
Objects_p1 Detections_p1 Meta
Pm
P1
PS1
Objects Meta
Detections
PS1 database
Query Manager (QM)
Legend Database Full table partitioned
table Partitioned View
DRL
Web Based Interface (WBI)
14
Pan-STARRS Prototype in Context
15
Pan-STARRS Prototypes
SDSS includes a mirror of 11.3 lt ? lt 30
objects to ? lt 0
  • Total GB of csv loaded data 300 GB
  • CSV Bulk insert load 8 MB/s
  • Binary Bulk insert 18-20 MB/s
  • Creation Started October 15th 2007
  • Finished October 29th 2007
  • Includes
  • 10 epochs of single image detections (2 x 5
    filters)
  • 5 epochs of Stack detections (1 x 5
    filters)

16
Size of PS1 Prototype Database
Table sizes are in GB
9.6 TB of data in a distributed database
17
Well-Balanced Partitions
18
VO Space _at_ JHU
  • - C 2.0 implementation based on new Window
    Communication Foundation (WCF)
  • - Self-contained SQL Server 2005 backend
  • VOPipe Architecture
  • - Higher level services for data/work flows
  • - Basis for next generation VO services such as
    Open SkyQuery

19
Education and Public Outreach
  • Visualization tool for Open SkyQuery
  • Lesson plan for high school astronomy
  • Hubble Diagram with SDSS GALEX

20
Summary
  • The 20 Spatial Queries
  • Partitioning Parallelization
  • 10 TB distributed DB well balanced partitioned
  • Asynchronous Data Access
  • CasJobs
  • Efficient Cross-Match
  • 1.3 billion x 120 m in 1.5 h
  • Work in progress
  • Workflow Management
  • Cluster Management
  • Data transport
  • 20 Spatial Queries benchmark

21
Related Publications
  • 20 Spatial Queries for an Astronomers
    Bench(mark), M. Nieto-Santisteban, T. School, A.
    Szalay, A. Kemper, in Proceedings of Astronomical
    Data Analysis Software and Systems XVII, London,
    UK, 23rd - 26th September 2007.
  • Probabilistic Cross-Identification of
    Astronomical Sources, T. Budavari, A. Szalay, and
    M. Nieto-Santisteban, in Proceedings of
    Astronomical Data Analysis Software and Systems
    XVII, London, UK, 23rd - 26th September 2007.
  • The Pan-STARRS Object Data Manager Database, J.
    Heasley, M. Nieto-Santisteban, A. Szalay, A.
    Thakar, AAS Meeting 210th - Honolulu, HW, USA,
    5th 10th, May 2007.
  • LSST, the Spatial Cross-Match Challenge, María A.
    Nieto-Santisteban, Alexander S. Szalay, Aniruddha
    R. Thakar, Jim Gray Astronomical Data in
    Proceedings of Astronomical Data Analysis
    Software and Systems XVI, Tucson, AZ, USA, 15th -
    18th October 2006.
  • When Database Systems Meet the Grid. María A.
    Nieto-Santisteban, Jim Gray, Alexander Szalay,
    James Annis, Aniruddha R. Thakar, William J.
    OMullane, in Proceedings of ACM CIDR 2005,
    Asilomar, CA, January 2005.

22
Related Publications
  • Cross-matching Multiple Spatial Observations and
    Dealing with Missing Data, J. Gray, A. Szalay, T.
    Budavári, R. Lupton, M. Nieto-Santisteban, A.
    Thakar, Microsoft Technical Report MSR TR
    2006-175, December 2006.
  • The Zones Algorithm for Finding
    Points-Near-a-Point or Cross-Matching Spatial
    Datasetes, Jim Gray, María A. Nieto-Santisteban,
    Alexander S. Szalay, Microsoft Technical Report
    MSR-TR-2006-52, April 2006.
  • Batch is back CasJobs, serving multi-TB data on
    the Web. William OMullane, Nolan Li, Maria A.
    Nieto-Santisteban, Ani Thakar, Alexander S.
    Szalay, Jim Gray in in the Proceedings of the
    2005 IEEE International Conference on Web
    Services (ICWS 2005). Orlando, FL, July 2005.
  • Large-Scale Query and XMatch, Entering the
    Parallel Zone. María A. Nieto-Santisteban,
    Aniruddha R. Thakar, Alexander S. Szalay, Jim
    Gray Astronomical Data Analysis Software and
    Systems XV ASP Conference Series, Vol. 351,
    Proceedings of the Conference Held 2-5 October
    2005 in San Lorenzo de El Escorial, Spain. Edited
    by Carlos Gabriel, Christophe Arviset, Daniel
    Ponz, and Enrique Solano. San Francisco
    Astronomical Society of the Pacific, 2006., p.493.
Write a Comment
User Comments (0)
About PowerShow.com