The PanSTARRS Data Challenge - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

The PanSTARRS Data Challenge

Description:

Can scan the sky rapidly and repeatedly, and can detect very faint objects ... of these approaches are created equal, either in cost and/or performance (DeWitt ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 24
Provided by: jimh156
Category:

less

Transcript and Presenter's Notes

Title: The PanSTARRS Data Challenge


1
The Pan-STARRS Data Challenge
  • Jim Heasley
  • Institute for Astronomy
  • University of Hawaii

2
What is Pan-STARRS?
  • Pan-STARRS - a new telescope facility
  • 4 smallish (1.8m) telescopes, but with extremely
    wide field of view
  • Can scan the sky rapidly and repeatedly, and can
    detect very faint objects
  • Unique time-resolution capability
  • Project led by IfA with help from Air Force, Maui
    High Performance Computer Center, MITs Lincoln
    Lab.
  • The prototype, PS1, will be operated by an
    international consortium

3
Pan-STARRS Overview
  • Pan-STARRS observatory specifications
  • Four 1.8m R-C corrector
  • 7 square degree FOV - 1.4Gpixel cameras
  • Sited in Hawaii
  • A ? 50
  • R 24 in 30 s integration
  • -gt 7000 square deg/night
  • All sky deep field surveys in g,r,i,z,y
  • Time domain astronomy
  • Transient objects
  • Moving objects
  • Variable objects
  • Static sky science
  • Enabled by stacking repeated scans to form a
    collection of ultra-deep static sky images

4
The Published Science Products Subsystem
5
(No Transcript)
6
(No Transcript)
7
Front of the Wave
  • Pan-STARRS is only the first of a new generation
    of astronomical data programs that will generate
    such large volumes of data
  • SkyMapper, southern hemisphere optical
  • VISTA, southern hemisphere IR survey
  • LSST, an all sky survey like Pan-STARRS
  • Eventually, these data sets will be useful for
    data mining.

8
(No Transcript)
9
PS1 Data Products
  • Detectionsmeasurements obtained directly from
    processed image frames
  • Detection catalogs
  • Stacks of the sky images source catalogs
  • Difference catalogs
  • High significance (gt 5? transient events)
  • Low significance (transients between 3 and 5? )
  • Other Image Stacks (Medium Deep Survey)
  • Objectsaggregates derived from detections

10
Whats the Challenge?
  • At first blush, this looks pretty much like the
    Sloan Digital Sky Survey
  • BUT
  • Size Over its 3 year mission, PS1 will record
    over 150 billion detections for approximately 5.5
    billion sources
  • Dynamic Nature new data will be always coming
    into the database system, for things weve seen
    before or new discoveries

11
How to Approach This Challenge
  • There are many possible approaches to deal with
    this data challenge.
  • Shared what?
  • Memory
  • Disk
  • Nothing
  • Not all of these approaches are created equal,
    either in cost and/or performance (DeWitt Gray,
    1992, Parallel Database Systems The Future of
    High Performance Database Processing).

12
Conversation with the Pan-STARRS Project Manager
  • Jim Tom, what are we going to do if the
    solution proposed by TBJD is more than you can
    afford?
  • Tom Jim, Im sure youll think of something!
  • Not long after that, TBJD did give us a
    hardware/software plan we couldnt afford. Not
    long after, Tom resigned from the project to
    pursue other activities
  • The Pan-STARRS project teamed up with Alex and
    his database team at JHU

13
Building upon the SDSS Heritage
  • In teaming with the team at JHU we hoped to build
    upon the experience and software developed for
    the SDSS.
  • A key question was how could we scale the system
    to deal with the volume of data expected from PS1
    (gt 10X SDSS in the first year alone).
  • The second key question, could the system keep up
    with the data flow.
  • The heritage is more one of philosophy than
    recycled software, as to deal with the challenges
    posed by PS1 weve had to generate a great deal
    of new code.

14
The Object Data Manager
  • The Object Data Manager (ODM) was considered to
    be the long pole in the development of the PS1
    PSPS.
  • Parallel database systems can provide both data
    redundancy and spreading very large tables that
    cant fit on a single machine over multiple
    storage volumes.
  • For PS1 (and beyond) we need both.

15
Distributed Architecture
  • The bigger tables will be spatially partitioned
    across servers called Slices
  • Using slices improves system scalability
  • Tables are sliced into ranges of ObjectID, which
    correspond to broad declination ranges
  • ObjectID boundaries are selected so that each
    slice has a similar number of objects
  • Distributed Partitioned Views glue the data
    together

16
Distributed Architecture
  • The bigger tables will be spatially partitioned
    across servers called Slices
  • Using slices improves system scalability
  • Tables are sliced into ranges of ObjectID, which
    correspond to broad declination ranges
  • ObjectID boundaries are selected so that each
    slice has a similar number of objects
  • Distributed Partitioned Views glue the data
    together

17
Design Decisions ObjID
  • Objects have their positional information encoded
    in their objID
  • fGetPanObjID (ra, dec, zoneH)
  • ZoneID is the most significant part of the ID
  • objID is the Primary Key
  • Objects are organized (clustered indexed) so
    nearby objects in the sky are stored on disk
    nearby as well
  • It gives good search performance, spatial
    functionality, and scalability

18
Pan-STARRS Data Flow
19
Pan-STARRS Data Layout
Image Pipeline
L1 Data
csv
csv
csv
csv
csv
csv
L2 Data
Load-Merge Nodes
LOAD
Load Merge 1
Load Merge 2
Load Merge 3
Load Merge 4
Load Merge 5
Load Merge 6
COLD
S 1
S 2
S 3
S 4
S 5
S 6
S 7
S 8
S 9
S 10
S 11
S 12
S 13
S 14
S 15
S 16
Slice Nodes
Slice 1
Slice 2
Slice 3
Slice 4
Slice 5
Slice 6
Slice 7
Slice 8
HOT
S 1
S 2
S 3
S 4
S 5
S 6
S 7
S 8
S 9
S 10
S 11
S 12
S 13
S 14
S 15
S 16
WARM
s 16
s 3
s 2
s 5
s 4
s 7
s 6
s 9
s 8
s 11
s 10
s 13
s 12
s 15
s 14
s 1
Main
Main
Head 2
Head 1
Head Nodes
Distributed View
20
The ODM Infrastructure
  • Much of our software development has gone into
    extending the ingest pipeline developed for SDSS.
  • Unlike SDSS, we dont have campaign loads but a
    steady from of data from the telescope through
    the Image Processing Pipeline to the ODM.
  • We have constructed data workflows to deal with
    both the regular data flow into the ODM as well
    as anticipated failure modes (lost disk, RAID,
    and various severer nodes).

21
Pan-STARRS Object Data Manager Subsystem
21
22
What Next?
  • Will this approach scale to our needs?
  • PS1 yes. But, we already see the need for
    better parallel processing query plans.
  • PS4 unclear! Even though Im not from Missouri,
    show me! One year of PS4 produces gt data
    volume than the entire PS1 3 year mission!
  • Cloud computing?
  • How can we test issues like scalability without
    actually building the system?
  • Does each project really need its own data
    center?
  • Having these databases in the cloud may greatly
    facilitate data sharing/mining.

23
Finally, Thanks
  • To Alex for stepping in, hosting the development
    system at JHU, and building up his core team to
    construct the ODM, especially
  • Maria Nieto-Santisteban
  • Richard Wilton
  • Susan Werner
  • And at Microsoft to
  • Michael Thomassy
  • Yogesh Simmhan
  • Catharine van Ingen
Write a Comment
User Comments (0)
About PowerShow.com