Data Explosion: Science with Terabytes - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Data Explosion: Science with Terabytes

Description:

Data Explosion: Science with Terabytes – PowerPoint PPT presentation

Number of Views:355
Avg rating:3.0/5.0
Slides: 37
Provided by: alex263
Category:

less

Transcript and Presenter's Notes

Title: Data Explosion: Science with Terabytes


1
Data Explosion Science with Terabytes
  • Alex Szalay, JHUand Jim Gray, Microsoft Research

2
Living in an Exponential World
  • Astronomers have a few hundred TB now
  • 1 pixel (byte) / sq arc second 4TB
  • Multi-spectral, temporal, ? 1PB
  • They mine it looking for new (kinds of) objects
    or more of interesting ones (quasars),
    density variations in 400-D space correlations
    in 400-D space
  • Data doubles every year
  • Data is public after 1 year
  • So, 50 of the data is public
  • Same access for everyone

3
The Challenges
Exponential data growth Distributed
collections Soon Petabytes
Data Collection
Discovery and Analysis
Publishing
New analysis paradigm Data federations,
Move analysis to data
New publishing paradigm Scientists are
publishers and Curators
4
New Science Data Exploration
  • Data growing exponentially in many different
    areas
  • Publishing so much data requires a new model
  • Multiple challenges for different communities
  • publishing, data mining, data visualization,
    digital library, educational, web services
    poster-child,
  • Information at your fingertips
  • Students see the same data as professional
    astronomers
  • More data coming Petabytes/year by 2010
  • We need scalable solutions
  • Move analysis to the data!
  • Same thing happening in all sciences
  • High energy physics, genomics, cancer
    research,medical imaging, oceanography, remote
    sensing,
  • Data Exploration an emerging new branch of
    science
  • Currently has no owner

5
Advances at JHU
  • Designed and built the science archive for the
    SDSS
  • Currently 2 Terabytes, soon to reach 3 TB
  • Built fast spatial search library
  • Created novel pipeline for data loading
  • Built the SkyServer, a public access website for
    SDSSwith over 45M web hits, millions of
    free-form SQL queries
  • Built the first web-services used in science
  • SkyQuery, ImgCutout, various visualization tools
  • Leading the Virtual Observatory effort
  • Heavy involvement in Grid Computing
  • Exploring other areas

6
Collaborative Projects
  • Sloan Digital Sky Survey (11 inst)
  • National Virtual Observatory (17 inst)
  • International Virtual Observatory Alliance (14
    countries)
  • Grid For Physics Networks (10 inst)
  • Wireless sensors for Soil Biodiversity (BES,
    Intel, UCB)
  • Digital Libraries (JHU, Cornell, Harvard,
    Edinburgh)
  • Hydrodynamic Turbulence (JHU Engineering)
  • Informal exchanges with NCBI

7
Directions
  • We understand how to mine a few terabytes
  • Directions
  • We built an environment now our tools allow new
    breakthroughs in astrophysics
  • Open collaborations beyond astrophysics(turbulenc
    e, sensor driven biodiversity, bioinformatics,
    digital libraries, education )
  • Attack problems on 100 Terabyte scale, prepare
    for the Petabytes of tomorrow

8
The JHU Core Group
  • Faculty
  • Alex Szalay
  • Ethan Vishniac
  • Charles Meneveau
  • Graduate Students
  • Tanu Malik
  • Adrian Pope
  • Postdoctoral Fellows
  • Tamas Budavari
  • Research Staff
  • George Fekete
  • Vivek Haridas
  • Nolan Li
  • Will OMullane
  • Maria Nieto-Santisteban
  • Jordan Raddick
  • Anirudha Thakar
  • Jan Vandenberg

9
Examples
  • Astrophysics inside the database
  • Technology sharing in other areas
  • Beyond Terabytes

10
I. Astrophysics in the DB
  • Studies of galaxy clustering
  • Budavari, Pope, Szapudi
  • Spectro Service Publishing spectral data
  • Budavari, Dobos
  • Cluster finding with a parallel DB-oriented
    workflow system
  • Nieto-Santisteban, Malik, Thakar, Annis, Sekhri
  • Complex spatial computations inside the DB
  • Fekete, Gray, Szalay
  • Visual tools with the DB
  • ImgCutout (Nieto), Geometry viewer (Szalay),
    MirageSQL (Carlisle)

11
The SDSS Photo-z Sample
All 50M
mrlt21 15M
10 stripes 10M
0.1ltzlt0.3 -20 gt Mr 2.2M
-20 gt Mr gt-21 1182k
-21 gt Mr gt-23 931k
-21 gt Mr gt-22 662k
-22 gt Mr gt-23 269k
12
The Analysis
  • eSpICE I.Szapudi, S.Colombi and S.Prunet
  • Integrated with the database by T. Budavari
  • Extremely fast processing
  • 1 stripe with about 1 million galaxies is
    processed in 3 mins
  • Usual figure was 10 min for 10,000 galaxies gt 70
    days
  • Each stripe processed separately for each cut
  • 2D angular correlation function computed
  • w(?) average with rejection of pixels along the
    scan
  • Correlations due to flat field vector
  • Unavoidable for drift scan

13
Angular Power Spectrum
  • Use photometric redshifts for LRGs
  • Create thin redshift slices and analyze angular
    clustering
  • From characteristic features (baryon bumps, etc)
    we obtain angular diameter vs distance -gt Dark
    Energy
  • Healpix pixelization in the database
  • Each redshift slice is generated in 2 minutes
  • Using Spice over 160,000 pixels in N1.7 time

14
Large Scale Power Spectrum
  • Goal measure cosmological parameters
  • Cosmological constant or Dark Energy?
  • Karhunen-Loeve technique
  • Subdivide slices into about 5K-15K cells
  • Compute correlation matrix of galaxy counts among
    cells from fiducial P(k) noise model
  • Diagonalize matrix
  • Expand data over KL basis
  • Iterate over parameter values
  • Compute new correlation matrix
  • Invert, then compute log likelihood

Vogeley and Szalay (1996)
15
Wb/ Wm
SDSS only Wmh 0.26 /- 0.04 Wb/Wm 0.29 /-
0.07
Wmh
SDSS Pope et al (2004)WMAP Verde et al.
(2003), Spergel et al. (2003)
16
Numerical Effort
  • Most of the time spent in data manipulation
  • Fast spatial searches over data and MC (SQL)
  • Diagonalization of 20Kx20K matrices
  • Inversions of few 100K 5Kx5K matrices
  • Has the potential to constrain the Dark Energy
  • Accuracy enabled by large data set sizes
  • But new kind of problems
  • Errors driven by the systematics, not by sample
    size
  • Scaling of analysis algorithms critical!
  • Monte Carlo realizations with few 100M points in
    SQL

17
Cluster Finding
  • Five main steps (Annis et al. 2002)
  • Get Galaxy List
  • fieldPrep Extracts from the main data set the
    measurements of interest.
  • Filter
  • brgSearch Calculates the unweighted BCG
    likelihood for each galaxy (unweighted by galaxy
    count) and discards unlikely galaxies.
  • Check Neighbors
  • bcgSearch Weights BCG likelihood with the number
    of neighbors.
  • Pick Most Likely
  • bcgCoalesce Determines whether a galaxy is the
    most likely galaxy in the neighborhood to be the
    center of the cluster.
  • Discard Bogus
  • getCatalog Removes suspicious results and
    produces and stores the final cluster catalog.

18
SQL Server Cluster
  • Applying a zone strategy, P gets partitioned
    homogenously among 3 servers.
  • S1 provides 1 deg buffer on top
  • S2 provides 1 deg buffer on top and bottom
  • S3 provides 1 deg buffer on bottom

P3
P
P2
Native to Server 3
P1
Native to Server 2
Native to Server 1
Total duplicated data 4 x 13 deg2. Total
duplicated work (1 object processed more than
once) 2 x 11 deg2 Maximum time spent by the
thicker partition2h 15 (other 2 servers 1h
50)
19
SQL Server vs Files
  • SQL Server
  • Resolve a Target of 66 deg2 requires
  • Step A Find Candidates
  • - Input data 108 MB covering 104 deg2 (72
    byte/row 1.574.656 row)
  • - Time 6 h on a dual 2.6 GHz
  • - Output data 1.5 MB covering 84 deg240
    byte/row 40.123 row
  • Step B Find Clusters
  • - Input Data 1.5 MB
  • - Time 20 minutes
  • - Output 0.43 MB covering 66 deg2 40 byte/row
    11.249 row
  • Total time 6h 20
  • Some extra space is required for indexes and some
    other auxiliary tables.
  • Scales linearly with no of servers

FILES Resolve a Target of 66 deg2 requires -
Input data 66 4 16MB 4GB - Output data
66 4 6KB1.5 MB - Time 73 hours Using 10
nodes 7.3 hours Notes Files SQL Buffer
0.25 deg 0.5
deg brgSearch z(0..1) in steps of
0.01 0.001 FILES would
require 20 60 times longer to solve this
problem for a buffer of 0.5 with steps of 0.001
20
II. Technology Sharing
  • Virtual Observatory
  • SkyServer database/website templates
  • Edinburgh, STScI, Caltech, Cambridge, Cornell
  • OpenSkyQuery/OpenSkyNodes
  • International standard for federating astro
    archives
  • Interoperable SOAP implementations working
  • NVO Registry Web Service (OMullane, Greene)
  • Distributed logging and harvesting (Thakar, Gray)
  • MyDB workbench for science (OMullane, Li)
  • Publish your own data
  • Ala Spectro Service, but for images and
    databases
  • SkyServer-gt Soil Biodiversity

21
National Virtual Observatory
  • NSF ITR project, Building the Framework for the
    National Virtual Observatory is a collaboration
    of 17 funded and 3 unfunded organizations
  • Astronomy data centers
  • National observatories
  • Supercomputer centers
  • University departments
  • Computer science/information technology
    specialists
  • PIs Alex Szalay (JHU), Roy Williams (Caltech)
  • Connect the disjoint pieces of data in the world
  • Bridge the technology gap for astronomers
  • Based on interoperable Web Services

22
International Collaboration
  • Similar efforts now in 14 countries
  • USA, Canada, UK, France, Germany, Italy, Holland,
    Japan, Australia, India, China, Russia, Hungary,
    South Korea, ESO
  • Total awarded funding world-wide is over 60M
  • Active collaboration among projects
  • Standards, common demos
  • International VO roadmap being developed
  • Regular telecons over 10 timezones
  • Formal collaboration
  • International Virtual Observatory Alliance (IVOA)
  • Aiming to have production services by Jan 2005

23
Boundary Conditions
  • Standards driven by evolving new technologies
  • Exchange of rich and structured data (XML)
  • DB connectivity, Web Services, Grid computing
  • Application to astronomy domain
  • Data dictionaries (UCDs)
  • Data models
  • Protocols
  • Registries and resource/service discovery
  • Provenance, data quality

Boundary conditions
  • Dealing with the astronomy legacy
  • FITS data format
  • Software systems

24
Main VO Challenges
  • How to avoid trying to be everything for
    everybody?
  • Database connectivity is essential
  • Bring the analysis to the data
  • Core web services
  • Higher level applications built on top
  • Use the 90-10 rule
  • Define the standards and interfaces
  • Build the framework
  • Build the 10 of services that are used by 90
  • Let the users build the rest from the components

25
Core Services
  • Metadata information about resources
  • Waveband
  • Sky coverage
  • Translation of names to universal dictionary
    (UCD)
  • Registry
  • Simple search patterns on the resources
  • Spatial Search
  • Image mosaic
  • Unit conversions
  • Simple filtering, counting, histograms

26
Higher Level Services
  • Built on Core Services
  • Perform more complex tasks
  • Examples
  • Automated resource discovery
  • Cross-identifications
  • Photometric redshifts
  • Image segmentation
  • Outlier detections
  • Visualization facilities
  • Expectation
  • Build custom portals in matter of days from
    existing building blocks (like today in IRAF or
    IDL)

27
Web Services in Progress
  • Registry
  • Harvesting and querying
  • Data Delivery
  • Query driven Queue management
  • Spectro service
  • Logging services
  • Graphics and visualization
  • Query driven vs interactive
  • Show spatial objects (Chart/Navi/List)
  • Footprint/intersect
  • It is a fractal
  • Cross-matching
  • SkyQuery and SkyNode
  • Ferris-wheel
  • Distributed vs parallel

28
MyDB eScience Workbench
  • Prototype of bringing analysis to the data
  • Everybody gets a workspace (database)
  • Executes analysis at the data
  • Store intermediate results there
  • Long queries run in batch
  • Results shared within groups
  • Only fetch the final results
  • Extremely successful matches the pattern of
    work
  • Next steps multiple locations, single
    authentication
  • Farther down the road parallel workflow system

29
eEducation Prototype
  • SkyServer Educational Projects, aimed at
    advanced high school students, but covering
    middle school
  • Teach how to analyze data, discover patterns,not
    just astronomy
  • 3.7 million project hits, 1.25 million page
    views of educational content
  • More than 4000 textbooks
  • On the whole web site 44 million web hits
  • Largely a volunteer effort by many individuals
  • Matches the 2020 curriculum

30
Soil Biodiversity
  • How does soil biodiversity affect ecosystem
    functions, especially decomposition and nutrient
    cycling in urban areas?
  • JHU is part of the Baltimore Ecosystem Study,one
    of the NSF LTER monitoring sites
  • High resolution monitoring will capture
  • Spatial heterogeneity of environment
  • Change over time

31
Sensor Monitoring
  • Plan use 400 wireless (Intel) sensors,
    monitoring
  • Air temperature, moisture
  • Soil temperature, moisture, at least in two
    depths (5cm, 20 cm)
  • Light (intensity, composition)
  • Gases (O2, CO2, CH4, )
  • Long-term continuous data
  • Small (hidden) and affordable (many)
  • Less disturbance
  • 200 million measurements/year
  • Collaboration with Intel and UCB(PI Szlavecz,
    JHU)
  • Complex database of sensor data and samples

32
III. Beyond Terabytes
  • Numerical simulations of turbulence
  • 100TB of multiple SQL Servers
  • Storing each timestep, enabling backtracking to
    initial conditions
  • Also fundamental problem in cosmological
    simulations of galaxy mergers
  • Will teach us how to do scientific analysis of
    100TBs
  • By the end of the decade several PB / year
  • One needs to demonstrate fault tolerance, fast
    enough loading speeds

33
Exploration of Turbulence
  • For the first time, we can now put it all
    together
  • Large scale range, scale-ratio O(1,000)
  • Three-dimensional in space
  • Time-evolution and Lagrangian approach (follow
    the flow)
  • Unique turbulence database
  • We will create a database of O(2,000)
    consecutive snapshots of a 1,0243 simulation of
    turbulenceClose to 100 Terabytes
  • Analysis cluster on top of DB
  • Treat it as a physics experiment,change
    configurations every 2 months

34
LSST
  • Large Synoptic Survey Telescope (2012)
  • Few PB/yr data rate
  • Repeat SDSS in 4 nights
  • Main issue is with data management
  • Data volume similar to high energy physics,but
    need object granularity
  • Very high resolution time series, moving objects
  • Need to build 100TB scale prototypes today
  • Hierarchical organization of data products

35
The Big Picture
Experiments Instruments
facts
questions
?
facts
Other Archives
answers
facts
Literature
facts
Simulations
new SCIENCE!
The Big Problems
  • Data ingest
  • Managing a petabyte
  • Common schema
  • How to organize it?
  • How to reorganize it
  • How to coexist with others
  • Query and Vis tools
  • Support/training
  • Performance
  • Execute queries in a minute
  • Batch query scheduling

36
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com