Addressing the Data Deluge: the Structuring, Sharing, and Preserving of Scientific Experiment Data - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Addressing the Data Deluge: the Structuring, Sharing, and Preserving of Scientific Experiment Data

Description:

Computational science is increasingly data intense and getting more so. Why? ... Streaming continuously from hundreds of sensor and network sources, scaling to ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 38
Provided by: BethP99
Category:

less

Transcript and Presenter's Notes

Title: Addressing the Data Deluge: the Structuring, Sharing, and Preserving of Scientific Experiment Data


1
Addressing the Data Deluge the Structuring,
Sharing, and Preserving of Scientific Experiment
Data
  • Beth Plale
  • Sangmi Lee
  • Scott Jensen
  • Yiming Sun
  • Computer Science Dept.
  • Indiana University

2
The Data Deluge
  • Computational science is increasingly data
    intense and getting more so. Why?
  • More complex computations
  • Nested model runs
  • Linked models
  • Finer resolution
  • More sources of data products
  • Observational data products
  • Streaming continuously from hundreds of sensor
    and network sources, scaling to thousands
  • Large archives
  • Annotations
  • Model configuration parameters
  • Output results
  • Model data
  • Statistical data (e.g., data mining)

3
Problem
  • Computational scientists are reaching their limit
    on ability to manage data products associated
    with investigations
  • Scientist can touch hundreds to thousands of data
    products in single investigation

4
The Experiment as A Days Work
6 hr run followed by
3 hr run followed by
1 hr run
5
Why not just put up a metadata database and let
them come?
  • The Kings solution.
  • Burdens users (people or programs) with
  • Knowing where database is located
  • Knowing the schema of the database
  • Initiating all the communication with database
  • Generating all metadata
  • Knowing precisely how to write the queries.
  • We cant afford the Kings solution - we have to
    be more aggressive if our solution is to be
    widely used.

6
Who are our users? (psstscientists)
  • Users dont want to write precise SQL
  • That is, learn the nuances of a relational schema
  • Users wont hand-code metadata
  • Scientists dont want to have to think about
    hierarchies of files, versions, or replicas.
    They want to run experiments and do their
    science.
  • Scientists use Google - they know searching can
    be fast and flexible - far more flexible than

find . -n 0305200513002530.nc -print
7
myLEAD an active metadata catalog
  • If were going to have half a chance of being
    widely used, it is going to be us that reaches
    3/4s of the way across the gulf. Our users
    reach the other 1/4
  • Easy query writing
  • Automated metadata generation
  • Transparent structure management
  • Transparent versioning management
  • Expressive query writing

8
(No Transcript)
9
Conventional Numerical Weather Prediction
  • OBSERVATIONS
  • Radar Data
  • Mobile Mesonets
  • Surface Observations
  • Upper-Air Balloons
  • Commercial Aircraft
  • Geostationary and Polar Orbiting Satellite
  • Wind Profilers
  • GPS Satellites

10
Conventional Numerical Weather Prediction
  • Analysis/Assimilation
  • Quality Control
  • Retrieval of Unobserved
  • Quantities
  • Creation of Gridded Fields
  • OBSERVATIONS
  • Radar Data
  • Mobile Mesonets
  • Surface Observations
  • Upper-Air Balloons
  • Commercial Aircraft
  • Geostationary and Polar Orbiting Satellite
  • Wind Profilers
  • GPS Satellites

11
Conventional Numerical Weather Prediction
  • Analysis/Assimilation
  • Quality Control
  • Retrieval of Unobserved
  • Quantities
  • Creation of Gridded Fields

Prediction PCs to Teraflop Systems
  • OBSERVATIONS
  • Radar Data
  • Mobile Mesonets
  • Surface Observations
  • Upper-Air Balloons
  • Commercial Aircraft
  • Geostationary and Polar Orbiting Satellite
  • Wind Profilers
  • GPS Satellites

12
Conventional Numerical Weather Prediction
  • Analysis/Assimilation
  • Quality Control
  • Retrieval of Unobserved
  • Quantities
  • Creation of Gridded Fields

Prediction PCs to Teraflop Systems
  • Product Generation,
  • Display,
  • Dissemination
  • OBSERVATIONS
  • Radar Data
  • Mobile Mesonets
  • Surface Observations
  • Upper-Air Balloons
  • Commercial Aircraft
  • Geostationary and Polar Orbiting Satellite
  • Wind Profilers
  • GPS Satellites

13
Conventional Numerical Weather Prediction
  • Analysis/Assimilation
  • Quality Control
  • Retrieval of Unobserved
  • Quantities
  • Creation of Gridded Fields

Prediction PCs to Teraflop Systems
  • Product Generation,
  • Display,
  • Dissemination
  • OBSERVATIONS
  • Radar Data
  • Mobile Mesonets
  • Surface Observations
  • Upper-Air Balloons
  • Commercial Aircraft
  • Geostationary and Polar Orbiting Satellite
  • Wind Profilers
  • GPS Satellites
  • End Users
  • NWS
  • Private Companies
  • Students

14
Conventional Numerical Weather Prediction
  • Analysis/Assimilation
  • Quality Control
  • Retrieval of Unobserved
  • Quantities
  • Creation of Gridded Fields

Prediction PCs to Teraflop Systems
  • Product Generation,
  • Display,
  • Dissemination
  • OBSERVATIONS
  • Radar Data
  • Mobile Mesonets
  • Surface Observations
  • Upper-Air Balloons
  • Commercial Aircraft
  • Geostationary and Polar Orbiting Satellite
  • Wind Profilers
  • GPS Satellites

The process is entirely serial and pre-scheduled
no responseto weather!
  • End Users
  • NWS
  • Private Companies
  • Students

15
The LEAD Vision No Longer Serial or Static
  • Analysis/Assimilation
  • Quality Control
  • Retrieval of Unobserved
  • Quantities
  • Creation of Gridded Fields

Prediction PCs to Teraflop Systems
  • Product Generation,
  • Display,
  • Dissemination
  • OBSERVATIONS
  • Radar Data
  • Mobile Mesonets
  • Surface Observations
  • Upper-Air Balloons
  • Commercial Aircraft
  • Geostationary and Polar Orbiting Satellite
  • Wind Profilers
  • GPS Satellites
  • End Users
  • NWS
  • Private Companies
  • Students

16
The LEAD Vision No Longer Serial or Static
  • Analysis/Assimilation
  • Quality Control
  • Retrieval of Unobserved
  • Quantities
  • Creation of Gridded Fields

Prediction PCs to Teraflop Systems
  • Product Generation,
  • Display,
  • Dissemination
  • OBSERVATIONS
  • Radar Data
  • Mobile Mesonets
  • Surface Observations
  • Upper-Air Balloons
  • Commercial Aircraft
  • Geostationary and Polar Orbiting Satellite
  • Wind Profilers
  • GPS Satellites
  • End Users
  • NWS
  • Private Companies
  • Students

17
Architecture Part 1 Distribution scheme of
metadata catalogues
Satellite catalogues at each of 5 sites
IU
UA Huntsville
Okla Univ
Millersville
UCAR Unidata
NCSA Illinois
Each satellite replicates its contents to the
master catalog
Master catalog
18
Architecture Part II single catalog
19
Providing higher level functionality
  • Structure, sharing, preservation, querying

20
Axes of Functionality
Increasing levels of access
Sharing
Structure
Increasing levels of transparency
Preservation
Versioning through time
21
Higher-level functionality transparent structure
  • Structure -- creating structure in metadata
    catalog transparent to user, based on knowledge
    of control flow
  • Why? Want to hide as structure so users dont
    need to learn it and abide by it, but
  • Structure gives user more attributes to query on

22
Capturing process in the structure
23
Example Query contains structure, but only
vaguely
LeadQuery SELECT TARGET collection WHERE
collection.date February 20, 2005 WITHIN
experiment.name mytest1 and CONTAINS
(file.type GOES or file.type Eta)
and file.geoProperty precipitation RECURSIV
E ResultSet TARGET_ONLY
24
Creating structure in database that mirrors
structure of experiment
12 hrs
Gather data products
Run 12 hour forecast (6 hrs to complete)
Analyze results
Based on analysis, gather other products
Analyze results
Run 6 Hr forecast (3 hrs to complete)
workflow
workflow
Notif service
Decoder service
myLEAD agent
myLEAD server
Product requests, Product registers, Notification
msgs,
25
Higher level functionality sharing
  • Depth-0 participant (P) is unaware that
    experiment data (E) owned by user (U) exists
  • Depth-1 P is aware that E exists
  • Depth-2 P can search E
  • Depth-3 P can browse the content of E
  • Depth-4 P can access E and its contents
  • Depth-5 P can remove and write E

26
(No Transcript)
27
Experimental evaluation
28
Experiment environment
  • myLEAD client dual processor Dell PowerEdge
    6400 Xeon server (700 MHz Pentium III), 2GF RAM,
    100 GB Raid 5, RedHat 7.2, JDK 1.4.2
  • myLEAD server dual processor 2.0 MHz Opterons,
    16BGRAM, GENTOO Linux, OGSA-DAI 3.0, Globus MCS
    3.1, mysql 5.0.
  • LAN 1Gbps switched Ethernet

29
Workload used in experimental evaluation
Characterizing simple and hard
Create Simple Hard
Objects created 1-11 203-500
Attributes created 2-5 512-1012
Depth of tree 1-3 7-9
Query Simple Hard
Tables joined 11-13 36-42
Number attributes 0-2 10
Size of result set 2K 0.4-0.6M
30
Response time for querying a single object having
an increasing
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
Related Work
  • myGrid
  • Intelligent Systems for Molecular Biology 2003
  • mySpace
  • UK e-Science All Hands Meeting 2003
  • NEESgrid metadata catalog
  • NEESGrid technical report 2004
  • Roma personal metadata service
  • Mobile Networks and Applications 2002
  • Presto Document System
  • User Interface Software and Technology 1999
  • Semantic File Systems
  • SOSP 1991

35
(No Transcript)
36
The end
37
Seeds of solution in Internet?
  • Internet has proven the utility of user-oriented
    view towards information space management
  • Search, tag browser, bookmarks
  • Publish blogs, web page tools
  • But web not completely appropriate. Web is
  • Single-writer, multiple reader, and
  • Search-and-download.
  • Apply concept of user-oriented view to managing
    data space
  • Want ability to work locally.
  • myLEAD tool to help an investigator make sense
    of, and operate in, the vast information space
    that is computational science (e.g., mesoscale
    meteorology.)
Write a Comment
User Comments (0)
About PowerShow.com