Title: Addressing the Data Deluge: the Structuring, Sharing, and Preserving of Scientific Experiment Data
1Addressing the Data Deluge the Structuring,
Sharing, and Preserving of Scientific Experiment
Data
- Beth Plale
- Sangmi Lee
- Scott Jensen
- Yiming Sun
- Computer Science Dept.
- Indiana University
2The Data Deluge
- Computational science is increasingly data
intense and getting more so. Why? - More complex computations
- Nested model runs
- Linked models
- Finer resolution
- More sources of data products
- Observational data products
- Streaming continuously from hundreds of sensor
and network sources, scaling to thousands - Large archives
- Annotations
- Model configuration parameters
- Output results
- Model data
- Statistical data (e.g., data mining)
3Problem
- Computational scientists are reaching their limit
on ability to manage data products associated
with investigations - Scientist can touch hundreds to thousands of data
products in single investigation
4The Experiment as A Days Work
6 hr run followed by
3 hr run followed by
1 hr run
5Why not just put up a metadata database and let
them come?
- The Kings solution.
- Burdens users (people or programs) with
- Knowing where database is located
- Knowing the schema of the database
- Initiating all the communication with database
- Generating all metadata
- Knowing precisely how to write the queries.
- We cant afford the Kings solution - we have to
be more aggressive if our solution is to be
widely used.
6Who are our users? (psstscientists)
- Users dont want to write precise SQL
- That is, learn the nuances of a relational schema
- Users wont hand-code metadata
- Scientists dont want to have to think about
hierarchies of files, versions, or replicas.
They want to run experiments and do their
science. - Scientists use Google - they know searching can
be fast and flexible - far more flexible than
find . -n 0305200513002530.nc -print
7myLEAD an active metadata catalog
- If were going to have half a chance of being
widely used, it is going to be us that reaches
3/4s of the way across the gulf. Our users
reach the other 1/4 - Easy query writing
- Automated metadata generation
- Transparent structure management
- Transparent versioning management
- Expressive query writing
8(No Transcript)
9Conventional Numerical Weather Prediction
- OBSERVATIONS
- Radar Data
- Mobile Mesonets
- Surface Observations
- Upper-Air Balloons
- Commercial Aircraft
- Geostationary and Polar Orbiting Satellite
- Wind Profilers
- GPS Satellites
10Conventional Numerical Weather Prediction
- Analysis/Assimilation
- Quality Control
- Retrieval of Unobserved
- Quantities
- Creation of Gridded Fields
- OBSERVATIONS
- Radar Data
- Mobile Mesonets
- Surface Observations
- Upper-Air Balloons
- Commercial Aircraft
- Geostationary and Polar Orbiting Satellite
- Wind Profilers
- GPS Satellites
11Conventional Numerical Weather Prediction
- Analysis/Assimilation
- Quality Control
- Retrieval of Unobserved
- Quantities
- Creation of Gridded Fields
Prediction PCs to Teraflop Systems
- OBSERVATIONS
- Radar Data
- Mobile Mesonets
- Surface Observations
- Upper-Air Balloons
- Commercial Aircraft
- Geostationary and Polar Orbiting Satellite
- Wind Profilers
- GPS Satellites
12Conventional Numerical Weather Prediction
- Analysis/Assimilation
- Quality Control
- Retrieval of Unobserved
- Quantities
- Creation of Gridded Fields
Prediction PCs to Teraflop Systems
- Product Generation,
- Display,
- Dissemination
- OBSERVATIONS
- Radar Data
- Mobile Mesonets
- Surface Observations
- Upper-Air Balloons
- Commercial Aircraft
- Geostationary and Polar Orbiting Satellite
- Wind Profilers
- GPS Satellites
13Conventional Numerical Weather Prediction
- Analysis/Assimilation
- Quality Control
- Retrieval of Unobserved
- Quantities
- Creation of Gridded Fields
Prediction PCs to Teraflop Systems
- Product Generation,
- Display,
- Dissemination
- OBSERVATIONS
- Radar Data
- Mobile Mesonets
- Surface Observations
- Upper-Air Balloons
- Commercial Aircraft
- Geostationary and Polar Orbiting Satellite
- Wind Profilers
- GPS Satellites
- End Users
- NWS
- Private Companies
- Students
14Conventional Numerical Weather Prediction
- Analysis/Assimilation
- Quality Control
- Retrieval of Unobserved
- Quantities
- Creation of Gridded Fields
Prediction PCs to Teraflop Systems
- Product Generation,
- Display,
- Dissemination
- OBSERVATIONS
- Radar Data
- Mobile Mesonets
- Surface Observations
- Upper-Air Balloons
- Commercial Aircraft
- Geostationary and Polar Orbiting Satellite
- Wind Profilers
- GPS Satellites
The process is entirely serial and pre-scheduled
no responseto weather!
- End Users
- NWS
- Private Companies
- Students
15The LEAD Vision No Longer Serial or Static
- Analysis/Assimilation
- Quality Control
- Retrieval of Unobserved
- Quantities
- Creation of Gridded Fields
Prediction PCs to Teraflop Systems
- Product Generation,
- Display,
- Dissemination
- OBSERVATIONS
- Radar Data
- Mobile Mesonets
- Surface Observations
- Upper-Air Balloons
- Commercial Aircraft
- Geostationary and Polar Orbiting Satellite
- Wind Profilers
- GPS Satellites
- End Users
- NWS
- Private Companies
- Students
16The LEAD Vision No Longer Serial or Static
- Analysis/Assimilation
- Quality Control
- Retrieval of Unobserved
- Quantities
- Creation of Gridded Fields
Prediction PCs to Teraflop Systems
- Product Generation,
- Display,
- Dissemination
- OBSERVATIONS
- Radar Data
- Mobile Mesonets
- Surface Observations
- Upper-Air Balloons
- Commercial Aircraft
- Geostationary and Polar Orbiting Satellite
- Wind Profilers
- GPS Satellites
- End Users
- NWS
- Private Companies
- Students
17Architecture Part 1 Distribution scheme of
metadata catalogues
Satellite catalogues at each of 5 sites
IU
UA Huntsville
Okla Univ
Millersville
UCAR Unidata
NCSA Illinois
Each satellite replicates its contents to the
master catalog
Master catalog
18Architecture Part II single catalog
19Providing higher level functionality
- Structure, sharing, preservation, querying
20Axes of Functionality
Increasing levels of access
Sharing
Structure
Increasing levels of transparency
Preservation
Versioning through time
21Higher-level functionality transparent structure
- Structure -- creating structure in metadata
catalog transparent to user, based on knowledge
of control flow - Why? Want to hide as structure so users dont
need to learn it and abide by it, but - Structure gives user more attributes to query on
22Capturing process in the structure
23Example Query contains structure, but only
vaguely
LeadQuery SELECT TARGET collection WHERE
collection.date February 20, 2005 WITHIN
experiment.name mytest1 and CONTAINS
(file.type GOES or file.type Eta)
and file.geoProperty precipitation RECURSIV
E ResultSet TARGET_ONLY
24Creating structure in database that mirrors
structure of experiment
12 hrs
Gather data products
Run 12 hour forecast (6 hrs to complete)
Analyze results
Based on analysis, gather other products
Analyze results
Run 6 Hr forecast (3 hrs to complete)
workflow
workflow
Notif service
Decoder service
myLEAD agent
myLEAD server
Product requests, Product registers, Notification
msgs,
25Higher level functionality sharing
- Depth-0 participant (P) is unaware that
experiment data (E) owned by user (U) exists - Depth-1 P is aware that E exists
- Depth-2 P can search E
- Depth-3 P can browse the content of E
- Depth-4 P can access E and its contents
- Depth-5 P can remove and write E
26(No Transcript)
27Experimental evaluation
28Experiment environment
- myLEAD client dual processor Dell PowerEdge
6400 Xeon server (700 MHz Pentium III), 2GF RAM,
100 GB Raid 5, RedHat 7.2, JDK 1.4.2 - myLEAD server dual processor 2.0 MHz Opterons,
16BGRAM, GENTOO Linux, OGSA-DAI 3.0, Globus MCS
3.1, mysql 5.0. - LAN 1Gbps switched Ethernet
29Workload used in experimental evaluation
Characterizing simple and hard
Create Simple Hard
Objects created 1-11 203-500
Attributes created 2-5 512-1012
Depth of tree 1-3 7-9
Query Simple Hard
Tables joined 11-13 36-42
Number attributes 0-2 10
Size of result set 2K 0.4-0.6M
30Response time for querying a single object having
an increasing
31(No Transcript)
32(No Transcript)
33(No Transcript)
34Related Work
- myGrid
- Intelligent Systems for Molecular Biology 2003
- mySpace
- UK e-Science All Hands Meeting 2003
- NEESgrid metadata catalog
- NEESGrid technical report 2004
- Roma personal metadata service
- Mobile Networks and Applications 2002
- Presto Document System
- User Interface Software and Technology 1999
- Semantic File Systems
- SOSP 1991
35(No Transcript)
36The end
37Seeds of solution in Internet?
- Internet has proven the utility of user-oriented
view towards information space management - Search, tag browser, bookmarks
- Publish blogs, web page tools
- But web not completely appropriate. Web is
- Single-writer, multiple reader, and
- Search-and-download.
- Apply concept of user-oriented view to managing
data space - Want ability to work locally.
- myLEAD tool to help an investigator make sense
of, and operate in, the vast information space
that is computational science (e.g., mesoscale
meteorology.)