Title: Shifting%20the%20Burden%20from%20the%20User%20to%20the%20Data%20Provider
1Shifting the Burden from the User to the Data
Provider
- Peter Fox
- High Altitude Observatory,
- NCAR ()
- With thanks to eGY and various NSF, DoE and NASA
projects
2Outline
- Background, definitions
- Informatics -gt e-Science
- Data has lots of uses
- Virtual Observatories use cases
- Data Framework Examples
- Data ingest, integration, mining and
- Discussion
3Background
- Scientists should be able to access a global,
distributed knowledge base of scientific data
that - appears to be integrated
- appears to be locally available
- But data is obtained by multiple instruments,
using various protocols, in differing
vocabularies, using (sometimes unstated)
assumptions, with inconsistent (or non-existent)
meta-data. It may be inconsistent, incomplete,
evolving, and distributed - And there exist(ed) significant levels of
semantic heterogeneity, large-scale data, complex
data types, legacy systems, inflexible and
unsustainable implementation technology
4But data has Lots of Audiences
Information products have
Information
More Strategic
Less Strategic
From Why EPO?, a NASA internal report on
science education, 2005
SCIENTISTS TOO
5The Information Era Interoperability
Modern information and communications
technologies are creating an interoperable
information era in which ready access to data and
information can be truly universal. Open access
to data and services enables us to meet the new
challenges of understand the Earth and its space
environment as a complex system
- managing and accessing large data sets
- higher space/time resolution capabilities
- rapid response requirements
- data assimilation into models
- crossing disciplinary boundaries.
6Shifting the Burden from the Userto the Provider
7 Modern capabilities
8Mind the Gap!
- As a result of finding out who is doing what,
sharing experience/ expertise, and substantial
coordination - There is/ was still a gap between science and the
underlying infrastructure and technology that is
available
- Informatics - information science includes the
science of (data and) information, the practice
of information processing, and the engineering of
information systems. Informatics studies the
structure, behavior, and interactions of natural
and artificial systems that store, process and
communicate (data and) information. It also
develops its own conceptual and theoretical
foundations. Since computers, individuals and
organizations all process information,
informatics has computational, cognitive and
social aspects, including study of the social
impact of information technologies. Wikipedia.
- Cyberinfrastructure is the new research
environment(s) that support advanced data
acquisition, data storage, data management, data
integration, data mining, data visualization and
other computing and information processing
services over the Internet.
9Progression after progression
IT Cyber Infrastructure Cyber Informatics Core Informatics Science Informatics, aka Xinformatics Science, SBAs
10Virtual Observatories
- Conceptual examples
- In-situ Virtual measurements
- Related measurements
- Remote sensing Virtual, integrative measurements
- Data integration
- Managing virtual data products/ sets
11Virtual Observatories
- Make data and tools quickly and easily accessible
to a wide audience. - Operationally, virtual observatories need to find
the right balance of data/model holdings, portals
and client software that researchers can use
without effort or interference as if all the
materials were available on his/her local
computer using the users preferred language
i.e. appear to be local and integrated - Likely to provide controlled vocabularies that
may be used for interoperation in appropriate
domains along with database interfaces for access
and storage and smart tools for evolution and
maintenance.
12Early days of discipline specific VOs
VO2
VO3
VO1
DBn
DB2
DB3
DB1
13The Astronomy approach data-types as a service
Limited interoperability
- VOTable
- Simple Image Access Protocol
- Simple Spectrum Access Protocol
- Simple Time Access Protocol
VO App2
VO App3
VO App1
Open Geospatial Consortium Web Feature,
Coverage, Mapping Service Sensor Web
Enablement Sensor Observation, Planning,
Analysis Service use the same approach
VO layer
DBn
DB2
DB3
DB1
14VO API
Web Serv.
VO Portal
Query, access and use of data
- Mediation Layer
- Ontology - capturing concepts of Parameters,
Instruments, Date/Time, Data Product (and
associated classes, properties) and Service
Classes - Maps queries to underlying data
- Generates access requests for metadata, data
- Allows queries, reasoning, analysis, new
hypothesis generation, testing, explanation, et
c.
Semantic mediation layer - VSTO - low level
Metadata, schema, data
DBn
DB2
DB3
DB1
15Content Coupling Energetics and Dynamics of
Atmospheric Regions WEB
Community data archive for observations and
models of Earth's upper atmosphere and
geophysical indices and parameters needed to
interpret them. Includes browsing capabilities
by periods, gt 310 instruments, models, gt 820
parameters
16Content Mauna Loa Solar Observatory
Near real-time data products from Hawaii from a
variety of solar instruments. Source for space
weather, solar variability, and basic solar
physics Other content used too - Center for
Integrated Space Weather Modeling
17Semantic Web Methodology and Technology
Development Process
- Establish and improve a well-defined methodology
vision for Semantic Technology based application
development - Leverage controlled vocabularies, et c.
Adopt Technology Approach
Leverage Technology Infrastructure
Science/Expert Review Iteration
Rapid Prototype
Open World Evolve, Iterate, Redesign, Redeploy
Use Tools
Analysis
Use Case
Develop model/ ontology
Small Team, mixed skills
18Science and technical use cases
- Find data which represents the state of the
neutral atmosphere anywhere above 100km and
toward the arctic circle (above 45N) at any time
of high geomagnetic activity. - Extract information from the use-case - encode
knowledge - Translate this into a complete query for data -
inference and integration of data from
instruments, indices and models - Provide semantically-enabled, smart data query
services via a SOAP web for the Virtual
Ionosphere-Thermosphere-Mesosphere Observatory
that retrieve data, filtered by constraints on
Instrument, Date-Time, and Parameter in any order
and with constraints included in any combination.
19VSTO - semantics and ontologies in an operational
environment vsto.hao.ucar.edu, www.vsto.org
20(No Transcript)
21(No Transcript)
22(No Transcript)
23(No Transcript)
24Semantic Web Benefits
- Unified/ abstracted query workflow Parameters,
Instruments, Date-Time - Decreased input requirements for query in one
case reducing the number of selections from eight
to three - Generates only syntactically correct queries
which was not always insurable in previous
implementations without semantics - Semantic query support by using background
ontologies and a reasoner, our application has
the opportunity to only expose coherent query
(portal and services) - Semantic integration in the past users had to
remember (and maintain codes) to account for
numerous different ways to combine and plot the
data whereas now semantic mediation provides the
level of sensible data integration required, now
exposed as smart web services - understanding of coordinate systems,
relationships, data synthesis, transformations,
et c. - returns independent variables and related
parameters - A broader range of potential users (PhD
scientists, students, professional research
associates and those from outside the fields)
25What is a Non-Specialist Use Case?
Someone should be able to query a virtual
observatory without having specialist knowledge
Teacher accesses internet goes to An Educational
Virtual Observatory and enters a search for
Aurora.
26What should the User Receive?
Teacher receives four groupings of search
results 1) Educational materials
http//www.meted.ucar.edu/topics_spacewx.php and
http//www.meted.ucar.edu/hao/aurora/ 2)
Research, data and tools via VSTO, VSPO and
VITMO, knows to search for brightness, or
green/red line emission 3) Did you know? Aurora
is a phenomena of the upper terrestrial
atmosphere (ionosphere) also known as Northern
Lights 4) Did you mean? Aurora Borealis or
Aurora Australis, et c.
27Semantic Information Integration Concept map for
educational use of science data in a lesson plan
28(No Transcript)
29Issues for Virtual Observatories
- Scaling to large numbers of data providers and
redefining the role(s)/ relations with them - Crossing discipline boundaries
- Security, access to resources, policies
- Branding and attribution (where did this data
come from and who gets the credit, is it the
correct version, is this an authoritative
source?) - Provenance/derivation (propagating key
information as it passes through a variety of
services, copies of processing algorithms, ) - Data quality, preservation, stewardship
These are currently burden areas for users
30Problem definition
- Data is coming in faster, in greater volumes and
outstripping our ability to perform adequate
quality control - Data is being used in new ways and we frequently
do not have sufficient information on what
happened to the data along the processing stages
to determine if it is suitable for a use we did
not envision - We often fail to capture, represent and propagate
manually generated information that need to go
with the data flows - Each time we develop a new instrument, we develop
a new data ingest procedure and collect different
metadata and organize it differently. It is then
hard to use with previous projects - The task of event determination and feature
classification is onerous and we don't do it
until after we get the data
31Use cases
- Determine which flat field calibration was
applied to the image taken on January, 26, 2005
around 2100UT by the ACOS Mark IV polarimeter. - Which flat-field algorithm was applied to the set
of images taken during the period November 1,
2004 to February 28, 2005? - How many different data product types can be
generated from the ACOS CHIP instrument? - What images comprised the flat field calibration
image used on January 26, 2007 for all ACOS CHIP
images? - What processing steps were completed to obtain
the ACOS PICS limb image of the day for January
26, 2005? - Who (person or program) added the comments to the
science data file for the best vignetted,
rectangular polarization brightness image from
January, 26, 2005 184909UT taken by the ACOS
Mark IV polarimeter? - What was the cloud cover and atmospheric seeing
conditions during the local morning of January
26, 2005 at MLSO? - Find all good images on March 21, 2008.
- Why are the quick look images from March 21,
2008, 1900UT missing? - Why does this image look bad?
32Provenance
- Origin or source from which something comes,
intention for use, who/what generated for, manner
of manufacture, history of subsequent owners,
sense of place and time of manufacture,
production or discovery, documented in detail
sufficient to allow reproducibility
33(No Transcript)
34(No Transcript)
35(No Transcript)
36Visual browse
37(No Transcript)
38(No Transcript)
39Discussion (1)
- Taken together, an emerging set of collected
experience manifests an emerging informatics core
capability that is starting to take data
intensive science into a new realm of
realizability and potentially, sustainability - Use cases (i.e. real users)
- X-informatics
- Core Informatics
- Cyber Informatics
- There are implications for data models
40Progression after progression
IT Cyber Infrastructure Cyber Informatics Core Informatics Science Informatics Science, SBAs
- Example
- CI OPeNDAP server running over HTTP/HTTPS
- Cyberinformatics Data (product) and service
ontologies, triple store - Core informatics Reasoning engine (Pellet), OWL
- Science (X) informatics Use cases, science
domain terms, concepts in an ontology
41Discussion (2)
- Data and information science is becoming the
fourth column (along with theory, experiment
and computation) - Semantics (of the data) are a very key ingredient
-gt may imply richer data models
42Summary
- Informatics is playing a key role in filling the
gap between science (and the spectrum of
non-expert) use and generation and the underlying
cyberinfrastructure, i.e. in shifting the burden - This is evident due to the emergence of
Xinformatics (world-wide) - Our experience is implementing informatics as
semantics in Virtual Observatories (as a working
paradigm) and Grid environments - VSTO is only one example of success
- Data mining, data integration, smart search,
provenance are close behind - Informatics is a profession and a community
activity and requires efforts in all 3 sub-areas
(science, core, cyber) and must be synergistic
43More Information
- Virtual Solar Terrestrial Observatory (VSTO)
http//vsto.hao.ucar.edu, http//www.vsto.org - Semantically-Enalbed Science Data Integration
(SESDI) http//sesdi.hao.ucar.edu - Semantic Provenance Capture in Data Ingest
Systems (SPCDIS) http//spcdis.hao.ucar.edu - Semantic Knowledge Integration Framework
(SKIF/SAM) http//skif.hao.ucar.edu - Semantic Web for Earth and Environmental
Terminology (SWEET) http//sweet.jpl.nasa.gov - Conferences AGU 2008, EGU 2009, ISWC 2008, CIKM
2008, - Peter Fox pfox_at_ucar.edu