Title: Community cyberinfrastructure and X-informatics - Assessment of convergence and innovation based on project experience
1Community cyberinfrastructure and X-informatics -
Assessment of convergence and innovation based on
project experience
- Peter Fox
- High Altitude Observatory,
- NCAR
- Work performed in part with Deborah McGuinness
(RPI), Rob Raskin (JPL), Krishna Sinha (VT), Luca
Cinquini (NCAR), Patrick West (NCAR), Stephan
Zednik (NCAR), Paulo Pinheiro da Silva (UTEP), Li
Ding (RPI) and others
2Outline
- Background and inevitabilities
- Informatics -gt e-Science
- Informatics methodology e.g. Semantic Web as a
approach and a technology - Virtual Observatories use cases, some examples,
and non-specialist use - Data ingest, integration, mining and where we are
heading - Discussion
3Background
- Scientists should be able to access a global,
distributed knowledge base of scientific data
that - appears to be integrated
- appears to be locally available
- But data is obtained by multiple instruments,
using various protocols, in differing
vocabularies, using (sometimes unstated)
assumptions, with inconsistent (or non-existent)
meta-data. It may be inconsistent, incomplete,
evolving, and distributed - And there exist(ed) significant levels of
semantic heterogeneity, large-scale data, complex
data types, legacy systems, inflexible and
unsustainable implementation technology
4But data has Lots of Audiences
Information products have
Information
More Strategic
Less Strategic
From Why EPO?, a NASA internal report on
science education, 2005
SCIENTISTS TOO
5Shifting the Burden from the Userto the Provider
6The Astronomy approach data-types as a service
Limited interoperability
- VOTable
- Simple Image Access Protocol
- Simple Spectrum Access Protocol
- Simple Time Access Protocol
VO App2
VO App3
VO App1
Open Geospatial Consortium Web Feature,
Coverage, Mapping Service Sensor Web
Enablement Sensor Observation, Planning,
Analysis Service use the same approach
VO layer
DBn
DB2
DB3
DB1
7Mind the Gap!
- As a result of finding out who is doing what,
sharing experience/ expertise, and substantial
coordination - There is/ was still a gap between science and the
underlying infrastructure and technology that is
available
- Informatics - information science includes the
science of (data and) information, the practice
of information processing, and the engineering of
information systems. Informatics studies the
structure, behavior, and interactions of natural
and artificial systems that store, process and
communicate (data and) information. It also
develops its own conceptual and theoretical
foundations. Since computers, individuals and
organizations all process information,
informatics has computational, cognitive and
social aspects, including study of the social
impact of information technologies. Wikipedia.
- Cyberinfrastructure is the new research
environment(s) that support advanced data
acquisition, data storage, data management, data
integration, data mining, data visualization and
other computing and information processing
services over the Internet.
8Progression after progression
IT Cyber Infrastructure Cyber Informatics Core Informatics Science Informatics, aka Xinformatics Science, SBAs
9Virtual Observatories
- Make data and tools quickly and easily accessible
to a wide audience. - Operationally, virtual observatories need to find
the right balance of data/model holdings, portals
and client software that researchers can use
without effort or interference as if all the
materials were available on his/her local
computer using the users preferred language
i.e. appear to be local and integrated - Likely to provide controlled vocabularies that
may be used for interoperation in appropriate
domains along with database interfaces for access
and storage -gt thus part IT, part CI, part
Informatics
10VO API
Web Serv.
VO Portal
Query, access and use of data
- Mediation Layer
- Ontology - capturing concepts of Parameters,
Instruments, Date/Time, Data Product (and
associated classes, properties) and Service
Classes - Maps queries to underlying data
- Generates access requests for metadata, data
- Allows queries, reasoning, analysis, new
hypothesis generation, testing, explanation, et
c.
Semantic mediation layer - VSTO - low level
Metadata, schema, data
DBn
DB2
DB3
DB1
11Semantic Web Methodology and Technology
Development Process
- Establish and improve a well-defined methodology
vision for Semantic Technology based application
development - Leverage controlled vocabularies, et c.
Adopt Technology Approach
Leverage Technology Infrastructure
Science/Expert Review Iteration
Rapid Prototype
Open World Evolve, Iterate, Redesign, Redeploy
Use Tools
Analysis
Use Case
Develop model/ ontology
Small Team, mixed skills
12Science and technical use cases
- Find data which represents the state of the
neutral atmosphere anywhere above 100km and
toward the arctic circle (above 45N) at any time
of high geomagnetic activity. - Extract information from the use-case - encode
knowledge - Translate this into a complete query for data -
inference and integration of data from
instruments, indices and models - Provide semantically-enabled, smart data query
services via a SOAP web for the Virtual
Ionosphere-Thermosphere-Mesosphere Observatory
that retrieve data, filtered by constraints on
Instrument, Date-Time, and Parameter in any order
and with constraints included in any combination.
13(No Transcript)
14But data has Lots of Audiences
More Strategic
Less Strategic
From Why EPO?, a NASA internal report on
science education, 2005
15What is a Non-Specialist Use Case?
Someone should be able to query a virtual
observatory without having specialist knowledge
Teacher accesses internet goes to An Educational
Virtual Observatory and enters a search for
Aurora.
16What should the User Receive?
Teacher receives four groupings of search
results 1) Educational materials
http//www.meted.ucar.edu/topics_spacewx.php and
http//www.meted.ucar.edu/hao/aurora/ 2)
Research, data and tools via research VOs but
the search for brightness, or green/red line
emission is mediated for them 3) Did you know?
Aurora is a phenomena of the upper terrestrial
atmosphere (ionosphere) also known as Northern
Lights 4) Did you mean? Aurora Borealis or
Aurora Australis, etc.
17Semantic Information Integration Concept map for
educational use of science data in a lesson plan
18(No Transcript)
19Informatics issues for Virtual Observatories
- Scaling to large numbers of data providers and
redefining the roles/ relations among them - Branding and attribution (where did this data
come from and who gets the credit, is it the
correct version, is this an authoritative
source?) - Provenance/derivation (propagating key
information as it passes through a variety of
services, copies of processing algorithms, ) - Crossing discipline boundaries
- Data quality, preservation, stewardship
- Security, access to resources, policies
20Provenance
- Origin or source from which something comes, its
intention for use, whom or what it was generated
for, the manner of manufacture, history of
subsequent owners, sense of place and time of
manufacture, production or discovery documented
in detail sufficient to allow reproducibility
21Use cases
- Who (person or program) added the comments to the
science data file for the best vignetted,
rectangular polarization brightness image from
January, 26, 2005 184909UT taken by the ACOS
Mark IV polarimeter? - What was the cloud cover and atmospheric seeing
conditions during the local morning of January
26, 2005 at MLSO? - Find all good images on March 21, 2008.
- Why are the quick look images from March 21,
2008, 1900UT missing? - Why does this image look bad?
22(No Transcript)
23(No Transcript)
24(No Transcript)
25Quick look browse
Yasukawa Computer crash
Yasukawa Computer crash
26(No Transcript)
27Visual browse
28(No Transcript)
29(No Transcript)
30Search
31(No Transcript)
32A Better Way to Access Data
The Problem
Scientists only use data from a single instrument
because it is difficult to access, process, and
understand data from multiple instruments. A
typical data query might be Give me the
temperature, pressure, and water vapor from the
AIRS instrument from Jan 2005 to Jan
2008 Search for MLS/Aura Level 2, SO2 Slant
Column Density from 2/1/2007
A Solution
Using a simple process, SESDI allows data from
various sources to be registered in an ontology
so that it can be easily accessed and understood.
Scientists can use only the ontology components
that relate to their data. An SESDI query might
look like Show all areas in California where
sulfur dioxide (SO2) levels were above normal
between Jan 2000 and Jan 2007 This query will
pull data from all available sources registered
in the ontology and allow seamless data fusion.
Because the query is measurement related,
scientists do not need to understand the details
of the instruments and data types.
33Determine the statistical signatures of volcanic
forcings on the height of the tropopause
34Detection and attribution relations
35(No Transcript)
36(No Transcript)
37Leveraged VSTO semantic framework indicating how
volcano and atmospheric parameters and databases
can immediately be plugged in to the semantic
data framework to enable data integration.
38Data Registration Framework
Data Discovery
Data Integration
Level 1 Data Registration at the Discovery
Level, e.g. Volcano location and activity
Level 2 Data Registration at the Inventory
Level, e.g. list of datasets by, types, times,
products
Level 3 Data Registration at the Item
Detail Level, e.g. access to individual quantities
Earth Sciences Virtual Database A Data Warehouse
where Schema heterogeneity problem is Solved
schema based integration
Ontology based Data Integration
A.K.Sinha, Virginia Tech, 2006
39How to find the data?
- Think about it the way the data providers do
40SEDRE Semantically Enabled Data Registration
Engine
- SEDRE an application that enables scientists to
semantically register data sets for optimal
querying and semantic integration - SEDRE enables mapping of heterogeneous data to
concepts in domain ontologies
A. K. Sinha, A. Rezgui, Virginia Tech
41Registering Atmospheric Data (2)
42Discussion (1)
- Taken together, an emerging set of collected
experience manifests an emerging informatics core
capability that is starting to take data
intensive science into a new realm of
realizability and potentially, sustainability - Use cases
- X-informatics
- Core Informatics
- Cyber Informatics
- Evolvable technical infrastructure
43Progression after progression
IT Cyber Infrastructure Cyber Informatics Core Informatics Science Informatics Science, Societal Benefit Areas, Edu
- One example
- CI OPeNDAP server running over HTTP/HTTPS
- Cyberinformatics Data (product) and service
ontologies, triple store - Core informatics Reasoning engine (Pellet),
OWL, CMAP, - Science (X) informatics Use cases, science
domain terms, concepts in an ontology
44Discussion (2)
- The data and information challenges are (almost)
being identified as increasingly common - Data and information science is becoming the
fourth column (along with theory, experiment
and computation) - Semantics are a very key ingredient for progress
in informatics - A sustained involvement of key inter-disciplinary
team members is very important -gt leads to
incentives, rewards, etc. and a balance of
research and production
45Summary
- Informatics is playing a key role in filling the
gap between science (and the spectrum of
non-expert) use and generation and the underlying
cyberinfrastructure - This is evident due to the emergence of
Xinformatics (world-wide) - Our experience is implementing informatics as
semantics in Virtual Observatories (as a working
paradigm) and Grid environments - VSTO is only one example of success
- Data mining, data integration, smart search,
provenance - Informatics is a profession and a community
activity and requires efforts in all 3 sub-areas
(science, core, cyber) and must be synergistic
46More Information
- Virtual Solar Terrestrial Observatory (VSTO)
http//vsto.hao.ucar.edu, http//www.vsto.org - Semantically-Enalbed Science Data Integration
(SESDI) http//sesdi.hao.ucar.edu - Semantic Provenance Capture in Data Ingest
Systems (SPCDIS) http//spcdis.hao.ucar.edu - SAM/Semantic Knowledge Integration Framework
(SKIF) http//skif.hao.ucar.edu - Conferences numerous
- Journals Earth Science Informatics
- Texts ltemptygt, a few are in progress
- Courses
- Semantic e-Science, fall 2008 course at RPI
- Geoinformatics, at Purdue
- Contact Peter Fox pfox_at_ucar.edu
47Spare room
48 Translating the Use-Case - non-monotonic?
GeoMagneticActivity has ProxyRepresentation Geophy
sicalIndex is a ProxyRepresentation (in Realm of
Neutral Atmosphere) Kp is a GeophysicalIndex
hasTemporalDomain daily hasHighThreshold
xsd_number 8 Date/time when KP gt 8
Specification needed for query to
CEDARWEB Instrument Parameter(s) Operating
Mode Observatory Date/time Return-type data
- Input
- Physical properties State of neutral atmosphere
- Spatial
- Above 100km
- Toward arctic circle (above 45N)
- Conditions
- High geomagnetic activity
- Action Return Data
49VSTO - semantics and ontologies in an operational
environment vsto.hao.ucar.edu, www.vsto.org
50(No Transcript)
51(No Transcript)
52Semantic Web Services
53Semantic Web Services
OWL document returned using VSTO ontology - can
be used both syntactically or semantically
54Semantic Web Services
55Semantic Web Services
56VSTO achievements
- Conceptual model and architecture developed by
combined team KR experts, domain experts, and
software engineers - Semantic framework developed and built with a
small, cohesive, carefully chosen team in a
relatively short time (deployments in 1st year) - Production portal released, includes security, et
c. with community migration (and so far
endorsement) - VSTO ontology version 1.2, (vsto.owl) in
production, 2.0 in preparation - Web Services encapsulation of semantic interfaces
in use - Solar Terrestrial use-cases are driving the
completion of the ontologies (e.g. instruments) - Using ontologies and the overall framework in
other applications (volcanoes, climate, oceans,
water, )
57Semantic Web Basics
- The triple subject-predicate-object
- Interferometer is-a optical instrument
- Optical instrument has focal length
- An ontology is a representation of this knowledge
- W3C is the primary (but not sole) governing
organization for languages, specifications, best
practices, et c. - RDF - Resource Description Framework
- OWL 1.0 - Ontology Web Language (OWL 1.1 on the
way) - Encode the knowledge in triples, in a
triple-store, software is built to traverse the
semantic network, it can be queried or reasoned
upon - Put semantics between/ in your interfaces, i.e.
between layers and components in your
architecture, i.e. between users and
information to mediate the exchange
58Semantic Web Benefits
- Unified/ abstracted query workflow Parameters,
Instruments, Date-Time - Decreased input requirements for query in one
case reducing the number of selections from eight
to three - Generates only syntactically correct queries
which was not always insurable in previous
implementations without semantics - Semantic query support by using background
ontologies and a reasoner, our application has
the opportunity to only expose coherent query
(portal and services) - Semantic integration in the past users had to
remember (and maintain codes) to account for
numerous different ways to combine and plot the
data whereas now semantic mediation provides the
level of sensible data integration required, now
exposed as smart web services - understanding of coordinate systems,
relationships, data synthesis, transformations,
et c. - returns independent variables and related
parameters - A broader range of potential users (PhD
scientists, students, professional research
associates and those from outside the fields)
59Example 1 Registration of Volcanic Data
- Location Codes
- U - Above the 180 turn at Holei Pali (upper
Chain of Craters Road) - L - Below Holei Pali (lower Chain of Craters
Road) - UL - Individual traverses were made both above
and below the 180 turn at Holei Pali - H - Highway 11
SO2 Emission from Kilauea east rift zone -
vehicle-based (Source HVO)
Abreviations t/dmetric tonne (1000 kg)/day,
SDstandard deviation, WSwind speed, WDwind
direction east of true north, Nnumber of
traverses
60Registering Volcanic Data (1)
61Registering Volcanic Data (2)
- No explicit lat/long data
- Volcano identified by name
- Volcano ontology framework will link name to
location
62Example 2 Registration of Atmospheric Data
Satellite data for SO2 emissions
Abbreviation SCD Slant Column Density (in
Dobson Unit (DU))
63Registering Atmospheric Data (1)
64SAM Project ObjectivesS. Graves, R. Ramachandran
- To create a prototype Semantic Analysis and
Mining framework (SAM) comprising - Data mining and knowledge extraction web services
- Linked ontologies describing the mining services,
data and the problem domain - Web-based client
- To allow users to discover and explore existing
data and services, compose workflows for mining
and invoke these workflows. - Semantic search
- Automated web service invocation
- Automated web service composition
65Data Mining Ontology Design
Courtesy R. Ramachandran
66Data Mining Ontology Snapshot
Courtesy R. Ramachandran
67The Information Era Interoperability
Modern information and communications
technologies are creating an interoperable
information era in which ready access to data and
information can be truly universal. Open access
to data and services enables us to meet the new
challenges of understand the Earth and its space
environment as a complex system
- managing and accessing large data sets
- higher space/time resolution capabilities
- rapid response requirements
- data assimilation into models
- crossing disciplinary boundaries.
68Virtual Observatories
- Conceptual examples
- In-situ Virtual measurements
- Related measurements
- Remote sensing Virtual, integrative measurements
- Data integration
- Managing virtual data products/ sets
69Virtual Solar Terrestrial Observatory
- A distributed, scalable education and research
environment for searching, integrating, and
analyzing observational, experimental, and model
databases. - Subject matter covers the fields of solar,
solar-terrestrial and space physics - Provides virtual access to specific data, model,
tool and material archives containing items from
a variety of space- and ground-based instruments
and experiments, as well as individual and
community modeling and software efforts bridging
research and educational use - 3 year NSF-funded (OCI/SCI) project - completed
- Several follow-on projects
70Problem definition
- Data is coming in faster, in greater volumes and
outstripping our ability to perform adequate
quality control - Data is being used in new ways and we frequently
do not have sufficient information on what
happened to the data along the processing stages
to determine if it is suitable for a use we did
not envision - We often fail to capture, represent and propagate
manually generated information that need to go
with the data flows - Each time we develop a new instrument, we develop
a new data ingest procedure and collect different
metadata and organize it differently. It is then
hard to use with previous projects - The task of event determination and feature
classification is onerous and we don't do it
until after we get the data
71Building blocks
- Data formats and metadata IAU standard FITS,
with SoHO keyword convention, JPeG, GIF - Ontologies OWL-DL and RDF
- The proof markup language (PML) provides an
interlingua for capturing the information agents
need to understand results and to justify why
they should believe the results. - The Inference Web toolkit provides a suite of
tools for manipulating, presenting, summarizing,
analyzing, and searching PML in efforts to
provide a set of tools that will let end users
understand information and its derivation,
thereby facilitating trust in and reuse of
information. - Capturing semantics of data quality, event, and
feature detection within a suitable community
ontology packages (SWEET, VSTO)