Title: Designing and building webbased infrastructures for data sharing and collaboration
1Designing and building web-based infrastructures
for data sharing and collaboration
- Presentation at
- New Approaches to Software for Statistical
Processing - RSS / ASC Joint Event
- London, 21st January 2004
- Jostein Ryssevik
- Managing Director, Nesstar Ltd
2What if.
- statistical data could be published as easily on
the Web as documents and pictures? - users of statistical data could search for
relevant sources across the Web more or less in
the same way as they are googling for relevant
documents? - users of statistical data had a give me more
data like this function at hand allowing them to
locate data from disparate sources in order to
create a time-series or do a comparison - published statistical data could be described in
such a way that human users as well as software
clients would know exactly what they mean and how
they can be used - agreed languages were at hand allowing one
statistical software system to talk to and
exchange information with another - we had a Data Web allowing all of this to happen
3Essentials
- Open standards
- Shared protocols
- Partial agreements
4Starting from the human end of the equation
- The majority of users of statistical data have
not been engaged in the creation of a dataset. - Statistical data will frequently be used for
other research purposes than intended by the
creators (secondary analysis). - Statistical data will frequently be used many
years after they were created. - Users of statistical data are often comparing and
combining data from a broad range of sources
(across time and space). - In sum Statistical data are travelling. The
distance between the end users of statistical
data and the production process is normally long.
5Metadata data about.....
6The functions of (human readable) metadata
Finding
Understanding
Assessing
7Towards the Semantic Web
- From documents to data
- From brainware to software
- From machine readable to machine understandable
information - Metadata the glue of the Semantic Web
- A framework for knowledge representation RDF
- The introduction of namespaces (allowing
different system of terms and concepts to
cohabitate in a single information system) - Partial understanding/agreement
- The vision the creation of a dynamic framework
facilitating cooperation/interoperability across
domains and communities - gradually expanding the
web of understanding.
The Semantic Web is an extension of the current
Web in which information is given well defined
meaning, better enabling computers and people to
work in cooperation. Tim Berners-Lee
8What is Web services adding to the party
- A simple down to earth architecture for
distributed computing and cross system
interoperability based on XML messaging and HTTP. - A mechanism (directory service) allowing clients
to dynamically locate relevant services on the
Web (Universal Description, Discovery and
Integration Services, UDDI) - A way of describing (the interface of) remote
objects that will allow clients to recognise and
talk to them (Web Service Definition Language,
WSDL) - An communication protocol supporting calls to
remote objects (SOAP).
9Requirements to metadata standards for global
information systems
- Modularity the Lego? block principle
- Extensibility allowing domains or application
providers to add metadata elements without
compromising the interoperability offered by the
base schema - Refinement allowing domains or application
providers to refine the use of a universal
standard (making elements obligatory, restricting
value domains, requiring the use of specific
controlled vocabularies etc.)
10The tension between the makers of standards and
the implementers
- The main concern of standards makers is
interoperability across domains, communities and
systems (creating Empires of understanding) - The main concern of implementers is efficiency
and relevance within an application - ...even in situations where interoperability is
high on the agenda there will be plenty of
reasons for breaking the standards - Therefore metadata standards aspiring for
universal acceptance cannot insist on regulating
every little detail the wider the acceptance
the thinner the standard.
11Metadata for traveling statistics where are we?
- The Common Warehouse Metamodel (CWM) from OMG a
model and syntax for the exchange of metadata
for data warehousing and business intelligence - ISO 11179 a universal standard for describing
data elements in a metadata repository - GESMES (and GESMES CB) a metadata model for the
exchange of multidimensional data and
time-series. - IQML, AskXML and Triple-S, metadata for the
exchange of questionnaire data - SPSS MR data model
- The Data Documentation Initiative (DDI) a
general metadata standard for statistical data
(micro as well as aggregated)
12- Established in 1995 to create a universally
supported metadata standard for the social
science community - Initiated and organised by the the
Inter-University Consortium for Political and
Social Research (ICPSR), Michigan, USA - Members coming from social science data archives
and libraries in USA, Canada and Europe and from
major producers of statistical data - First version of the standard expressed as an
SGML-DTD - Translated to XML in 1997
- DDI 1.0 published spring 2000
- Extended to cover multidimensional data
(cubes/tables) in 2001 - Architectural reform process initiated 2003
- Fast take-up in the core community and beyond
13Characteristics
- End-user perspective
- provide the end-user with the information needed
to locate relevant data sources and to use
data-sources in a sound way - Initial emphasis on survey-data
- developed to describe independent surveys on
study, file and variable-level (rudimentary
support for other types of data) - Emphasis on codebooks (survey-data dictionaries)
- metadata seen as a complete book or document
- Library-orientation
- strong on catalogue information,
- mapping to Dublin Core
14Achievements
- Acceptance
- fast take-up in the community of data archives
and data libraries world-wide - Community building
- revitalised the co-operation and sharing of
know-how and technologies among the archives and
libraries - Strengthening of the ties between the data
archiving and data producing communities - Software development
15Nesstar - vision
To develop a truly distributed platform for
electronic publishing of statistical data,
building on object technology, open (metadata-)
standards and lightweight Internet protocols.
....or simply
To bring the models, technologies and collective
energy of the Web to the world of statistics.
16Nesstar an overview
- An architecture for a totally distributed virtual
data library - The ability to locate multiple data sources
across national boundaries - The ability to browse detailed information about
these data sources - ..and to do data analysis and visualisation over
the net - ..or to download the appropriate subset of data
in one of a number of formats - Supporting standard micro-data as well as
aggregated tables/cubes
- Allowing the user to bookmark/hyperlink resources
in the data and metadata repositories - searches
- datasets
- analysis (tables, models etc.)
- ..and to hyperlink these resources from external
Web-objects (like texts) - A system for imposing a variety of access control
policies, including statistical disclosure
control (SDC) - Powerful data preparation tools, including a
system for remote publishing of data to NESSTAR
servers
17Nesstar - a fully distributed Data Web
18Ongoing development
- Integrated multilingual thesaurus support
- Trend Wizard locating and harmonizing
potentially comparable variables from disparate
sources/sites in order to create trends and
comparisons - Geo-referencing of statistical data facilitating
geo-oriented resource location - Directory service (web service registry) allowing
clients to automatically discover Nesstar servers
and get a description of the services that they
are providing.
19Examples of use The World Bank
- Using Nesstar to provide access to large amount
of survey-data colected by the World Bank in
various developing countries - Data to be used by internal reserachers to
evaluate the effects of the Banks investments - A customised version of Nesstar fully integrated
with the Banks intranet
Demo
20Examples of use Black Country regional
observatory
- Using Nesstar to build a regional observatory
providing access to local data and knowledge - One of a series of obsrevatories set up to serve
the community, local industry and the general
public. - Fully based on the UK e-government standards
(e.g. e-GMS)
Demo
21Characteristics of the architecture
- Fully distributed architecture with no "central
server". Creating a resilient and scalable system
with no single point of failure. - Well defined namespace. Every resource or
operation has a corresponding URL. Operations can
consequently be "bookmarked" and reapplied at a
later time. The Nesstar URLs can be embedded in
normal Web pages to provide Nesstar/Web
integration. - Programming language independent protocol.
Nesstar is implemented in Java and C/C but the
protocol is XML and RDF based and fully language
independent. - Integration by hyperlinking. The Nesstar objects
can link to each other across server boundaries. - Object orientation. The Nesstar system is built
according to an object-oriented principle,
consisting of an extensible set of self
describing components.
22NEOOM Nesstar Object Oriented Middleware
- All statistical objects live at a URL
- Objects are self describing when a client
access the URL of the object, the object returns
a description of its current state (and its
available methods) in RDF (using RDF as an
Interface Description Language) - Remote object-oriented calls are performed by a
simple protocol running on top of HTTP. The calls
can be stored as a URL, specifying the location
of the relevant object as well as the method
parameters. - This allows for client side storage of
statistical operations that easily can be rerun
at a later stage thereby creating a simple batch
language for operations on remote statistical
objects.
23Nesstar end-user client Runs on any
PC/workstation with any operating system that
can run a modern web-browser with standard
Javascript support. Recommended minimum hardware
configuration RAM 256 Mb Processor 800 Mhz
Software architecture
End-user client
Standard Java script enabled Web Browser
MS Internet Explorer 5.0 or Netscape/Mozilla
5.0
Nesstar Web engine
Using Cocoon 2.1 and Velocity 1.3
Web Client Application
Nesstar server Dedicated server running under MS
Windows 2000 or XP operating. Recommended minimum
hardware configuration RAM 2 Gb Processor 2 x
2.0 Ghz Harddisk 60Gb Nesstar Web engine can run
on top of the Nesstar server or reside on another
server machine with the same recommended minimum
configuration
Object cache
Proxy objects
HTTP Interface Servlet
RDF Class Interface Definitions
Web Server/Container Tomcat 4.1.24
BridgeRemote Bean
LocalBean
Percistence manager
J2EE compliant EJB Container Jboss 3.2.1
MVCSoft 1.1
Metadata database Oracle/MySQL/MS SQL-Server