Title: Information Architecture of the Global Biodiversity Information Facility
1- Information Architecture of the Global
Biodiversity Information Facility - Open Forum for Metadata Registries
- Santa Fe, New Mexico
- 20-24 January 2003
- Hannu Saarenmaa
- hsaarenmaa_at_gbif.org
- www.gbif.net
2Outline of presentation
- What is biodiversity informatics
- What is GBIF
- Architectural principles
- Central registries
- Data nodes
- Participant nodes
- Role of portals
- Conclusion
3Biodiversity informatics
- The science of organisation, sharing,
dissemination and use of data, information, and
knowledge on biological diversity - Builds on distributed computing, data management,
knowledge management, multimedia, GIS, e-business
technologies, Grid... - Biodiversity information is extremely distributed
- Biodiversity informatics is new and really first
became feasible with the web
4SCOPE OF BIODIVERSITY INFORMATICS WITHIN
BIOLOGICAL INFORMATICS
Environmental informatics
Biodiversity informatics
Bioinformatics
Scientific Name
Publica- tion
Population / Culture
Habitat
Land- scape
Geno- mics
Ecological Relation- ships
Species / Taxon
Eco- system
Individual / Specimen
Site
Molecular biology
Bio- sphere
Field Observation
Spatial Data
Collection
5BIODIVERSITY IS AN INFORMATION MANAGEMENT
CHALLENGE
- Total number of species about 10 million
- 1.7 million species have been described and named
- Total number of specimens in museum collections
1-3 billion - Also hiding a large number of not yet described
species - 18 000 new species described each year
- This rate has not improved during the past 40
years - 1 000 to 10 000 species lost each year to
extinction - This rate is 1000 times faster than the natural
rate
6WHAT IS GBIF?
- GBIF is an international scientific co-operative
project based on a multilateral agreement (MoU)
between countries, economies and international
organisations, dedicated to - establishing an interoperable, distributed
network of databases containing scientific
biodiversity information, in order to - make the worlds scientific biodiversity data
freely and universally available to all, - with initial focus on species- and specimen-level
data, - with links to molecular, genetic and ecosystems
levels
7THE STORY OF GBIF
- 1996 Planning of GBIF starts
- January 1999Working group of the MegaScience
Forum of the OECD recommends establishing GBIF - March 2001GBIF formally established
- June 2001Denmark chosen to host GBIF Secretariat
- November 2001Executive Secretary James L.
Edwards moves to Copenhagen and initiates
Secretariat - October 2002First work programme approved
- 2004Three-year review and necessary
reorientation - 2006Initial 5 year commitment of participants
over and future of GBIF will be reconsidered
8GBIF WORK PROGRAMMES
- Data Access and Database Interoperability
- Electronic Catalogue of Names of Known Organisms
- Digitisation of Natural History Collections
- Outreach and Capacity Building
- Species Bank
- Digital Biodiversity Literature Resources
9GBIF VOTING PARTICIPANTS (as of 1 January 2003)
- 22 Voting Participants
- Australia, Belgium, Canada, Costa Rica, Denmark,
Finland, France, Germany, Iceland, Japan, Korea,
Mexico, Netherlands, New Zealand, Nicaragua,
Peru, Portugal, Slovenia, Spain, Sweden, UK, USA - Convention on Biological Diversity is also an ex
officio (non-voting) member of Governing Board
1030 ASSOCIATE PARTICIPANTS (as of 1 January 2003)
- Argentina
- Austria
- Bulgaria
- Czech Republic
- European Commission
- Ghana
- Pakistan
- Poland
- Slovak Republic
- South Africa
- Switzerland
- Taiwan
- Tanzania
- ALL Species Foundation
- ASEANET
- BioNET
- BIOSIS
- Commonwealth Agricultural Bureau (CABI)
- EASIANET
- Expert Centre for Taxonomic Identification
- Inter-American Biodiversity Information Network
- Integrated Taxonomic Information System
- NatureServe
- Ocean Biogeographic Information System
- Species 2000
- Taxonomic Databases Working Group
- UNESCO--Man and the Biosphere Program
- UNEP--World Conservation Monitoring Centre
- The World Federation for Culture Collections
11PARTICIPANTS AGREE TO...
- Share biodiversity data
- Set up a node or nodes for sharing the data
- Formulate and implement GBIF work programme for
their part - Voting Participants (countries and economies)
make yearly contribution based on GDP - GBIF central budget is 3M
- Associate Participants (countries, economies,
international organisations) cannot vote, but
otherwise participate fully in GBIF activities
and decisions - Make additional investments in biodiversity
information and the necessary infrastructure - 90 of investment in GBIF happens within
Participants, only 10 centrally for providing
the linking mechanism
12GBIF node responsibilities
GBIF Metadata Registry and Portal
- Network
- Standards
- Tools
- Consolidated Data
- Identify (local) Data Nodes
- Forward registration metadata from Data Nodes
- National Language Interfaces
Data Node
Participant Node
- Encourage participation
- Manage registration of Data Nodes
13GBIF IPR Principles
- GBIF will seek to ensure that data in
GBIF-affiliated databases is in public domain - In particular data enabling linking with other
data - GBIF will seek to ensure that source of data is
acknowledged by all users - Cf. Open Source licenses, commons
- Maintenance and control of data remain in hands
of database owners - There will be no central data banks (except
caches) - Database owners can block access to sensitive
data - Countries have sovereignity over their biological
resources - It follows that GBIF services will mainly be
integrative metadata services, and standards
14GBIF information architecture and central
registries
15Information model(under development)
Institutions
Rights Services
Source URL Protocol Format Data
exchange Description
Data sources
Subject Taxa Coverage Spatial Coverage
Temporal Description
Knowledge Bases
Datasets
Taxonomies in Global Species Databases
Units/ Records
Objects
Rights Format Encodings
Unstructured information
Checklists Redlists
Observation data
Specimen data
Species Knowledge
16COMPONENTS IN THE GBIF INFORMATION ARCHITECTURE
Participant Services
Central Services
Internal Services
Participant Node Portal
GBIF portal
Group Collaboration Services
Multiple Data Nodes
Data Base
Data Repository
GBIFS Services
Registry
Taxonomic Name Service
Standard Validation Tool
Standards Repository
Digitization Tools
17You dont get very far with web services unless
you have a registry...-Tom Gaskins, uddi.org
The registry
- One global marketplace of shared biodiversity
data - A central services registry
- Directory of Participants and Data providers
- Datasources and datasets offered
- Services of the providers
- Services registry will then be used to generate a
metadata registry of the available data - Registry retrieves metadata from the registered
datasets - Indices over key elements in data sets (Dublin
Core registry) - Subject taxon Coverage spatial, temporal ...
- Open interfaces for portals and specialised
search engines - Anybody can write their portal/search tool that
uses the registry and the index - Will be written by GBIFS using open source
components - Examples available in Biomoby, EIONET Content
Registry, NBN
18The registry consists of a Services registry and
a Content registry
- Communications Portal
- Syndication
- Collaboration
- User directories
Species Bank
Specialised Portal B
Web Application A Search Engine A
- Services registry
- Providers
- Datasources /sets
- Services of above
- Content registry
- Names and concepts
- Federated key data
- Indexes of content
Data source
Institution
Data source
Institution
19Taxonomic name service
- Dynamic linking mechanism derived from contents
of Catalogue of Life - Closely paired with the registry
- Many possible approaches
- Semantic web and RDF
- Taxonomic object service global namespace of
URIs - Provide programmatic access to the current state
of knowledge for taxonomy. - Provide a single name service that encapsulates
existing services such as ITIS and SP2000 - GBIF is working with other groups such as
Consortium on the Catalogue of Life, TDWG and
Octopus to define this service
20Integration by name service (ECAT)
Portal
ECAT elements have been coloured orange Name
Lists are lists of names for a specific purpose
(e.g. Red List, regional checklist)
XML Data Access
HTML Data Access
GBIF Data Access (Servlets)
Registry
Species 2000
Name Usage Index
Name Service Interface (ECAT)
Indexing of usage
Indexing of usage
21Data exchange standards are key
- XML Schema must be agreed for
- Name
- Taxon
- Specimen
- Collection
- Person in various roles
- Publication
- Site
- Observation
- Standards process must be open and consistent
- Discussion
- Documentation
- Support for format validation
- Support for quality assurance
- Leading standards
- TDWG ABCD
- BioCASE
- Dublin Core
- Darwin Core
- DiGIR
- SOAP
- Grid OGSA
22Data nodes
23Data node
WSDL Service Descriptions
Specimen Index Data (3-5 fields)
Specimen Detail (full data)
Specimen Summary Data (20-30 fields)
HTML
Data Repository
Node Data Services
Presentation Service
Metadata Services
Collection Database Adaptor
Collection Database Adaptor
Collection Database
Collection Database
24Services of data nodes
- Export the shared data into
- Data warehouse in SQL
- Data repository (locally owned) in document
format - Advertise the provider, its services and
available datasets towards central registry - WSDL, possibly UDDI/ebXML
- Participant node to coordinate
- Dublin Core description of published datasets
- Enable the central metadata registry to index the
datasets - SOAP/ DiGIR protocol for queries and responses
- TDWG/ABCD standard for data encoding in XML
- Respond to queries of data users
- SOAP/ DiGIR protocol for queries and responses
- TDWG/ABCD standard for data encoding in XML
25Some issues with online databases
- Writing middleware (wrappers, resource
agents) is a complicated task. - Toolkits, guidelines, and assistance (roaming
wrapper writers) are needed. - High availability of databases when queries come
must be arranged, or caching be used - Data warehousing is recommended
- Original operational data on web is risky
- Operational databases are optimised for input and
storage, not always for querying - Quality assurance and possible approval easier to
do on an exported dataset than an entire database - For these reasons international reporting systems
traditionally have used document-based data
exchange
26Data repositories
- Export the shared dataset into a locally owned
repository - Exported dataset is held in document format
- or data warehouse in a flat structure
- Data is stored and served in XML format using the
standards of GBIF/TDWG - I.e., Darwin Core /ABCD
- Repository acts as a wrapper and allows the
central metadata registry to index its datasets - SOAP/DiGIR, WSDL, Dublin Core, Darwin Core/ABCD
- But... a repository does not respond to dynamic
queries - It only returns the files as they were uploaded
- Enables data warehousing elsewhere
27Participant nodes
28Participant node
WSDL Service Descriptions
Specimen Data
Name Data
General Resource Data
HTML
Portal Services
Presentation Service
Registry Management
UDDI Service Registry
Specimen Data from Collection Data Nodes
Data Services from GBIF Portal
WSDL Service Descriptions
29Node relationships for registration
GBIF Registry and Portal
Service Metadata
Service Metadata
Indexing Metadata
Participant Node
Participant Node
Service Metadata
Service Metadata
Collection Node
Collection Nodes
30Possible services of the Participant nodes
- Promotion and helping of inclusion of new data
providers and data sets - Quality assurance and compliance
- Institutional level Coordinate participant part
of network who can play? - Dataset level Compliance with national
regulations and IPR - Data element level Scientific quality control of
correctness advertised data - Tecnical level Is the format right
- Host data from the willing data nodes
- National language support as needed
- Use of the central registry to provide access to
domain-relevant data - Thematic portal and search facilities to find
special data of the Participant
31What tools Participant node needs
- Necessities
- Register institutions and data sources (nodes)
- Local directory server or UDDI database, linking
with central registry - Register the services and datasets of nodes
- Local UDDI or other registry, linking with
central registry - Good to have
- Tools for quality assurance
- Portal server for domain-specific website
- PTK for communication, repository tool for
hosting - Directory of people and communication tools
32GBIF central portal
33Role of portals
- Communication/ coordination needs
- Portals are integrative tools and gateways to
information that go beyond single websites - Portals and related directory services can be
used to coordinate network activities - Data access needs
- Much of the content on the portals can be built
automatically out of contents of the Registry
using metadata - GBIF central portal is only one of many portals
and search engines making use of the central
metadata registry and related indices through
their open interfaces
34Services of the GBIF portal
- Version 1 released at the end of 2002, and as
toolkit for nodes, http//beta.gbif.org/ - News syndication and electronic newspaper with
discussion - Events, calendar of calendars, projects
- Articles, documents, images, audio and video
content - Search within the site, across the GBIF network
- Download area
- Getting started service and how to become a node
- About GBIF
- CIRCA-based group collaboration services
- Directory services (CIRCA-based open LDAP)
- Suggestions and feedback from users
- Prototype data repository
- Version 2 end of 2003, demonstration earlier in
the year - XML standards repository and registry links
- Links to Participant nodes and their content
- Access to biological content derived from the
registry
35Test version of the central GBIF
communic-ationsportal
36GBIF Portal Toolkit (PTK)
- Model implementation of the functions needed for
dissemination and interaction between portals and
the registry - Interoperability and automatic content
syndication between collaborative portals, e.g.
CHM - How to use the registry to create a vertical
portal and specialised search engine - Knowledge management based on GBIF data sources
- User interface for data mining, knowledge
discovery, knowledge contributions, ... - Packages the tools in one turn-key solution to
reduce the time needed for a node to get online
and include a new data source in gbif.net - Open source, based on Zope www.zope.org
- Available now as beta
37Conclusion
38 GBIF VISION (a technical update)
Content area responsibilities of GBIF
GenBank, et al.
Sequence Data (RNA, protein, etc.)
Specimen Observation Data
Registry of Shared Biodiversity Data
GeospatialData
Climate Data
Electronic Catalog of Names
SpeciesBank, Search Engines Portals
Ecosystems Data
Existing responsibilities of other groups
Ecological Data
39Species portals?When all is done, new kinds of
services can be created semiautomatically from
the contents of the registry, e.g. the
SpeciesBank Scoping of the SpeciesBank needed
- http//ponderosa.pinus.plantae.bio
40GRID AND GBIF
- Grid is the emerging global distributed network
for universally available high-performance
computing and networking. - Potentially very relevant for GBIF.
- Architectures
- Web Services or Open Grid Service Architecture?
- Possible areas of activity
- Semantic Grid might fit the taxonomic name
service - Production of global distribution map under
multiple global change scenarios could require
computational capacities from the Grid. - Advanced collaborative environment (ACE is a Grid
Research Group) is needed for accelerating
species discovery and distributed authoring of
the Species Bank
41HOW TO PLAY?
- After all, 90 of investement in GBIF should be
within Participants, not centrally - Share your data. Anyone can apply to become a
data node the Participant nodes will coordinate - Use the data. Provide value-added services for
data archiving, mining, analysis that build on
the upcoming wealth of data. - Vertical portals and specilised search engines.
- Contribute new data and knowledge.
- Calls for proposals for seed money for important
digitisation and other projects - GBIF builds on open source
- Lots of room to provide tools
- Contribute to standards refinement
42SUMMARY
- GBIF network to be up and running by end of 2003
- New generation of simple data exchange standards
- Central registry and marketplace of distributed
data - Anyone can build their vertical portals or
specilised search engines on top of that - Participant nodes Major role in quality control,
coordination and dissemination - Data nodes Register your datasets, provide
online access to database or repository - Data remains under the control of providers