IZA Data Service Center DDI/SDMX Workshop Wiesbaden, Germany, June 18th 2008 The Data Documentation Initiative (DDI) - PowerPoint PPT Presentation

Loading...

PPT – IZA Data Service Center DDI/SDMX Workshop Wiesbaden, Germany, June 18th 2008 The Data Documentation Initiative (DDI) PowerPoint presentation | free to download - id: 67f70c-NjEyM



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

IZA Data Service Center DDI/SDMX Workshop Wiesbaden, Germany, June 18th 2008 The Data Documentation Initiative (DDI)

Description:

IZA Data Service Center DDI/SDMX Workshop Wiesbaden, Germany, June 18th 2008 The Data Documentation Initiative (DDI) Arofan Gregory / Pascal Heus – PowerPoint PPT presentation

Number of Views:11
Avg rating:3.0/5.0
Slides: 139
Provided by: Pasca60
Learn more at: http://odaf.org
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: IZA Data Service Center DDI/SDMX Workshop Wiesbaden, Germany, June 18th 2008 The Data Documentation Initiative (DDI)


1
IZA Data Service CenterDDI/SDMX
WorkshopWiesbaden, Germany, June 18th 2008The
Data Documentation Initiative (DDI)
  • Arofan Gregory / Pascal Heus
  • agregory_at_opendatafoundation.org /
    pheus_at_opendatafoundation.org
  • Open Data Foundation

2
Content
  • Background on metadata and XML
  • Metadata and Microdata
  • XML and Microdata the DDI
  • DDI 2.0
  • DDI 3.0
  • DDI 2.0 vs 3.0
  • Major stakeholders / initiatives

3
Metadata / XML
4
What is metadata?
  • Common definition Data about Data

5
What is XML?
  • Today's Universal language on the web
  • Purpose is to facilitate sharing of structured
    information across information systems
  • XML stands for eXtensible Markup Language
  • eXtensibe ? can be customized
  • Markup ? tags, marks, attach attributes to things
  • Language ? syntax (grammatical rules)
  • HTML (HyperText Markup Language) is a markup
    language but not extensible! It is also concerned
    about presentation, not content.
  • XML is a text format (not a binary black box)
  • XML is a also a collection of technologies (built
    on the XML language)
  • It is platform independent and is understood by
    modern programming languages (C, Java, .NET,
    pHp, perl, etc.)
  • It is both machine and human readable

6
Simple XML example
Attributes
ltcataloggt ltbook isbn0385504209gt
lttitlegtDa Vinci Codelt/titlegt
ltauthorgtDan Brownlt/authorgt lt/bookgt
ltbook isbn0553294385 pages352gt
lttitlegtI, robotlt/titlegt ltauthorgtIsaac
Asimovlt/authorgt ltlanguagegtEnglishlt/langu
agegt lt/bookgtlt/cataloggt
Elements
Opening and Closing tags
Text content
7
XML Technology overview
Document Type Definition (DTD) and XSchema are
use to validate an XML document by defining
namespaces, elements, rules
Specialized software and database systems can be
used to create and edit XML documents. In the
future the XForm standard will be used
XML separates the metadata storage from its
presentation. XML documents can be transformed
into something else, like HTML, PDF, XML, other)
through the use of the eXtensible Stylesheet
Language, XSL Transformations (XSLT) and XSL
Formatting Objects (XSL-FO)
Very much like a database system, XML documents
can be searched and queried through the use of
XPath oe XQuery. There is no need to create
tables, indexes or define relationships
XML metadata or data can be published in smart
catalogs often referred to as registries than can
be used for discovery of information.
XML Documents can be sent like regular files but
are typically exchanged between applications
through Web Services using the SOAP and other
protocols
8
What is an XML Schema?
  • Exchange / sharing / harmonization implies
    agreement on structure
  • We need a specification that describes the
    structure and rules ? Schema
  • A schema is a set of rules to which an XML
    document must conform in order to be considered
    'valid'
  • XML Schema was also designed with the intent that
    determination of a document's validity would
    produce a collection of information adhering to
    specific data types
  • Similar to relational databases structural
    definition
  • Many schemas exists for different purposes
  • Examples
  • DDI, SDMX ,Dublin Core, RSS, XHTML, etc.

9
Metadata, XML and Microdata
10
What is a survey?
  • More than just data.
  • A complex process to produce data for the purpose
    of statistical analysis
  • Beyond this, a tool to support evidence based
    policy making and results monitoring
  • The data is surrounded by a large body of
    documentation
  • Survey data often come with limited documention
  • Note that microdata is intended for experts
  • Statisticians / researchers
  • Represents a single point in time and space
  • Need to be aggregated to produce meaningful
    results
  • It is the beginning of the story

11
What is survey metadata?
  • Survey documentation can be broken down into
    structured metadata and documents
  • Structured metadata can be captured using XML
  • Documents can be described in structured metadata
  • Example of metadata
  • Survey level Title, country, year, abstract,
    sampling, agencies, access policy, etc.
  • Variable level filename, label, code, questions,
    instructions, derivation, etc.
  • Related materials report, questionnaire, papers,
    manuals, scripts/programs, photos
  • Cross-surveys catalogs, longitudinal, concepts,
    comparability, etc.

12
Importance of survey metadata
  • Data Quality
  • Usefulness accessibility coherence
    completeness relevance timeliness
  • Undocumented data is useless
  • Partially documented data is risky (misuse)
  • Data discovery and access
  • Preservation
  • Replication standard (Gary King)
  • Information exchange
  • Reduce need to access sensitive data
  • Maintain coherence / linkages across the complete
    life cycle (from respondent to policy maker)
  • Reuse

13
The Data Documentation Initiative
  • The Data Documentation Initiative is an XML
    specification to capture structured metadata
    about microdata (broad sense)
  • First generation DDI 1.02.1 (2000-2008)
  • focus on single archived instance
  • Second generation DDI 3.0 (2008)
  • focus on life cycle
  • go beyond the single survey concept
  • mutli-purpose

14
DDI Timeline / Status
  • Pre-DDI 1.0
  • 70s / 80s OSIRIS Codebook
  • 1993 IASSIST Codebook Action Group
  • 1996 SGML DTD
  • 1997 DDI XML
  • 1999 Draft DDI DTD
  • 2000 DDI 1.0
  • Simple survey
  • Archival data formats
  • Microdata only
  • 2003 DDI 2.0
  • Aggregate data (based on matrix structure)
  • Added geographic material to aid geographic
    search systems and GIS users
  • 2003 - Establishment of DDI Alliance
  • 2004 Acceptance of a new DDI paradigm
  • Lifecycle model
  • Shift from the codebook centric / variable
    centric model to capturing the lifecycle of data
  • Agreement on expanded areas of coverage
  • 2005
  • Presentation of schema structure
  • Focus on points of metadata creation and reuse
  • 2006
  • Presentation of first complete 3.0 model
  • Internal and public review
  • 2007
  • Vote to move to Candidate Version (CR)
  • Establishment of a set of use cases to test
    application and implementation
  • October 3.0 CR2
  • 2008
  • February 3.0 CR3
  • March 3.0 CR3 update
  • April 3.0 CR3 final
  • April 28th 3.0 Approved by DDI Alliance
  • May 21st DDI 3.0 Officially announced
  • Initial presentations at IASSIST 2008
  • 2009
  • DDI 3.1 and beyond

15
DDI 1/2.x
16
The archive perspective
  • Focus on preservation of a survey
  • Often see survey as collection of data files
    accompanied by documentation
  • Code book centric
  • report, questionnaire, methodologies, scripts,
    etc.
  • Result in a static event the archive
  • Maintained by a single agency
  • Is typically documentation after the facts
  • This is the initial DDI perspective (DDI 2.0)

17
DDI 2.0 Technical Overview
  • Based on a single structure (DTD)
  • 1 codeBook, 5 sections
  • docDscr describes the DDI document
  • The preparation of the metadata
  • stdyDscr describes the study
  • Title, abstract, methodologies, agencies, access
    policy
  • fileDscr describes each file in the dataset
  • dataDscr describes the data in the files
  • Variables (name, code, )
  • Variable groups
  • Cubes
  • othMat other related materials
  • Basic document citation

18
Characteristics of DDI 1.0/2.0
  • Focuses on the static object of a codebook
  • Designed for limited uses
  • End user data discovery via the variable or high
    level study identification (bibliographic)
  • Only heavily structured content relates to
    information used to drive statistical analysis
  • Coverage is focused on single study, single data
    file, simple survey and aggregate data files
  • Variable contains majority of information
    (question, categories, data typing, physical
    storage information, statistics)

19
Impact of these limitations
  • Treated as an add on to the data collection
    process
  • Focus is on the data end product and end users
    (static)
  • Limited tools for creation or exploitation
  • The Variable must exist before metadata can be
    created
  • Producers hesitant to take up DDI creation
    because it is a cost and does not support their
    development or collection process

20
DDI 1/2.x Tools
  • Nesstar
  • Nesstar Publisher, Nesstar Server
  • IHSN
  • Microdata Management Toolkit
  • NADA (online catalog for national data archive)
  • Archivist / Reviewer Guidelines
  • Other tools
  • SDA, Harvard/MIT Virtual Data Center (Dataverse)
  • UKDA DExT, ODaF DeXtris
  • http//tools.ddialliance.org

21
DDI 2.0 perspective
DDI 2 Survey
DDI 2 Survey
DDI 2 Survey
DDI 2 Survey
DDI 2 Survey
DDI 2 Survey
DDI 2 Survey
22
DDI 3.0
  • The life cycle

23
When to capture metadata?
  • Metadata must be captured at the time the event
    occurs!
  • Documenting after the facts leads to considerable
    loss of information
  • Multiple contributors are typically involved in
    this process (not only the archivist)
  • This is true for producers and researchers

24
DDI 3.0 and the Survey Life Cycle
  • A survey is not a static process It dynamically
    evolved across time and involves many
    agencies/individuals
  • DDI 2.x is about archiving, DDI 3.0 across the
    entire life cycle
  • 3.0 focus on metadata reuse (minimizes
    redundancies/discrepancies, support comparison)
  • Also supports multilingual, grouping, geography,
    and others
  • 3.0 is extensible

25
Requirements for 3.0
  • Improve and expand the machine-actionable aspects
    of the DDI to support programming and software
    systems
  • Support CAI instruments through expanded
    description of the questionnaire (content and
    question flow)
  • Support the description of data series
    (longitudinal surveys, panel studies, recurring
    waves, etc.)
  • Support comparison, in particular comparison by
    design but also comparison-after-the fact
    (harmonization)
  • Improve support for describing complex data files
    (record and file linkages)
  • Provide improved support for geographic content
    to facilitate linking to geographic files (shape
    files, boundary files, etc.)

26
Approach
  • Shift from the codebook centric model of early
    versions of DDI to a lifecycle model, providing
    metadata support from data study conception
    through analysis and repurposing of data
  • Shift from an XML Data Type Definition (DTD) to
    an XML Schema model to support the lifecycle
    model, reuse of content and increased controls to
    support programming needs
  • Redefine a single DDI instance to include a
    simple instance similar to DDI 1/2 which
    covered a single study and complex instances
    covering groups of related studies. Allow a
    single study description to contain multiple data
    products (for example, a microdata file and
    aggregate products created from the same data
    collection).
  • Incorporate the requested functionality in the
    first published edition

27
Designing to support registries
  • Resource package
  • structure to publish non-study-specific materials
    for reuse
  • Extracting specified types of information in to
    schemes
  • Universe, Concept, Category, Code, Question,
    Instrument, Variable, etc.
  • Allowing for either internal or external
    references
  • Can include other schemes by reference and select
    only desired items
  • Providing Comparison Mapping
  • Target can be external harmonized structure

28
Technical Overview
  • DDI 3 is composed of several schemas
  • Use only what you need!
  • Schemas represent modules, sub-modules
    (substitutions), reusable, external schemas
  • archive
  • comparative
  • conceptualcomponent
  • datacollection
  • dataset
  • dcelements
  • DDIprofile
  • ddi-xhtml11
  • ddi-xhtml11-model-1
  • ddi-xhtml11-modules-1
  • group
  • inline_ncube_recordlayout
  • instance
  • logicalproduct
  • ncube_recordlayout
  • physicaldataproduct
  • physicalinstance
  • proprietary_record_layout (beta)
  • reusable
  • simpledc20021212
  • studyunit
  • tabular_ncube_recordlayout
  • xml
  • set of xml schemas to support xhtml

29
Technical Overview
  • Any element that can be referenced is globally
    uniquely identified
  • Maintainable (by an agency)
  • Versionable (can change across time)
  • Identifiable (within a maintainable scheme)
  • Modules
  • Reflect closely related sets of information
    similar to the sections of DDI 1/2. DTD
  • Modules can be held as separate XML instances and
    be included in a large instance by either
    inclusion or reference
  • All modules are maintainable (but not all
    maintainables are modules)

30
Technical Overview Maintainable Schemes
(thats with an e not an a)
  • Category Scheme
  • Code Scheme
  • Concept Scheme
  • Control Construct Scheme
  • GeographicStructureScheme
  • GeographicLocationScheme
  • InterviewerInstructionScheme
  • Question Scheme
  • NCubeScheme
  • Organization Scheme
  • Physical Structure Scheme
  • Record Layout Scheme
  • Universe Scheme
  • Variable Scheme
  • Packages of reusable metadata maintained by a
    single agency

31
DDI 3.0 Use Cases
  • Study design/survey instrumentation
  • Questionnaire generation/data collection and
    procesing
  • Data recoding, aggregation and other processing
  • Data dissemination/discovery
  • Archival ingestion/metadata value-add
  • Question/concept/variable banks
  • DDI for use within a research project
  • Capture of metadata regarding data use
  • Metadata mining for comparison, etc.
  • Generating instruction packages/presentations

32
Study Design/Survey Instrumentation
  • This use case concerns how DDI 3.0 can support
    the design of studies and survey instrumentation
  • Without benefit of a question or concept bank

33
  • Types of Metadata
  • Concepts (conceptual module)
  • Universe (conceptual module)
  • Questions (datacollection module)
  • Flow Logic (datacollection module)

ltDDI 3.0gt Concepts Universes
ltDDI 3.0gt Concepts Universes
Final
Drafting/ Review/ Revision

ltDDI 3.0gt Questions Flow Logic
ltDDI 3.0gt Concepts Universes Questions Flow Logic
As the survey instrument is tested, all
revisions and history can be tracked and
preserved. This would include question
translation and internationalization.
Final
Drafting/ Testing/ Revision
34
Questionnaire Generation, Data Collection, and
Processing
  • This use case concerns how DDI 3.0 can support
    the creation of various types of
    questionnaires/CAI, and the collection and
    processing of raw data into microdata.

35
  • Types of Metadata
  • Concepts (conceptual module)
  • Universe (conceptual module)
  • Questions (datacollection module)
  • Flow Logic (datacollection module)
  • Variables (logicalproduct module)
  • Categories/Codes (logicalproduct module)
  • Coding (datacollection module)

Paper Questionnaire
ltDDI 3.0gt Concepts Universes Questions Flow Logic
Online Survey Instrument
CAI Instrument
Final
Raw Data
Microdata
DDI captures the content XML allows for each
application to do its own presentation
ltDDI 3.0gt Concepts Universes Questions Flow Logic
ltDDI 3.0gt Variables Coding
ltDDI 3.0gt Categories Codes Physical Data
Product Physical Data Instance


36
Data Recoding, Aggregation, etc.
  • This use case concerns how DDI 3.0 can describe
    recodes, aggregation, and similar types of data
    processing.

37
  • Initial microdata has
  • Concepts (conceptual module)
  • Universes (conceptual module)
  • Questions (datacollection module)
  • Flow Logic (datacollection module)
  • Variables (logicalproduct module)
  • Coding (datacollection module)
  • Categories (logicalproduct module)
  • Codes (logicalproduct module)
  • Physical Data Product
  • Physical Data Instance
  • Recode adds
  • More codings (datacollection module)
  • New variables
  • New categories
  • New codes
  • NCubes (for aggregation)

Could be a recode, an aggregation, or other
process.
Microdata/ Aggregates
Microdata
ltDDI 3.0gt Conceptual Datacollection Variables Cate
gories Codes
ltDDI 3.0gt Codings Variables (new) Categories
(new) Codes (new) NCubes

38
Data Dissemination/Data Discovery
  • This use case concerns how DDI 3.0 can support
    the discovery and dissemination of data.

39
ltDDI 3.0gt Can add archival events meta-data
Rich metadata supports auto-generation of
websites and other delivery formats
Codebooks
ltDDI 3.0gt Full meta- data set
Websites

Databases, repositories
Research Data Centers
Microdata/ Aggregates
Data-Specific Info Access Systems
Registries Catalogues Question/Concept/ Variable
Banks
40
Archival Ingestion and Metadata Value-Add
  • This use case concerns how DDI 3.0 can support
    the ingest and migration functions of data
    archives and data libraries.

41
Supports automation of processing if good DDI
metadata is captured upstream
Provides a neutral format for data migration as
analysis packages are versioned
ltDDI 3.0gt Full meta- data set (?)
Data Archive Data Library
Ingest Processing

Microdata/ Aggregates
ltDDI 3.0gt Full or additional metadata Archival
events
Provides good format foundation for
value- added metadata by archive
42
Question/Concept/Variable Banks
  • This use case describes how DDI 3.0 can support
    question, concept, and variable banks. These are
    often termed registries or metadata
    repositories because they contain only metadata
    links to the data are optional, but provide
    implied comparability. The focus is metadata
    reuse.

43
Because DDI has links, each type of bank
functions in a modular, complementary way.
Question Bank
ltDDI 3.0gt Questions Flow Logic Codings
ltDDI 3.0gt Questions Flow Logic Codings
Users and Applications
Variable Bank
ltDDI 3.0gt Variables Categories Codes
ltDDI 3.0gt Variables Categories Codes
Users and Applications
ltDDI 3.0gt Concepts
ltDDI 3.0gt Concepts
Users and Applications
Concept Bank
Supports but does not require ISO 11179
44
DDI For Use within a Research Project
  • This use case concerns how DDI 3.0 can support
    various functions within a research project, from
    the conception of the study through collection
    and publication of the resulting data.

45
Prinicpal Investigator
Research Staff
Collaborators
ltDDI 3.0gt Variables Physical Stores
ltDDI 3.0gt Questions Instrument

ltDDI 3.0gt Concepts Universe Methods Purpose People
/Orgs
ltDDI 3.0gt Funding Revisions



ltDDI 3.0gt Data Collection Data Processing

Data
Archive/ Repository
Submitted Proposal
Publication
Presentations

46
Capture of Metadata Regarding Data Use
  • This use case concerns how DDI 3.0 can capture
    information about how researchers use data, which
    can then be added to the overall metadata set
    about the data sources they have accessed.

47
  • Types of Metadata
  • Recodes (datacollection module)
  • Record subsets (physicalinstance module)
  • Variable subsets (logicalproduct module)
  • Comparison (comparative module)

Data Sets
ltDDI 3.0gt StudyUnit DataCollection LogicalProduct
PhysicalDataProduct PhysicalInstance
  • ltDDI 3.0gt
  • Recodes
  • Case Selection
  • Variable Selection
  • Comparison to original study
  • Resulting physical file descriptions

Data

Data Analysis
48
Metadata Mining for Comparison, etc.
  • This use case concerns how collections of DDI 3.0
    metadata can act as a resource to be explored,
    providing further insight into the comparability
    and other features of a collection of data.

49
  • Types of Metadata
  • Universe (comparative module)
  • Concept (comparative module)
  • Question (datacollection module)
  • Variable (logicalproduct module)

Questions
Variable
Concepts
Metadata Repositories/ Registries
Universe
ltDDI 3.0gt Instances
  • ltDDI 3.0gt
  • Comparison
  • Questions
  • Categories
  • Codes
  • Variables
  • Universe
  • Concepts
  • Recodes
  • Harmonizations

?
Data Sets
50
Generating Instruction Packages/Presentations
  • This use case concerns how DDI 3.0 can support
    automation around the instruction of students and
    others.

51
  • Types of Metadata
  • Individual studies (studyunit module)
  • Grouping purpose (group module)
  • Linking information (comparative module)
  • Processing assistance (group module)

ltDDI 3.0gt StudyUnit 1
ltDDI 3.0gt StudyUnit 2
ltDDI 3.0gt StudyUnit 1 StudyUnit 2 StudyUnit
3 StudyUnit 4 Comparative OtherMaterials
ltDDI 3.0gt StudyUnit 3
ltDDI 3.0gt StudyUnit 4
ltDDI 3.0gt StudyUnit 1 StudyUnit 2 StudyUnit
3 StudyUnit 4
  • Topically related studies selected
  • Group is made with description of the intended
    use for the group
  • Comparative information is added indicating
    matching fields for linking and mapping between
    similar variables
  • Other materials such as SAS/SPSS recode command
    are referenced from the group

Instructional Package
52
DDI 3.0 Tools
  • Under developments
  • DDI Foundation Tools Program
  • Road Map
  • XML Beans, validation,
  • DDI DExT, DDI2StatsProgs
  • Other tools
  • R SPSS Export, Algenta SurveyViz, others
    presented at IASSIST
  • DDI Editing Suite
  • Proposed as extension of DDI-FTP
  • Plan for generic editor in 6-9 months
  • DDI 3.0 related projects / initiatives
  • RDC Canada, Germany RDC / EURASI, DANS MIXED, NORC

53
DDI 3 Relationship to Other Standards
  • SDMX (from microdata to indicators / time series)
  • Completely mapping to and from DDI NCubes
  • Dublin Core (surveys and documents gets cited)
  • Mapping of citation elements
  • Option for DC namespace basic entry
  • ISO 19115 Geography (microdata gets mapped)
  • Search requirements
  • Support for GIS users
  • METS
  • Designed to support profile development
  • OAIS (alignment of archiving standards)
  • Reference model for the archival lifecycle
  • ISO/IEC 11179 (metadata mining through concepts)
  • Variable linking representation to concept and
    universe
  • Optional data element construct in
    ConceptualComponent that allows for complete
    ISO/IEC 11179 structure as a maintained item

54
DDI 3.0 perspective
55
DDI 2.0 and DDI 3.0
56
DDI 2 / DDI 3
  • Single survey
  • Focus on the archive
  • Non-reusable metadata
  • Maintained by single agency
  • Loose validation
  • DTD based
  • Sparse documentation
  • Designed by archivists
  • Some tools are available
  • Multiple surveys
  • Focus on life cycle
  • Highly reusable metadata
  • Maintained by many agencies
  • Tied validation
  • Schema based
  • Extensive guide
  • Designed by expert groups
  • Tools are beginning to emerge

57
What 3.0 can do for you
  • Manage multi-surveys
  • Support multiple contributors
  • Support many different perspectives
  • Support many different use cases
  • Maintain metadata integrity across the life cycle
  • Connect to other metadata spaces
  • Metadata reuse
  • Publication in registries
  • Backward compatibility with 2.0

58
DDI Community
59
DDI Organizations/ Agencies
  • DDI Alliance (http//www.ddialliance.org)
  • Interuniversity Consortium for Political and
    Social Research (ICPSR) (http//icpsr.umich.edu)
  • International Association for Social Science
    Infromation Service Technology (IASSIST)
    (http//www.iassistdata.org)
  • International Household Survey Network (IHSN)
    (http//www.surveynetwork.org)
  • Open Data Foundation (ODaF) (http//www.opendatafo
    undation.org)
  • National Opinion Research Center Data Enclave
    (NORC) (http//dataenclave.norc.org)
  • Metadata Technology (http//www.metadatatechnology
    .com)

60
IZA Data Service CenterDDI/SDMX
WorkshopWiesbaden, Germany, June 18th 2008The
Statistical Data and Metadata Exchange Standard
(SDMX) An Introduction
  • Arofan Gregory / Pascal Heus
  • agregory_at_opendatafoundation.org /
    pheus_at_opendatafoundation.org
  • Open Data Foundation

61
Overview of the Session
  • SDMX Background and Goals
  • SDMX and Data
  • SDMX and Metadata
  • SDMX and Best Practices The Content-Oriented
    Guidelines
  • The SDMX Information Model
  • SDMX and Web Services
  • The SDMX Registry
  • SDMX Data Services
  • Tools and Resources

62
SDMX Background and Goals
63
What is SDMX?
  • The problem space
  • Statistical collection, processing, and exchange
    is time-consuming and resource-intensive
  • Focus on aggregate data (esp. time series)
  • Various international and national organisations
    have individual approaches for their
    constituencies
  • Uncertainties about how to proceed with new
    technologies (XML, web services )

64
What is SDMX?
  • The Statistical Data and Metadata Exchange
    (SDMX) initiative is taking steps to address
    these challenges and opportunities that have just
    been mentioned
  • By focusing on business practices in the field
    of statistical information
  • By identifying more efficient processes for
    exchange and sharing of data and metadata using
    modern technology and open standards

65
Who is SDMX?
  • SDMX is an initiative made up of seven
    international organizations
  • Bank for International Settlements
  • European Central Bank
  • Eurostat
  • International Monetary Fund
  • Organisation for Economic Cooperation and
    Development
  • United Nations
  • World Bank
  • The initiative was launched in 2002

66

www.z.orgwww.hub.org
180 Countries
Internet, Search, Navigation
www.y.org
www.x.org
67
SDMX Products
  • Technical standards for the formatting and
    exchange of aggregate statistics
  • SDMX Technical Specifications version 1.0 (now
    ISO/TS 17369 SDMX TC 154 WG2)
  • SDMX Technical Specifications version 2.0 (soon
    to be submitted to ISO TC 154 WG2)
  • Content-Oriented Guidelines (in draft)
  • Common Metadata Vocabulary
  • Cross-Domain Statistical Concepts
  • Statistical Subject-Matter Domains

68
Major Features of SDMX
  • Structure and formats (XML, EDIFACT) for
    aggregate data
  • Structure and formats (XML) for metadata
  • Formal information model (UML) for managing
    statistical exchange and sourcing
  • Web-services guidelines and registry services
    specification for use of modern technologies
  • Content-oriented guidelines to recommend best
    practices

69
Recent Events
  • Jan 2007 Launch meeting at the World Bank for
    SDMX 2.0 Technical Specifications
  • February 2007 Endorsement of SDMX by EUs
    Statistical Programme Committee
  • March 2008 SDMX becomes the preferred standard
    for data and metadata of the UN Statistical
    Commission
  • Other standards were mentioned DDI and XBRL
    specifically

70
Adopters/Interest
  • The following are known adopters (or planning to
    adopt)
  • US Federal Reserve Board and Bank of New York
  • European Central Bank
  • Joint External Debt Hub (WB, IMF, OECD, BIS)
  • UN/TRADECOM at UN Statistical Division
  • NAAWE (National Accounts from OECD/Eurostat)
  • SODI (Eurostat and European Governments)
  • Mexican Federal System
  • Vietnamese Ministry of Planning and Investment
  • Qatar Information Exchange
  • IMF (BOP, SNA, SDDS/GDDS)
  • Food and Agriculture Organization
  • Millenium Development Goals (UN System, others)
  • International Labor Organization
  • Bank for International Settlements
  • OECD
  • World Bank
  • Marchioness Islands (Spanish/Portugese
    Statistical Region)
  • UNESCO (Education)

71
Rate of Adoption
  • Between January 2007 and January 2008, adoption
    has doubled
  • We anticipate a similar rate of growth for the
    coming year
  • Tools are becoming available
  • UNSC recommendation makes it a safe course to
    follow for risk-averse institutions
  • Training courses are in increasing demand
    (Eurostat, Metadata Technology)
  • Standard data and metadata structures for many
    domains are being developed

72
SDMX and Data
73
SDMX and Data Formats
  • SDMX provides a format for describing the
    structure of data (structural metadata)
  • EDIFACT (was GESMES/TS, now SDMX-EDI)
  • XML (SDMX-ML)
  • SDMX provides formats for transmission and
    processing of data
  • EDIFACT (1 message)
  • XML (4 different equivalent flavors for different
    functions)
  • Data is tabulated, aggregate data (eg,
    multi-dimensional/OLAP cubes)
  • Can be any aggregate data!
  • Most data formats are derived from the structural
    metadata (eg, XML schemas are generated for each
    type of structure according to the business
    rules)

74
Data Set Structure
75
First Identify the Concepts
  • A statistical concept is a characteristic of a
    time series or an observation (MCV)
  • A concept is a unit of knowledge created by a
    unique combination of characteristics (SDMX
    Information Model)
  • Whatever the definition, statistical concepts are
    the DNA of the key family
  • Their usage (type, structure, sequence) define
    the structure of the data

76
Data Set StructureConcepts
  • Computers need structure of data
  • Concepts
  • Code lists
  • Data values
  • How these fit together

77
Data Set Structure Code Lists
Code Lists
Concepts
78
Data Makes Sense
Q,ZA,B,1,1999-06-3016547
Quarterly, South Africa, Bank Loans, Stocks, for
30 June 1999
79
Data Set Structure Defining Multi-Dimensional
Structures
  • Comprises
  • Concepts that identify the observation value
  • Concepts that add additional metadata about the
    observation value
  • Concept that is the observation value
  • Any of these may be
  • coded
  • text
  • date/time
  • number
  • etc.

Dimensions
Attributes
Measure
Representation
80
Data Set Structure Concept Usage
(Dimension)
(Dimension)
(Attribute)
(Attribute)
(Dimension)
(Dimension)
(Dimension)
(Measure)
81
SDMX and Metadata
82
SDMX and Metadata
  • SDMX provides for several types of metadata
  • Structural (describes structures of data sets and
    metadata sets and related items)
  • Provisioning (describes the sourcing of data
    between departments and organizations)
  • Reference metadata all other types of
    metadata (footnotes, methodology, quality, etc.
    Can be specified by the user!)
  • Reference metadata is the most important one it
    is what we typically think of as metadata

83
SDMX Metadata Sets
  • Version 2.0 of the SDMX Technical Specifications
    provides XML formats for metadata sets (SDMX-ML)
  • To describe their structure
  • To exchange metadata in XML
  • This is based on concepts (similar to the data
    formats)
  • SDMX supports any metadata concepts the users
    wishes to report/exchange/process
  • May be flat lists or hierarchical
  • Definitions provided by users, but
    recommendations exist for many common concepts
  • Metadata sets are attached to a formal object in
    the information model (an organization, a data
    set, a codelist, etc.)

84
SDMX and Metadata
  • This is a very powerful feature of SDMX
  • It can be used to integrate/mimic other metadata
    standards!
  • Provides very good support for standard exchange
    of metadata which cannot be anticipated by the
    designers of systems/standards
  • Must be based on common agreements about the
    meaning of metadata concepts
  • Often, concepts are taken from other metadata
    models/standards such as DDI, Dublin Core, etc.

85
The SDMX Information Model
86
The SDMX Information Model
  • A formal, documented conceptual model of
    statistical exchange, management, and sourcing
  • Expressed as a UML model
  • Used as the basis of all SDMX implementation
  • XML
  • EDIFACT
  • Any other programming language/platform
  • Provides consistency between implementations
  • Based on analysis of many statistical processing
    systems
  • Describes existing business practices in a
    generic way

87
Information Model High-Level Schematic
structure and code list maps
Data or Metadata Structure Definition
Category Scheme
Structure Maps
comprises subject or reporting categories
uses specific data/metadata structure
can be linked to categories in multiple category
schemes
conforms to business rules of the data/metadata
flow
Data or Metadata Flow
Data or Metadata Set
Category
can get data/metadata from multiple data/metadata
providers
publishes/reports data/metadata sets
can have child categories
can provide data/metadata for many data/metadata
flows using agreed data/metadata structure
Registration of Data or Metadata Set
Provision Agreement
URL, registration date etc.
Data Provider
registers existence of data and metadata
88
SDMX and Best Practices The Content-Oriented
Guidelines
89
SDMX Content-Oriented Guidelines
  • There is a long history of discussion about what
    is best practice in the collection of statistics
  • SDMX decided to define the technical basis for
    statistical exchange, and then engage in this
    debate
  • It makes reaching agreements between
    organizations easier!
  • These documents build on many years of work
    defining statistical concepts, terms, and
    classifications
  • Although described as statistical, much of what
    is here also applies to social science (and
    other) research

90
SDMX Content-Oriented Guidelines
  • Four main documents
  • Overview
  • Metadata Common Vocabulary (annex)
  • Cross-Domain Concepts (2 annexes)
  • Statistical Subject-Matter Domains (annex)
  • These will not become ISO specifications, but
    will evolve as publications of the SDMX
    Initiative
  • They are now available in their first official
    release at www.sdmx.org

91
Common Metadata Vocabulary
  • A set of terms and definitions for the different
    parts of the SDMX technical standards, and many
    common concepts used in data and metadata
    structures
  • Does not replace other major vocabularies in this
    space (such as the OECD glossary) but references
    these other works

92
Cross-Domain Concepts
  • Includes concepts which are common across many
    statistical domains
  • Names Definitions
  • Representations
  • Approximately 130 concepts, some with recommended
    representations (codelists)
  • These are concepts which support both data and
    metadata structures
  • Emphasis on quality frameworks for reference
    metadata concepts

93
Statistical Subject-Matter Domains
  • Based on the UN/ECE classification of statistical
    activities
  • Provides a classification system for use in
    exchanging statistics across domain boundaries
  • Provides a breakdown of the various domains
    within official statistics

94
SDMX and Web Services
95
Web-Services Components of SDMX
  • Web-Services Guidelines
  • Part of the Technical Specifications package
  • SDMX Query message
  • Part of SDMX-ML
  • SDMX Registry Services
  • Part of version 2.0 Technical Specifications
  • Interfaces are in SDMX-ML
  • Document describes implementation rules

96
Web Services Guidelines
  • Recommends use of WSS 1.1 for web services which
    use SOAP, WSDL
  • Provides standard function names for many typical
    web-services functions
  • Querying for data
  • Querying for metadata
  • Querying for structural information

97
SDMX Query Message
  • An XML Query to support two-way web-services
    calls using XML messages
  • Designed to support
  • Queries for structural information from online
    databases/repositories
  • Queries for data from online databases
  • Queries for metadata from online databases
  • Part of SDMX-ML
  • Very similar to the SQL query language supported
    by all database packages
  • Specific to SDMX objects

98
SDMX Registry Services
  • A registry is a common type of technology
  • Every Windows machine has a Windows registry to
    let applications know what other applications are
    on that machine, and where they are located
  • Web services registries do the same thing on a
    network
  • Functions like a card catalogue in a print
    library you can look up resources and find out
    how to obtain them
  • A registry provides a single place on the
    Internet where everyone can discover the data,
    metadata, and structures that other organizations
    use/publish
  • They do not contain the data and metadata it
    just indexes it and links to it

99
SDMX Registry Services (cont.)
  • SDMX Registry Services are based on generic,
    standard web-services registry technology
  • ISO 15000 ebXML Registry/Repository
  • OASIS UDDI Registry (part of .NET, etc.)
  • SDMX Registry Services are not generic
  • They are specific to SDMX exchanges of data and
    metadata, etc.
  • There is not one central SDMX Registry
  • Each domain will have its own registry for its
    members
  • The registries can be linked (federated)

100
SDMX Registry/Repository
SDMX Registry Interfaces
Register
Indexes data and metadata
REGISTRY Data Set/Metadata Set
Query
Submit
Describes data and metadata sources and reporting
processes
REPOSITORY Provisioning Metadata
Query
Submit
REPOSITORY Structural Metadata
Describes data and metadata structures
Query
101
SDMX Registry/Repository
SDMX Registry Interfaces
Register
Indexes data and metadata
REGISTRY Data Set/Metadata Set
Query
Subscription/Notification Applications can
subscribe to notification of new or changed
objects
Submit
REPOSITORY Provisioning Metadata
Query
Submit
REPOSITORY Structural Metadata
Describes data and metadata structures
Query
102
The Old JEDH Site
BIS
WEBSITE
IMF
OECD
World Bank
(Various Formats)
(3-month production cycle)
103
JEDH with SDMX
Retrieves data from sites
BIS
SDMX Agent
SDMX-ML
SDMX-ML Loaded into JEDH DB
Info about data is registered
IMF
SDMX-ML
Discover data and URLs
SDMX Registry
OECD
SDMX-ML
Data provided in real time to site
World Bank
SDMX-ML
JEDH Site
SDMX-ML
(Debtor database)
104
Recent and On-Going Developments
  • Many organizations using SDMX have been
    implementing web services
  • There is growing interest in forming a working
    group to further extend the specification for use
    with web-services technology
  • Standard error messages
  • Expanded function calls
  • Standard WSDLs
  • If you are interested in this, please tell me!

105
Tools and Resources
106
SDMX Tools
  • There are now several sources for SDMX tools
  • All are free or open-source
  • Eurostat complete package of tools for data,
    metadata, and registry services
  • Metadata Technology Ltd similar package of
    tools
  • Data editors are usually based on Excel
  • Some other tools
  • Open Data Foundation SDMX Browser for data
    visualization
  • OECD, ECB, and UN/Statistical Division provide
    some other tools for specific applications
  • Integration with PC-Axis has been prototyped, to
    be available this summer
  • DevInfo has SDMX support
  • FAME is developing SDMX support
  • Commercial vendors provide good support through
    web-services functionality
  • Eg, Oracle 11, .NET, etc.

107
Resources
  • The SDMX Initiative Site http//www.sdmx.org
  • The SDMX Toolkit and Forums
  • http//www.metadatatechnology.com
  • Various papers and (soon) open-source tools
  • http//www.opendatafoundation.org

108
IZA Data Service CenterDDI/SDMX
WorkshopWiesbaden, Germany, June 18th
2008SDMX, DDI, and Other Standards
  • Arofan Gregory / Pascal Heus
  • agregory_at_opendatafoundation.org /
    pheus_at_opendatafoundation.org
  • Open Data Foundation

109
Overview of the Session
  • DDI/SDMX Philosophy and Timing of Standards
    Development
  • DDI/SDMX Points of Functional Overlap
  • DDI/SDMX Direct Mappings
  • DDI/SDMX Integration Approaches
  • Other Related Standards and On-Going Work

110
DDI/SDMX Philosophy and Timing of Development
111
Development Philosophies/Timing
  • Unlike many standards bodies, both the SDMX
    Initiative and the DDI Alliance have attempted to
    create standards which do not duplicate existing
    efforts
  • There is an awareness that users need to deal
    with several different standards
  • DDI (3.0) and SDMX were both intentionally
    aligned with other, related standards
  • DDI 1./2. existed before SDMX
  • It was largely self-contained
  • SDMX was created before DDI 3.0 existed
  • Created with an awareness of DDI 1./2.
  • DDI 3.0 benefited from having SDMX as a published
    specification
  • Actively aligned with SDMX and many other
    standards

112
SDMX Design
  • SDMX was intentionally designed to accommodate
    integration of standards which are used with the
    inputs to aggregate data
  • This included DDI and XBRL
  • Mechanism for integration is generic
  • The key point for this integration is the SDMX
    Registry
  • It provides links between aggregate (SDMX) data
    sets, and also to source data and metadata

113
DDI/SDMX Points of Functional Overlap
114
SDMX and DDI as Complementary
  • DDI is designed to document micro-data
  • 1./2. versions were archival, after-the-fact
    documentation
  • 3.0 version covers entire life cycle, but still
    has an after-the fact function
  • SDMX is designed as a standard for processing and
    automation
  • It is not documentary, but is aimed at automation
    of statistical systems and exchanges
  • These purposes are related, but not duplicative
  • SDMX and DDI can both do useful things within a
    single system

115
Examples
  • DDI could be used to document SDMX-based
    aggregates more completely for archival purposes
  • DDI could be used to document the micro-data on
    which aggregates are based
  • As soon as tabulation occurs, SDMX can be used to
    describe and format the data
  • SDMX can describe micro-data, but it is not very
    useful
  • DDI can be used to automate processing of
    multi-dimensional data cubes, but it is more
    difficult than with SDMX
  • SDMX can be used to link DDI instances with other
    types of standard data and metadata (including
    both SDMX and DDI)

116
DDI and SDMX
SDMX Aggregated data Indicators, Time
Series Across time Across geography Open
Access Easy to use
DDI Microdata Low level observations Single time
period Single geography Controlled access Expert
Audience
  • Microdata data is a important source of
    aggregated data
  • Crucial overlap and mappings exists between both
    worlds (but commonly undocumented)
  • Interoperability provides users with a full
    picture of the production process

117
Generic Process Example
DDI
Survey/Register
Anonymization, cleaning, recoding, etc.
Tabulation, processing, case selection, etc.
Indicators
Raw Data Set
Micro-Data Set/ Public Use Files
Aggregation, harmonization
Aggregation, harmonization
SDMX
Aggregate Data Set (Lower level)
Aggregate Data Set (Higher Level)
118
DDI SDMX?
  • When you have data which has been
    tabulated/aggregated, it may be useful to have
    both SDMX and DDI
  • SDMX for processing and exchanging the data
  • DDI for documenting these processes, in case they
    are of interest to researchers
  • DDI has a much richer descriptive capability for
    addressing the exact processes used in
    statistical packages
  • SDMX is easier to process

119
DDI/SDMX Direct Mappings
120
Direct Mappings DDI SDMX
  • IDs and referencing use the same approach
    (identifiable versionable - maintainable
    structured URN syntax)
  • Both are organized around schemes
  • Reusable packages of data, similar to relational
    tables in databases
  • Both describe multi-dimensional data
  • A clean cube in DDI maps directly to/from SDMX
  • Both have concepts and codelists
  • DDI has much less emphasis on concepts
  • SDMX emphasizes concepts because they are needed
    for comparison
  • Both contain mappings (comparison) for codes
    and concepts

121
Formal Mapping
  • There is on-going work to describe a formal
    mapping between SDMX and DDI
  • It will cover these direct correspondences
  • They are quite obvious a code maps to a code a
    concept to a concept etc.
  • There are currently no tools, because generic
    tools such as XSLT will work for this
    transformation
  • Drafts of this work are expected this summer, as
    part of the SDMX submission to ISO for the
    version 2.0 Technical Specifications
  • The direct mappings are the easy part!

122
Issues with Direct Mapping
  • It is possible to describe everything in the DDI
    as an SDMX Metadata Set
  • This is probably not the best way to use SDMX
    with DDI!
  • It is usually better to select the important
    fields, and keep the rest in native DDI format
  • When you map from DDI to SDMX, you typically will
    not carry much of the descriptive metadata,
    question text, etc.
  • Mostly structural (codelists, dimensions,
    attributes, concepts)
  • You must have concepts for SDMX which are not
    always present in DDI
  • Going from SDMX to DDI, it is not always possible
    to map all the data
  • Especially for SDMX Metadata Sets, which may have
    user-configured concepts that dont always exist
    in DDI
  • Note that SDMX-DDI mappings refer to all versions
    of DDI

123
DDI/SDMX Integration Approaches
124
Integration Use Cases
  • The most important aspect of DDI SDMX
    integration is understanding what the use cases
    are
  • This defines what mapping/transformation is
    needed
  • It also defines what links need to be stored
    between data and metadata files
  • There are some common use cases
  • DDI used to describe and link microdata inputs to
    SDMX aggregates
  • DDI used to more fully document SDMX aggregates
    for dissemination to users
  • Using the SDMX Registry as a lifecycle management
    tool for DDI, SDMX, etc.

125
Linking Source Data and Aggregates
  • DDI provides a wealth of information about the
    micro-data which serves as an input to SDMX
    aggregates
  • It is possible to capture these links in SDMX, at
    the cell level or higher, to provide automated
    access to source data
  • An SDMX registry can be used to provide easy
    access to these links
  • The user/collector of aggregate data can access
    the rich DDI metadata, and possibly the data (if
    they have access rights)
  • It is possible to automatically generate SDMX
    output from the DDI metadata describing
    tabulation of micro-data
  • This may not be useful if the desired SDMX target
    is a standard cube structure described by another
    organization
  • It may make transformation to the standard cube
    easier, however
  • The SDMX Registry provides a good tool for
    managing links
  • Links between SDMX and DDI files are stored as
    Metadata Reports

126
Demo SDMX DDI Source Links
127
DDI SDMX for Dissemination
  • Typically, the full DDI documentation is not
    provided on web-sites which publish
    aggregates/indicators
  • SDMX is becoming a popular dissemination format
    for these data
  • It has been shown to increase the use of data on
    the Web
  • If the DDI documentation is available, this could
    also be delivered as additional documentation
  • Especially useful at study level
  • Links could be directly embedded in SDMX data
    files as attributes or stored in an SDMX
    Registry, or both

128
The SDMX Registry for Lifecycle Management
  • The SDMX Registry provides a tool for tracking
    the sources of data for aggregates
  • It can also track the transformation of versions
    of DDI as the data moves through the lifecycle
  • There is an SDMX model for processes
  • This can be used to describe the DDI lifecycle
    model
  • SDMX Metadata Reports can be used to link DDI
    metadata to specific stages of the DDI lifecycle,
    and to each other
  • Applications could query the SDMX Registry to
    discover all of the DDI metadata produced
    upstream, as micro-data is collected and processed

129
Demos
  • SDMX Metadata Report used to express DDI metadata
  • SDMX Metadata Report used to link DDI instances

130
Other Related Standards and On-Going Work
131
Many Related Standards
  • DDI
  • SDMX
  • ISO/IEC 11179 concept management and semantic
    modelling
  • ISO 19115 Geographical metadata
  • METS packaging/archiving of digital objects
  • PREMIS Archival lifecycle metadata
  • XBRL business reporting
  • Dublin Core citation metadata
  • Standard mappings are being defined by people
    from many different organizations (see
    presentation from METIS 2008 in Luxembourg)

132
ISO/IEC 11179
  • ISO/IEC 11179 is used to describe the meanings
    and representations of terms and concepts
  • Both SDMX and DDI are aligned with ISO/IEC 11179
  • SDMX and DDI concepts can be defined using the
    ISO/IEC 11179 attributes
  • Codelists and categories can be directly mapped
    (and other representations)
  • ISO/IEC 11179 can be implemented with DDI
    (directly, for concepts) and/or with SDMX (as a
    Metadata Report)
  • ISO/IEC 11179 has no standard expression in XML
    it is just a model

133
ISO 19115 Geographical Metadata
  • ISO 19115 describes geographies (bounding boxes
    for countries, etc.)
  • DDI uses the ISO 19115 model in its own XML
  • It does not use the standard ISO 19115 XML
    format, but there is a 1-to-1 mapping
  • SDMX could model ISO 19115 if desired
  • Linking to DDI or ISO 19115 XML is probably more
    useful, using the standard SDMX mechanism
  • Most geographies in SDMX aggregate data sets are
    coded, not directly described

134
METS
  • METS is used to package a set of files which work
    together as a digital object
  • Both DDI and SDMX metadata could be placed inside
    a METS wrapper
  • They would be metadata sections
  • The primary use case would be for archiving of a
    set of related data and metadata files, possibly
    with other related materials such as research
    publications

135
PREMIS
  • PREMIS allows for the capture of administrative
    metadata as a collection is placed and managed
    within the archive
  • DDI and SDMX files would be treated like any
    other files forming part of the collection
  • Both may contain metadata which can be extracted
    and used to populate PREMIS instances (access
    levels, confidentiality, etc.)

136
XBRL
  • XBRL is used by business to report required
    information to national supervisory bodies
  • This includes banking supervision and other
    economic data
  • XBRL is a source format for some aggregate
    statistics
  • XBRL International and the SDMX Sponsors are
    working together to define a cross-walk between
    the two standards

137
Dublin Core
  • Dublin Core is used to capture citation-type
    metadata for resources on the Internet and
    els
About PowerShow.com