Title: Inter-university Consortium for Political and Social Research (ICPSR)
1DDI Across the Life Cycle One Data Model, Many
Products
Click to edit Master title style
- Inter-university Consortium for Political and
Social Research (ICPSR) - and
- Survey Research Operations (SRO)
Click to edit Master subtitle style
IASSIST MeetingTampere, FinlandMay 29, 2009
2Presenters
- Mary Vardigan,Assistant Director, ICPSR
- Sue Ellen Hansen,Director, SRO Technical Systems
Group - Peter Granda, Archivist, ICPSR
- Sanda Ionescu, Documentation Specialist, ICPSR
- Felicia LeClere, Associate Research Scientist,
ICPSR
3The Collaborators
- Both are units of the Institute for Social
Research, University of Michigan - ICPSR is a large social science data archive
- SRO is a data collection center
4Past Collaborations
- Working together on the National Survey of Family
Growth, sponsored by NCHS, to create data and an
interactive codebook - Partnered on the Collaborative Psychiatric
Epidemiology Surveys, sponsored by NIMH - This involved a harmonization of three datasets
and interactive documentation featuring question
comparison and five languages
www.icpsr.umich.edu/CPES
5(No Transcript)
6Rationale for Collaboration
- We share a need for rich, high-quality metadata
- We want to comply with metadata standards in
particular, the Data Documentation Initiative
(DDI) - DDI 3 enables life cycle perspective
- We need to pass data easily from SRO to ICPSR
without information loss
7SRO-ICPSR Joint Project
- Shared DDI-compliant data model and database
design for survey metadata - Challenges
- Different computing platforms
- Different end products
- Different staff orientations
8(No Transcript)
9Products and Benefits
- SRO
- Tools to enhance MQDS, which produces XML
documentation from Blaise instruments - Tool to permit external users to add metadata for
NSFG - ICPSR
- Variable-level database that permits users to
search across the ICPSR collection compare
variables create new datasets and questionnaires - Internal variable search for harmonization
10Data Life Cycle Coverage
11Michigan Questionnaire Documentation System (MQDS)
- Sue Ellen Hansen
- Nicole Kirgis
12What Does MQDS Do?
- Facilitates automated documentation and
harmonization of Blaise survey instruments and
datasets - Extracts survey question metadata
- Standardized format
13Survey Question Metadata
- Question universe
- Variable name and label
- Question text
- Question variable text (fills)
- Data type
- Code values and code text
- Skip instructions
- etc.
14Data Documentation Initiative (DDI)
- Standard specification for technical
documentation of social science data - eXtensible Markup Language (XML)
- Widely used
- Facilitates sharing of data
- Initial focus on standard dataset codebook
- Ongoing development
http//www.ddialliance.org/
15MQDS Version 1
- Extracted metadata from Blaise data model as XML
tagged data - Provided user interface for selection of
- Blaise files
- Instrument questions and sections
- Types of metadata to extract
- Languages to display
- Style sheet for generation of instrument
documentation or codebook
16Using MQDS V1 XML Codebook in Five Languages
National Latino and Asian American
Study www.icpsr.umich.edu/CPES
17MQDS Version 1
- Limitations
- XML not DDI-compliant
- DDI Version 2 did not have XML tags for all
metadata provided by Blaise - Did not provide easy means of adding XML tags
without becoming noncompliant - XML files for complex surveys can be very large
(text files) - Entire files had to be processed in computer
memory - Limited ability to fully automate documentation
18DDI Version 3
- Released April 2008
- Focus on complete data lifecycle going beyond
the codebook
19DDI Version 3
- Included extensions proposed by DDI working group
on instrument design
Persistent Content of Question Use of Question in Instrument
Question text Static Dynamic or variable Order and routing Sequence / skip patterns Loops
Multiple-part question Universe
Response domain Open Set categories Special types (date, time, etc.) Analysis unit
Definitional text Instructions
20MQDS Version 3
- Joint SRC and ICPSR venture
- Goals
- Address version 2 limitations
- Process Blaise instrument of any size
- Exploit new elements and validate to the recently
released DDI version 3 standard - Move from processing XML metadata in memory to
streaming metadata to a relational database
21MQDS Version 3Relational Database Import,
Export, Transform
XML (DDI 3)
User specifies output files (location,
Language/locale, XML output options, etc.)
Questionnaire
Codebook
User specifies stylesheet selection criteria,
type of output desired (html, rtf, pdf), etc.
22MQDS Version 3
- Relational database
- DDI compliant standardized tables
- Flexibility for SRC and ICPSR to add extensions
that meet their specific organizational needs - Allows
- Automated documentation of any Blaise survey
instrument - Importing and documenting data produced by other
software - Lower cost development of other tools that
facilitate editing and disseminating data
23MQDS V3 Prototype Exporting Language XML
24MQDS Development
- Expect to release Summer 2009
- Working out a distribution plan for Blaise users
25Data Life Cycle Coverage
26ApplicationsCustomized Editing Tool
27MQDS Version 3
- Relational database
- DDI compliant standardized tables
- Flexibility for SRC and ICPSR to add extensions
that meet their specific organizational needs - Allows Development of new tools to deal with the
practical problems involved in transforming data
and documentation derived from BLAISE
instruments into public-use products
28Features of the Tool
- Loads MQDS output into database tables
- Web interface to permit quick viewing
- Application that permits both internal and
external clients to access and edit
variable-level information - Ability to include disposition codes to
designate which variables to include in
public-use files - Maintain permanent record of decisions made
throughout the editing process
29(No Transcript)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33SELECT VARIABLE TO EDIT FROM DATABASE POPULATED
WITH METADATA FROM MQDS WITH POSSIBLE REVISIONS
FROM SUBSEQUENT DATA PROCESSING STEPS
Variable Name
Variable Label
Universe Statements
Value Labels
Question Text
List of Standard Formats
- VARIABLE DISPOSITION
- Place in public-use file
- Place in restricted-use file
- Leave in original file created by the data
producer
34Data Life Cycle Coverage
35Social Science Variables DatabaseThe Public
Search
36SSVD The Public Search
- ICPSR variables search
- Internal (staff, other authorized users)
- External (public)
37SSVD The Public Search
- Enables ICPSR users to search variables across
datasets - Assists in data discovery, comparison,
harvesting, and analysis - Useful in question mining for designing new
research
38SSVD The Public Search
- Concept first tested in a pilot project completed
in 2005 - Good functionality
- Demonstrated benefits of using DDI markup easy
import complex, granular searches user-friendly
display - Limited number of data sets (69 ICPSR studies
included)
39SSVD The Public Search
- Expand the project to ultimately include most of
ICPSRs holdings - Generate DDI documentation for most ICPSR studies
- Need for automated production
- Build a solid, state-of-the-art, DDI compliant
database - Handle large number of files
- Support multiple applications
40SSVD The Public Search
- The Hermes batch processing system
ASCII data file
SPSS system / portable file (Mandatory)
Statistical setups SPSS, SAS, Stata
Ready-to-go data files SAS transport, SPSS
portable, Stata system
Question text file in fixed format (Optional)
DDI 2.1 variable-level documentation with
frequencies and question text (optional)
PDF Codebook
(Part of )
This is a simplified diagram
41SSVD The Public Search
- Hermes
- Consistent, reliable source of variables
descriptions in DDI - DDI documentation limited to content of input
files - Labels may be truncated or may contain
abbreviations - Question text may be missing although available
in original documentation
42SSVD The Public Search
- Additional quality standards necessary for DDI
documentation, to maximize effectiveness of
Public Search - Presence of question text, whenever available
- Increased readability of variable/value labels,
especially if question text is not present
43SSVD The Public Search
- Not all ICPSR studies qualify for variable-level
searches - Criteria for selecting studies not included
- Aggregate/statistical data (ex. Census data, Data
Books, Roll Call records, etc.) - Poor documentation
- Some restricted data
44SSVD The Public Search
- Pre-SSVD upload
- Review of DDI output from Hermes to apply content
quality standards and study selection criteria - Additional work to upgrade DDI where necessary
(and feasible) - Add question text
- Complete truncated text
- Improve readability of labels
- Add frequencies
45SSVD The Public Search
- Preparing studies for SSVD
- Started end of 2006
- Included DDI produced for previous projects
- Reviewed all variable-level DDI created at ICPSR,
November 2006 to present (new releases and
updates)
46SSVD The Public Search
- New database finalized Fall 2008
- Built to match DDI 3.0 data model
- Both DDI 2.x and DDI 3.0 compliant
- Designed to accept both DDI 2.x and 3.0 input and
produce output in both versions - ICPSR version currently uploads DDI 2.1 and
generates DDI 3.0 individual variables
descriptions.
47SSVD The Public Search
- First batch of variable-level description files
uploaded into SSVD - Approx. 3,500 DDI files (one file per dataset),
representing - Approx. 1,300 ICPSR studies (approx. 18.5 percent
of total ICPSR holdings, excluding US Census
approx. 30 percent of holdings with data and
setups) - Over 1,000,000 individual variable descriptions
23,000,000 categories
48SSVD The Public Search
- Currently in Beta-testing phase.
- Email bugs at ssvd-testing_at_icpsr.umich.edu
- Uses Oracle Text.
- http//www.icpsr.umich.edu/ICPSR/ssvd/index.html
49 SSVD The Public Search Moving forward
- Fall 2009 switch to Solr searches (based on
Lucene) - Faster
- More sophisticated results filtered by multiple
relevant parameters - Enable side-by-side/same page display of selected
variables for comparison - Enable variable search from individual study page
(search within study)
50 SSVD The Public Search Moving forward
- Adding content
- Second batch of DDI files ready to upload
- 900 DDI files, representing 500-600 studies (will
bring total close to 45 percent of ICPSR studies
with data and setups) - Initiate retrofit project to examine older
studies that were not covered in the first
conversion phase
51 SSVD The Public Search Moving forward
- Transition to automated DDI upload
- DDI uploaded at the time of study publication
- First quality check performed by study processing
staff - Acceptable DDI immediately released for public
view - Problematic DDI suppressed from public view for
further review, and upgrade as appropriate
52Data Life Cycle Coverage
53Applications Internal Variable Search and
Documentation
54The Integrated Fertility Survey Series
- 5 year grant from NICHD to harmonize data from 10
large surveys of marriage, fertility, and
child-bearing in the United States - 10 surveys beginning in 1955 through 2002
55Problem of Harmonization
- In order to make decisions about harmonizing
across all files need - Question text
- Value labels and categories
- Be able to find and export metadata from all 10
files at the variable level - Be able to document each variable, recode and
variable choice
56Tools from Variables Database
- Need to be able to do nested searches that are
documented - Need to be able to search all fields individually
and in sequence - Need to be able to download results and document
what search terms were used
57ICPSR SSVD Internal Search
- All 10 data sets were loaded in ICPSRs version
of the shared data base - Designed to capture all of the relevant fields
that were marked up in DDI
58Entry screen for internal search
59Search results screen
60Excel download from search
Can also download value labels and codes
61Search Utilities
- Downloaded search fields serve to
- 1. Identify variables to be harmonized
- 2. Provide metadata for translation tables
which are used to harmonize files -
62Harmonization steps
- Use search results to populate two intermediate
steps to reforming data set - Exploratory comparative tables
- Use this comparative table to make decisions
about harmonization by examining universes,
question texts, and response categories - Translation tables
- These tables are designed to provide instructions
on recoding the underlying items from the 10
surveys to a single harmonized item. The table
provides instructions to an automated SAS program
that recodes items from 10 surveys.
63Comparative table date of birth
64Translation Table for place of birth
65Harmonization steps
- After the translation table, the recode
instructions for all 10 files are built into the
SAS file and a new data file has been created. - The underlying metadata data provided by the
database allow us to (1) search all 10 files, (2)
explore comparability and (3) recode to new
variables