Title: NORC Data Enclave: Providing Secure Remote Access to Sensitive Microdata Julia Lane, Principal Inves
1NORC Data EnclaveProviding Secure Remote
Access to Sensitive Microdata Julia Lane,
Principal Investigator, National Science
Foundation Chet Bowie, Senior Vice President,
Director, Economics, NORC Fritz Scheuren, Vice
President, Senior Fellow, Statistics
Methodology, NORCTimothy Mulcahy, Data Enclave
Program Director, NORC
- http//www.norc.org/dataenclave
2Overview
- Background
- Mission
- Portfolio Approach
- Security
- Enclave Walkthrough
- Data and Metadata Best Practices
- Outreach / Dissemination
- Next Steps
3Background the Data Continuum
- Conceptualization and operationalization
- Data collection
- Data maintenance and management
- Data processing and analysis
- Data quality and security
- Data archiving and access
- Production and dissemination of new knowledge to
inform policy and practice
4Background Data in the 21st Century
- Remarkable growth in the amount of data available
to us - Growing demand for better management of the
quality and value of data - Dramatic advances in the technology available to
best use those data to inform policy and practice
5Background the Challenge
- To be responsive to the dramatic and fast paced
technological, social, and cultural changes
taking place in the data continuum - To be resourceful enough to take advantage of
them to best inform policymaking, program
development, and performance management
6Mission
- Promote access to sensitive microdata
- Protect confidentiality (portfolio approach)
- Archive, index and curate micro data
- Encourage researcher collaboration / virtual
collaboratory
7Emergence of the Remote Data Enclave
- Enclave began in July 2006
- Sponsors
- US Department of Commerce (NIST-TIP)
- US Department of Agriculture (ERS/NASS)
- Ewing Marion Kauffman Foundation
- US Department of Energy (EIA) (pilot project)
- National Science Foundation (2009)
8What is the Enclave?
- Secure environment for accessing sensitive data
- Access remote desktop, encryption, audit logs
- Security controlled information flow, group
isolation - Facility software, tools, collaborative space,
technical support
9Ideal System
- Secure
- Flexible
- Low Cost
- Meet Replication standard
- The only way to understand and evaluate an
empirical analysis fully is to know the exact
process by which the data were generated - Replication dataset include all information
necessary to replicate empirical results - Metadata crucial to meet the standard
- Composed of documentation and structured metadata
- Undocumented data are useless
- Create foundation for metadata documentation and
extend data lifecycle
10Principles of Providing Access
- Valid research purpose
- statistical purpose
- need sponsor agency /data producer benefit
- Trusted researchers
- Limits on data use
- Remote access to secure results
- Disclosure control of results / output
safe projects
safe people
safe setting
safe outputs
? safe use
11Safe People and Disciplinary Actions
- All new researchers undergo background checks and
extensive training (as per producer) - Once competence has been demonstrated,
researchers are permitted to access the enclave - Appropriate conduct also allows researchers to
have longer term contracts with data
producers/sponsors
12Safe People and Disciplinary Actions
- Users may not attempt to identify individuals or
organizations from the data - Any attempt to remove data (successful or not) is
regarded as a deliberate action - Any breach of trust dealt with severely
-
- Possible actions include
- Immediate removal of access for researcher (and
colleagues?) - Formal complaint by Data Producer to institution
- Potential legal action
13Safe People are Vital
- Not possible to remove data electronically from
enclave except by NORC data custodian. - All users trained in access to enclave
- However, risk of users abusing data cant be
reduced to zero (portfolio approach) - Researchers
- May only access the dataset they applied for and
were authorized to use - use must relate to the originally proposed
project -
14Data Enclave Today
- Approximately 50 active researchers
- More than 40 research projects
- Approximately 500 people on enclave listserv
- Increasingly accepted as a model for providing
secure remote access to sensitive data - UK Data Archive, Council on European Social
Science Archives, University of Pennsylvania,
Chapin Hall Center for Children - Emphasis on building and sustaining virtual
organizations or collaboratories
15Available Datasets
- Department of Commerce, National Institute of
Standards and Technology - ATP Survey of Joint Ventures
- ATP Survey of Applicants
- Business Reporting Survey Series
- Department of Agriculture, Economic Research
Service, National Agricultural Statistical
Service - Agricultural Resource Management Survey (ARMS)
- Kauffman Foundation
- Kauffman Firm Survey
- Department of Energy, Energy Information Agency
(pilot) - National Science Foundation (2009)
16Examples of Researcher Topic Areas
- Entrepreneurship
- Knowledge innovation
- Joint ventures
- New businesses/startups
- Strategic alliances
- Agricultural economics
17Researcher Institutions
18Portfolio Approach to Secure Data Access
- Portfolio Approach
- Legal
- Educational / Training
- Statistical
- Data Protection / Operational / Technological
- (Customized per data producer dataset)
19Portfolio Approach
20Legal Protection Data Enclave Specific
- On an annual basis
- Approved researchers sign Data User Agreements
(legally binding the individual and institution) - Researchers and NORC staff sign Non-disclosure
Agreements specific to each dataset - Researchers and NORC staff complete
confidentiality training
21Educational / Researcher Training
- Locations
- Onsite
- Remote / Web-based / SFTP
- Researcher locations (academic institutions,
conferences AAEA, JSM, AOM, ASA, ASSA, NBER
summer institute - Note The training is designed to go above and
beyond current practice in terms of both
frequency and coverage
22Educational/Training Example Agenda
- Day 1
- Data enclave navigation (NORC)
- Metadata documentation (NORC)
- Confidentiality and data disclosure (NORC)
- Survey overview (Data Producer)
- Confidentiality agreement signing (NORC/ (Data
Producer) - Day 2
- Data files and documentation (Data Producer)
- Sampling and weights (Data Producer)
- Item quality control and treatments for
non-response (Data Producer) - Statistical testing (Data Producer)
23Statistical Protection
- Remove obvious identifiers and replace with
unique identifiers - Statistical techniques chosen by agency
(recognizing data quality issues) - Noise added?
- Full disclosure review of all data exported
coordinated between NORC and Data Producer - Note At discretion of producer and can go
above and beyond the minimum level of protection
24Disclosure Review Process
- Getting disclosure proofed results
- Drop table/file and related files in special
folder in shared area - Submit request for disclosure review through tech
support - Complete Checklist (researcher to do list)
- Include details about number of observations in
each cell previous releases - Summary, disclosure proofed output made available
for download on a public server
25Disclosure Review Process
Disclosure Review
26Disclosure Review Guidelines
- Provide a brief description of your work
- Identify which table(s) / file(s) you are
requesting to be reviewed and their location - Specify the dataset(s) and variables from which
the output derives - Identify the underlying cell sizes for each
variable, including regression coefficients based
on discrete variables
27Helpful Hints on Disclosure Review
- Threshold Rule
- No cells with less than 10 units
(individuals/enterprises). Local unit analysis
must show enterprise count (even when there is no
information associated with each cell) - Avoid / be careful when / remember to
- Tabulating raw data (threshold rule)
- Using lumpy variables, such as investment
- Researching small geographical areas (dominance
rule) - Graphs are simply tables in another form (always
display frequencies) - Treat quantiles as tables (always display
frequencies) - Avoid minimum and maximum values
- Regressions generally only present disclosure
issues when - Only on dummies a table
- On public explanatory variables
- Potentially disclosive situations when
differencing hiding coefficients makes linear and
non linear estimation completely non-disclosive
(note panel models are inherently safe)
28Data Protection / Operational
- Encrypted connection with the data enclave using
virtual private network (VPN) technology. VPN
technology enables the data enclave to prevent an
outsider from reading the data transmitted
between the researchers computer and NORCs
network. - Users access the data enclave from a static or
pre-defined narrow range of IP addresses. - Citrixs Web-based technology.
- All applications and data run on the server at
the data enclave. - Data enclave can prevent the user from
transferring any data from data enclave to a
local computer. - Data files cannot be downloaded from the remote
server to the users local PC. - User cannot use the cut and paste feature in
Windows to move data from the Citrix session. - User is prevented from printing the data on a
local computer. - Audit logs and audit trails
29Data Protection / Operational
- NORC already collects data for multiple
statistical agencies (BLS, Federal Reserve (IRS
data), EIA, NSF/SRS etc.) gt has existing
safeguards in place - The Data Enclave is fully compliant with DOC IT
Security Program Policy, Section 6.5.2, the
Federal Information Security Management Act,
provisions of mandatory Federal Information
Processing Standards (FIPS) and all other
applicable NIST Data IT system and physical
security requirements, e.g., - - Employee security
- - Rules of behavior
- - Nondisclosure agreements
- - NIST approved IT security / certification
accreditation - - Applicable laws and regulations
- - Network connectivity
- - Remote access
- - Physical access
30High-level Access Perspective
31Restrictions in the Enclave Environment
- Access only to authorized applications
- Most system menus have been disabled
- Some control key combinations or right click
functions are also disabled on keyboard - Closed environment no open ports, no access to
Internet or email - No output (tables, files) may be exported and no
datasets imported without being reviewed - File explorer is on default settings
32Accessing the Enclave
https//enclave.norc.org dont forget its
https//, not http//
The message center will inform you on browser
related technical issues. Note that you will
first need to install the Citrix Client on you
system (a download link will be provided)
Enter your user name and password. The first time
you access, you will need to change your
password. Need at least one number and a mix
upper/lower case characters. Password must be
changed every 3-months
33Tools Available in the Enclave
- Stata/SE 10.0
- StatTransfer 9 (selected users)
- SAS v9.2
- R Project for Statistical Computing
- MATLAB
- LimDep / NLogit
- Microsoft office 2007
- Adobe PDF Reader
- IHSN Microdata Management Toolkit / Nesstar
Publisher (upon request, selected users only)
34Collaboration Tools
PRODUCER PORTAL
GENERAL INFORMATION
KNOWLEDGE SHARING
SUPPORT
- Background info
- Announcements
- Calendar or events
- About
- Topic of the week
- Discussion groups
- Wiki
- Shared libraries
- Metadata / Report
- Scripts
- Research papers
- Frequently Asked Questions
- Technical Support
- DE usage
- Data usage
- Quality
Content fully editable by producers and
researchers using a simple web based interface
Private research group portals with similar
functionalities are configured for each research
project
35Contributing to the Wiki
Wiki pages can be changed by clicking the Edit
button
All changes are tracked and can be seen using the
history page.
Links to other wiki pages are shown as hyperlinks
Pages that have not yet been created are
underlined
36Data Documentation Shared Code Libraries
Click on documents and folders to open or
navigate in the structure
Use the menu to create folders or upload documents
37What are Metadata?
- Common definition Data about Data
38Metadata You Use Everyday
- The internet is build on metadata and XML
technologies
39Managing Social Science Metadata is Challenging!
We are in charge of the data. We support our
users but also need to protect our respondents!
We have an information management problem
We want easy access to high quality and well
documented data!
We need to collect the information from the
producers, preserve it, and provide access to our
users!
40Metadata Needs in Social Science
- The data food chain
- Data takes a very long path from the respondent
to the policy maker - Process should be properly documented at each
step - Different needs and perspectives across life
cycle - But its all about the data/knowledge being
transformed and repurposed - Linkages / information should be maintained
- Drill from results to source
- Needed to understand how to use the data and
interpret the results - Information should be reusable
- Common metadata
- Shared across projects
- Dynamic integration of knowledge
41Importance of Metadata
- Data Quality
- Usefulness accessibility coherence
completeness relevance timeliness - Undocumented data is useless
- Partially documented data is risky (misuse)
- Data discovery and access
- Preservation
- Replication standard (Gary King)
- Information / knowledge exchange
- Reduce need to access sensitive data
- Maintain coherence / linkages across the complete
life cycle (from respondent to policy maker) - Reuse
42Metadata Issues
- Without producer / archive metadata
- We do not know what the data is about
- We loose information about the production
processes - Information cant be properly preserved
- Researchers cant work discover data or perform
efficient analysis - Without researcher metadata
- Research process is not documented and cannot be
reproduced (Gary King ? replication standard!) - Other researchers are not aware of what has been
done (duplication / lack of visibility) - Producer dont know about data usage and quality
issues - Without standards
- Such information cant be properly managed and
exchanged between actors or with the public - Without tools
- We cant capture, preserve or share knowledge
43What is a Survey?
- More than just data.
- A complex process to produce data for the purpose
of statistical analysis - Beyond this, a tool to support evidence based
policy making and results monitoring in order to
improve living conditions - Represents a single point in time and space
- Need to be aggregated to produce meaningful
results - It is the beginning of the story
- The microdata is surrounded by a large body of
knowledge - But survey data often come with limited
documentation - Survey documentation can be broken down into
structured metadata and documents - Structured metadata (can be captured using XML)
- Documents (can be described in structured
metadata)
44Microdata Metadata Examples
- Survey level
- Data dictionary (variable labels, names,
formats,) - Questionnaires questions, instructions, flow,
universe - Dataset structure files, structure/relationships,
- Survey and processes concepts, description,
sampling, stakeholders, access conditions, time
and spatial coverage, data collection
processing, - Documentation reports, manuals, guides,
methodologies, administration, multimedia, maps,
- Across surveys
- Groups series, longitudinal, panel,
- Comparability by design, after the fact
- Common metadata concepts, classifications,
universes, geography, universe
45Questionnaire Example
Instruction
Universe
Module/Concepts
Questions
Classifications (some reusable)
Value level Instruction (skip)
Instruction
46Information Technology and Metadata
The eXtensible Markup Language (XML) and related
technologies are use to manage metadata or data
Document Type Definition (DTD) and XSchema are
use to validate an XML document by defining
namespaces, elements, rules
Specialized software and database systems can be
used to create and edit XML documents. In the
future the XForm standard will be used
XML separates the metadata storage from its
presentation. XML documents can be transformed
into something else, like HTML, PDF, XML, other)
through the use of the eXtensible Stylesheet
Language, XSL Transformations (XSLT) and XSL
Formatting Objects (XSL-FO)
Very much like a database system, XML documents
can be searched and queried through the use of
XPath oe XQuery. There is no need to create
tables, indexes or define relationships
XML metadata or data can be published in smart
catalogs often referred to as registries than can
be used for discovery of information.
XML Documents can be sent like regular files but
are typically exchanged between applications
through Web Services using the SOAP and other
protocols
47Metadata specifications for social sciences
- We need generic structures to capture the
metadata - A single specification is not enough
- We need a set of metadata structures
- That can map to each other to (maintain linkages)
- Will be around for a long time (global adoption,
strong community support) - Based on technology standards (XML)
- Suggested set
- Data Documentation Initiative (DDI) survey /
administrative microdata - Statistical Data and Metadata Exchange standard
(SDMX) aggregated data / time series - ISO/IEC 11179 concept management and semantic
modeling - ISO 19115 Geographical metadata
- METS packaging/archiving of digital objects
- PREMIS Archival lifecycle metadata
- XBRL business reporting
- Dublin Core citation metadata
48The Data Documentation Initiative
- The Data Documentation Initiative is an XML
specification to capture structured metadata
about microdata (broad sense) - First generation DDI 1.02.1 (2000-2008)
- Focus on single archived instance
- Second generation DDI 3.0 (2008)
- Focus on life cycle
- Go beyond the single survey concept
- Multi-purpose
- Governance DDI Alliance
- Membership based organizations (35 members)
- Data archives, producers, research data centers,
academic - http//www.ddialliance.org/org/index.html
49DDI Timeline / Status
- Pre-DDI 1.0
- 70s / 80s OSIRIS Codebook
- 1993 IASSIST Codebook Action Group
- 1996 SGML DTD
- 1997 DDI XML
- 1999 Draft DDI DTD
- 2000 DDI 1.0
- Simple survey
- Archival data formats
- Microdata only
- 2003 DDI 2.0
- Aggregate data (based on matrix structure)
- Added geographic material to aid geographic
search systems and GIS users - 2003 Establishment of DDI Alliance
- 2004 Acceptance of a new DDI paradigm
- Lifecycle model
- Shift from the codebook centric / variable
centric model to capturing the lifecycle of data - Agreement on expanded areas of coverage
- 2005
- Presentation of schema structure
- Focus on points of metadata creation and reuse
- 2006
- Presentation of first complete 3.0 model
- Internal and public review
- 2007
- Vote to move to Candidate Version (CR)
- Establishment of a set of use cases to test
application and implementation - October 3.0 CR2
- 2008
- February 3.0 CR3
- March 3.0 CR3 update
- April 3.0 CR3 final
- April 28th 3.0 Approved by DDI Alliance
- May 21st DDI 3.0 Officially announced
- Initial presentations at IASSIST 2008
- 2009
- DDI 3.1 and beyond
50DDI 2.0 Perspective
DDI 2 Survey
DDI 2 Survey
DDI 2 Survey
DDI 2 Survey
DDI 2 Survey
DDI 2 Survey
DDI 2 Survey
51DDI 2 Characteristics and Limitations
- Characteristics
- Focuses on the static object of a codebook
- Designed for limited uses
- Coverage is focused on single study, single data
file, simple survey and aggregate data files - Variable contains majority of information
(question, categories, data typing, physical
storage information, statistics) - Limitations
- Treated as an add on to the data collection
process - Focus is on the data end product and end users
(static) - Limited tools for creation or exploitation
- The Variable must exist before metadata can be
created - Producers hesitant to take up DDI creation
because it is a cost and does not support their
development or collection process
52DDI 3.0 and the Survey Life Cycle
- A survey is not a static process It dynamically
evolved across time and involves many
agencies/individuals - DDI 2.x is about archiving, DDI 3.0 across the
entire life cycle - 3.0 focus on metadata reuse (minimizes
redundancies/discrepancies, support comparison) - Also supports multilingual, grouping, geography,
and others - 3.0 is extensible
53DDI 3.0 Perspective
54DDI 3.0 Use Cases
- DDI 3 is composed of several schemas/modules
- You only use what you need!
- DDI 3.0 provides the common metadata language to
maintain links and consistency across the entire
life cycle - Some examples
- Study design/survey instrumentation
- Questionnaire generation/data collection and
processing - Data recoding, aggregation and other processing
- Data dissemination/discovery
- Archival ingestion/metadata value-add
- Question /concept /variable banks
- DDI for use within a research project
- Capture of metadata regarding data use
- Metadata mining for comparison, etc.
- Generating instruction packages/presentations
- Data sourced from registers
- The same specification is used across the
lifecycle by different actors ? maintains
consistency and linkages
55Benefits (1)
- Improvement in data quality (usefulness)
- Discovery, accessibility
- Coherence, integrity
- Timeliness
- Agencies speak the same language and share
common/compatible metadata structure - Across the entire life cycle, form respondent to
policy maker - Improve services
- Publication Institutional (inside public),
community, regional, global - User customized products (on the fly)
- Documentation (subset, profiled for user)
- Code generator (statistical packages, database,
etc.) - Notification services
- User feedback / dialog (quality, usage)
- Foster community space (build knowledge,
collaboration) - Reuse of tools / software / best practices
56Benefits (2)
- Harmonization of common metadata
- Concepts, classifications, terminology (or
documented mappings) - Improved search capabilities
- Producer, time, geography, concepts
- Comparability (be design, after the fact)
- Metadata mining, exploration and visualization
- Understanding of data usage
- Reduced burden and cost of ownership
- Preservation
- Return on investment
- Preparation and maintenance costs offset by
reduction in production, dissemination and
support costs - Build on industry standard technology
57Metadata and the NORC Data Enclave
- Advocate and foster use of metadata standards and
best practices - Datasets coming in the enclave are documented
using DDI - In collaboration with data producer
- Currently using DDI 2
- Supporting community initiative towards the
development of new tools, in particular for DDI
3.0 - Frontier research in knowledge capture
- Source code scrapping / tagging
- Better understanding of data usage
- Collaborative efforts with University of Chicago
Computational Institute - Leveraging web 2.0 technologies in social-science
58Outreach / Dissemination
- Public web site
- Newsletter
- Sponsor / participate in workshops/conferences
- Multiple channels of coordinated outreach
efforts are most advantageous
59Public Website (www.norc.org/dataenclave)
60Enclave Quarterly Newsletter
61Publications
- Norman Bradburn, Randy Horton, Julia Lane,
Michael Tilkin, Developing a Data Enclave for
Sensitive Microdata Proceedings of International
E-Social Science Conference, 2006 - Julia Lane, Optimizing Access to Microdata
Journal of Official Statistics, September 2007 - Julia Lane and Stephanie Shipp Using a Remote
Access Data Enclave for Data Dissemination
International Journal of Digital Curation, 2007 - Julia Lane, Pascal Heus and Tim Mulcahy, Data
Access in a Cyber World Making Use of
Cyberinfrastructure Transactions in Data
Privacy, 2008 - Stephanie Shipp, Stephen Campbell, Tim Mulcahy,
and Ted Allen Informing Public Policy on Science
and Innovation The Advanced Technology Programs
Experience Journal of Technology Transfer, 2008
62Next Steps
- Continue applying practical lessons learned /
feedback for continuous improvement - Continue to innovate
- Develop researcher scholarship collaboration
incentive programs - Identify, implement, and test new collaboratory
functionality - Continue to pilot test remote training platform
- Develop Executive Council comprised of sponsor
agencies to help steer the system from a producer
and user perspective
63Instant Messaging, Audio, Video, Webcast
64Q A