NORC Data Enclave: Providing Secure Remote Access to Sensitive Microdata Julia Lane, Principal Inves - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

NORC Data Enclave: Providing Secure Remote Access to Sensitive Microdata Julia Lane, Principal Inves

Description:

Unlabeled stuff. Labeled stuff. The bean example is taken from: A Manager's ... software and database systems can be used to create and edit XML documents. ... – PowerPoint PPT presentation

Number of Views:444
Avg rating:3.0/5.0
Slides: 65
Provided by: Pasca75
Category:

less

Transcript and Presenter's Notes

Title: NORC Data Enclave: Providing Secure Remote Access to Sensitive Microdata Julia Lane, Principal Inves


1
NORC Data EnclaveProviding Secure Remote
Access to Sensitive Microdata Julia Lane,
Principal Investigator, National Science
Foundation Chet Bowie, Senior Vice President,
Director, Economics, NORC Fritz Scheuren, Vice
President, Senior Fellow, Statistics
Methodology, NORCTimothy Mulcahy, Data Enclave
Program Director, NORC
  • http//www.norc.org/dataenclave

2
Overview
  • Background
  • Mission
  • Portfolio Approach
  • Security
  • Enclave Walkthrough
  • Data and Metadata Best Practices
  • Outreach / Dissemination
  • Next Steps

3
Background the Data Continuum
  • Conceptualization and operationalization
  • Data collection
  • Data maintenance and management
  • Data processing and analysis
  • Data quality and security
  • Data archiving and access
  • Production and dissemination of new knowledge to
    inform policy and practice

4
Background Data in the 21st Century
  • Remarkable growth in the amount of data available
    to us
  • Growing demand for better management of the
    quality and value of data
  • Dramatic advances in the technology available to
    best use those data to inform policy and practice

5
Background the Challenge
  • To be responsive to the dramatic and fast paced
    technological, social, and cultural changes
    taking place in the data continuum
  • To be resourceful enough to take advantage of
    them to best inform policymaking, program
    development, and performance management

6
Mission
  • Promote access to sensitive microdata
  • Protect confidentiality (portfolio approach)
  • Archive, index and curate micro data
  • Encourage researcher collaboration / virtual
    collaboratory

7
Emergence of the Remote Data Enclave
  • Enclave began in July 2006
  • Sponsors
  • US Department of Commerce (NIST-TIP)
  • US Department of Agriculture (ERS/NASS)
  • Ewing Marion Kauffman Foundation
  • US Department of Energy (EIA) (pilot project)
  • National Science Foundation (2009)

8
What is the Enclave?
  • Secure environment for accessing sensitive data
  • Access remote desktop, encryption, audit logs
  • Security controlled information flow, group
    isolation
  • Facility software, tools, collaborative space,
    technical support

9
Ideal System
  • Secure
  • Flexible
  • Low Cost
  • Meet Replication standard
  • The only way to understand and evaluate an
    empirical analysis fully is to know the exact
    process by which the data were generated
  • Replication dataset include all information
    necessary to replicate empirical results
  • Metadata crucial to meet the standard
  • Composed of documentation and structured metadata
  • Undocumented data are useless
  • Create foundation for metadata documentation and
    extend data lifecycle

10
Principles of Providing Access
  • Valid research purpose
  • statistical purpose
  • need sponsor agency /data producer benefit
  • Trusted researchers
  • Limits on data use
  • Remote access to secure results
  • Disclosure control of results / output

safe projects
safe people
safe setting
safe outputs
? safe use
11
Safe People and Disciplinary Actions
  • All new researchers undergo background checks and
    extensive training (as per producer)
  • Once competence has been demonstrated,
    researchers are permitted to access the enclave
  • Appropriate conduct also allows researchers to
    have longer term contracts with data
    producers/sponsors

12
Safe People and Disciplinary Actions
  • Users may not attempt to identify individuals or
    organizations from the data
  • Any attempt to remove data (successful or not) is
    regarded as a deliberate action
  • Any breach of trust dealt with severely
  • Possible actions include
  • Immediate removal of access for researcher (and
    colleagues?)
  • Formal complaint by Data Producer to institution
  • Potential legal action

13
Safe People are Vital
  • Not possible to remove data electronically from
    enclave except by NORC data custodian.
  • All users trained in access to enclave
  • However, risk of users abusing data cant be
    reduced to zero (portfolio approach)
  • Researchers
  • May only access the dataset they applied for and
    were authorized to use
  • use must relate to the originally proposed
    project

14
Data Enclave Today
  • Approximately 50 active researchers
  • More than 40 research projects
  • Approximately 500 people on enclave listserv
  • Increasingly accepted as a model for providing
    secure remote access to sensitive data
  • UK Data Archive, Council on European Social
    Science Archives, University of Pennsylvania,
    Chapin Hall Center for Children
  • Emphasis on building and sustaining virtual
    organizations or collaboratories

15
Available Datasets
  • Department of Commerce, National Institute of
    Standards and Technology
  • ATP Survey of Joint Ventures
  • ATP Survey of Applicants
  • Business Reporting Survey Series
  • Department of Agriculture, Economic Research
    Service, National Agricultural Statistical
    Service
  • Agricultural Resource Management Survey (ARMS)
  • Kauffman Foundation
  • Kauffman Firm Survey
  • Department of Energy, Energy Information Agency
    (pilot)
  • National Science Foundation (2009)

16
Examples of Researcher Topic Areas
  • Entrepreneurship
  • Knowledge innovation
  • Joint ventures
  • New businesses/startups
  • Strategic alliances
  • Agricultural economics

17
Researcher Institutions
18
Portfolio Approach to Secure Data Access
  • Portfolio Approach
  • Legal
  • Educational / Training
  • Statistical
  • Data Protection / Operational / Technological
  • (Customized per data producer dataset)

19
Portfolio Approach
20
Legal Protection Data Enclave Specific
  • On an annual basis
  • Approved researchers sign Data User Agreements
    (legally binding the individual and institution)
  • Researchers and NORC staff sign Non-disclosure
    Agreements specific to each dataset
  • Researchers and NORC staff complete
    confidentiality training

21
Educational / Researcher Training
  • Locations
  • Onsite
  • Remote / Web-based / SFTP
  • Researcher locations (academic institutions,
    conferences AAEA, JSM, AOM, ASA, ASSA, NBER
    summer institute
  • Note The training is designed to go above and
    beyond current practice in terms of both
    frequency and coverage

22
Educational/Training Example Agenda
  • Day 1
  • Data enclave navigation (NORC)
  • Metadata documentation (NORC)
  • Confidentiality and data disclosure (NORC)
  • Survey overview (Data Producer)
  • Confidentiality agreement signing (NORC/ (Data
    Producer)
  • Day 2
  • Data files and documentation (Data Producer)
  • Sampling and weights (Data Producer)
  • Item quality control and treatments for
    non-response (Data Producer)
  • Statistical testing (Data Producer)

23
Statistical Protection
  • Remove obvious identifiers and replace with
    unique identifiers
  • Statistical techniques chosen by agency
    (recognizing data quality issues)
  • Noise added?
  • Full disclosure review of all data exported
    coordinated between NORC and Data Producer
  • Note At discretion of producer and can go
    above and beyond the minimum level of protection

24
Disclosure Review Process
  • Getting disclosure proofed results
  • Drop table/file and related files in special
    folder in shared area
  • Submit request for disclosure review through tech
    support
  • Complete Checklist (researcher to do list)
  • Include details about number of observations in
    each cell previous releases
  • Summary, disclosure proofed output made available
    for download on a public server

25
Disclosure Review Process
Disclosure Review
26
Disclosure Review Guidelines
  • Provide a brief description of your work
  • Identify which table(s) / file(s) you are
    requesting to be reviewed and their location
  • Specify the dataset(s) and variables from which
    the output derives
  • Identify the underlying cell sizes for each
    variable, including regression coefficients based
    on discrete variables

27
Helpful Hints on Disclosure Review
  • Threshold Rule
  • No cells with less than 10 units
    (individuals/enterprises). Local unit analysis
    must show enterprise count (even when there is no
    information associated with each cell)
  • Avoid / be careful when / remember to
  • Tabulating raw data (threshold rule)
  • Using lumpy variables, such as investment
  • Researching small geographical areas (dominance
    rule)
  • Graphs are simply tables in another form (always
    display frequencies)
  • Treat quantiles as tables (always display
    frequencies)
  • Avoid minimum and maximum values
  • Regressions generally only present disclosure
    issues when
  • Only on dummies a table
  • On public explanatory variables
  • Potentially disclosive situations when
    differencing hiding coefficients makes linear and
    non linear estimation completely non-disclosive
    (note panel models are inherently safe)

28
Data Protection / Operational
  • Encrypted connection with the data enclave using
    virtual private network (VPN) technology. VPN
    technology enables the data enclave to prevent an
    outsider from reading the data transmitted
    between the researchers computer and NORCs
    network.
  • Users access the data enclave from a static or
    pre-defined narrow range of IP addresses.
  • Citrixs Web-based technology.
  • All applications and data run on the server at
    the data enclave.
  • Data enclave can prevent the user from
    transferring any data from data enclave to a
    local computer.
  • Data files cannot be downloaded from the remote
    server to the users local PC.
  • User cannot use the cut and paste feature in
    Windows to move data from the Citrix session.
  • User is prevented from printing the data on a
    local computer.
  • Audit logs and audit trails

29
Data Protection / Operational
  • NORC already collects data for multiple
    statistical agencies (BLS, Federal Reserve (IRS
    data), EIA, NSF/SRS etc.) gt has existing
    safeguards in place
  • The Data Enclave is fully compliant with DOC IT
    Security Program Policy, Section 6.5.2, the
    Federal Information Security Management Act,
    provisions of mandatory Federal Information
    Processing Standards (FIPS) and all other
    applicable NIST Data IT system and physical
    security requirements, e.g.,
  • - Employee security
  • - Rules of behavior
  • - Nondisclosure agreements
  • - NIST approved IT security / certification
    accreditation
  • - Applicable laws and regulations
  • - Network connectivity
  • - Remote access
  • - Physical access

30
High-level Access Perspective
31
Restrictions in the Enclave Environment
  • Access only to authorized applications
  • Most system menus have been disabled
  • Some control key combinations or right click
    functions are also disabled on keyboard
  • Closed environment no open ports, no access to
    Internet or email
  • No output (tables, files) may be exported and no
    datasets imported without being reviewed
  • File explorer is on default settings

32
Accessing the Enclave
https//enclave.norc.org dont forget its
https//, not http//
The message center will inform you on browser
related technical issues. Note that you will
first need to install the Citrix Client on you
system (a download link will be provided)
Enter your user name and password. The first time
you access, you will need to change your
password. Need at least one number and a mix
upper/lower case characters. Password must be
changed every 3-months
33
Tools Available in the Enclave
  • Stata/SE 10.0
  • StatTransfer 9 (selected users)
  • SAS v9.2
  • R Project for Statistical Computing
  • MATLAB
  • LimDep / NLogit
  • Microsoft office 2007
  • Adobe PDF Reader
  • IHSN Microdata Management Toolkit / Nesstar
    Publisher (upon request, selected users only)

34
Collaboration Tools
PRODUCER PORTAL
GENERAL INFORMATION
KNOWLEDGE SHARING
SUPPORT
  • Background info
  • Announcements
  • Calendar or events
  • About
  • Topic of the week
  • Discussion groups
  • Wiki
  • Shared libraries
  • Metadata / Report
  • Scripts
  • Research papers
  • Frequently Asked Questions
  • Technical Support
  • DE usage
  • Data usage
  • Quality

Content fully editable by producers and
researchers using a simple web based interface
Private research group portals with similar
functionalities are configured for each research
project
35
Contributing to the Wiki
Wiki pages can be changed by clicking the Edit
button
All changes are tracked and can be seen using the
history page.
Links to other wiki pages are shown as hyperlinks
Pages that have not yet been created are
underlined
36
Data Documentation Shared Code Libraries
Click on documents and folders to open or
navigate in the structure
Use the menu to create folders or upload documents
37
What are Metadata?
  • Common definition Data about Data

38
Metadata You Use Everyday
  • The internet is build on metadata and XML
    technologies

39
Managing Social Science Metadata is Challenging!
We are in charge of the data. We support our
users but also need to protect our respondents!
We have an information management problem
We want easy access to high quality and well
documented data!
We need to collect the information from the
producers, preserve it, and provide access to our
users!
40
Metadata Needs in Social Science
  • The data food chain
  • Data takes a very long path from the respondent
    to the policy maker
  • Process should be properly documented at each
    step
  • Different needs and perspectives across life
    cycle
  • But its all about the data/knowledge being
    transformed and repurposed
  • Linkages / information should be maintained
  • Drill from results to source
  • Needed to understand how to use the data and
    interpret the results
  • Information should be reusable
  • Common metadata
  • Shared across projects
  • Dynamic integration of knowledge

41
Importance of Metadata
  • Data Quality
  • Usefulness accessibility coherence
    completeness relevance timeliness
  • Undocumented data is useless
  • Partially documented data is risky (misuse)
  • Data discovery and access
  • Preservation
  • Replication standard (Gary King)
  • Information / knowledge exchange
  • Reduce need to access sensitive data
  • Maintain coherence / linkages across the complete
    life cycle (from respondent to policy maker)
  • Reuse

42
Metadata Issues
  • Without producer / archive metadata
  • We do not know what the data is about
  • We loose information about the production
    processes
  • Information cant be properly preserved
  • Researchers cant work discover data or perform
    efficient analysis
  • Without researcher metadata
  • Research process is not documented and cannot be
    reproduced (Gary King ? replication standard!)
  • Other researchers are not aware of what has been
    done (duplication / lack of visibility)
  • Producer dont know about data usage and quality
    issues
  • Without standards
  • Such information cant be properly managed and
    exchanged between actors or with the public
  • Without tools
  • We cant capture, preserve or share knowledge

43
What is a Survey?
  • More than just data.
  • A complex process to produce data for the purpose
    of statistical analysis
  • Beyond this, a tool to support evidence based
    policy making and results monitoring in order to
    improve living conditions
  • Represents a single point in time and space
  • Need to be aggregated to produce meaningful
    results
  • It is the beginning of the story
  • The microdata is surrounded by a large body of
    knowledge
  • But survey data often come with limited
    documentation
  • Survey documentation can be broken down into
    structured metadata and documents
  • Structured metadata (can be captured using XML)
  • Documents (can be described in structured
    metadata)

44
Microdata Metadata Examples
  • Survey level
  • Data dictionary (variable labels, names,
    formats,)
  • Questionnaires questions, instructions, flow,
    universe
  • Dataset structure files, structure/relationships,
  • Survey and processes concepts, description,
    sampling, stakeholders, access conditions, time
    and spatial coverage, data collection
    processing,
  • Documentation reports, manuals, guides,
    methodologies, administration, multimedia, maps,
  • Across surveys
  • Groups series, longitudinal, panel,
  • Comparability by design, after the fact
  • Common metadata concepts, classifications,
    universes, geography, universe

45
Questionnaire Example
Instruction
Universe
Module/Concepts
Questions
Classifications (some reusable)
Value level Instruction (skip)
Instruction
46
Information Technology and Metadata
The eXtensible Markup Language (XML) and related
technologies are use to manage metadata or data
Document Type Definition (DTD) and XSchema are
use to validate an XML document by defining
namespaces, elements, rules
Specialized software and database systems can be
used to create and edit XML documents. In the
future the XForm standard will be used
XML separates the metadata storage from its
presentation. XML documents can be transformed
into something else, like HTML, PDF, XML, other)
through the use of the eXtensible Stylesheet
Language, XSL Transformations (XSLT) and XSL
Formatting Objects (XSL-FO)
Very much like a database system, XML documents
can be searched and queried through the use of
XPath oe XQuery. There is no need to create
tables, indexes or define relationships
XML metadata or data can be published in smart
catalogs often referred to as registries than can
be used for discovery of information.
XML Documents can be sent like regular files but
are typically exchanged between applications
through Web Services using the SOAP and other
protocols
47
Metadata specifications for social sciences
  • We need generic structures to capture the
    metadata
  • A single specification is not enough
  • We need a set of metadata structures
  • That can map to each other to (maintain linkages)
  • Will be around for a long time (global adoption,
    strong community support)
  • Based on technology standards (XML)
  • Suggested set
  • Data Documentation Initiative (DDI) survey /
    administrative microdata
  • Statistical Data and Metadata Exchange standard
    (SDMX) aggregated data / time series
  • ISO/IEC 11179 concept management and semantic
    modeling
  • ISO 19115 Geographical metadata
  • METS packaging/archiving of digital objects
  • PREMIS Archival lifecycle metadata
  • XBRL business reporting
  • Dublin Core citation metadata

48
The Data Documentation Initiative
  • The Data Documentation Initiative is an XML
    specification to capture structured metadata
    about microdata (broad sense)
  • First generation DDI 1.02.1 (2000-2008)
  • Focus on single archived instance
  • Second generation DDI 3.0 (2008)
  • Focus on life cycle
  • Go beyond the single survey concept
  • Multi-purpose
  • Governance DDI Alliance
  • Membership based organizations (35 members)
  • Data archives, producers, research data centers,
    academic
  • http//www.ddialliance.org/org/index.html

49
DDI Timeline / Status
  • Pre-DDI 1.0
  • 70s / 80s OSIRIS Codebook
  • 1993 IASSIST Codebook Action Group
  • 1996 SGML DTD
  • 1997 DDI XML
  • 1999 Draft DDI DTD
  • 2000 DDI 1.0
  • Simple survey
  • Archival data formats
  • Microdata only
  • 2003 DDI 2.0
  • Aggregate data (based on matrix structure)
  • Added geographic material to aid geographic
    search systems and GIS users
  • 2003 Establishment of DDI Alliance
  • 2004 Acceptance of a new DDI paradigm
  • Lifecycle model
  • Shift from the codebook centric / variable
    centric model to capturing the lifecycle of data
  • Agreement on expanded areas of coverage
  • 2005
  • Presentation of schema structure
  • Focus on points of metadata creation and reuse
  • 2006
  • Presentation of first complete 3.0 model
  • Internal and public review
  • 2007
  • Vote to move to Candidate Version (CR)
  • Establishment of a set of use cases to test
    application and implementation
  • October 3.0 CR2
  • 2008
  • February 3.0 CR3
  • March 3.0 CR3 update
  • April 3.0 CR3 final
  • April 28th 3.0 Approved by DDI Alliance
  • May 21st DDI 3.0 Officially announced
  • Initial presentations at IASSIST 2008
  • 2009
  • DDI 3.1 and beyond

50
DDI 2.0 Perspective
DDI 2 Survey
DDI 2 Survey
DDI 2 Survey
DDI 2 Survey
DDI 2 Survey
DDI 2 Survey
DDI 2 Survey
51
DDI 2 Characteristics and Limitations
  • Characteristics
  • Focuses on the static object of a codebook
  • Designed for limited uses
  • Coverage is focused on single study, single data
    file, simple survey and aggregate data files
  • Variable contains majority of information
    (question, categories, data typing, physical
    storage information, statistics)
  • Limitations
  • Treated as an add on to the data collection
    process
  • Focus is on the data end product and end users
    (static)
  • Limited tools for creation or exploitation
  • The Variable must exist before metadata can be
    created
  • Producers hesitant to take up DDI creation
    because it is a cost and does not support their
    development or collection process

52
DDI 3.0 and the Survey Life Cycle
  • A survey is not a static process It dynamically
    evolved across time and involves many
    agencies/individuals
  • DDI 2.x is about archiving, DDI 3.0 across the
    entire life cycle
  • 3.0 focus on metadata reuse (minimizes
    redundancies/discrepancies, support comparison)
  • Also supports multilingual, grouping, geography,
    and others
  • 3.0 is extensible

53
DDI 3.0 Perspective
54
DDI 3.0 Use Cases
  • DDI 3 is composed of several schemas/modules
  • You only use what you need!
  • DDI 3.0 provides the common metadata language to
    maintain links and consistency across the entire
    life cycle
  • Some examples
  • Study design/survey instrumentation
  • Questionnaire generation/data collection and
    processing
  • Data recoding, aggregation and other processing
  • Data dissemination/discovery
  • Archival ingestion/metadata value-add
  • Question /concept /variable banks
  • DDI for use within a research project
  • Capture of metadata regarding data use
  • Metadata mining for comparison, etc.
  • Generating instruction packages/presentations
  • Data sourced from registers
  • The same specification is used across the
    lifecycle by different actors ? maintains
    consistency and linkages

55
Benefits (1)
  • Improvement in data quality (usefulness)
  • Discovery, accessibility
  • Coherence, integrity
  • Timeliness
  • Agencies speak the same language and share
    common/compatible metadata structure
  • Across the entire life cycle, form respondent to
    policy maker
  • Improve services
  • Publication Institutional (inside public),
    community, regional, global
  • User customized products (on the fly)
  • Documentation (subset, profiled for user)
  • Code generator (statistical packages, database,
    etc.)
  • Notification services
  • User feedback / dialog (quality, usage)
  • Foster community space (build knowledge,
    collaboration)
  • Reuse of tools / software / best practices

56
Benefits (2)
  • Harmonization of common metadata
  • Concepts, classifications, terminology (or
    documented mappings)
  • Improved search capabilities
  • Producer, time, geography, concepts
  • Comparability (be design, after the fact)
  • Metadata mining, exploration and visualization
  • Understanding of data usage
  • Reduced burden and cost of ownership
  • Preservation
  • Return on investment
  • Preparation and maintenance costs offset by
    reduction in production, dissemination and
    support costs
  • Build on industry standard technology

57
Metadata and the NORC Data Enclave
  • Advocate and foster use of metadata standards and
    best practices
  • Datasets coming in the enclave are documented
    using DDI
  • In collaboration with data producer
  • Currently using DDI 2
  • Supporting community initiative towards the
    development of new tools, in particular for DDI
    3.0
  • Frontier research in knowledge capture
  • Source code scrapping / tagging
  • Better understanding of data usage
  • Collaborative efforts with University of Chicago
    Computational Institute
  • Leveraging web 2.0 technologies in social-science

58
Outreach / Dissemination
  • Public web site
  • Newsletter
  • Sponsor / participate in workshops/conferences
  • Multiple channels of coordinated outreach
    efforts are most advantageous

59
Public Website (www.norc.org/dataenclave)
60
Enclave Quarterly Newsletter
61
Publications
  • Norman Bradburn, Randy Horton, Julia Lane,
    Michael Tilkin, Developing a Data Enclave for
    Sensitive Microdata Proceedings of International
    E-Social Science Conference, 2006
  • Julia Lane, Optimizing Access to Microdata
    Journal of Official Statistics, September 2007
  • Julia Lane and Stephanie Shipp Using a Remote
    Access Data Enclave for Data Dissemination
    International Journal of Digital Curation, 2007
  • Julia Lane, Pascal Heus and Tim Mulcahy, Data
    Access in a Cyber World Making Use of
    Cyberinfrastructure Transactions in Data
    Privacy, 2008
  • Stephanie Shipp, Stephen Campbell, Tim Mulcahy,
    and Ted Allen Informing Public Policy on Science
    and Innovation The Advanced Technology Programs
    Experience Journal of Technology Transfer, 2008

62
Next Steps
  • Continue applying practical lessons learned /
    feedback for continuous improvement
  • Continue to innovate
  • Develop researcher scholarship collaboration
    incentive programs
  • Identify, implement, and test new collaboratory
    functionality
  • Continue to pilot test remote training platform
  • Develop Executive Council comprised of sponsor
    agencies to help steer the system from a producer
    and user perspective

63
Instant Messaging, Audio, Video, Webcast
64
Q A
  • QUESTIONS ANSWERS
Write a Comment
User Comments (0)
About PowerShow.com