Towards a Data Network for Integrated Social Science Research - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Towards a Data Network for Integrated Social Science Research

Description:

Margaret Adams, Ken Bollen, Cavan Capps, Jonathan Crabtree, Darrell Donakowski, Myron Gutmann, Gary King, Lois Timms-Ferrarra, Marc Maynard, Amy Pienta ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 36
Provided by: donated
Category:

less

Transcript and Presenter's Notes

Title: Towards a Data Network for Integrated Social Science Research


1
Towards a Data Network for Integrated Social
Science Research
  • Micah Altman
  • Harvard University
  • Archival Director, Henry A. Murray Research
    Archive
  • Associate Director, Harvard-MIT Data Center
  • Senior Research Scientist, Institute for
    Quantitative Social Sciences
  • E micah_altman_at_harvard.eduW http//maltman.hmdc
    .harvard.edu/
  • Presented at the DLF Meeting 2008

2
This Talk
  • Why is Access to Social Science Data Important?
  • What are Challenges to Integrated Access?
  • Social Science and Cyberinfrastructure
  • Google (--?)
  • Dataverse Network (DVN) Virtual Archiving
  • Data Preservation Alliance for Social Sciences
    (Data-PASS) Replicated Institutional
    Preservation
  • The Social Science Research Computing Environment
    (RCE) Social Science Research Workflows
  • Conclusions

3
Related Work
  • Articles
  • M. Altman and G. King. A Proposed Standard for
    the Scholarly Citation of Quantitative Data,
    D-Lib, 13, 3/4 (March/April). 2007.
  • M. Altman, et. al, Data Preservation Alliance
    for the Social Sciences A Model for
    Collaboration Proceedings of DigCcurr07, Chapel
    Hill. April 2007.
  • G. King, An Introduction to the Dataverse
    Network as an Infrastructure for Data Sharing,
    Sociological Methods and Research, 32, 2
    (November, 2007) 173199.
  • M. Altman , "A Fingerprint Method for
    Verification of Scientific Data" in, Advances in
    Systems, Computing Sciences and Software
    Engineering, (Proceedings of the International
    Conference on Systems, Computing Sciences and
    Software Engineering 2007) , Springer Verlag.
    Forthcoming 2008.
  • Collaborators Co-conspirators
  • Margaret Adams, Ken Bollen, Cavan Capps, Jonathan
    Crabtree, Darrell Donakowski, Myron Gutmann, Gary
    King, Lois Timms-Ferrarra, Marc Maynard, Amy
    Pienta
  • Research Support
  • Thanks to the Library of Congress (PANDP03-1),
    the National Institutes of Aging (P01
    AG17625-01), the National Science Foundation
    (SES-0318275, IIS-9874747), the Harvard
    University Library, the Institute for
    Quantitative SocialScience, the Harvard-MIT Data
    Center, and the Murray Research Archive.

4
What is Digital Social-Science Data?
  • DIGITAL
  • Optical DVD, CD
  • Magnetic Tapes, Floppies
  • Paper cards, tapes
  • SOCIAL SCIENCE
  • Social class, crime, social movements, culture,
    folklore, family
  • Economic wealth, prosperity, labor, business,
    equity
  • Psychology cognition, attitudes, stereotypes
  • Politicsjustice, democracy, public policy,
    public administration, international conflic
  • DATA
  • Raw measurements
  • Numeric tables
  • Administrative records ( email)
  • Video and audio interviews, transcripts (
    blogs)
  • Digital objects (web sites, interactive databases)

5
Data Access is the Key to Science
  • Science is not (only) about being scientific
  • Scientific progress requires community
    Competition and collaboration in the pursuit of
    common goals
  • Without access to the same materials no
    community exists data is the nucleus of
    collaboration.
  • The value of an article that cant be replicated
    ?
  • Scholarly articles are summaries, not the actual
    research results
  • But Data access is spotty by field, finding the
    data is still hard
  • Hard for journal editors to verify.If you find
    it, how do you know its the same?
  • Replication projects showmost published
    articles in social science cannot be
    replicated data is necessary for replication
    and verification

6
Data Access is a Key to Democracy
  • Statistics state-istics
  • The state tax authority counting people,
    estimating wealth
  • Reformers use data to assess the performance of
    the state
  • Science informs public policy continually
  • In modern democracy the public needs a direct
    source of information

7
How Data Is Lost
Research by
  • Data Intentionally Discarded
  • It was just too long ago, I generally keep data
    for something like 10 years beyond the last time
    I do something with them.
  • Destroyed, in accord with APA 5-year
    post-publication rule.
  • Unintentional Hardware Problems
  • Some data were collected, but the data file was
    lost in a technical malfunction.
  • Destroyed for Confidentiality Reasons
  • The materialwas considered sensitive data.
    Institutional review boards.. required us to
    promise to destroy the data after a certain
    period of time...
  • Acts of Nature
  • The data from the studies were on punched cards
    that were destroyed in a flood in the department
    in the early 80s.
  • Discarded or Lost in a Move
  • As I retired . Unfortunately, I simply didnt
    have the room to store these data sets at my
    house.
  • Obsolescence
  • Speech recordings stored on a LISP Machine, an
    experimental computer which is long obsolete.
  • Simply Lost
  • For all I know, they are on a University
    server, but it has been literally years and years
    since the research was done, and my files are
    long gone.

8
Challenges to Research and Policy
  • Legal Challenges
  • Technical Privacy Challenges
  • Data Deluge
  • New Forms of Research

9
Legal Requirements
10
Technical Privacy Challenges
  • Some challenging findings
  • Large, sparse datasets can leak private
    information when correlated with external data.
    Even when significantly sub-sampled, perturbed,
    etc. Narayan and Shmatikov 2008
  • Repeated release of perturbation-masked
    geospatial point data leaks increasing amounts of
    information. Does not help to combine with
    aggregation masking Zimmerman and Pavlik 2008
  • Possible to identify other relationships in
    networks if you can generate seemingly innocuous
    relationships in same network Backstrom, et. al
    2007
  • Pseudonymous communication can be linked through
    textual analysis Tomkins et. al 2004
  • K-anonymized data still vulnerable if homogenous,
    or attacker has enough background knowledge.
    L-diversity offered as replacement
    MachanavaJJhala, et al 2007
  • Additional anonymization challenges for
    geospatial data
  • Very fine grained location versus multi-state
    aggregation mask required by HIPAA, and large
    social science surveys
  • Background knowledge very likely
  • Easy to integrate with other datasets Some data
    points may be directly observable
  • Sequences of locations even more challenging
  • May cross aggregation units Repetitive,
    temporally correlated Induces unique networks

11
Management of Legal Risks
  • Embedding all sensitive data access in a digital
    library can greatly improve subject privacy
  • Authentication, vetting, and access control
  • Standardized license terms governing analysis
    (derived from metadata and data characteristics)
  • Models can be run on-line without access to raw
    data
  • Monitoring and auditing of data use
  • Limit sequence of analyses by a user, in some
    cases ( for promising results, see Dwork, et al
    2006
  • Licensing and Intellectual Property Protections
  • Standard licence terms and metadata
  • Click-through agreements, vetting workflows
  • Authentication, auditing, logging

12
Long-Term Social Science Data Needs
  • Social science human activities and
    perceptions
  • Computational capacity of human brain 1014
    1019?
  • Future storage of a human history
  • 1030 bytes/person?
  • Compare to 1010 bytes for a long high-res FMRI
    session

Or, what are you thinking?
13
Social Science Data Deluge
  • Collective holdings of all U.S. numeric social
    science data in all major data archives,
    government repositories estimated 10s of TB
  • Ambient data increasingly becoming subject of
    social science research.
  • Data deluge annually (2002 annual)
  • Web (surface) 167 TB
  • Radio 3,500 TB
  • Television 69,000 TB
  • Web (deep) 92000 TB
  • Email (originals) 441,000 TB
  • Telephone 18,000,000 TB

Or, what are you thinking?
14
Research Infrastructure Challenges
  • Social science challenges
  • Few definitive answers
  • Complex conceptual primitives
  • Complex theories of behavior
  • Reliance on observational data
  • Specification uncertainty
  • Changing evidence base (blogs, video,
    continuously recorded behavioral data)
  • Some trends
  • Compute-intensive inferential statistics
  • Specification searches
  • Sensitivity analyses
  • Curse of dimensionality
  • Data explosion
  • Changing evidence base
  • Agent-based models

15
Why Infrastructure for Data?
  • Accessibility
  • Most large data sets in public archives
  • Most data in published articles not accessible,
    results not replicable without the original
    author
  • Most data sets from federal grants not publicly
    available
  • Problems even with professional archives
  • Data in different archives have different
    identifiers
  • Archives change identifiers, links
  • Changes to data are made identifiers are reused
    or removed old data are lost
  • Data sets are not like books
  • Static data files (even if on the web)
    unreadable after a few years
  • When storage methods change some data sets are
    lost others have altered content!
  • Why not Single Centralized infrastructure ?
  • Single point of failure
  • Data is heterogeneous in format, origin, size,
    effort needed to collect or analyze, IRB access
    rules, etc.
  • Data producers want credit, control, and
    visibility
  • Requirements
  • Recognition, for data producers, distributors,
    related publishers
  • Rule-based Public Distribution
  • Authorization fulfill requirements the author
    originally met

16
Emerging Technologies Social Science Data
  • Google
  • Virtual-Hosted archives
  • Workflow systems
  • Data networks

Plus Ça Change, Plus C'est Fou
17
Google (--?)
Privacy?
Law?



?
Preservation?


Analysis?
Can you count how many ?s are in this picture?
18
Virtual Archiving The Dataverse Network
  • An Open-Source, Federated, Web 2.0 Data Network
  • Gateway to over 20000 social science studies
    (worlds largest catalog)
  • Web Virtual Hosting 2.0 Service
  • Federated access to other networks
  • Unified access to major U.S. research data
    archives, government data
  • Open service endowed hosting
  • Open source GPL-Affero-3
  • Discovery Services
  • Simple fielded search
  • Virtual collection browsing
  • Management
  • Ingest
  • Curation review
  • Virtual Hosting and administration
  • Metadata delivery
  • Descriptive and structural
  • Provenance (chain-of custody metadata)
  • Human and OAI interfaces
  • Preservation
  • Standards based
  • Reformatting
  • Universal Numeric Fingerprints
  • Enhanced Delivery
  • Replication
  • Layered analysis services

To date 132 Dataverses 23,058 Studies
576,387 Files (April 28, 2008)
19
DVN Screenshots
http//dvn.iq.harvard.edu/
20
Some Dataverse Uses
  • Future Researchers discovery linking forward
    citation verification analysis
  • Journals, for replication
  • Authors, for their own data
  • Teachers, in depth analysis
  • Sections of scholarly organizations, to organize
    existing data
  • Granting agencies
  • Research centers
  • Archives
  • Major Research Projects
  • Academic departments, universities, centers,
    libraries

21
DVN Data Citations
  • Citations are a traditional formal mechanism to
    link together intellectual works
  • Citations glue together Regulations,
    Publications, and Evidence
  • But, lack of rules for citing numeric data
  • No consistency in practice
  • No fixed rules for copyeditors
  • Sometimes in the list of references sometimes a
    casual mention in the text
  • Sometimes the archive is noted
  • Sometimes a version number exists
  • Sometimes the version number is listed (if it
    exists)
  • Archive numbers are sometimes given, if they
    exist
  • Sometimes the author is noted
  • Date of creation is sometimes given
  • URLs often given, rarely persist
  • Dates of access protect the researcher, do not
    help find the data
  • The data may not be available publicly
  • The data may no longer exist

22
A Unified Citation Standard for Quantitative Data
23
DVN Whats New
  • Timeline
  • Version 1.0 (release) Dec. 2007
  • Version 1.1 March 2008
  • Version 1.2 April 2008
  • New Stuff
  • OAI enhancements Export Custom sets (1.2)
    Import DC, FGDC (1.1) as well as DDI
  • Data services zip delivery of remote files
    (1.2) plain-text and tab-delimited exports (1.2)
  • Java 6 Support (1.2)
  • Workflow Support Enhancements
  • Terms of use on login, upload, and download,
    configurable at network, dataverse, and study
    level (1.1, 1.2)
  • Enhanced workflows for account requests, password
    recovery, non-privileged (drop box)
    submissions, submissions review (1.1, 1.2)
  • Network Admin UI Enhancements
  • JHove validation of individual studies (1.2)
  • Batch ingest (1.2)
  • Numerous other performance, end-user, curator,
    and network UI enhancements
  • Future 2.0 (summer)
  • Data Services save analyses to R, additional
    formats
  • GUI for assigning geographic bounding box for
    study
  • Support harvesting of DVN through LOCKSS

24
DataPASS
25
Collaboration for Preservation
  • Joint Not-bad practices
  • Identification selection
  • Metadata
  • Security
  • Confidentiality
  • Shared Catalog
  • Unified Discovery
  • Content exchange
  • Layered Services
  • Partnership Agreements
  • Agreement to establish good practice
  • Preservation copies of data collected
  • Transfer Protocol in case of archival failure
  • Cooperating Operations
  • Central database of leads for acquisition
  • Development of shared procedures
  • Review of acquisitions

"Nothing new that is really interesting comes
without collaboration" -- James Watson
26
Data Rescued Examples
  • U.S. Information Agency Surveys
  • Directly informed U.S. foreign policy through
    surveys of foreign public opinion
  • Previously, only surveys from 1970-1990 were held
    in the national archives
  • Collaboration be NARA and Roper to create a much
    more complete series spanning the 1950-1990
  • Surveys conducted in Europe, Latin America, Asian
    countries include nuclear arms control,
  • Recent Subjects include US-Soviet relations, US
    strike on Libya, Soviet Union invasion of
    Afghanistan, and economic matters, terrorism,
    economic summits, arms control, and the Soviet
    actions in Afghanistan, drug trafficking,
    democratization, and conflicts in El Salvador and
    Nicaragua.
  • Longitudinal Study of Personality Development.
  • By Jack and Jeanne Humphrey Block
  • The most intensive study of human personality
    development in existence.
  • Thirty year longitudinal study.
  • Mixed methods quantitative, audio, video.
  • More than 100 instruments, and 1000s of measures
    (variables)
  • Resulted in more than 100 publications.
  • (Also shows how whiny kids are more likely to
    grow up to be conservatives.)
  • National Network of State Polls
  • Diverse membership of 50 members in 38 states
  • Covers a tremendous range of local and national
    issues
  • Data imminently at risk

27
Selected Topics Sponsors
  • Political activity, political activism, voting
    behavior, protest activity, voter registration,
    fundraising, political alienation, relationship
    to the Black community, feminism, racial
    identity, attitudes toward abortion, attitudes
    toward federal programs television viewing
    habits, affects of having children on the
    marriage, giving too much/little independence,
    discipline, overscheduling, overprotecting,
    measuring levels of success in teaching values,
    self-control, good citizenship, good money
    habits, religion, worries that parents have of
    the future facing their children problems facing
    parents and children from drugs, sex, violence to
    the lack of various family and religious values
    daycare, mothers working, childrearing, taxes,
    government spending, morals, childrens issues,
    economy, jobs, education, crime, health care,
    social security, local school administration,
    standardized testing, impact of poor scores on
    teachers, higher academic standards needed, too
    much/little homework, summer school., teachers,
    administrators, quality of academics, discipline
    matters, class size, level of science and math
    skills taught, Shakespeare, life skills,
    athletics, citizenship, Role of the US in the
    world and assessing US performance, terrorism,
    war in Iraq, respondent identified level of
    understanding of foreign affairs, US and foreign
    aid, assisting emerging democracies, enhancing
    national security, image of the US abroad,
    Seriousness of Welfare problems--abuse, fraud,
    generational, etc. assessing list of
    remedies--limit duration, require job training,
    provide day care, unannounced visits, business
    tax breaks for hiring recipients, penalize
    recipients who have more children, etc.
    profiling welfare recipients (e.g. more likely to
    be better/worse parents, lazy or hardworking,
    from troubled families defining the American
    Ideal, how to teach kids what it means to be
    American, , national identity, appreciation of
    freedoms in the US, importance of voting, ashamed
    of nation's history of racism, job US does in
    teaching immigrant children, bi-lingualism, fly
    an American flag most about the meaning of the
    rights the Constitution guarantees, assessing the
    level of appreciation of those rights in the US
    and how it is perceived to the international
    community aging. Money Mangers on union
    organizations, employers, and labor market
    institutions tort law reforms crime and
    urbanization law and social control natural
    disasters awareness of self
  • NSF, NIH, The Danforth Foundation, The Ford
    Foundation, The David and Lucille Packard
    Foundation, and Ewing Marion Kauffman
    Foundation., State Farm Insurance, Ronald
    McDonald House Charities, Advertising Council,
    American Federation of Teachers, the Annenberg
    Institute, the George Gund Foundation, the
    National School Boards Association, U.S.
    Department of Education, GE Foundation, Nellie
    Mae Education Foundation, Wallace Foundation,
    Bill Melinda Gates Foundation, Pew Charitable
    Trust, National Constitution Center, Alliance for
    Aging Research, American Federation for Aging
    Research the MacArthur Foundation, NiMH

28
(No Transcript)
29
(No Transcript)
30
Replication as Institutional Insurance
Data-PASS Syndicated Storage Project
  • External Causes of Preservation Failure
  • Third party attacks
  • Institutional funding
  • Change in legal regimes
  • Quis custodiet ipsos custodes?
  • Unintentional curatorial modification
  • Loss of institutional knowledge skills
  • Intentional removal
  • Change in institutional mission
  • Schema drivencapture inter-archival
    preservation commitments
  • Asymmetric resource commitments proportional
    to holdings
  • Versioned versioned data and citations
  • Integration LOCKSS Archival Replication
    Schema DVN technology archival workflows

31
Workflow Systems
  • Emerging tools for integration of research
    process in natural sciences
  • Orchestrate Data Collection, Transformation,
    Analysis
  • Examples Taverna, Kepler, Genepattern, VisTrails
  • Most are science and grid-oriented
  • Addresses different parts of scholarly work
    lifecycle
  • Not focused on social science tasks

Or life on the grid
32
Intersection of DL and Workflows
  • GenePattern
  • Genomics workflow system
  • Supports construction of complex reproducible
    data analysis pipelines
  • Targeted to local operations, but can make use of
    some job queueing systems (LSF, SGE)
  • http//www.broad.mit.edu/cancer/software/genepatte
    rn/
  • Integration project
  • Extends coverage of total research lifecycle
  • DVN will store GenePattern analyses as they
    evolve
  • When analyses are published, dissemination,
    preservation and reuse should be seamless
  • Funded project in early planning stage

33
New Social Science
  • From Social Science Research Computing
    Environment Project
  • Assess need for high performance computing among
    social scientists at Harvard
  • Prototype interfaces to make grid computing
    usable by social scientists
  • Examples
  • Harvesting and analysis of blogs for virtual
    political opinion surveys
  • Continuous collection of CSPAN, real-time subject
    coding, continuous dissemination
  • Cell phone data movement, proximity to others,
    social network analysis
  • Participative goals-based redistricting
  • Agent-based models of emerging institutions
  • FMRI analyses of reaction to political and social
    scenarios
  • Modal Features
  • Analyses emerge through exploration and
    interactions
  • Data collection from non-experimental, non
    instrumental, sources
  • Increasing scale of data
  • Compute limited
  • Data confidentiality
  • High-level analysis tools
  • Remote collaboration is part of projects

Meta-features of social science messy data
an abundance of plausible models
34
Mind the Gaps
  • No tool covers entire scholarly research
    lifecycle
  • Most tools immature
  • Poor integration across most tools
  • Many tools for hard science do not meet social
    science needs for non-experimental messy data
    (strange sensors), confidentiality, complex
    inferential methods
  • Decoupling of dissemination, formal publication,
    citation, peer-review
  • No tools integrate comprehensive, standard,
    flexible control over privacy, intellectual
    property

35
For More Information
Dataverse Network Project http//TheData.Org Da
ta-PASS Alliance http//www.icpsr.umich.edu/DAT
APASS/ Contact me http//maltman.hmdc.harvard.e
du/
Write a Comment
User Comments (0)
About PowerShow.com