Title: Towards a Data Network for Integrated Social Science Research
1Towards a Data Network for Integrated Social
Science Research
- Micah Altman
- Harvard University
- Archival Director, Henry A. Murray Research
Archive - Associate Director, Harvard-MIT Data Center
- Senior Research Scientist, Institute for
Quantitative Social Sciences - E micah_altman_at_harvard.eduW http//maltman.hmdc
.harvard.edu/ - Presented at the DLF Meeting 2008
2This Talk
- Why is Access to Social Science Data Important?
- What are Challenges to Integrated Access?
- Social Science and Cyberinfrastructure
- Google (--?)
- Dataverse Network (DVN) Virtual Archiving
- Data Preservation Alliance for Social Sciences
(Data-PASS) Replicated Institutional
Preservation - The Social Science Research Computing Environment
(RCE) Social Science Research Workflows - Conclusions
3Related Work
- Articles
- M. Altman and G. King. A Proposed Standard for
the Scholarly Citation of Quantitative Data,
D-Lib, 13, 3/4 (March/April). 2007. - M. Altman, et. al, Data Preservation Alliance
for the Social Sciences A Model for
Collaboration Proceedings of DigCcurr07, Chapel
Hill. April 2007. - G. King, An Introduction to the Dataverse
Network as an Infrastructure for Data Sharing,
Sociological Methods and Research, 32, 2
(November, 2007) 173199. - M. Altman , "A Fingerprint Method for
Verification of Scientific Data" in, Advances in
Systems, Computing Sciences and Software
Engineering, (Proceedings of the International
Conference on Systems, Computing Sciences and
Software Engineering 2007) , Springer Verlag.
Forthcoming 2008. - Collaborators Co-conspirators
- Margaret Adams, Ken Bollen, Cavan Capps, Jonathan
Crabtree, Darrell Donakowski, Myron Gutmann, Gary
King, Lois Timms-Ferrarra, Marc Maynard, Amy
Pienta - Research Support
- Thanks to the Library of Congress (PANDP03-1),
the National Institutes of Aging (P01
AG17625-01), the National Science Foundation
(SES-0318275, IIS-9874747), the Harvard
University Library, the Institute for
Quantitative SocialScience, the Harvard-MIT Data
Center, and the Murray Research Archive.
4What is Digital Social-Science Data?
- DIGITAL
- Optical DVD, CD
- Magnetic Tapes, Floppies
- Paper cards, tapes
- SOCIAL SCIENCE
- Social class, crime, social movements, culture,
folklore, family - Economic wealth, prosperity, labor, business,
equity - Psychology cognition, attitudes, stereotypes
- Politicsjustice, democracy, public policy,
public administration, international conflic - DATA
- Raw measurements
- Numeric tables
- Administrative records ( email)
- Video and audio interviews, transcripts (
blogs) - Digital objects (web sites, interactive databases)
5Data Access is the Key to Science
- Science is not (only) about being scientific
- Scientific progress requires community
Competition and collaboration in the pursuit of
common goals - Without access to the same materials no
community exists data is the nucleus of
collaboration. - The value of an article that cant be replicated
? - Scholarly articles are summaries, not the actual
research results - But Data access is spotty by field, finding the
data is still hard - Hard for journal editors to verify.If you find
it, how do you know its the same? - Replication projects showmost published
articles in social science cannot be
replicated data is necessary for replication
and verification
6Data Access is a Key to Democracy
- Statistics state-istics
- The state tax authority counting people,
estimating wealth - Reformers use data to assess the performance of
the state - Science informs public policy continually
- In modern democracy the public needs a direct
source of information
7How Data Is Lost
Research by
- Data Intentionally Discarded
- It was just too long ago, I generally keep data
for something like 10 years beyond the last time
I do something with them. - Destroyed, in accord with APA 5-year
post-publication rule. - Unintentional Hardware Problems
- Some data were collected, but the data file was
lost in a technical malfunction. - Destroyed for Confidentiality Reasons
- The materialwas considered sensitive data.
Institutional review boards.. required us to
promise to destroy the data after a certain
period of time... - Acts of Nature
- The data from the studies were on punched cards
that were destroyed in a flood in the department
in the early 80s. - Discarded or Lost in a Move
- As I retired . Unfortunately, I simply didnt
have the room to store these data sets at my
house. - Obsolescence
- Speech recordings stored on a LISP Machine, an
experimental computer which is long obsolete. - Simply Lost
- For all I know, they are on a University
server, but it has been literally years and years
since the research was done, and my files are
long gone.
8Challenges to Research and Policy
- Legal Challenges
- Technical Privacy Challenges
- Data Deluge
- New Forms of Research
9Legal Requirements
10Technical Privacy Challenges
- Some challenging findings
- Large, sparse datasets can leak private
information when correlated with external data.
Even when significantly sub-sampled, perturbed,
etc. Narayan and Shmatikov 2008 - Repeated release of perturbation-masked
geospatial point data leaks increasing amounts of
information. Does not help to combine with
aggregation masking Zimmerman and Pavlik 2008 - Possible to identify other relationships in
networks if you can generate seemingly innocuous
relationships in same network Backstrom, et. al
2007 - Pseudonymous communication can be linked through
textual analysis Tomkins et. al 2004 - K-anonymized data still vulnerable if homogenous,
or attacker has enough background knowledge.
L-diversity offered as replacement
MachanavaJJhala, et al 2007 - Additional anonymization challenges for
geospatial data - Very fine grained location versus multi-state
aggregation mask required by HIPAA, and large
social science surveys - Background knowledge very likely
- Easy to integrate with other datasets Some data
points may be directly observable - Sequences of locations even more challenging
- May cross aggregation units Repetitive,
temporally correlated Induces unique networks
11Management of Legal Risks
- Embedding all sensitive data access in a digital
library can greatly improve subject privacy - Authentication, vetting, and access control
- Standardized license terms governing analysis
(derived from metadata and data characteristics) - Models can be run on-line without access to raw
data - Monitoring and auditing of data use
- Limit sequence of analyses by a user, in some
cases ( for promising results, see Dwork, et al
2006 - Licensing and Intellectual Property Protections
- Standard licence terms and metadata
- Click-through agreements, vetting workflows
- Authentication, auditing, logging
12Long-Term Social Science Data Needs
- Social science human activities and
perceptions - Computational capacity of human brain 1014
1019? - Future storage of a human history
- 1030 bytes/person?
- Compare to 1010 bytes for a long high-res FMRI
session
Or, what are you thinking?
13Social Science Data Deluge
- Collective holdings of all U.S. numeric social
science data in all major data archives,
government repositories estimated 10s of TB - Ambient data increasingly becoming subject of
social science research. - Data deluge annually (2002 annual)
- Web (surface) 167 TB
- Radio 3,500 TB
- Television 69,000 TB
- Web (deep) 92000 TB
- Email (originals) 441,000 TB
- Telephone 18,000,000 TB
Or, what are you thinking?
14Research Infrastructure Challenges
- Social science challenges
- Few definitive answers
- Complex conceptual primitives
- Complex theories of behavior
- Reliance on observational data
- Specification uncertainty
- Changing evidence base (blogs, video,
continuously recorded behavioral data) - Some trends
- Compute-intensive inferential statistics
- Specification searches
- Sensitivity analyses
- Curse of dimensionality
- Data explosion
- Changing evidence base
- Agent-based models
15Why Infrastructure for Data?
- Accessibility
- Most large data sets in public archives
- Most data in published articles not accessible,
results not replicable without the original
author - Most data sets from federal grants not publicly
available - Problems even with professional archives
- Data in different archives have different
identifiers - Archives change identifiers, links
- Changes to data are made identifiers are reused
or removed old data are lost - Data sets are not like books
- Static data files (even if on the web)
unreadable after a few years - When storage methods change some data sets are
lost others have altered content! - Why not Single Centralized infrastructure ?
- Single point of failure
- Data is heterogeneous in format, origin, size,
effort needed to collect or analyze, IRB access
rules, etc. - Data producers want credit, control, and
visibility - Requirements
- Recognition, for data producers, distributors,
related publishers - Rule-based Public Distribution
- Authorization fulfill requirements the author
originally met
16Emerging Technologies Social Science Data
- Google
- Virtual-Hosted archives
- Workflow systems
- Data networks
Plus Ça Change, Plus C'est Fou
17Google (--?)
Privacy?
Law?
?
Preservation?
Analysis?
Can you count how many ?s are in this picture?
18Virtual Archiving The Dataverse Network
- An Open-Source, Federated, Web 2.0 Data Network
- Gateway to over 20000 social science studies
(worlds largest catalog) - Web Virtual Hosting 2.0 Service
- Federated access to other networks
- Unified access to major U.S. research data
archives, government data - Open service endowed hosting
- Open source GPL-Affero-3
- Discovery Services
- Simple fielded search
- Virtual collection browsing
- Management
- Ingest
- Curation review
- Virtual Hosting and administration
- Metadata delivery
- Descriptive and structural
- Provenance (chain-of custody metadata)
- Human and OAI interfaces
- Preservation
- Standards based
- Reformatting
- Universal Numeric Fingerprints
- Enhanced Delivery
- Replication
- Layered analysis services
To date 132 Dataverses 23,058 Studies
576,387 Files (April 28, 2008)
19DVN Screenshots
http//dvn.iq.harvard.edu/
20Some Dataverse Uses
- Future Researchers discovery linking forward
citation verification analysis - Journals, for replication
- Authors, for their own data
- Teachers, in depth analysis
- Sections of scholarly organizations, to organize
existing data - Granting agencies
- Research centers
- Archives
- Major Research Projects
- Academic departments, universities, centers,
libraries
21DVN Data Citations
- Citations are a traditional formal mechanism to
link together intellectual works - Citations glue together Regulations,
Publications, and Evidence - But, lack of rules for citing numeric data
- No consistency in practice
- No fixed rules for copyeditors
- Sometimes in the list of references sometimes a
casual mention in the text - Sometimes the archive is noted
- Sometimes a version number exists
- Sometimes the version number is listed (if it
exists) - Archive numbers are sometimes given, if they
exist - Sometimes the author is noted
- Date of creation is sometimes given
- URLs often given, rarely persist
- Dates of access protect the researcher, do not
help find the data - The data may not be available publicly
- The data may no longer exist
22A Unified Citation Standard for Quantitative Data
23DVN Whats New
- Timeline
- Version 1.0 (release) Dec. 2007
- Version 1.1 March 2008
- Version 1.2 April 2008
- New Stuff
- OAI enhancements Export Custom sets (1.2)
Import DC, FGDC (1.1) as well as DDI - Data services zip delivery of remote files
(1.2) plain-text and tab-delimited exports (1.2) - Java 6 Support (1.2)
- Workflow Support Enhancements
- Terms of use on login, upload, and download,
configurable at network, dataverse, and study
level (1.1, 1.2) - Enhanced workflows for account requests, password
recovery, non-privileged (drop box)
submissions, submissions review (1.1, 1.2) - Network Admin UI Enhancements
- JHove validation of individual studies (1.2)
- Batch ingest (1.2)
- Numerous other performance, end-user, curator,
and network UI enhancements - Future 2.0 (summer)
- Data Services save analyses to R, additional
formats - GUI for assigning geographic bounding box for
study - Support harvesting of DVN through LOCKSS
24DataPASS
25Collaboration for Preservation
- Joint Not-bad practices
- Identification selection
- Metadata
- Security
- Confidentiality
- Shared Catalog
- Unified Discovery
- Content exchange
- Layered Services
- Partnership Agreements
- Agreement to establish good practice
- Preservation copies of data collected
- Transfer Protocol in case of archival failure
- Cooperating Operations
- Central database of leads for acquisition
- Development of shared procedures
- Review of acquisitions
"Nothing new that is really interesting comes
without collaboration" -- James Watson
26Data Rescued Examples
- U.S. Information Agency Surveys
- Directly informed U.S. foreign policy through
surveys of foreign public opinion - Previously, only surveys from 1970-1990 were held
in the national archives - Collaboration be NARA and Roper to create a much
more complete series spanning the 1950-1990 - Surveys conducted in Europe, Latin America, Asian
countries include nuclear arms control, - Recent Subjects include US-Soviet relations, US
strike on Libya, Soviet Union invasion of
Afghanistan, and economic matters, terrorism,
economic summits, arms control, and the Soviet
actions in Afghanistan, drug trafficking,
democratization, and conflicts in El Salvador and
Nicaragua. - Longitudinal Study of Personality Development.
- By Jack and Jeanne Humphrey Block
- The most intensive study of human personality
development in existence. - Thirty year longitudinal study.
- Mixed methods quantitative, audio, video.
- More than 100 instruments, and 1000s of measures
(variables) - Resulted in more than 100 publications.
- (Also shows how whiny kids are more likely to
grow up to be conservatives.) - National Network of State Polls
- Diverse membership of 50 members in 38 states
- Covers a tremendous range of local and national
issues - Data imminently at risk
27Selected Topics Sponsors
- Political activity, political activism, voting
behavior, protest activity, voter registration,
fundraising, political alienation, relationship
to the Black community, feminism, racial
identity, attitudes toward abortion, attitudes
toward federal programs television viewing
habits, affects of having children on the
marriage, giving too much/little independence,
discipline, overscheduling, overprotecting,
measuring levels of success in teaching values,
self-control, good citizenship, good money
habits, religion, worries that parents have of
the future facing their children problems facing
parents and children from drugs, sex, violence to
the lack of various family and religious values
daycare, mothers working, childrearing, taxes,
government spending, morals, childrens issues,
economy, jobs, education, crime, health care,
social security, local school administration,
standardized testing, impact of poor scores on
teachers, higher academic standards needed, too
much/little homework, summer school., teachers,
administrators, quality of academics, discipline
matters, class size, level of science and math
skills taught, Shakespeare, life skills,
athletics, citizenship, Role of the US in the
world and assessing US performance, terrorism,
war in Iraq, respondent identified level of
understanding of foreign affairs, US and foreign
aid, assisting emerging democracies, enhancing
national security, image of the US abroad,
Seriousness of Welfare problems--abuse, fraud,
generational, etc. assessing list of
remedies--limit duration, require job training,
provide day care, unannounced visits, business
tax breaks for hiring recipients, penalize
recipients who have more children, etc.
profiling welfare recipients (e.g. more likely to
be better/worse parents, lazy or hardworking,
from troubled families defining the American
Ideal, how to teach kids what it means to be
American, , national identity, appreciation of
freedoms in the US, importance of voting, ashamed
of nation's history of racism, job US does in
teaching immigrant children, bi-lingualism, fly
an American flag most about the meaning of the
rights the Constitution guarantees, assessing the
level of appreciation of those rights in the US
and how it is perceived to the international
community aging. Money Mangers on union
organizations, employers, and labor market
institutions tort law reforms crime and
urbanization law and social control natural
disasters awareness of self - NSF, NIH, The Danforth Foundation, The Ford
Foundation, The David and Lucille Packard
Foundation, and Ewing Marion Kauffman
Foundation., State Farm Insurance, Ronald
McDonald House Charities, Advertising Council,
American Federation of Teachers, the Annenberg
Institute, the George Gund Foundation, the
National School Boards Association, U.S.
Department of Education, GE Foundation, Nellie
Mae Education Foundation, Wallace Foundation,
Bill Melinda Gates Foundation, Pew Charitable
Trust, National Constitution Center, Alliance for
Aging Research, American Federation for Aging
Research the MacArthur Foundation, NiMH
28(No Transcript)
29(No Transcript)
30Replication as Institutional Insurance
Data-PASS Syndicated Storage Project
- External Causes of Preservation Failure
- Third party attacks
- Institutional funding
- Change in legal regimes
- Quis custodiet ipsos custodes?
- Unintentional curatorial modification
- Loss of institutional knowledge skills
- Intentional removal
- Change in institutional mission
- Schema drivencapture inter-archival
preservation commitments - Asymmetric resource commitments proportional
to holdings - Versioned versioned data and citations
- Integration LOCKSS Archival Replication
Schema DVN technology archival workflows
31Workflow Systems
- Emerging tools for integration of research
process in natural sciences - Orchestrate Data Collection, Transformation,
Analysis - Examples Taverna, Kepler, Genepattern, VisTrails
- Most are science and grid-oriented
- Addresses different parts of scholarly work
lifecycle - Not focused on social science tasks
Or life on the grid
32Intersection of DL and Workflows
- GenePattern
- Genomics workflow system
- Supports construction of complex reproducible
data analysis pipelines - Targeted to local operations, but can make use of
some job queueing systems (LSF, SGE) - http//www.broad.mit.edu/cancer/software/genepatte
rn/ - Integration project
- Extends coverage of total research lifecycle
- DVN will store GenePattern analyses as they
evolve - When analyses are published, dissemination,
preservation and reuse should be seamless - Funded project in early planning stage
33New Social Science
- From Social Science Research Computing
Environment Project - Assess need for high performance computing among
social scientists at Harvard - Prototype interfaces to make grid computing
usable by social scientists - Examples
- Harvesting and analysis of blogs for virtual
political opinion surveys - Continuous collection of CSPAN, real-time subject
coding, continuous dissemination - Cell phone data movement, proximity to others,
social network analysis - Participative goals-based redistricting
- Agent-based models of emerging institutions
- FMRI analyses of reaction to political and social
scenarios - Modal Features
- Analyses emerge through exploration and
interactions - Data collection from non-experimental, non
instrumental, sources - Increasing scale of data
- Compute limited
- Data confidentiality
- High-level analysis tools
- Remote collaboration is part of projects
Meta-features of social science messy data
an abundance of plausible models
34Mind the Gaps
- No tool covers entire scholarly research
lifecycle - Most tools immature
- Poor integration across most tools
- Many tools for hard science do not meet social
science needs for non-experimental messy data
(strange sensors), confidentiality, complex
inferential methods - Decoupling of dissemination, formal publication,
citation, peer-review - No tools integrate comprehensive, standard,
flexible control over privacy, intellectual
property
35For More Information
Dataverse Network Project http//TheData.Org Da
ta-PASS Alliance http//www.icpsr.umich.edu/DAT
APASS/ Contact me http//maltman.hmdc.harvard.e
du/