Title: A brief history of markup of social science data: from punched cards to the
1A brief history of markup of social science data
from punched cards to the life cycle approach
- Presentation to
- International symposium on XML for the long haul
(Balisage 2010 pre-conference) - By Laine G.M. Ruus, Librarian emeritus,
University of Toronto - 2010-08-02
- http//www.chass.utoronto.ca/laine/misc/balisage2
010.ppt
2Overview
- What are data (that is, quantitative social
science data)? - History of social science quantitative data and
metadata - Lessons learned
3What are data?
4Data are
- Representations of selected characteristics of a
population of entities, eg individuals,
companies, periods of time, etc - Characteristics are grouped, and variations of a
characteristic are assigned (normally) numeric
values - Assigning numeric values to variations of a
characteristic allows their manipulation by
mathematical/statistical procedures
5wisdomknowledgeinformation (statistics)data
6Data and statistics are not equals
- Statistics are two kinds
- Descriptive statistics summaries of common
characteristics of the raw data units - (one-way tables, two-way tables multi-way
tables) - Inferential statistics measure strength and
direction of relationships among characteristics
of raw data units
7Of course, statistics (descriptive or
inferential) can become data in their turn, and
used in other statistical procedures.
8(No Transcript)
9Data and statistics are not equals (contd)
- Ie, data are
- the raw materials from which statistics are
generated - ideally, available at the level at which the data
were originally collected (microdata) - need to be manipulated with statistical software
in order to be comprehensible
10Data
11raw data
12Metadata
13record layout
14variable description (aka data dictionary)
15province
gender
16syntax file for SPSS
17Metadata are
- Instructions to explain the content and coding of
a data set (whether numeric, alphabetic, or
other), and aid in their correct interpretation - Can be intended for human or computer
consumption, but are ideally both
18Raw data a syntax file, processed through a
statistical software package results in a system
file average shelf life less than 10 years
19The beginnings
- Hollerith cards first used to process the 1890 US
census of population - By 1930s, public opinion polling was being used
to eg predict electoral outcomes - 1936 Literary Digest poll predicted defeat of
Roosevelt in the US presidential election - Data gathering make-work projects in the 30s in
the US, such as economic censuses, surveys on
unemployment, crop production, etc
20By the 1940s
- Polling and survey taking matured
- Beginnings of improved sampling methods, such a
Gallups quota samples - 1948 polls chose Dewey over Truman in the US
presidential election, leading to formation of a
committee to determine why the error - the Roper Center was created, the first data
archive (1946) - Data stored on punched cards, and analyzed using
card sorters and similar equipment - And metadata usually looked like this
21The metadata for the May 1945 Canadian Gallup
Poll..
22The 1950s
- UNIVAC 1, the first alphanumeric computer
- UNIVAC 2 correctly predicted the Eisenhower sweep
in the 1952 US presidential election - MIT began working on keyboard entry
- Development of the COBOL compiler and Fortran
- Magnetic tapes, at 200 bpi, could store the
contents of 70,000 punched cards, ie about 5.6
megabytes of data - Lucci Rokkan promoted the idea of data
management by libraries
23But the metadata for the August 1958 Canadian
Gallup poll still looked like this
241960s
- Development of Basic, the Unix operating system,
and ASCII which allowed interchange of data among
different computers - Statistical software packages DATA-TEXT, SPSS,
P-STAT, BIOMED, NUCROS, SAS - Magnetic tapes moved from 556 to 800 bpi
- Most social scientists were still writing own
local software, or using card-sorters and
calculators to produce cross-tabulations and
compute chi-squares
251970s a watershed decade
- Microprocessors, and 8 and later 5-1/4
diskettes - Wang word processor, Ataris, Apple 1 and the
Commodore PET - dBASE, VISICALC and WORD STAR
- ARPANET and expansion of time-sharing and online
systems - Online bibliographic services such as Dialog,
BRS, and Orbit
261970s (contd)
- David Nasatir wrote first manual on data
management under aegis of UNESCO (1972) - Mid-decade saw the creation of IASSIST, and the
first training at ICPSR for data librarians - US census of population 1970 partly disseminated
on computer tapes instead of print, forcing
libraries to consider this new medium
271970s (contd)
- OSIRIS software developed at University of
Michigan, included statistical capabilities as
well as outstanding data and metadata management - NSF funded the National Conference on Cataloging
and Information Services for Machine-Readable
Data Files at Airlie House in Virginia - US Department of Justice funded the project which
resulted in Roistachers Style manual for
machine-readable data files bibliographic
identity, methodology, and data dictionary
28An OSIRIS codebook generally followed the
Roistacher recommendations. The record layout
and data dictionary portion looked like this
291980s
- Supercomputers and NSFNET changed face of large
scale computing, and PCs and MACs did the same
for small scale computing - BITNET, followed by the Internet, provided
e-mail, listservs and remote login - tape cartridges held the equivalent of 8 million
cards or four times that of a 6250 tape. Five
megabyte hard drives became available for
microcomputers - IBM brought microcomputing to the academic sector
- CD-ROMS, and the Quadra directory of databases
301980s (contd)
- Sue Dodds Cataloging machine-readable data files
an interpretive manual, 1982 - Social forces one of the first journals to
include guidelines on citing machine-readable
data files - Population index the first bibliographic journal
to cite data files - A draft revision of AACR2 chapter 9 (renamed
Computer Files) was published in 1987
bibliographic control for data files
311990s
- Migration from IBM mainframes (EBCDIC) to Unix
(ASCII) - Demise of tapes for storage, in favour of
widespread use of CD-ROM - Statistics Canada makes the electronic products
from census the primary product - Gopher, developed in 1991, was replaced by the
WWW and html, and by 1996 there are about 100,000
web servers - Beginning of the DDI (Data documentation
initiative) project in 1995, published its first
DTD in 1996
32Three major developments lead up to DDI
- OSIRIS metadata management capability
- Roistacher s outline of machine-readable data
file documentation (1980) - Dodds cataloguing manual (1982)
33OSIRIS metadata
- OSIRIS dictionary provided structural
information location, size, missing data, a
variable name and a variable label (brief) - OSIRIS codebook provided a tagged format
- Introduction (unstructured)
- Full question text
- Variable values and value labels
- Variable-level comments
- North American institutions standardized on the
OSIRIS type-1 and type-4 codebooks, Europe on the
type-3 format codebook
34Roistachers style manual
- Provided outline of the information that should
be contained in the full metadata (aka codebook),
including - Bibliographic identity
- Project history
- File processing summary
- Data dictionary contents
- Recommended appendices
35Sue Dodds cataloguing manual
- Further refined the bibliographic identity
component of the metadata - Provided a cross-walk to AACR cataloguing rules
- Provided the foundation for the development of a
MARC record - Dodd also defined the components of a
bibliographic citation
36(No Transcript)
37Many kinds of metadata for many purposes
- Data collection
- Data interpretation
- Data preservation
- Data discovery
- Coding standardization
38Based on the NISO metadata classification
Descriptive Structural Administrative
MARC records RAD records Thesauri Concordances Syntax files for eg SAS or SPSS Programming syntax Record layouts Data dictionnaries Missing data specifications Definitions of derived variables Project conception, implementation and funding Methodology reports, sampling frames, etc. Questionnaires and data collection protocols Interviewer instructions Post-processing, weighting, etc Access and dissemination restrictions Question banks
39DDI provides a format
- From which other subtypes of metadata
(bibliographic records, syntax files, question
banks, etc) can be generated - Describes not just microdata but also an
intelligent means of describing aggregate
statistics as data - Can incorporate all documentation from original
project conception to edition management and
post-processing
40DDI provides a format (contd)
- 3rd generation data access tools (Nesstar, DDI,
and Dataverse (VDC)) all support DDI 2.0 at
present and provide a useful way to provide
on-line remote distributed access to data
discovery and data - Leads to proliferation of new applications of
metadata and realization of initiatives from
earlier decades
41Lessons learned
- Three killers of data
- Software dependence
- Lost metadata
- Physical medium on which data are stored
- No solution as yet combines data, full metadata
and statistical capability in a non-software
dependant format