A brief history of markup of social science data: from punched cards to the - PowerPoint PPT Presentation

About This Presentation
Title:

A brief history of markup of social science data: from punched cards to the

Description:

Title: Data services in Canada: who, where, when, and why Author: Laine Ruus Last modified by: laine Created Date: 6/15/2003 12:39:30 PM Document presentation format – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 42
Provided by: Laine
Category:

less

Transcript and Presenter's Notes

Title: A brief history of markup of social science data: from punched cards to the


1
A brief history of markup of social science data
from punched cards to the life cycle approach
  • Presentation to
  • International symposium on XML for the long haul
    (Balisage 2010 pre-conference)
  • By Laine G.M. Ruus, Librarian emeritus,
    University of Toronto
  • 2010-08-02
  • http//www.chass.utoronto.ca/laine/misc/balisage2
    010.ppt

2
Overview
  • What are data (that is, quantitative social
    science data)?
  • History of social science quantitative data and
    metadata
  • Lessons learned

3
What are data?
4
Data are
  • Representations of selected characteristics of a
    population of entities, eg individuals,
    companies, periods of time, etc
  • Characteristics are grouped, and variations of a
    characteristic are assigned (normally) numeric
    values
  • Assigning numeric values to variations of a
    characteristic allows their manipulation by
    mathematical/statistical procedures

5
wisdomknowledgeinformation (statistics)data
6
Data and statistics are not equals
  • Statistics are two kinds
  • Descriptive statistics summaries of common
    characteristics of the raw data units
  • (one-way tables, two-way tables multi-way
    tables)
  • Inferential statistics measure strength and
    direction of relationships among characteristics
    of raw data units

7
Of course, statistics (descriptive or
inferential) can become data in their turn, and
used in other statistical procedures.
8
(No Transcript)
9
Data and statistics are not equals (contd)
  • Ie, data are
  • the raw materials from which statistics are
    generated
  • ideally, available at the level at which the data
    were originally collected (microdata)
  • need to be manipulated with statistical software
    in order to be comprehensible

10
Data
11
raw data
12
Metadata
13
record layout
14
variable description (aka data dictionary)
15
province
gender
16
syntax file for SPSS
17
Metadata are
  • Instructions to explain the content and coding of
    a data set (whether numeric, alphabetic, or
    other), and aid in their correct interpretation
  • Can be intended for human or computer
    consumption, but are ideally both

18
Raw data a syntax file, processed through a
statistical software package results in a system
file average shelf life less than 10 years
19
The beginnings
  • Hollerith cards first used to process the 1890 US
    census of population
  • By 1930s, public opinion polling was being used
    to eg predict electoral outcomes
  • 1936 Literary Digest poll predicted defeat of
    Roosevelt in the US presidential election
  • Data gathering make-work projects in the 30s in
    the US, such as economic censuses, surveys on
    unemployment, crop production, etc

20
By the 1940s
  • Polling and survey taking matured
  • Beginnings of improved sampling methods, such a
    Gallups quota samples
  • 1948 polls chose Dewey over Truman in the US
    presidential election, leading to formation of a
    committee to determine why the error
  • the Roper Center was created, the first data
    archive (1946)
  • Data stored on punched cards, and analyzed using
    card sorters and similar equipment
  • And metadata usually looked like this

21
The metadata for the May 1945 Canadian Gallup
Poll..
22
The 1950s
  • UNIVAC 1, the first alphanumeric computer
  • UNIVAC 2 correctly predicted the Eisenhower sweep
    in the 1952 US presidential election
  • MIT began working on keyboard entry
  • Development of the COBOL compiler and Fortran
  • Magnetic tapes, at 200 bpi, could store the
    contents of 70,000 punched cards, ie about 5.6
    megabytes of data
  • Lucci Rokkan promoted the idea of data
    management by libraries

23
But the metadata for the August 1958 Canadian
Gallup poll still looked like this
24
1960s
  • Development of Basic, the Unix operating system,
    and ASCII which allowed interchange of data among
    different computers
  • Statistical software packages DATA-TEXT, SPSS,
    P-STAT, BIOMED, NUCROS, SAS
  • Magnetic tapes moved from 556 to 800 bpi
  • Most social scientists were still writing own
    local software, or using card-sorters and
    calculators to produce cross-tabulations and
    compute chi-squares

25
1970s a watershed decade
  • Microprocessors, and 8 and later 5-1/4
    diskettes
  • Wang word processor, Ataris, Apple 1 and the
    Commodore PET
  • dBASE, VISICALC and WORD STAR
  • ARPANET and expansion of time-sharing and online
    systems
  • Online bibliographic services such as Dialog,
    BRS, and Orbit

26
1970s (contd)
  • David Nasatir wrote first manual on data
    management under aegis of UNESCO (1972)
  • Mid-decade saw the creation of IASSIST, and the
    first training at ICPSR for data librarians
  • US census of population 1970 partly disseminated
    on computer tapes instead of print, forcing
    libraries to consider this new medium

27
1970s (contd)
  • OSIRIS software developed at University of
    Michigan, included statistical capabilities as
    well as outstanding data and metadata management
  • NSF funded the National Conference on Cataloging
    and Information Services for Machine-Readable
    Data Files at Airlie House in Virginia
  • US Department of Justice funded the project which
    resulted in Roistachers Style manual for
    machine-readable data files bibliographic
    identity, methodology, and data dictionary

28
An OSIRIS codebook generally followed the
Roistacher recommendations. The record layout
and data dictionary portion looked like this
29
1980s
  • Supercomputers and NSFNET changed face of large
    scale computing, and PCs and MACs did the same
    for small scale computing
  • BITNET, followed by the Internet, provided
    e-mail, listservs and remote login
  • tape cartridges held the equivalent of 8 million
    cards or four times that of a 6250 tape. Five
    megabyte hard drives became available for
    microcomputers
  • IBM brought microcomputing to the academic sector
  • CD-ROMS, and the Quadra directory of databases

30
1980s (contd)
  • Sue Dodds Cataloging machine-readable data files
    an interpretive manual, 1982
  • Social forces one of the first journals to
    include guidelines on citing machine-readable
    data files
  • Population index the first bibliographic journal
    to cite data files
  • A draft revision of AACR2 chapter 9 (renamed
    Computer Files) was published in 1987
    bibliographic control for data files

31
1990s
  • Migration from IBM mainframes (EBCDIC) to Unix
    (ASCII)
  • Demise of tapes for storage, in favour of
    widespread use of CD-ROM
  • Statistics Canada makes the electronic products
    from census the primary product
  • Gopher, developed in 1991, was replaced by the
    WWW and html, and by 1996 there are about 100,000
    web servers
  • Beginning of the DDI (Data documentation
    initiative) project in 1995, published its first
    DTD in 1996

32
Three major developments lead up to DDI
  • OSIRIS metadata management capability
  • Roistacher s outline of machine-readable data
    file documentation (1980)
  • Dodds cataloguing manual (1982)

33
OSIRIS metadata
  • OSIRIS dictionary provided structural
    information location, size, missing data, a
    variable name and a variable label (brief)
  • OSIRIS codebook provided a tagged format
  • Introduction (unstructured)
  • Full question text
  • Variable values and value labels
  • Variable-level comments
  • North American institutions standardized on the
    OSIRIS type-1 and type-4 codebooks, Europe on the
    type-3 format codebook

34
Roistachers style manual
  • Provided outline of the information that should
    be contained in the full metadata (aka codebook),
    including
  • Bibliographic identity
  • Project history
  • File processing summary
  • Data dictionary contents
  • Recommended appendices

35
Sue Dodds cataloguing manual
  • Further refined the bibliographic identity
    component of the metadata
  • Provided a cross-walk to AACR cataloguing rules
  • Provided the foundation for the development of a
    MARC record
  • Dodd also defined the components of a
    bibliographic citation

36
(No Transcript)
37
Many kinds of metadata for many purposes
  • Data collection
  • Data interpretation
  • Data preservation
  • Data discovery
  • Coding standardization

38
Based on the NISO metadata classification
Descriptive Structural Administrative
MARC records RAD records Thesauri Concordances Syntax files for eg SAS or SPSS Programming syntax Record layouts Data dictionnaries Missing data specifications Definitions of derived variables Project conception, implementation and funding Methodology reports, sampling frames, etc. Questionnaires and data collection protocols Interviewer instructions Post-processing, weighting, etc Access and dissemination restrictions Question banks
39
DDI provides a format
  • From which other subtypes of metadata
    (bibliographic records, syntax files, question
    banks, etc) can be generated
  • Describes not just microdata but also an
    intelligent means of describing aggregate
    statistics as data
  • Can incorporate all documentation from original
    project conception to edition management and
    post-processing

40
DDI provides a format (contd)
  • 3rd generation data access tools (Nesstar, DDI,
    and Dataverse (VDC)) all support DDI 2.0 at
    present and provide a useful way to provide
    on-line remote distributed access to data
    discovery and data
  • Leads to proliferation of new applications of
    metadata and realization of initiatives from
    earlier decades

41
Lessons learned
  • Three killers of data
  • Software dependence
  • Lost metadata
  • Physical medium on which data are stored
  • No solution as yet combines data, full metadata
    and statistical capability in a non-software
    dependant format
Write a Comment
User Comments (0)
About PowerShow.com