Title: Data curation standards and the messy world of social science occupational information resources
1Data curation standards and the messy world of
social science occupational information resources
- Paper presented to the 2nd International Digital
Curation Conference, 21-22nd November 2006,
Glasgow.
2GEODE www.geode.stir.ac.uk
- Grid Enabled Occupational Data Environment
- Operate as a portal
- User friendly access to occupational data
- High volume use
- Support a community of occupational data
providers - Depository of occupational information resources
- Limited volume use
- Experiment with / promote e-Social Science
3(Part 1) Occupational analyses in the social
sciences
A mans work is as good a clue as any to the
course of his life and to his social being and
identity (Hughes, 1958)
- (Quotes as reproduced in Coxon and Jones 1978
Crompton 1998)
Nothing stamps a man as much as his occupation.
Daily work determines the mode of life.. It
constrains our ideas, feelings and tastes
(Goblot, 1961)
The backbone of the class structure, and indeed
of the entire reward system of modern Western
society, is the occupational order (Parkin, 1972)
4Why is occupational research messy?
- Two stage process
- Collect preserve source occupational data
- Summary / translation of source data
- This model is a scientific approach
- Published documentation (at both stages)
- Replicable
- Validation exercises
- But social researchers have been not been good at
using it - (Bechhofer 1969 Marsh 1986 Rose and Pevalin
2003)
5Stage 1 - Collecting Occupational Data
Examples
6Stage 1 - Collecting occupational data summary
- All methods lead eventually to coding to an
occupational index scheme - Occupational Unit Groups
- Standardised Industrial Classifications
- Standardised employment status classifications
- Occupational index schemes are the point of
departure for GEODE
7Stage 2 Summary / translation of source occ. data
- Published occupational information resources
used to link source data, via an index scheme,
with substantively meaningful measures - Social class schemes
- Stratification scales
- Gender segregation statistics
- Labour process statistics
- Coding by fiat
- (Allocation by expert social scientist)
- Lack of documentation / replicability /
consistency - Unscientific
8Whats the problem?
- But
- Low uptake of existing occupational information
resources - Strict security constraints on users
micro-social survey data - Problems in the formatting / distribution of
occupational information resources (Part 2)
9Handling Occupational Information
- Messy because
- Large volume of occupational information
resources - Limited coordination between resources
- Inconsistencies in access and exploitation
processes
Occupational information resources are used to
interpret occupational records
10Some illustrative occupational information
resources
11Occupational information resources
- Large volumes of occupational information
resources - Coverage across countries and time periods
- Different research fields / topics
- Dynamic updates to occupational information
resources - Internet based distributions lead to duplication
and expansion, e.g. ISEI - ISCO translation files
at - PISA webpages (Ganzeboom)
- IDEAS/Repec webpagees (Hendrickx)
- CAMSIS occupational data webpage
- Some maths
- 100 alternative index schemes (OUGs others)
- X
- 500 alternative output measures (class schemes,
etc)
12Occupational information resources
- Limited coordination
- Varying metadata practices
- Coordinated structure, e.g. ISEI at IDEAS/Repec
rare - Natural language, e.g. CAMSIS common
- No documentation
- Varying data file formats
- SPSS, Stata, Plain text
- One-way distribution
- Internet download text publications
- Gaps between NSIs and academic researchers
- NSIs make regular changes to favoured resources
13Occupational information resources
- Limited coordination (ctd)
- Varying translation rules
- One file for all occupations (universal)
- Multiple files for different contexts
(specific) - Different occupational index requirements
14Occupational information resources
- Inconsistencies in access / exploitation
- Occupational Unit Group schemes variants
- Decennial updates / International variations
- Localised adaptations e.g. HESA / Survey
variations e.g. GHS - Numeric or string format preservation
- Hierarchical organisations
- E.g. ISCO-88
- 1234 ? 123 ? 12 ? 1
- 110 0110 ? 11 ? 1 ? 0
- Focus for application of occupational data
- Individual level measures
- Household / career contexts
15Returning to the occupational research model
- Two stage process
- Collection preservation of source occupational
data - Summary / translation of source data via
occupational information resources - Critically, stage (2) places responsibility for
reviewing occupational information resources with
the social scientist - The volume of variants / inconsistencies isnt
huge, but is enough to impede easy application
16(Part 2) Curating Occupational Data
- GEODE Grid Enabled Occupational Data
Environment - Core provision support the management of and
access to occupational information resources - Occupational information depository
- Easy access to occupational data (portal for
occupational data)
17Metadata - Occupational information depository
- How to facilitate searching, registering,
accessing index service? - Establish a GEODE-M meta-data subset (.xml)
- Founded on Michigan Data Documentation Initiative
- Semantic curation of occupational information
18Benefits of DDI-XML curation
- XML suits
- OGSA-DAI
- (data access integration, www.ogsadai.org.uk)
- Supports data indexing / preservation /
management - Supports secure data matching programme
- Could facilitate analytical queries
- Gridsphere search programmes
- Data curation standards
- DDI widely deployed in social science resources
- XML accessibility / transferability
- Repeatability of tags very helpful
- E.g. data files index measures contexts
authors
19Implementing GEODE-M metadata
- Critical entries
- Context of data country, time period
- Index scheme
- ltStdCatgrygt GEODE database of known index
scheme - Source uri for resource
- 2 stage curation process (?)
- Web-proforma for supply of occupational data
- Author context, index units
- Gridsphere portlet
- Manual updating of xml resource by depositor /
GEODE members - Gridsphere portlet
20Example issues
- ltStdCatgrygt Variant implementations lt-gt indexed
translation files - ltcontextgt cross-country resources
- ltproducergt roleformatting caters to multiple
author roles - ltfileDscr id"dkcherisco88.sav"gt caters to
multiple files - ltabstractgt
21Management of GEODE-M curation
- Metadata considerations
- GEODE-M as flexible recommended components of
DDI - GEODE-M templates
- webpages at GEODE
- Other facilities?
- Data considerations
- Stored at GEODE vs Linkage to external data
- Proprietary software (plain text / SPSS / STATA)
- At present
- Stage 1 automated curation (allows external
linkage, any file format) - Stage 2 extended manual curation (requires
GEODE server copy of data, translation to plain
text rectangular format - Premised upon small commitment from depositors
GEODE
22GEODE user uptake
- High potential demand
- Numerous queries on occupational data management
- Numerous researchers wishing to distribute
occupational data - Prototype GEODE services not yet user-friendly
- Carrots
- High demands for easier access and review
- Sticks
- Poor standards of many previous research which
neglects good review of occupational information - Hurdles
- Change research cultures in social science
disciplines(?)
23Conclusions
- Occupational data curation and the Grid
- Grid facilitates management / access via xml
formats (OGSA-DAI) - Current models require moderate specialist input
(manual curation) - Grid offers new level of service not previously
available - Dynamic coordinated file storage
- File matching security
- Occupational data as case study for focused DDI
xml curation - Complex but finite range of occupational
information resources - High user demand
- Uptake will require combination of motivation,
and instigation
24App 1 e-Social Science
- The Grid and e-Science
- Online Coordination of electronic resources and
collaborations - (Distributed computing)
- Large scale
- Collaborative
- Heterogeneous
- Standard protocols / information management
systems - UK eSocial Science
- Investment in assessing / implementing technology
- Computationally demanding data analysis
- Qualitative and quantitative data collection
technologies - Data sharing, processing and access
25App 2 GEODE architecture
26App3 Collecting occupational data
- Follow a recommended process
- ONS good practice
- www.statistics.gov.uk/methods_quality/ns_sec/quest
ions.asp - Industry description / occupation description /
size of organisation / employment status /
supervisory status - Occupation descriptions -gt standardised numeric
index - Text coding tools, e.g.CASCOT -
www2.warwick.ac.uk/fac/soc/ier/publications/softwa
re/cascot/ - Do your own thing
- European Social Survey parental occupational
questions - free text description of parental occupations
27App 4 Summary data what is the best class
scheme?
- Published occupational information resources
link source data, via index scheme, with
substantively meaningful measures - Occupation-based social classifications
- Social class schemes
- Registrar Generals Social Class Scheme
(1907-2001) skill / prestige - National Statistics Socio-Economic Classifn.
(2002-) employment relations - Goldthorpe / CASMIN / EGP (Employment relations)
- Wright ownership and authority
- W.E.S. female occupational groupings
- Stratification scales
- SIOPS prestige
- ISEI socio-economic status education and
income average - CAMSIS social interaction
- CAMSIS is the best