Reconstructing historical populations from genealogical data - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Reconstructing historical populations from genealogical data

Description:

Reconstructing historical populations from genealogical data An overview of methods used for aggregating data from GEDCOM files Corry Gellatly – PowerPoint PPT presentation

Number of Views:141
Avg rating:3.0/5.0
Slides: 21
Provided by: cg95
Category:

less

Transcript and Presenter's Notes

Title: Reconstructing historical populations from genealogical data


1
Reconstructing historical populations from
genealogical data
  • An overview of methods used for aggregating data
    from GEDCOM files

Corry Gellatly Department of History and Art
History Utrecht University
Workshop on Population Reconstruction IISH
Amsterdam, 19-21 February 2014
2
1. Overview
  • Why build a large genealogical database by
    aggregating hundreds of genealogical data
    (GEDCOM) files?
  • Research increasingly requires big data, to
  • Understand large-scale population dynamics
  • between regions
  • over time
  • social, biological, cultural and economic
    aspects
  • Detect rare or small-effects
  • epidemiology (disease and intervention)
  • inheritance (genetics)
  • comparative life histories

3
2. GEDCOM files
  • Why use GEDCOM files for population
    reconstruction?
  • Pros
  • a standard file structure for representing
    information about familial relationships and life
    events
  • most popular format for storage and exchange of
    genealogical data
  • used internationally and widely available online
  • Cons
  • it is a highly flexible format that allows users
    to enter wildly incorrect information (if they
    wish to)

4
3. Data aggregation
  • Single GEDCOM files typically contain only a few
    hundred individuals, so we import hundreds of
    files into a single genealogical database
  • There are broadly 3 steps between import of files
    and the output, which is usable research datasets
  • Screening (to reject poor quality files)
  • Data cleaning
  • Linkage / de-duplication

5
4. Screening
  • Screening is carried out for various errors, e.g.
  • low mean number of offspring per family
  • individuals younger than 0 or very old (gt110)
  • impossible relationships (due to age difference
    between individuals)
  • individuals occurring as different sexes
  • missing individuals
  • If errors are detected, then the file is either
  • removed (in the case of obvious errors)
  • retained for further checking (in the case of
    ambiguous errors)e.g. where individuals have
    more than two parents this can be due to
    adoption or incorrect family links between
    individuals

6
5. Cleaning
  • Example date errors
  • If DOB is 1857
  • Born to 10 year old mother?
  • Wife 17 years older?
  • First of 5 children born atthe age of 39?
  • If DOB is actually 1875
  • Born to 28 year old mother?
  • Wife 1 year younger?
  • First of 5 children born atthe age of 21?

7
6. Dataset extraction
  • Definition of datasets is driven by research
    questions
  • which timespan?
  • which region?
  • do we need complete families?
  • do we need dates of birth, death, marriage?
  • The identification of links between genealogies
    (or removal of duplicate individuals) is done
    during the process of dataset extraction

8
7. Linkage, de-duplication
  • Linkage fields
  • Day of birth, marriage or death (DOB, DOM, DOD)
  • Year of birth, marriage or death (YOB, YOM, YOD)
  • Surname
  • Given names
  • Sex
  • Problems
  • YOB, YOM, YOD more common than DOB, DOM, DOD
    (particularly in older data) but less unique to
    each individual
  • High inconsistency in recording of given names
  • Middle names included or excluded
  • Middle names used instead of first names
  • Abbreviated names
  • Nicknames (sometimes in brackets)

9
8. Linkage, de-duplication
  • Trade-off between data coverage and quality
  • Surname, given name, DOB
  • Low risk of false linkages, but high risk of
    missing linkages (due to problems with given
    names) and low data coverage
  • Surname, DOB
  • Low risk of false linkages, but low data
    coverage
  • Surname, YOB
  • High risk of false linkages, but high data
    coverage

10
9. Group-linking method
  • Individuals are identifiable by those they are
    related to
  • This principle is being applied to the problem of
    genealogical data, in which many records have
    YOB, but not DOB and given names are somewhat
    unreliable for linking
  • Group-linking string

11
10. Group-link test
  • Test with single GEDCOM file containing no
    duplicates2,082 individuals 971 marriages 681
    conceptive relationships 1,913 conceptions

12
11. Group-link test
  • Percentage data coverage x Percentage of unique
    records within that data ( 100) gives an
    estimation of linkage power

13
12. Missing data
  • What about missing information?
  • The information on the siblings of these
    individuals is probably missing. Why? Because
    they appeared at marriage
  • This data is left censored, because these
    individuals appeared in the data after the event
    we are measuring (i.e. number and sex of
    siblings).

14
13. Missing data
  • Depending on what type of links we are trying to
    find, we may want to break up the string
  • String to link individuals based on their
    siblings
  • String to link individuals based on their
    marriages and children

15
14. Record de-duplication (1600-1699)
  • De-duplication of 17th century records from the
    genealogical database
  • Febrl program (Freely Extensible Biomedical
    Record Linkage)
  • 17,488 records with Surname and YOB
  • Indexes
  • Surname gt YOB
  • Surname gt Group-link string 2 (sex siblings)
  • Surname gt Group-link string 3 (sex marriages
    offspring)
  • Comparison function
  • Winkler
  • Classifier
  • KMeans

16
15. Record de-duplication (1600-1699)
  • Results

17
16. Record de-duplication (1600-1699)
  • Results
  • Examples of matches in highest weight category
    (1,914 matches)

18
17. Record de-duplication (1600-1699)
  • Results
  • Examples of matches in lower weight category
    (10,434 matches)

19
17. Further work
  • Record linkage
  • Refine a method of probabilistic data matching
    that can identify linkages
  • where typo errors or name variations occur
  • possible date typos exist
  • there are missing persons in the family structure
  • Group-linking algorithm
  • Using the group-linking string as a start point
    to then check for existence of birth, marriage
    and death dates of relatives (where these exist)
    and performing matches on these variables
  • Inherently based on probabilistic matching

20
18. Acknowledgements
  • Netherlands Organisation for Scientific Research
    (NWO)
  • Project number 276-53-008 Nature or nurture? A
    search for the institutional and biological
    determinants of life expectancy in Europe during
    the early modern period
  • Colleagues at Utrecht University
  • Tine De Moor
  • Institutions for Collective Action team
Write a Comment
User Comments (0)
About PowerShow.com