Get your hands dirty cleaning data. - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Get your hands dirty cleaning data.

Description:

Screen Designer. e.g. Summary tab for ecatalogue module ... Need to be eyeballed' by a person, preferably someone familiar with the museum's collections ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 26
Provided by: kesof
Category:
Tags: cleaning | data | dirty | hands

less

Transcript and Presenter's Notes

Title: Get your hands dirty cleaning data.


1
Get your hands dirty cleaning data.
  • 2008 European EMu Users Meeting, 3rd June.
  • - Elizabeth Bruton, Museum of the History of
    Science, Oxford
  • elizabeth.bruton_at_mhs.ox.ac.uk

2
Outline
  • Data Migration
  • Problem -gt Solution approach
  • Tools
  • Manual Data Cleaning
  • Examples
  • Current and Future Practices (Documentation,
    Policing, Review)?

3
Data Migration
  • First step towards better, cleaner data
  • Steps
  • Prepare and analyse legacy system
  • Data mapping
  • KE EMu system design
  • Data migration

4
Legacy System Analysis
  • Prepare and analyse previous (legacy) system
  • Data structure and relationships - tables and
    fields.
  • Primary
  • Secondary
  • Cross-reference
  • Documentation and usage
  • Redundant data

5
Legacy Data analysis
6
Data Mapping
7
KE EMu system design
  • Default and additional fields across different
    modules
  • Field titles
  • Screen Designer
  • e.g. Summary tab for ecatalogue module
  • Finally data migration

8
Data cleaning overview
  • Problem -gt solution approach
  • Input data
  • Operations
  • Output data
  • Manual or automated operations or both?
  • Which tools to use for automated operations?
  • KE EMu tools many powerful built-in tools
    within EMu
  • Non-KE EMu tools scripts to use on data
    imported from EMu reimport back into EMu
  • Both

9
KE EMu Tools Texql
  • KE Texpress Texql queries
  • Similar syntax to mySQL or SQL
  • Uses
  • Analysing data and data structure
  • Analysing search queries
  • Advanced search queries

10
KE EMu Tools Global Replace
  • Very useful, powerful but also potentially
    dangerous tool
  • Can use in combination with search query or list
    options within EMu
  • Can use regular expressions and/or wildcard
    searches
  • Powerful tool for single field or Field A-gtField
    B operations

11
KE EMu Tools Record Merge
  • Does what it says on the tin
  • Merge one or more duplicate record(s) into single
    record
  • Only attachments to different modules are
    merged into record not data
  • Ditto tool can be used for easily copying data
    from one record to another
  • Attachments to original duplicate record(s) are
    removed so records can be deleted

12
KE EMu Tools Reports
  • Tool to present information in assorted ways
  • Can be used to produce reports but can also be
    used as data export tool
  • Microsoft Excel or CSV format appropriate for
    more advanced data operations

13
Non-KE EMu Tools Scripting
  • Personally use php and mySQL
  • Perl is also useful scripting tool used by KE
  • Have written CSV to mySQL file checker and
    converter in php
  • Then run more advanced operations on data using
    php scripts
  • PhpMyAdmin can export data in many formats
    including CSV

14
Non-KE EMu Tools Scripting
  • Systematic Approach
  • Keep copy of original data
  • Produce data mapping or data cleaning document
  • Perform operations using php file on mySQL table
  • Check data produced (manual or automatic) and
    output logs
  • Validate data in EMu and then import

15
Manual Data Cleaning
  • Some problems cannot be done automatically,
    either partially or entirely
  • Need to be eyeballed by a person, preferably
    someone familiar with the museums collections

16
Example Parties Records
  • Legacy system used two systems of noting object
    makers
  • Freetext Maker field with no centralised system
    (11 ratio) used for applicable records
  • Assigned makers with centralised system only
    used for first 3,000 or so records
  • Freetext data imported into EMu resulted in
    approximately 5,500 Parties records

17
Example Parties Records
  • Good example of mapping freetext field to more
    structured data field with 1Many ratio
  • KE ran script which detected maker type and
    formatted accordingly, i.e. Maker Type etc
  • But still much cleaning up to be done
  • Two approaches automatic then manual

18
Example Parties Records
  • Problem Creation-related data within legacy
    system were all free-text fields
  • The museum wanted to keep this data in some
    format as it contained valuable information, such
    as ambiguities or uncertainties
  • e.g. Italy or France, Attributed to Smith
    Jones, possibly last quarter of 19th century etc

19
Example Parties Records
  • This data did not fit neatly into defined,
    structure fields such as Parties, Places or
    Creation Date
  • Also wanted to clean Parties records
  • Solution Automatic batch process then manual
    cleaning

20
Example Parties Records Automatic Approach
  • Exported Creation data (Parties, Place, Creation
    Date) from EMu
  • Ran script which checked for and removed
    duplicates in Parties and Place
  • Note The above operation deleted rather than
    manipulated data but still integral part of data
    cleaning operation
  • Copied cleaned Parties, Place, Creation Data into
    single free-text field Creation Notes
  • Re-imported data into EMu using Import Tool

21
Example Parties Records Automatic Approach
  • Began data cleaning by running Global Replace
    operation within EMu eparties module, removing
    'Signed by', 'Attributed to', or 'Made by' from
    the relevant parties records
  • Next Manual Approach

22
Example Parties Records Manual Approach
  • Cleaned records Check Parties Type (Person or
    Organisation) and edited records (Surname,
    Forename, Organisation etc)?
  • Merged and deleted duplicate records
  • Checked and deleted unattached parties records

23
Example Parties Records End Result
  • Currently have 3,300 cleaner Parties records

24
Current and Future Practices
  • Current
  • Systematic approach to data cleaning
    incorporated into monthly museum EMu Users'
    Meeting
  • Review
  • In Progress
  • Documentation
  • Future
  • Policing

25
Conclusion
  • Data cleaning and policing is an ongoing process
    for an institution of any size
  • Data standards must be set and adhered to
  • Needs to be approached and done in a systematic
    way
  • Any questions?
Write a Comment
User Comments (0)
About PowerShow.com