Computer Science Research for Family History and Genealogy - PowerPoint PPT Presentation

About This Presentation
Title:

Computer Science Research for Family History and Genealogy

Description:

Computer Science Research for Family History and Genealogy Computer Graphics, Vision, & Image Processing Laboratory Neural Networks and Machine Learning Laboratory – PowerPoint PPT presentation

Number of Views:254
Avg rating:3.0/5.0
Slides: 47
Provided by: Davi208
Learn more at: https://deg.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: Computer Science Research for Family History and Genealogy


1
Computer Science Research forFamily History and
Genealogy
Computer Graphics, Vision, Image Processing
Laboratory Neural Networks and Machine Learning
Laboratory Data Extraction and Integration
Laboratory Laboratory for Information,
Collaboration, Interaction Environments Performa
nce Evaluation Laboratory Data and Software
Engineering Laboratory
www.cs.byu.edu/familyhistory
David W. Embley Heath Nielson, Mike
Rimer, Luke Hutchison, Ken Tubbs, Doug
Kennard, Tom Finnigan William A. Barrett
2
The Problem
  • 2.5 million rolls of microfilm
  • Assuming 1000 images per roll
  • 2.5 billion images

Is there a way to automatically extract this
information?
3
A (Possible) Solution
Let a computer do the extraction work.
  • Input Images of Microfilmed Records
  • Table Recognition (Heath Nielson)
  • Old-Text Recognition (Mike Rimer)
  • Handwriting Recognition (Luke Hutchison)
  • Record Extraction Organization (Ken Tubbs)
  • Just-in-Time Browsing (Doug Kennard)
  • Visualization (Tom Finnigan)
  • Output Organized Genealogical Information

4
ZoningGeneral Overview
  • Find the lines in the document using the
    horizontal and vertical profiles of the image.
  • Apply a matched filter to the profiles to
    identify the line signatures.
  • Recursively divide the document into separate
    pieces, analyzing each piece for lines.

5
(No Transcript)
6
(No Transcript)
7
Zone ClassificationMachine vs. Handwriting
  • Machine printed text is consistent/regular.
  • Handwriting is irregular.

8
Document templates
  • Images are not ideal.
  • Results in incorrect zoning and classification.
  • Form layout is the same across documents.
  • Features missed in one image, are found in
    another.
  • Build a template of the documents form by using
    several documents.
  • Provides robustness, and increases accuracy.

9
Document Templates
10
Zoned Image
11
Automated Text Recognition
12
Word Segmentation
13
Letter Segmentation
14
Optical Character Recognition
15
Handwriting Recognition
16
Handwriting Recognition
  • The Task
  • Online handwriting recognition
  • The writer's pen movements are captured
  • Velocity, acceleration, stroke order are
    available
  • Offline handwriting recognition
  • Page was previously-written and scanned
  • Only pixel color information available
  • Genealogical records are all offline
  • Offline is harder, but doable

17
Handwriting Recognition
  • Can we just convert offline data into (simulated)
    online data?
  • Yes, although difficult to do reliably
  • What order were the strokes written in?
  • Doubled-up line segments? Ink blobs? Spurious
    joins between letters? Missing joins?
  • Inferring online data (e.g. stroke ordering)
    could be crucial to success
  • Demonstrated to be solvable with reasonable
    reliability

18
Handwriting Recognition
  • An example of some steps in the analysis process
  • Contour extraction
  • Midline determination
  • Stroke ordering

19
Handwriting Recognition
  • An example of some steps in the recognition
    process
  • Handwriting style clustering
  • Letter recognition
  • Approximate string matching

Smith Smythe
20
Automatic Record Extraction
21
Extraction Algorithm
  • Identify the Geometric Structure
  • Identify the Type of Information
  • Identify the Attribute-Value pairs
  • Identify the Record Boundaries

22
Column-Row Recognition
23
Genealogical Ontology
24
Match Labels
ROAD, STREET, c., And No. or NAME of HOUSE
Location
25
Match Labels
NAME and Surname of each Person
Full Name
Location
26
Match Labels
RELATION to Head of Family
Relationship
Location
Full Name
27
Extract Records
Collafer
28
Extract Records
Collafer
29
Extract Records
Collafer
30
Extract Records
Collafer
31
Web Query
John
Eyres
32
Search Results
33
Online Digital Microfilm Problem
  • Many of the images we are interested in are quite
    large.

6048 x 4287 pixels
34
What is Just-In-Time Browsing?
A method of quickly browsing digital images over
the Internet which capitalizes on
  • Progressive Image Transmission
  • Hierarchical Spatial Resolution
  • Progressive Bitplane Encoding
  • JBIG Compressed Bitplanes
  • Prioritized Regions of Interest
  • User Interaction

35
Hierarchical PIT
Sequential Transmission
(Progressive Image Transmission)
36
PIT Using Bitplane Method
1 BitPlane (2 levels of gray)
2 BitPlanes (4 levels of gray)
3 BitPlanes (8 levels of gray)
4 BitPlanes (16 levels of gray)
37
Digital Microfilm Browser
38
PAF 5 Generation Pedigree
39
PAF 5 Generation Pedigree
40
GenaA 3D Genealogy Visualizer
41
Concluding Remarks
Workshop April 4, 2002 at BYU www.cs.byu.edu/fami
lyhistory
42
Appendix
Categorized List of BYU Faculty Interests
in Computer Science Research Topics that Support
Technology for Family History and Genealogy
43
Extraction from Digitized Images
  • Scanning (Flanagan)
  • Segmentation Table Recognition
    (Barrett, Martinez)
  • OCR for Old Type-Set Text (Martinez)
  • Element Classification Record Construction
    (Embley, Barrett, Martinez)
  • Handwriting Recognition (Sederberg)
  • Recognition of Hand-printed Text
    (Olson, Barrett, Martinez)

44
Extraction from Digital Data Sources
  • Automatic Extraction from Semi-structured and
    Unstructured Sources (Embley, Martinez)
  • Mappings from Heterogeneous Structured Source
    Views to Target Views (Embley)
  • Individualized Source Views (Woodfield)

45
Information Integration
  • Definition of Ontological Expectations
    (Embley, Woodfield)
  • Value Normalization (Woodfield)
  • Object Identity Data Merging
    (Embley, Sederberg)
  • Managing Uncertainty
    (Embley, Woodfield, Martinez)

46
Systems for Family History and Genealogy
  • Storage of Large Volumes of Data (Flanagan)
  • Distributed Storage (Woodfield)
  • Indexing Original Documents (Martinez, Embley)
  • Human-Computer Interaction (Olsen)
  • Just-in-Time Browsing (Barrett, Olsen)
  • Workflow for Directing Genealogical Work
    (Woodfield, Martinez, Embley)
  • Notification Systems (Woodfield)
  • Visualization (Sederberg)
Write a Comment
User Comments (0)
About PowerShow.com