Title: Computer Science Research for Family History and Genealogy
1Computer Science Research forFamily History and
Genealogy
Computer Graphics, Vision, Image Processing
Laboratory Neural Networks and Machine Learning
Laboratory Data Extraction and Integration
Laboratory Laboratory for Information,
Collaboration, Interaction Environments Performa
nce Evaluation Laboratory Data and Software
Engineering Laboratory
www.cs.byu.edu/familyhistory
David W. Embley Heath Nielson, Mike
Rimer, Luke Hutchison, Ken Tubbs, Doug
Kennard, Tom Finnigan William A. Barrett
2The Problem
- 2.5 million rolls of microfilm
- Assuming 1000 images per roll
- 2.5 billion images
Is there a way to automatically extract this
information?
3A (Possible) Solution
Let a computer do the extraction work.
- Input Images of Microfilmed Records
- Table Recognition (Heath Nielson)
- Old-Text Recognition (Mike Rimer)
- Handwriting Recognition (Luke Hutchison)
- Record Extraction Organization (Ken Tubbs)
- Just-in-Time Browsing (Doug Kennard)
- Visualization (Tom Finnigan)
- Output Organized Genealogical Information
4ZoningGeneral Overview
- Find the lines in the document using the
horizontal and vertical profiles of the image. - Apply a matched filter to the profiles to
identify the line signatures. - Recursively divide the document into separate
pieces, analyzing each piece for lines.
5(No Transcript)
6(No Transcript)
7Zone ClassificationMachine vs. Handwriting
- Machine printed text is consistent/regular.
- Handwriting is irregular.
8Document templates
- Images are not ideal.
- Results in incorrect zoning and classification.
- Form layout is the same across documents.
- Features missed in one image, are found in
another. - Build a template of the documents form by using
several documents. - Provides robustness, and increases accuracy.
9Document Templates
10Zoned Image
11Automated Text Recognition
12Word Segmentation
13Letter Segmentation
14Optical Character Recognition
15Handwriting Recognition
16Handwriting Recognition
- The Task
- Online handwriting recognition
- The writer's pen movements are captured
- Velocity, acceleration, stroke order are
available - Offline handwriting recognition
- Page was previously-written and scanned
- Only pixel color information available
- Genealogical records are all offline
- Offline is harder, but doable
17Handwriting Recognition
- Can we just convert offline data into (simulated)
online data? - Yes, although difficult to do reliably
- What order were the strokes written in?
- Doubled-up line segments? Ink blobs? Spurious
joins between letters? Missing joins? - Inferring online data (e.g. stroke ordering)
could be crucial to success - Demonstrated to be solvable with reasonable
reliability
18Handwriting Recognition
- An example of some steps in the analysis process
- Contour extraction
- Midline determination
- Stroke ordering
19Handwriting Recognition
- An example of some steps in the recognition
process - Handwriting style clustering
- Letter recognition
- Approximate string matching
Smith Smythe
20Automatic Record Extraction
21Extraction Algorithm
- Identify the Geometric Structure
- Identify the Type of Information
- Identify the Attribute-Value pairs
- Identify the Record Boundaries
22 Column-Row Recognition
23Genealogical Ontology
24Match Labels
ROAD, STREET, c., And No. or NAME of HOUSE
Location
25Match Labels
NAME and Surname of each Person
Full Name
Location
26Match Labels
RELATION to Head of Family
Relationship
Location
Full Name
27Extract Records
Collafer
28Extract Records
Collafer
29Extract Records
Collafer
30Extract Records
Collafer
31Web Query
John
Eyres
32Search Results
33Online Digital Microfilm Problem
- Many of the images we are interested in are quite
large.
6048 x 4287 pixels
34What is Just-In-Time Browsing?
A method of quickly browsing digital images over
the Internet which capitalizes on
- Progressive Image Transmission
- Hierarchical Spatial Resolution
- Progressive Bitplane Encoding
- JBIG Compressed Bitplanes
- Prioritized Regions of Interest
- User Interaction
35Hierarchical PIT
Sequential Transmission
(Progressive Image Transmission)
36PIT Using Bitplane Method
1 BitPlane (2 levels of gray)
2 BitPlanes (4 levels of gray)
3 BitPlanes (8 levels of gray)
4 BitPlanes (16 levels of gray)
37Digital Microfilm Browser
38PAF 5 Generation Pedigree
39PAF 5 Generation Pedigree
40GenaA 3D Genealogy Visualizer
41Concluding Remarks
Workshop April 4, 2002 at BYU www.cs.byu.edu/fami
lyhistory
42Appendix
Categorized List of BYU Faculty Interests
in Computer Science Research Topics that Support
Technology for Family History and Genealogy
43Extraction from Digitized Images
- Scanning (Flanagan)
- Segmentation Table Recognition
(Barrett, Martinez) - OCR for Old Type-Set Text (Martinez)
- Element Classification Record Construction
(Embley, Barrett, Martinez) - Handwriting Recognition (Sederberg)
- Recognition of Hand-printed Text
(Olson, Barrett, Martinez)
44Extraction from Digital Data Sources
- Automatic Extraction from Semi-structured and
Unstructured Sources (Embley, Martinez) - Mappings from Heterogeneous Structured Source
Views to Target Views (Embley) - Individualized Source Views (Woodfield)
45Information Integration
- Definition of Ontological Expectations
(Embley, Woodfield) - Value Normalization (Woodfield)
- Object Identity Data Merging
(Embley, Sederberg) - Managing Uncertainty
(Embley, Woodfield, Martinez)
46Systems for Family History and Genealogy
- Storage of Large Volumes of Data (Flanagan)
- Distributed Storage (Woodfield)
- Indexing Original Documents (Martinez, Embley)
- Human-Computer Interaction (Olsen)
- Just-in-Time Browsing (Barrett, Olsen)
- Workflow for Directing Genealogical Work
(Woodfield, Martinez, Embley) - Notification Systems (Woodfield)
- Visualization (Sederberg)