Automatically Identifying Records from the Extracted Data Fields of Genealogical Microfilm - PowerPoint PPT Presentation

About This Presentation
Title:

Automatically Identifying Records from the Extracted Data Fields of Genealogical Microfilm

Description:

The coordinates of each table cell. The printed text in ASCII for each cell, if any. ... The confidence of the table primitive matches. The probability the the ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 25
Provided by: drdavid59
Learn more at: https://www.deg.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: Automatically Identifying Records from the Extracted Data Fields of Genealogical Microfilm


1
Automatically Identifying Records from the
Extracted Data Fields of Genealogical Microfilm
  • Kenneth Tubbs

2
Microfilm Image
3
Input
Table Zones
  • The coordinates of each table cell
  • The printed text in ASCII for each cell, if
    any.
  • Whether or not the cell is empty.

4
Algorithm
5
Identify Structure
  1. Identify Table Primitives
  2. Aggregate Table Primitives
  3. Sort Candidates

Identify Structure
6
Identify Structure
  1. Identify Table Primitives

Column table_label width table_value
width below
Identify Structure
7
Identify Structure
  1. Identify Table Primitives

Row table_label height table_value height
left
Identify Structure
8
Identify Structure
  1. Identify Table Primitives

Printed Text
Hand-written Text
Identify Structure
9
Identify Structure
2. Identify Table Primitives
  • Probabilistic Rules are associated with each
  • primitive type.
  • Examples
  • Column primitives should be factored left to
    right. (.9)
  • Row primitives factor the Column primitives below
    them. (.7)

Identify Structure
10
Identify Structure
2. Aggregate Table Primitives
Identify Structure
11
Identify Structure
2. Aggregate Table Primitives
G H I J K L or G H I J K L or K G H I J
L or G H I J KL or Others
Identify Structure
12
Identify Structure
2. Sort Candidates
  • The candidates are evaluated based on
  • The confidence of the table primitive matches.
  • The probability the the rules used are correct.

Identify Structure
13
Identify Structure
2. Sort Candidates
  1. G H I J K L
  2. G H I J K L
  3. G H I J KL
  4. K G H I J L
  5. Others

Identify Structure
14
Match Attributes
  1. Identify Possible Mappings
  2. Sort Candidates

Match Attributes
15
Match Attributes
  1. Identify Possible Mappings

Printed Text
Mapping types
  • Identical Matches
  • Synonym Matches
  • Composite Matches
  • Human-Aided Matches

Match Attributes
16
Match Attributes
2. Sort Candidates
  • The candidates are evaluated based on
  • The type of the match.
  • The confidence of the match.

Match Attributes
17
Check Constraints
  1. Identify the individual records
  2. Evaluate the records with the Genealogical
    Ontology.

Check Constraints
18
Check Constraints
Table (Address , Age) 4.1
Address
1
1
1
4.1
3.9
4.2
Gender
Name
Age
Check Constraints
19
Check Constraints
Ontology (Address, Age) 1.5 4.3 .9 5.805
Age
Gender
Name
5
10
1.1
1.1
.9
1.1
1.3
1.5
4.3
1.3
Family
Address
Person
Check Constraints
20
Check Constraints
Constraint_Score 1
2 (1\(2n)) ?
Ontology(i, j) Table(i,j) 2
  • The variables i and j are attributes.
  • The sum is over all combinations of i and j.
  • The variable n is number of attributes.

Check Constraints
21
Check Constraints
The algorithm sorts the candidates by their
constraint score.
The algorithm creates rules to prevent the
factoring of the attributes the receive low
constraint scores.
Check Constraints
22
Algorithm
23
Final Remarks
  • The algorithm produces
  • Record Patterns
  • Attributes for each record
  • Geometry for each record
  • 2. Attribute mappings from the table to the
    ontology.

24
Final Remarks
  • Given extracted values for the information
    written by hand,
  • the process can extract the records into an XML
    file.
  • Individuals can then query the XML files and
    index
  • back into the original microfilm images.
  •  
Write a Comment
User Comments (0)
About PowerShow.com