Title: Automatically Identifying Records from the Extracted Data Fields of Genealogical Microfilm
1Automatically Identifying Records from the
Extracted Data Fields of Genealogical Microfilm
2Microfilm Image
3Input
Table Zones
- The coordinates of each table cell
- The printed text in ASCII for each cell, if
any. - Whether or not the cell is empty.
4Algorithm
5Identify Structure
- Identify Table Primitives
- Aggregate Table Primitives
- Sort Candidates
Identify Structure
6Identify Structure
- Identify Table Primitives
Column table_label width table_value
width below
Identify Structure
7Identify Structure
- Identify Table Primitives
Row table_label height table_value height
left
Identify Structure
8Identify Structure
- Identify Table Primitives
Printed Text
Hand-written Text
Identify Structure
9Identify Structure
2. Identify Table Primitives
- Probabilistic Rules are associated with each
- primitive type.
-
- Examples
- Column primitives should be factored left to
right. (.9) - Row primitives factor the Column primitives below
them. (.7)
Identify Structure
10Identify Structure
2. Aggregate Table Primitives
Identify Structure
11Identify Structure
2. Aggregate Table Primitives
G H I J K L or G H I J K L or K G H I J
L or G H I J KL or Others
Identify Structure
12Identify Structure
2. Sort Candidates
- The candidates are evaluated based on
-
- The confidence of the table primitive matches.
- The probability the the rules used are correct.
Identify Structure
13Identify Structure
2. Sort Candidates
- G H I J K L
- G H I J K L
- G H I J KL
- K G H I J L
- Others
Identify Structure
14Match Attributes
- Identify Possible Mappings
- Sort Candidates
Match Attributes
15Match Attributes
- Identify Possible Mappings
Printed Text
Mapping types
- Identical Matches
- Synonym Matches
- Composite Matches
- Human-Aided Matches
Match Attributes
16Match Attributes
2. Sort Candidates
- The candidates are evaluated based on
- The type of the match.
- The confidence of the match.
Match Attributes
17Check Constraints
- Identify the individual records
- Evaluate the records with the Genealogical
Ontology.
Check Constraints
18Check Constraints
Table (Address , Age) 4.1
Address
1
1
1
4.1
3.9
4.2
Gender
Name
Age
Check Constraints
19Check Constraints
Ontology (Address, Age) 1.5 4.3 .9 5.805
Age
Gender
Name
5
10
1.1
1.1
.9
1.1
1.3
1.5
4.3
1.3
Family
Address
Person
Check Constraints
20Check Constraints
Constraint_Score 1
2 (1\(2n)) ?
Ontology(i, j) Table(i,j) 2
- The variables i and j are attributes.
- The sum is over all combinations of i and j.
- The variable n is number of attributes.
Check Constraints
21Check Constraints
The algorithm sorts the candidates by their
constraint score.
The algorithm creates rules to prevent the
factoring of the attributes the receive low
constraint scores.
Check Constraints
22Algorithm
23Final Remarks
- The algorithm produces
- Record Patterns
- Attributes for each record
- Geometry for each record
- 2. Attribute mappings from the table to the
ontology.
24Final Remarks
- Given extracted values for the information
written by hand, - the process can extract the records into an XML
file. - Individuals can then query the XML files and
index - back into the original microfilm images.
-