Title: Record Location and Reconfiguration in Unstructured MultipleRecord Web Documents
1Record Location and Reconfiguration in
Unstructured Multiple-Record Web Documents
- David W. Embley and Li Xu
- Department of Computer Science
- Brigham Young University
2Overview
- Context the Larger Project
- Record Location and Reconfiguration
- The Problem
- Record-Recognition Heuristic
- RLR Algorithm
- Experiments and Results
- Conclusions
3Objective Convert Unstructured Multiple-Record
Web Documents into Structured DB Tables
Car Year Make Model
Price Phone Nr
. . .
Car Feature
. . .
4Framework
Application Ontology
Web Page
Ontology Parser
Record Extractor
Constant/Keyword Matching Rules
Record-Level Objects, Relationships, and
Constraints
Database Scheme
Constant/Keyword Recognizer
Unstructured Record Documents
Database-Instance Generator
Populated Database
Data-Record Table
5Record Location and Reconfiguration
Application Ontology
Web Page
Ontology Parser
Record Extractor
Constant/Keyword Matching Rules
Record-Level Objects, Relationships, and
Constraints
Database Scheme
Constant/Keyword Recognizer
Unstructured Record Documents
Database-Instance Generator
Populated Database
Data-Record Table
6Problems Encountered
7Proposed Solution
- Maximize a Record-Recognition Measure
- Improvements
- Split joined records
- Distribute factored values
- Link off-page information
- Join split records
- Discard interspersed records
8Record-Recognition Heuristic Vector Space
Modeling
- VSM
- VSM Measure
- Cosine
- Vector Length
DV
OV
9RLR-A Initial Records (OK)
OV DV Year 0.98 7 Make 0.93
6 Model 0.91 6 Mileage 0.45 3 Price 0.80
7 Feature 2.10 7 PhoneNr 1.15 7
Cosine c
0.95 Estimator for Number of Records lDVl /
lOVl 5.4 Number of tags 6
10RLR-A Initial Records (not OK)
separated by lines
separated by white space
11RLR-A Join Split Records
c 0.80
CHEV Beretta, 2 door PW. PL, tilt, Sun roof, new
engine W/only 13,000 miles runs GREAT! 12,290
c 0.85
c 0.29
12RLR-A Link Off-Page Information
c 0.38
JERRY SEINER MIDVALE, 566-3800 (see list)
BUICK Regal Sedan, highest rating! GM guy price!
13,999. CHEV Lumina, 6 pass., super deal!
11,977. CHEV Beretta, 2 door PW. PL, tilt, Sun
roof, new engine W/only 13,000 miles runs GREAT!
12,290 MASDA 626 LX, auto, leather, sunrf. Must
see. 15,000
c 0.90
13RLR-A Split Joined Records
c 0.56, lDVl / lOVl 1.74
DODGE Neon 6,995 DODGE Stratus 10,295 DODGE
Avenger 9,295 KenGarff.com 526-1700
14RLR-A Distribute Factored Values
DODGE Neon 6,995 526-1700 DODGE Stratus 10,295
526-1700 DODGE Avenger 9,295 526-1700
15RLR-A Discard Interspersed Records
c 0.94 threshold
1998 CHEV Cavalier, 4 door, auto., air, cass.,
very clean, 9,995, 571-7214, DL1497. 1998 Next
to new must sell washer and dryer, Whirlpool,
electric, 750
c 0.32
16Test Set Characteristics
- 30 pre-selected documents
- Characteristics
- 8 contained only regular car ads.
- 13 contained inside-boundary joined car ads all
with inside-boundary factored values. - 1 contained outside-boundary factored values.
- 13 contained interspersed non-car-ads.
- None contained off-page or split ads.
17Results
- Correctly reconfigured 91 (of 304)
- 36 false drops
- 11 car ads improperly discarded
(value-recognition problem) - 25 car ads improperly reconfigured
- 20 ads with identical phone numbers on every 5th
- 5 inside-boundary ads not split (missing years
makes not all models recognized) - Correctly discarded 94 (of 47)
- 3 false positives
- all snowmobile ads
- Correctly produced 97 (of 1,077)
18Conclusions
- Record location and reconfiguration is possible.
- Key idea maximize VSM measure
- Establish expectations.
- Reconfigure recognized patterns with respect to
expectations. -
www.deg.byu.edu