Record Location and Reconfiguration in Unstructured MultipleRecord Web Documents - PowerPoint PPT Presentation

About This Presentation
Title:

Record Location and Reconfiguration in Unstructured MultipleRecord Web Documents

Description:

Record Location and Reconfiguration in Unstructured MultipleRecord Web Documents – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 19
Provided by: LiXu8
Learn more at: https://www.deg.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: Record Location and Reconfiguration in Unstructured MultipleRecord Web Documents


1
Record Location and Reconfiguration in
Unstructured Multiple-Record Web Documents
  • David W. Embley and Li Xu
  • Department of Computer Science
  • Brigham Young University

2
Overview
  • Context the Larger Project
  • Record Location and Reconfiguration
  • The Problem
  • Record-Recognition Heuristic
  • RLR Algorithm
  • Experiments and Results
  • Conclusions

3
Objective Convert Unstructured Multiple-Record
Web Documents into Structured DB Tables
Car Year Make Model
Price Phone Nr
. . .
Car Feature
. . .
4
Framework
Application Ontology
Web Page
Ontology Parser
Record Extractor
Constant/Keyword Matching Rules
Record-Level Objects, Relationships, and
Constraints
Database Scheme
Constant/Keyword Recognizer
Unstructured Record Documents
Database-Instance Generator
Populated Database
Data-Record Table
5
Record Location and Reconfiguration
Application Ontology
Web Page
Ontology Parser
Record Extractor
Constant/Keyword Matching Rules
Record-Level Objects, Relationships, and
Constraints
Database Scheme
Constant/Keyword Recognizer
Unstructured Record Documents
Database-Instance Generator
Populated Database
Data-Record Table
6
Problems Encountered
7
Proposed Solution
  • Maximize a Record-Recognition Measure
  • Improvements
  • Split joined records
  • Distribute factored values
  • Link off-page information
  • Join split records
  • Discard interspersed records

8
Record-Recognition Heuristic Vector Space
Modeling
  • VSM
  • VSM Measure
  • Cosine
  • Vector Length

DV
OV
9
RLR-A Initial Records (OK)
OV DV Year 0.98 7 Make 0.93
6 Model 0.91 6 Mileage 0.45 3 Price 0.80
7 Feature 2.10 7 PhoneNr 1.15 7
Cosine c
0.95 Estimator for Number of Records lDVl /
lOVl 5.4 Number of tags 6
10
RLR-A Initial Records (not OK)
separated by lines
separated by white space
11
RLR-A Join Split Records
c 0.80
CHEV Beretta, 2 door PW. PL, tilt, Sun roof, new
engine W/only 13,000 miles runs GREAT! 12,290
c 0.85
c 0.29
12
RLR-A Link Off-Page Information
c 0.38
JERRY SEINER MIDVALE, 566-3800 (see list)

BUICK Regal Sedan, highest rating! GM guy price!
13,999. CHEV Lumina, 6 pass., super deal!
11,977. CHEV Beretta, 2 door PW. PL, tilt, Sun
roof, new engine W/only 13,000 miles runs GREAT!
12,290 MASDA 626 LX, auto, leather, sunrf. Must
see. 15,000
c 0.90
13
RLR-A Split Joined Records
c 0.56, lDVl / lOVl 1.74
DODGE Neon 6,995 DODGE Stratus 10,295 DODGE
Avenger 9,295 KenGarff.com 526-1700
14
RLR-A Distribute Factored Values
DODGE Neon 6,995 526-1700 DODGE Stratus 10,295
526-1700 DODGE Avenger 9,295 526-1700
15
RLR-A Discard Interspersed Records
c 0.94 threshold
1998 CHEV Cavalier, 4 door, auto., air, cass.,
very clean, 9,995, 571-7214, DL1497. 1998 Next
to new must sell washer and dryer, Whirlpool,
electric, 750
c 0.32 16
Test Set Characteristics
  • 30 pre-selected documents
  • Characteristics
  • 8 contained only regular car ads.
  • 13 contained inside-boundary joined car ads all
    with inside-boundary factored values.
  • 1 contained outside-boundary factored values.
  • 13 contained interspersed non-car-ads.
  • None contained off-page or split ads.

17
Results
  • Correctly reconfigured 91 (of 304)
  • 36 false drops
  • 11 car ads improperly discarded
    (value-recognition problem)
  • 25 car ads improperly reconfigured
  • 20 ads with identical phone numbers on every 5th
  • 5 inside-boundary ads not split (missing years
    makes not all models recognized)
  • Correctly discarded 94 (of 47)
  • 3 false positives
  • all snowmobile ads
  • Correctly produced 97 (of 1,077)

18
Conclusions
  • Record location and reconfiguration is possible.
  • Key idea maximize VSM measure
  • Establish expectations.
  • Reconfigure recognized patterns with respect to
    expectations.

www.deg.byu.edu
Write a Comment
User Comments (0)
About PowerShow.com