Schema Matching and Data Extraction over HTML Tables - PowerPoint PPT Presentation

About This Presentation
Title:

Schema Matching and Data Extraction over HTML Tables

Description:

Form attribute-value pairs (adjust if necessary) Do extraction ... Make, Honda , Model, Civic EX , Year, 1995 , Colour, White , Price, $6300 ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 29
Provided by: Cui79
Learn more at: https://www.deg.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: Schema Matching and Data Extraction over HTML Tables


1
Schema Matching and Data Extraction over HTML
Tables
  • Cui Tao
  • Data Extraction Research Group
  • Department of Computer Science
  • Brigham Young University

supported by NSF
2
Introduction
  • Many tables on the Web
  • How to integrate data stored in different tables?
  • Detect the table of interest
  • Form attribute-value pairs (adjust if necessary)
  • Do extraction
  • Infer mappings from extraction patterns

3
ProblemDetecting The Table of Interest
4
Problem
Different schemas
  • Different source table schemas
  • Run , Yr, Make, Model, Tran, Color, Dr
  • Make, Model, Year, Colour, Price, Auto, Air
    Cond., AM/FM, CD
  • Vehicle, Distance, Price, Mileage
  • Year, Make, Model, Trim, Invoice/Retail, Engine,
    Fuel Economy
  • Target database schema
  • Car, Year, Make, Model, Mileage, Price,
    PhoneNr,
  • Car,
    Feature

5
ProblemAttribute is Value
6
Problem Attribute-Value is Value
7
ProblemValue is not Value
8
ProblemFactored Values
9
ProblemSplit Values
10
ProblemMerged Values
11
ProblemInformation Behind Links
12
Solution
  • Detect the table of interest
  • Form attribute-value pairs (adjust if necessary)
  • Do extraction
  • Infer mappings from extraction patterns

13
SolutionDetect The Table of Interest
  • Real table test
  • Same number of values
  • Table size
  • Attribute test
  • Density measure test
  • of ontology extracted values
  • total of values in the table

14
Solution Remove Factoring
15
SolutionReplace Boolean Values
16
SolutionForm Attribute-Value Pairs
ltMake, Hondagt, ltModel, Civic EXgt, ltYear, 1995gt,
ltColour, Whitegt, ltPrice, 6300gt, ltAuto,
Autogt, ltAir Cond., Air Cond.gt, ltAM/FM, AM/FMgt
17
SolutionAdjust Attribute-Value Pairs
ltMake, Hondagt, ltModel, Civic EXgt, ltYear, 1995gt,
ltColour, Whitegt, ltPrice, 6300gt, ltAuto,
Autogt, ltAir Cond., Air Cond.gt, ltAM/FM, AM/FMgt
18
SolutionAdd Information Hidden Behind Links
19
SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
20
SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
21
SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
22
SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
23
SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
24
SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
25
SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
26
Experimental Results
  • Car Advertisement Application domain
  • 10 training tables
  • 100 of the 57 mappings (no false mappings)
  • 94.6 precision of the values in linked pages
    (5.4 false declarations)
  • 50 test tables
  • 94.7 of the 300 mappings (no false mappings)
  • On the bases of sampling 3,000 values in linked
    pages, we obtained 97 recall and 86 precision

27
Other Applications
  • Cell Phone Plan Application domain
  • Soccer Player Application domain

28
Contribution
  • Provides an approach to extract information
    automatically from HTML tables
  • Suggests a different way to solve the problem of
    schema matching
Write a Comment
User Comments (0)
About PowerShow.com