Semiautomatic Generation of Resilient DataExtraction Ontologies - PowerPoint PPT Presentation

About This Presentation
Title:

Semiautomatic Generation of Resilient DataExtraction Ontologies

Description:

02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 221-1250. Record 3: ... 00 Buick Century Stk# HU7159 Green $9,319, 714-2200 ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 33
Provided by: deg7
Learn more at: https://www.deg.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: Semiautomatic Generation of Resilient DataExtraction Ontologies


1
Semiautomatic Generation of Resilient
Data-Extraction Ontologies
  • Yihong Ding
  • Data Extraction Group
  • Brigham Young University
  • Sponsored by NSF

2
Wrapper-Driven Data Extraction
  • Web data extraction
  • Obtain user-specified information from Web
    documents
  • Wrapper
  • Convert implicit HTML data into explicit
    formatted data
  • Data-source-specified, high performance
  • Examples
  • SoftMealy, STALKER, WIEN, Omini, ROADRUNNER,

3
Common Problem of Wrappers
SoftMealy
  • ltLIgt ltA HREF""gt Mani Chandy lt/Agt,
  • ltIgtProfessor of Computer Sciencelt/Igt
  • and ltIgtExecutive Officer for Computer
  • Sciencelt/Igt

Resiliency fixed domain changeable layout
Scalability unchanged existing wrapper extendable
domain and functions
4
Data-Extraction Ontology
  • Structure
  • Object sets
  • Relationship sets
  • Participation constraints
  • Data frames
  • Pros resilient and scalable
  • Cons hard to create
  • Knowledge requirements
  • Tedious and error-prone work
  • Car -gt object
  • Car 01 has Make 1
  • Make matches 10
  • constant extract "\baudi\b"
  • end
  • Car 01 has Model 1
  • Model matches 25
  • constant extract "80"
  • context "\baudi\S\s80\b"
  • end
  • Car 01 has Mileage 1
  • Mileage matches 8
  • constant extract "\b1-9\d0,2k"
  • substitute "kK" -gt "000"
  • end

5
Motif of Ontology Generation
6
Thesis Statement
  • Given knowledge base
  • Input sample Web pages of interest
  • Output a data-extraction ontology for the domain
    of interest
  • Between input and output this is the work of
    this thesis

7
Ontology-Generation Procedure
8
Primary Knowledge Source
  • Requirements
  • Available
  • General in coverage
  • Rich in meaningful relationship
  • Encoded in or easily converted to XML
  • Mikrokosmos (?K) Ontology
  • Developed by NMSU jointly with U.S. DoD
  • Contains over 5000 concepts
  • Connects to an average 14 links per concept
  • Represented in XML format

9
Integrated Knowledge Base
KNOWLEDGE BASE
?K Ontology
Data-Frame Library
Lexicons
Synonym Dictionary (WordNet)
10
Ontology-Generation Procedure
11
Domain Specification
  • Training documents
  • Data-rich
  • Narrow in topic breadth
  • Preprocessing

12
Example Car Advertisement
Record 1 00 GrandAM SE, Sunfire Red, CD, AC, PW,
PLGreat Condition, 10,800, Call 798-3446
Record 2 02 Buick Century Custom, Pwr Seat,
Nada Retail 13,695 221-1250 Record 3 02 Buick
Century, lo mi, mint cond, 11,999. 373-4445 dlr
2755 Record 4 00 Buick Century Stk HU7159 Green
9,319, 714-2200To Apply By Phone,
1-877-228-9486, OREM Utah
13
Ontology-Generation Procedure
14
Concept Selection
  • Selection strategies
  • Compare a string with the name of a concept
  • Compare a string with the values belonging to a
    concept
  • Apply data-frame recognizers to recognize a
    string

KB
ltPHONE-NRgt
00 Buick Century Stk HU7159 Green 9,319,
714-2200To Apply By Phone, 1-877-228-9486, OREM
Utah
15
Concept Selection
  • Reasons of conflict
  • Synonymy
  • Polysemy
  • Conflict resolution
  • Same-string only one meaning
  • Favor longer over shorter
  • Context decides meaning

KB
02 Buick Century Custom, Pwr Seat, Nada Retail
13,695 221-1250.
16
Ontology-Generation Procedure
17
Relationship Retrieval
KB
ltAUTOMOBILEgt
ltMILEAGEgt
ltYEARgt
ltPRICEgt
ltPHONE-NRgt
ltAUDIO-MEDIA-ARTIFACTgt
ltCENTURYgt
18
Ontology-Generation Procedure
19
Constraint Discovery
02 Buick Century, lo mi, mint cond, green, pwr
seat, 11,999. 373-4445 dlr 2755
AUTOMOBILE 01 IsA.ARTIFACT.CostofProduction
PRICE 11
00 Buick Century Stk HU7159 Green 9,319,
714-2200To Apply By Phone, 1-877-228-9486, OREM
Utah
20
Ontology-Generation Procedure
21
Ontology Generation
  • concept nodes ? object sets
  • paths ? relationship sets
  • discovered constraints ? participation
    constraints
  • concept recognizers ? data frames

22
Automatically Generated Ontology -- Car
Advertisement
(01) Automobile -gt object (02) Automobile
01 has Mileage 11 (03) Automobile 01
IsA.ARTIFACT.CostOfProduction Price 11 (12)
Price 11 IsA.SCALARATTRIBUTE.MeasuredIn.MEASUR
INGUNIT.Subclasses Year 0 (20) Automobile
01 relatesTo PhoneNr 1 relatesTo
ArtifactPart 1 relatesTo Mileage 1
relatesTo Truck 1 relatesTo
AudioMediaArtifact 1 relatesTo
CommunicationDevice 1 relatesTo ControlEvent
1 relatesTo TravelEvent 1
23
Ontology-Generation Procedure
24
Updating Strategies
  • Remove all bad relationship sets
  • Modify remaining incorrect relationship sets
  • Substitute incorrect object sets
  • Reduce long n-ary relationship sets
  • Fix participation constraints
  • Adjust names or re-arrange sequences
  • Add new relationship sets

25
Final Ontology
  • Car -gt object
  • Car 01 has Year 1
  • Car 01 has Mileage 1
  • Car 01 has Price 1
  • PhoneNr 1 is for Car 01
  • PhoneNr 01 has Extension 1
  • Car 0 has Feature 1
  • Car 01 has Make 1
  • Car 01 has Model 1

26
Evaluation Criteria
  • Basic measures
  • POG (Precision of Ontology Generation)
  • ROG (Recall of Ontology Generation)
  • Human constraints
  • PROG (Pseudo-ROG)
  • Comparing with an expert-created ontology
  • Knowledge base constraints
  • EPROG (Effective-PROG)
  • Correctness dependency
  • DEPROG (Dependent-EPROG)
  • For example relationship sets depends on object
    sets

27
Evaluation Results
28
Discussion of Results
  • Bottleneck cannot generate what not in the
    knowledge base
  • Object sets
  • Concept-selection procedure works well
  • Desired concept not shown in training records
  • Rarely occurring concept ? not severe even if we
    dont fix the error
  • Example extension
  • Aggregation and union
  • USAddressCity, USAddressState, USAddressZipCode ?
    Location
  • CropPlant, AnimalProduct, FruitFoodStuff ?
    AgriculturalProduct
  • Close-meaning concepts FurniturePart ? Furnished

29
Discussion of Results
  • Relationship sets
  • Binary relationship sets over 95
  • Most errors due to incorrectly generated object
    sets
  • Semantically incorrect relationship sets
  • Price IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT
    .Subclasses Year
  • n-ary relationship sets (usually huge)
  • Participation constraints
  • Error due to lack of training examples
  • How much is enough?

30
Knowledge Base Extensibility
  • Add SALT -- a new knowledge source
  • Successfully integrated into existing KB
  • Sample new relationship set (DOE abstract domain)
  • CrudeOil IsA.PHYSICALOBJECT.Location.PLACE.Subclas
    ses Nation

31
Conclusion
  • Experimented with knowledge-base construction and
    extension
  • Standardized application domain specification
  • Generated data-extraction ontologies from a
    specified domain and an integrated knowledge base
  • Showed DEPROG results of more than 70 on average
    and over 90 for well-defined domains

32
Future Work
  • Build a general-purpose knowledge source for
    data-extraction usage
  • Study more about data frames
  • Can a system correctly identify concepts with
    data frames?
  • Can a system update a data frame to fit a special
    situation?
  • Can a system generate a data frame from a
    collection of information of interest?
Write a Comment
User Comments (0)
About PowerShow.com