Wrapper Induction for End-User Semantic Content Development - PowerPoint PPT Presentation

About This Presentation
Title:

Wrapper Induction for End-User Semantic Content Development

Description:

Remove unmapped nodes to make pattern. May 18, 2004. Interaction Design and the Semantic Web ... pattern by removing unmapped nodes. Replace with wildcards ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 26
Provided by: andrewh160
Learn more at: http://secondthought.org
Category:

less

Transcript and Presenter's Notes

Title: Wrapper Induction for End-User Semantic Content Development


1
Wrapper Induction for End-User Semantic Content
Development
  • Andrew Hogue
  • MIT CSAIL

2
Acknowledgments
  • David Karger
  • (karger_at_csail.mit.edu)
  • Haystack Group
  • (http//haystack.csail.mit.edu)

3
Labeling the Semantic Web
  • Semantic Web requires RDF labeling of semantic
    data
  • Most existing labeling methods geared towards
    content providers
  • End-user tools require knowledge of underlying
    HTML of page
  • Goal easy interface for non-technical end-users

4
Labeling the Semantic Web
  • Our approach create patterns for existing
    semantic content
  • User provides examples of semantic content in the
    browser
  • Induce patterns from examples
  • Pattern matches provide content-specific context
    menus

5
Labeling the Semantic Web
  • Extends Haystack information management client
  • Provides context-sensitive menus
  • Matched patterns overlay semantic context on Web
    documents

6
Demo
7
Wrapper Induction
  • Wrapper pattern created from examples
  • User provides positive examples
  • Generalize examples into reusable pattern
  • Existing techniques
  • head-left-right-tail (HLRT) descriptors
  • Hidden Markov models
  • Support Vector Machines
  • Other Machine Learning

8
Wrapper Induction
  • Our approach take advantage of hierarchical
    structure of HTML
  • Each example picks out a subtree of DOM
  • Calculate tree edit distance between examples
  • Least-cost edit distance gives best mapping
  • Remove unmapped nodes to make pattern

9
Edit Distance
  • Least-cost sequence of operations to transform
    one tree into the other
  • Operations insert, delete, change a node
  • Cost of an operation size of subtree it affects
  • Byproduct best mapping between elements

10
Mapping Examples
11
Underlying Structure
  • Each example is built with similar HTML
  • Only text is different
  • Tree edit distance provides us with a mapping
  • Create general pattern by removing unmapped nodes
  • Replace with wildcards

12
Mapping Examples
13
Mapping Examples
14
Pattern Matching
  • Look for document subtrees with similar structure
  • Find alignments of wrapper in tree
  • Require every node in wrapper be mapped to some
    node in document subtree
  • Wildcards match zero or more times
  • Each valid alignment is a match

15
Matching Example
16
Matching Example
17
Adding Semantics
  • How to tie wrappers to semantic content?
  • Assert RDF statements
  • Tied to wrapper structure
  • Classes bound to wrappers
  • Properties bound to wildcards

18
Semantic Labels
19
Semantic Matching
20
Semantic Matching
21
Semantic Matching
  • ltrdftypegt ltTalkAnnouncementgt
  • ltseriesgt Dertouzos Lect
  • ltdctitlegt Distributed Hash
  • lttimegt 330 PM

22
Additional Heuristics
  • Allow us to create more flexible, reusable
    patterns with as few as a single example
  • List Collapse
  • Context
  • Automatic additional examples
  • URL Prefixes

23
Our Contributions
  • Ease-of-use
  • Few examples required
  • Wrappers bridge syntactic-semantic gap

24
Future Work and Applications
  • Document-level classes
  • Mozilla port
  • Push wrappers
  • Page reformatting
  • Autonomous agent interaction
  • Wrapper sharing
  • Automatic wrapper induction

25
  • ahogue_at_csail.mit.edu
  • http//haystack.csail.mit.edu

26
Edit Distance
27
Edit Distance - Delete
X
28
Edit Distance - Insert
29
Edit Distance - Change
30
Best Mapping
31
List Collapse
  • Current wrappers generalize well for single
    elements
  • Will not recognize variable length lists
  • Collapse neighboring nodes with low-cost edit
    distance
  • For matching, allow nodes to match more than once

32
List Collapse Example
33
List Collapse Example
34
List Collapse Example
35
Context
  • Highlighted region doesnt tell full story
  • Surrounding text often useful for differentiating
    semantic content
  • Also useful for finding additional examples

36
Context Example
37
Context Example
38
Context Example
39
Context Example
40
Wrapper Wrap-up
  • Gather context nodes
  • Generalize examples using best mapping
  • Collapse similar neighboring nodes
  • Add semantic labels
  • Match by finding alignments
Write a Comment
User Comments (0)
About PowerShow.com