ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites

Description:

... find the least upper bound on two UFRES. Matching/Mismatching ... Example Pages. Example #PCDATA. String mismatches are used to discover fields of the documents ... – PowerPoint PPT presentation

Number of Views:202
Avg rating:3.0/5.0
Slides: 17
Provided by: LiXu8
Category:

less

Transcript and Presenter's Notes

Title: ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites


1
ROADRUNNER Towards Automatic Data Extraction
from Large Web Sites
  • Valter Crescenzi
  • Giansalvatore Mecca
  • Paolo Merialdo

VLDB 2001
2
Overview
  • Automatically generates a wrapper from large
    structured Web pages
  • Supports nested structures
  • Efficient approach to large, complex pages with
    regular structures

3
Approach
  • Given a set of example pages
  • Generate a Union-free Regular Expression (UFRE)
  • Find the least upper bounds on the RE lattice to
    generate a wrapper
  • Reduces to find the least upper bound on two UFRES

4
Matching/Mismatching
  • Start with the first page and create a RE that
    defines the wrapper
  • Match each successive sample against the wrapper
  • Mismatches result in generalizations of the
    regular expression
  • Types of mismatches
  • String mismatches
  • Tag mismatches

5
Example Pages
6
Example
PCDATA
  • String mismatches are used to discover fields of
    the documents
  • Wrapper is generated by replacing John Smith
    with PCDATA

7
Example (Cont.)
PCDATA
  • Tag Mismatches Discovering Optionals
  • First check to see if mismatch is caused by an
    iterator
  • If not, could be an optional field in wrapper or
    sample
  • Cross search used to determine possible optionals
  • Image field determined to be optional
  • (ltimg src/gt)?

8
Example (Cont.)
PCDATA
(ltIMG src/gt)?
  • Tag Mismatches Discovering Optionals
  • First check to see if mismatch is caused by an
    iterator
  • If not, could be an optional field in wrapper or
    sample
  • Cross search used to determine possible optionals
  • Image field determined to be optional
  • (ltimg src/gt)?

9
  • Tag Mismatches Discovering Iterators
  • Assume mismatch is caused by repeated elements in
    a list
  • Match possible squares against earlier squares
  • Generalize the wrapper by finding all contiguous
    repeated occurrences
  • (ltligtltigtTitlelt/igtPCDATAlt/ligt)

Example (Cont.)
PCDATA
(ltIMG src/gt)?
PCDATA
PCDATA
10
Extracted Result
11
Recursive Example
12
Complexity
13
Discussion
  • Assumptions
  • Pages are well-structured
  • Want to extract at the level of entire fields
  • Structure can be modeled without disjunctions
  • Search Space for explaining mismatches is huge
  • Uses a number of heuristics to prune space
  • Limited backtracking
  • Limit on number of choices to explore
  • Patterns can not be delimited by optionals
  • Will result in pruning possible wrappers

14
Experimental Result
15
Comparison with Other Works
16
X means the information extraction system has
the capability X means the information
extraction system has the ability as long as the
training corpus can accommodate the required
training data ? Shows that the systems can has
the ability in somewhat degree means that the
extraction pattern itself doesnt show the
ability, but the overall system has the
capability.
Write a Comment
User Comments (0)
About PowerShow.com