XWRAP:An XMLenabled Wrapper Construction System for Web Information Sources PowerPoint PPT Presentation

presentation player overlay
1 / 48
About This Presentation
Transcript and Presenter's Notes

Title: XWRAP:An XMLenabled Wrapper Construction System for Web Information Sources


1
XWRAPAn XML-enabled Wrapper Construction System
for Web Information Sources
  • Youngju Son

2
Contents
  • XWRAP
  • Background
  • Introduction
  • Characteristics
  • Procedures
  • Architecture
  • Wrapping Phase
  • Information Extraction
  • Semantic Token Extraction
  • Hierarchical structure extractor
  • Code generator
  • Conclusion

3
Background
  • Semi-structure data (HTML) has increased
  • Not directly usable by standard SQL-like query
    processing engine
  • Need a smart way to extract data from web sources

4
Background
  • Wrapper generation types
  • Manually constructed Wrapper
  • Whenever domain change occurs, new changes must
    be adapted.
  • Semi-Automatic Wrapper(User Interaction)
  • Use graphic interface
  • Show information to extract
  • XWRAP is included
  • Automatic Wrapper
  • Use machine learning technology

5
Introduction
  • XWRAP
  • Systematic approach to build and interactive
    system for semi-automatic construction of Wrapper
  • Transforms from HTML into program-friendly XML
  • Provides interactive mechanism and heuristics for
    generating extraction rules
  • Combines extraction rules into wrapper program

6
Characteristics
  • Two phases for wrapper generation
  • Utilize interactive interface for generating
    extraction rules with a few clicks
  • Construct wrapper program through extraction
    rules
  • Advantages
  • User friendly interface
  • clean separation
  • micro-feedback approach

7
Characteristics
  • Two types of boundary identification
  • Divide the tasks of identifying object
    boundaries into two steps
  • Region identification
  • Semantic token identification

8
Methodology
  • Web document is fetched
  • Build parse tree with HTML tags
  • User highlight a specific field by
  • Region or semantic token
  • Construct domain-specific wrapper
  • Created wrapper extract data from webs

9
Methodology
  • Region identification
  • User highlight word, phase or sentence as the
    starting point of a meaningful region
  • XWRAP apply heuristics on nearest region tags to
    derive the type of the region

10
Methodology
  • Semantic tokens
  • User identify tokens of interest with a few
    clicks
  • Fire learning algorithms to detect repetitive
    token patterns within a region

11
Architecture
  • Syntactical Structure Normalization
  • Information Extraction
  • Code generation
  • Testing and packing

12
Architecture
13
Architecture
  • Syntactical Structure Normalization
  • Prepare environment for information extraction
    process by following tasks
  • Accept URL to be entered by XWRAP user
  • Clean up bad HTML tags
  • Transform information parser tree

14
Architecture
  • Information Extraction
  • Derive extraction rules with below steps
  • Identify interesting regions
  • Identify important semantic tokens/their logical
    paths/node positions in the parse tree
  • Identify the useful hierarchical structure

15
Architecture
  • Code generation
  • Generate the wrapper program code
  • Key technique
  • Encoding of semantic knowledge represented in the
    form of declarative extraction rules and XML
    template

16
Architecture
  • Testing Packing
  • Check if new extraction rules or updates to the
    existing extraction rules are derived.
  • Run the code generation to generate new version
    of wrapper program
  • Release wrapper program by release button

17
Architecture
18
Wrapping phases
19
Wrapping phases
  • XWRAP goes through 6 phases
  • Tasks within a phase run concurrently
  • Bookkeeping
  • Collects information that appear in the retrieved
    source document
  • Error handler
  • Designed for error detection
  • Allow the wrapper developer to determine exactly
    where errors have occurred

20
Preprocessing
  • Consists of
  • Remote document retrieval and syntax reparation
  • Syntactic-token parser tree

21
Preprocessing
  • Fetching a web page
  • Each wrapper has a set of retrieval rules
  • Each rule specifies
  • The name of rules
  • The list of parameters like
  • Protocol like HTTP, FTP etc
  • Fetch method like HTTP get, HTTP put

22
Preprocessing
  • Assume we construct wrapper for noaa current
    weather report web site
  • URL http//weather.noaa.gov/cgi-bin/currwx.pl?cc
    cc-KSAV

23
Preprocessing
  • Repairing bad syntax
  • Check bad HTML syntax
  • Insert missing tags
  • Remove useless tags like
  • lt/prgt exists without start tag ltprgt
  • Use HTML Tidy

24
Preprocessing
  • Generate a syntactic token tree
  • After error repairing, HTML parse tree is created
  • Tree structure has each node representing a
    syntactic token

25
Preprocessing
26
Preprocessing
27
Information Extraction(IE)
  • Methodology for IE
  • IE phase takes a parse tree as input
  • Interact with user to identify semantic tokens
    and hierarchical structure
  • Annotate the tree nodes with semantic token in
    comma-delimited format and nesting hierarchy in
    context-free grammar

28
Information Extraction(IE)
  • IE involves 3 steps
  • Identify regions
  • Output a set of region extraction rules
  • Identify semantic tokens
  • Output a set of semantic token extraction rules
  • Determine the nesting hierarchy for the content
    presentation of a page
  • Output a set of hierarchy structure extraction
    rules

29
Region Extraction
  • Region Extraction
  • Begin by asking user to highlight tree node only
    having start tag
  • Look for corresponding end tag and highlight the
    entire region
  • Compute the type and the number of sub regions
  • Derive the set of RE describing the structure
    layout of the region

30
Region Extraction
31
Region Extraction
  • The set of rules derived are
  • Tree_path
  • Specify how to find the path of table node
  • Table_Area
  • Find the number of rows and columns of the table
  • Effective_Area
  • Define the effective area of the table
  • Table_Style
  • Distinguish vertical and horizontal table
  • getTableInfo
  • Describe how to find the table name

32
Region Extraction
  • Example
  • Consider weather report ltFig 5gt,ltFig 7gt
  • Apply extraction rule given in ltFig 8gt
  • Get specific extraction rule for TABLE2

33
Region Extraction
  • Apply TABLE2 into Extraction rule
  • By Tree_Path rule
  • TABLE2 HTML.BODY.TABLE0.TR0.TD4.TABLE2
  • By Table_Area rule
  • MAX 5 rows and MAX 3 cols from users selection
  • By Effective Area
  • rowSI, rowEI colSI and colEi from users input
  • By Table_Style
  • deduce horizontal or vertical table

34
Region Extraction
  • How to extract table name node
  • Ask user to highlight the table name node ltFIG 2gt
  • Based on user input, XWRAP infer path expression
  • Apply rule getTableInfo and extract table name
  • getTableName calls following semantic token
    extraction rule to obtain actual string of table
    name

35
Semantic token extraction
  • Semantic token
  • A sub-string of the source document that is to be
    treated as a single logical unit
  • Example
  • ltfontgtMaximum and minimum Temperature F or
    Current Weather conditionlt/fontgt
  • To handle both types, XWRAP treats a token as a
    pair of token name/token value.

36
Semantic token Extraction
  • Main Tasks
  • Find semantic tokens of interest
  • Define extraction rules to locate such tokens
  • Specify such tokens in a comma-delimited format

37
Semantic token Extraction
  • Example TABLE2.TR1.TD0 in Fig 7
  • Based on user interaction,
  • When selecting TABLE2.TR1.TD0,
  • TABLE2.TR1.TD0 is treated as a semantic
    token with 3 leaf nodes
  • Token name Maximum Temperature F(c)
  • Token value (82,0) (27,8)
  • A set of semantic token rules can be derived for
    the rest of subtrees at tr3 and tr4 using
    function getStoken()

38
Semantic token Extraction

39
Semantic token Extraction
40
Semantic token Extraction
41
Hierarchical structure extractor
  • Purpose
  • Make explicit the hierarchical structure by
    identifying which parts of regions or token
    streams should be grouped
  • Hierarchical structure can be extracted in a
    semi-automatic fashion

42
Hierarchical structure extractor
  • How to group
  • Identify all regions that are sibling and
    organize them in sequential order as they appear
    in the original document
  • Obtain a section heading or table name using the
    paired header tag such as lth3gtlt/h3gt
  • Infer the nesting hierarchy of section or the
    columns of tables using font size and the nesting
    structure of the presentation layout tags

43
Hierarchical structure extractor
44
Hierarchical structure extractor
  • XML-template
  • Specify the hierarchical structure extraction
    rule
  • Facilitate the code generation of XWRAP
  • Why
  • XML-template are well-formed XML file that
    includes processing instructions (direct template
    engine to special placeholder where data field is
    inserted into the template)
  • exlt?XG-InsertionField-XG fieldNamegt
  • XML-template contains a repetitive part, called
    XG-Iteration-XG.

45
Hierarchical structure extractor
  • Example XG-Iteration-XG
  • Determines the beginning and the end of a
    repetitive part
  • After template engine reaches the end position in
    a repetition, It takes a new record from the
    delimited file
  • Go back to the start position to create the same
    set of XML tags as in the previous pass.
  • New data is inserted into the resulting XML file

46
Code Generator
  • Generate the wrapper code for chosen web source
    by applying the comma-delimited fileltFig 9gt, the
    region extraction ruleltExample 2gt, and the
    hierarchical structure extraction rulesltFIg10gt.

47
Code Generator
48
Conclusion
  • Three contributions
  • Two Phase code generation methodology and
    mechanisms for semi-automatic construction of
    XML-enabled wrapper
  • Separate tasks of building wrappers that are
    specific to a web source from tasks that are
    repetitive for any source
  • Provide Inductive learning algorithms that derive
    or discover wrapper patterns
Write a Comment
User Comments (0)
About PowerShow.com