XWRAP:An XMLenabled Wrapper Construction System for Web Information Sources presentation

About This Presentation

Transcript and Presenter's Notes

Title: XWRAP:An XMLenabled Wrapper Construction System for Web Information Sources

1
XWRAPAn XML-enabled Wrapper Construction System
for Web Information Sources

Youngju Son

2
Contents

XWRAP
Background
Introduction
Characteristics
Procedures
Architecture
Wrapping Phase
Information Extraction
Semantic Token Extraction
Hierarchical structure extractor
Code generator
Conclusion

3
Background

Semi-structure data (HTML) has increased
Not directly usable by standard SQL-like query
processing engine
Need a smart way to extract data from web sources

4
Background

Wrapper generation types
Manually constructed Wrapper
Whenever domain change occurs, new changes must
be adapted.
Semi-Automatic Wrapper(User Interaction)
Use graphic interface
Show information to extract
XWRAP is included
Automatic Wrapper
Use machine learning technology

5
Introduction

XWRAP
Systematic approach to build and interactive
system for semi-automatic construction of Wrapper
Transforms from HTML into program-friendly XML
Provides interactive mechanism and heuristics for
generating extraction rules
Combines extraction rules into wrapper program

6
Characteristics

Two phases for wrapper generation
Utilize interactive interface for generating
extraction rules with a few clicks
Construct wrapper program through extraction
rules
Advantages
User friendly interface
clean separation
micro-feedback approach

7
Characteristics

Two types of boundary identification
Divide the tasks of identifying object
boundaries into two steps
Region identification
Semantic token identification

8
Methodology

Web document is fetched
Build parse tree with HTML tags
User highlight a specific field by
Region or semantic token
Construct domain-specific wrapper
Created wrapper extract data from webs

9
Methodology

Region identification
User highlight word, phase or sentence as the
starting point of a meaningful region
XWRAP apply heuristics on nearest region tags to
derive the type of the region

10
Methodology

Semantic tokens
User identify tokens of interest with a few
clicks
Fire learning algorithms to detect repetitive
token patterns within a region

11
Architecture

Syntactical Structure Normalization
Information Extraction
Code generation
Testing and packing

12
Architecture
13
Architecture

Syntactical Structure Normalization
Prepare environment for information extraction
process by following tasks
Accept URL to be entered by XWRAP user
Clean up bad HTML tags
Transform information parser tree

14
Architecture

Information Extraction
Derive extraction rules with below steps
Identify interesting regions
Identify important semantic tokens/their logical
paths/node positions in the parse tree
Identify the useful hierarchical structure

15
Architecture

Code generation
Generate the wrapper program code
Key technique
Encoding of semantic knowledge represented in the
form of declarative extraction rules and XML
template

16
Architecture

Testing Packing
Check if new extraction rules or updates to the
existing extraction rules are derived.
Run the code generation to generate new version
of wrapper program
Release wrapper program by release button

17
Architecture
18
Wrapping phases
19
Wrapping phases

XWRAP goes through 6 phases
Tasks within a phase run concurrently
Bookkeeping
Collects information that appear in the retrieved
source document
Error handler
Designed for error detection
Allow the wrapper developer to determine exactly
where errors have occurred

20
Preprocessing

Consists of
Remote document retrieval and syntax reparation
Syntactic-token parser tree

21
Preprocessing

Fetching a web page
Each wrapper has a set of retrieval rules
Each rule specifies
The name of rules
The list of parameters like
Protocol like HTTP, FTP etc
Fetch method like HTTP get, HTTP put

22
Preprocessing

Assume we construct wrapper for noaa current
weather report web site
URL http//weather.noaa.gov/cgi-bin/currwx.pl?cc
cc-KSAV

23
Preprocessing

Repairing bad syntax
Check bad HTML syntax
Insert missing tags
Remove useless tags like
lt/prgt exists without start tag ltprgt
Use HTML Tidy

24
Preprocessing

Generate a syntactic token tree
After error repairing, HTML parse tree is created
Tree structure has each node representing a
syntactic token

25
Preprocessing
26
Preprocessing
27
Information Extraction(IE)

Methodology for IE
IE phase takes a parse tree as input
Interact with user to identify semantic tokens
and hierarchical structure
Annotate the tree nodes with semantic token in
comma-delimited format and nesting hierarchy in
context-free grammar

28
Information Extraction(IE)

IE involves 3 steps
Identify regions
Output a set of region extraction rules
Identify semantic tokens
Output a set of semantic token extraction rules
Determine the nesting hierarchy for the content
presentation of a page
Output a set of hierarchy structure extraction
rules

29
Region Extraction

Region Extraction
Begin by asking user to highlight tree node only
having start tag
Look for corresponding end tag and highlight the
entire region
Compute the type and the number of sub regions
Derive the set of RE describing the structure
layout of the region

30
Region Extraction
31
Region Extraction

The set of rules derived are
Tree_path
Specify how to find the path of table node
Table_Area
Find the number of rows and columns of the table
Effective_Area
Define the effective area of the table
Table_Style
Distinguish vertical and horizontal table
getTableInfo
Describe how to find the table name

32
Region Extraction

Example
Consider weather report ltFig 5gt,ltFig 7gt
Apply extraction rule given in ltFig 8gt
Get specific extraction rule for TABLE2

33
Region Extraction

Apply TABLE2 into Extraction rule
By Tree_Path rule
TABLE2 HTML.BODY.TABLE0.TR0.TD4.TABLE2
By Table_Area rule
MAX 5 rows and MAX 3 cols from users selection
By Effective Area
rowSI, rowEI colSI and colEi from users input
By Table_Style
deduce horizontal or vertical table

34
Region Extraction

How to extract table name node
Ask user to highlight the table name node ltFIG 2gt
Based on user input, XWRAP infer path expression
Apply rule getTableInfo and extract table name
getTableName calls following semantic token
extraction rule to obtain actual string of table
name

35
Semantic token extraction

Semantic token
A sub-string of the source document that is to be
treated as a single logical unit
Example
ltfontgtMaximum and minimum Temperature F or
Current Weather conditionlt/fontgt
To handle both types, XWRAP treats a token as a
pair of token name/token value.

36
Semantic token Extraction

Main Tasks
Find semantic tokens of interest
Define extraction rules to locate such tokens
Specify such tokens in a comma-delimited format

37
Semantic token Extraction

Example TABLE2.TR1.TD0 in Fig 7
Based on user interaction,
When selecting TABLE2.TR1.TD0,
TABLE2.TR1.TD0 is treated as a semantic
token with 3 leaf nodes
Token name Maximum Temperature F(c)
Token value (82,0) (27,8)
A set of semantic token rules can be derived for
the rest of subtrees at tr3 and tr4 using
function getStoken()

38
Semantic token Extraction

39
Semantic token Extraction
40
Semantic token Extraction
41
Hierarchical structure extractor

Purpose
Make explicit the hierarchical structure by
identifying which parts of regions or token
streams should be grouped
Hierarchical structure can be extracted in a
semi-automatic fashion

42
Hierarchical structure extractor

How to group
Identify all regions that are sibling and
organize them in sequential order as they appear
in the original document
Obtain a section heading or table name using the
paired header tag such as lth3gtlt/h3gt
Infer the nesting hierarchy of section or the
columns of tables using font size and the nesting
structure of the presentation layout tags

43
Hierarchical structure extractor
44
Hierarchical structure extractor

XML-template
Specify the hierarchical structure extraction
rule
Facilitate the code generation of XWRAP
Why
XML-template are well-formed XML file that
includes processing instructions (direct template
engine to special placeholder where data field is
inserted into the template)
exlt?XG-InsertionField-XG fieldNamegt
XML-template contains a repetitive part, called
XG-Iteration-XG.

45
Hierarchical structure extractor

Example XG-Iteration-XG
Determines the beginning and the end of a
repetitive part
After template engine reaches the end position in
a repetition, It takes a new record from the
delimited file
Go back to the start position to create the same
set of XML tags as in the previous pass.
New data is inserted into the resulting XML file

46
Code Generator

Generate the wrapper code for chosen web source
by applying the comma-delimited fileltFig 9gt, the
region extraction ruleltExample 2gt, and the
hierarchical structure extraction rulesltFIg10gt.

47
Code Generator
48
Conclusion

Three contributions
Two Phase code generation methodology and
mechanisms for semi-automatic construction of
XML-enabled wrapper
Separate tasks of building wrappers that are
specific to a web source from tasks that are
repetitive for any source
Provide Inductive learning algorithms that derive
or discover wrapper patterns

Write a Comment

User Comments (0)

About PowerShow.com

XWRAP:An XMLenabled Wrapper Construction System for Web Information Sources PowerPoint PPT Presentation