Title: XWRAP:An XMLenabled Wrapper Construction System for Web Information Sources
1XWRAPAn XML-enabled Wrapper Construction System
for Web Information Sources
2Contents
- XWRAP
- Background
- Introduction
- Characteristics
- Procedures
- Architecture
- Wrapping Phase
- Information Extraction
- Semantic Token Extraction
- Hierarchical structure extractor
- Code generator
- Conclusion
3Background
- Semi-structure data (HTML) has increased
- Not directly usable by standard SQL-like query
processing engine - Need a smart way to extract data from web sources
4Background
- Wrapper generation types
- Manually constructed Wrapper
- Whenever domain change occurs, new changes must
be adapted. - Semi-Automatic Wrapper(User Interaction)
- Use graphic interface
- Show information to extract
- XWRAP is included
- Automatic Wrapper
- Use machine learning technology
5Introduction
- XWRAP
- Systematic approach to build and interactive
system for semi-automatic construction of Wrapper - Transforms from HTML into program-friendly XML
- Provides interactive mechanism and heuristics for
generating extraction rules - Combines extraction rules into wrapper program
6Characteristics
- Two phases for wrapper generation
- Utilize interactive interface for generating
extraction rules with a few clicks - Construct wrapper program through extraction
rules - Advantages
- User friendly interface
- clean separation
- micro-feedback approach
7Characteristics
- Two types of boundary identification
- Divide the tasks of identifying object
boundaries into two steps - Region identification
- Semantic token identification
8Methodology
- Web document is fetched
- Build parse tree with HTML tags
- User highlight a specific field by
- Region or semantic token
- Construct domain-specific wrapper
- Created wrapper extract data from webs
9Methodology
- Region identification
- User highlight word, phase or sentence as the
starting point of a meaningful region - XWRAP apply heuristics on nearest region tags to
derive the type of the region
10Methodology
- Semantic tokens
- User identify tokens of interest with a few
clicks - Fire learning algorithms to detect repetitive
token patterns within a region
11Architecture
- Syntactical Structure Normalization
- Information Extraction
- Code generation
- Testing and packing
12Architecture
13Architecture
- Syntactical Structure Normalization
- Prepare environment for information extraction
process by following tasks - Accept URL to be entered by XWRAP user
- Clean up bad HTML tags
- Transform information parser tree
14Architecture
- Information Extraction
- Derive extraction rules with below steps
- Identify interesting regions
- Identify important semantic tokens/their logical
paths/node positions in the parse tree - Identify the useful hierarchical structure
15Architecture
- Code generation
- Generate the wrapper program code
- Key technique
- Encoding of semantic knowledge represented in the
form of declarative extraction rules and XML
template
16Architecture
- Testing Packing
- Check if new extraction rules or updates to the
existing extraction rules are derived. - Run the code generation to generate new version
of wrapper program - Release wrapper program by release button
17Architecture
18Wrapping phases
19Wrapping phases
- XWRAP goes through 6 phases
- Tasks within a phase run concurrently
- Bookkeeping
- Collects information that appear in the retrieved
source document - Error handler
- Designed for error detection
- Allow the wrapper developer to determine exactly
where errors have occurred
20Preprocessing
- Consists of
- Remote document retrieval and syntax reparation
- Syntactic-token parser tree
21Preprocessing
- Fetching a web page
- Each wrapper has a set of retrieval rules
- Each rule specifies
- The name of rules
- The list of parameters like
- Protocol like HTTP, FTP etc
- Fetch method like HTTP get, HTTP put
22Preprocessing
- Assume we construct wrapper for noaa current
weather report web site - URL http//weather.noaa.gov/cgi-bin/currwx.pl?cc
cc-KSAV
23Preprocessing
- Repairing bad syntax
- Check bad HTML syntax
- Insert missing tags
- Remove useless tags like
- lt/prgt exists without start tag ltprgt
- Use HTML Tidy
24Preprocessing
- Generate a syntactic token tree
- After error repairing, HTML parse tree is created
- Tree structure has each node representing a
syntactic token
25Preprocessing
26Preprocessing
27Information Extraction(IE)
- Methodology for IE
- IE phase takes a parse tree as input
- Interact with user to identify semantic tokens
and hierarchical structure - Annotate the tree nodes with semantic token in
comma-delimited format and nesting hierarchy in
context-free grammar
28Information Extraction(IE)
- IE involves 3 steps
- Identify regions
- Output a set of region extraction rules
- Identify semantic tokens
- Output a set of semantic token extraction rules
- Determine the nesting hierarchy for the content
presentation of a page - Output a set of hierarchy structure extraction
rules
29Region Extraction
- Region Extraction
- Begin by asking user to highlight tree node only
having start tag - Look for corresponding end tag and highlight the
entire region - Compute the type and the number of sub regions
- Derive the set of RE describing the structure
layout of the region
30Region Extraction
31Region Extraction
- The set of rules derived are
- Tree_path
- Specify how to find the path of table node
- Table_Area
- Find the number of rows and columns of the table
- Effective_Area
- Define the effective area of the table
- Table_Style
- Distinguish vertical and horizontal table
- getTableInfo
- Describe how to find the table name
32Region Extraction
- Example
- Consider weather report ltFig 5gt,ltFig 7gt
- Apply extraction rule given in ltFig 8gt
- Get specific extraction rule for TABLE2
33Region Extraction
- Apply TABLE2 into Extraction rule
- By Tree_Path rule
- TABLE2 HTML.BODY.TABLE0.TR0.TD4.TABLE2
- By Table_Area rule
- MAX 5 rows and MAX 3 cols from users selection
- By Effective Area
- rowSI, rowEI colSI and colEi from users input
- By Table_Style
- deduce horizontal or vertical table
34Region Extraction
- How to extract table name node
- Ask user to highlight the table name node ltFIG 2gt
- Based on user input, XWRAP infer path expression
- Apply rule getTableInfo and extract table name
- getTableName calls following semantic token
extraction rule to obtain actual string of table
name
35Semantic token extraction
- Semantic token
- A sub-string of the source document that is to be
treated as a single logical unit - Example
- ltfontgtMaximum and minimum Temperature F or
Current Weather conditionlt/fontgt - To handle both types, XWRAP treats a token as a
pair of token name/token value.
36Semantic token Extraction
- Main Tasks
- Find semantic tokens of interest
- Define extraction rules to locate such tokens
- Specify such tokens in a comma-delimited format
37Semantic token Extraction
- Example TABLE2.TR1.TD0 in Fig 7
- Based on user interaction,
- When selecting TABLE2.TR1.TD0,
- TABLE2.TR1.TD0 is treated as a semantic
token with 3 leaf nodes - Token name Maximum Temperature F(c)
- Token value (82,0) (27,8)
- A set of semantic token rules can be derived for
the rest of subtrees at tr3 and tr4 using
function getStoken()
38Semantic token Extraction
39Semantic token Extraction
40Semantic token Extraction
41Hierarchical structure extractor
- Purpose
- Make explicit the hierarchical structure by
identifying which parts of regions or token
streams should be grouped - Hierarchical structure can be extracted in a
semi-automatic fashion
42Hierarchical structure extractor
- How to group
- Identify all regions that are sibling and
organize them in sequential order as they appear
in the original document - Obtain a section heading or table name using the
paired header tag such as lth3gtlt/h3gt - Infer the nesting hierarchy of section or the
columns of tables using font size and the nesting
structure of the presentation layout tags
43Hierarchical structure extractor
44Hierarchical structure extractor
- XML-template
- Specify the hierarchical structure extraction
rule - Facilitate the code generation of XWRAP
- Why
- XML-template are well-formed XML file that
includes processing instructions (direct template
engine to special placeholder where data field is
inserted into the template) - exlt?XG-InsertionField-XG fieldNamegt
- XML-template contains a repetitive part, called
XG-Iteration-XG.
45Hierarchical structure extractor
- Example XG-Iteration-XG
- Determines the beginning and the end of a
repetitive part - After template engine reaches the end position in
a repetition, It takes a new record from the
delimited file - Go back to the start position to create the same
set of XML tags as in the previous pass. - New data is inserted into the resulting XML file
46Code Generator
- Generate the wrapper code for chosen web source
by applying the comma-delimited fileltFig 9gt, the
region extraction ruleltExample 2gt, and the
hierarchical structure extraction rulesltFIg10gt.
47Code Generator
48Conclusion
- Three contributions
- Two Phase code generation methodology and
mechanisms for semi-automatic construction of
XML-enabled wrapper - Separate tasks of building wrappers that are
specific to a web source from tasks that are
repetitive for any source - Provide Inductive learning algorithms that derive
or discover wrapper patterns