SemiAutomatic Wrapper generation and Adaption - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

SemiAutomatic Wrapper generation and Adaption

Description:

Provide interface adapted to provider by loading a source description file ... Use extraction rules written in QEL(Qualified path expression Extractor Language) ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 36
Provided by: young95
Category:

less

Transcript and Presenter's Notes

Title: SemiAutomatic Wrapper generation and Adaption


1
Semi-Automatic Wrapper generation and Adaption
  • Youngju Son

2
Contents
  • Background
  • Related works
  • Wrapper Wrapper generation
  • Adaption
  • Implementation
  • Application
  • Conclusion

3
Background
  • The success of internet allows
  • Electronic market structure
  • Electronic business
  • Market infrastructure faces
  • Heterogeneity and autonomy of the source providers

4
Background
  • Solution Wrapper
  • Use interfaces of the provider to solve
    heterogeneity and autonomy
  • Provide interface adapted to provider by loading
    a source description file
  • Present a generation tool of source description
    files for a given provider

5
Background
6
Related-Works
  • UNICAT
  • W4F
  • XWRAP
  • LIXTO

7
Related-Works
  • UNICAT
  • Based on semi-structure data (HTML)
  • Oriented on tree of HTML
  • Use extraction rules written in QEL(Qualified
    path expression Extractor Language)
  • Limitation
  • Extraction process is limited to single page

8
Related-Works
  • W4F
  • Retrieval part Loading of web pages
  • Extraction part HEL(Html Extraction Language)
  • Mapping part Extracted data is created in
    NSL(Nested String List)
  • Limitation
  • Extraction process is limited to single page

9
Related-Works
  • XWRAP
  • Support graphical tool for the semi-automatic
    generation of wrapper
  • Use pattern matching from web page
  • Detect and remove errors in HTML doc
  • Limitation
  • A navigation over structural different web pages
    is not supported

10
Related-Works
  • LIXTO
  • Support visual interactive Wrapper Generator
  • Use Declarative extraction Language
  • Use instance base for different attributes
    contained in HTML
  • Transform pattern instances to XML Doc
  • Limitation
  • Restricted to one web page

11
Wrapper
  • Wrapper contains basic modules
  • Coordinator
  • Validation
  • Planner
  • Converter

12
Wrapper
13
Wrapper
  • Coordinator (Wrapper Interface)
  • Receives and interprets query
  • Send back result
  • Validator
  • Check whether a query is syntactically and
    semantically correct
  • If correct, hand over to planner

14
Wrapper
  • Planner
  • Determine how to perform the query
  • Which pages are needed
  • How to navigate between pages
  • Create a query plan
  • Transmitted a query plan to converter

15
Wrapper
  • Navigation Graph
  • The central data structure used for planning
    process
  • Determine the structure of web interfaces

16
Wrapper
  • Converter
  • Send query to providers web interface
  • Navigate autonomously pages
  • Extract necessary data
  • Authorization
  • Prove if query is allowed to perform Which
    rights may be granted

17
Wrapper
  • Cost monitor
  • Pre-calculate performance cost of query
  • Control converter and suspend query process
  • Protocol
  • Protocols all actions of the wrapper

18
Wrapper
  • Source Description File(SDF)
  • Metadata Related with web source
  • Contains following information
  • Query format
  • Result types
  • Structure of result pages

19
Wrapper
  • Extraction Rules
  • Formulized as Extended Hierarchical path
    expression (EHPE)
  • EHPE
  • Consists of one or more nodes and operators
  • Each node has node name and index
  • The sequence of nodes describes paths visited
    from the root to considered node in tree
  • Node name can be HTML tag or keyword pcdata

20
Wrapper
  • Extraction rules
  • expression (node) (op).
  • node node-name index.
  • node-name tag pcdata.
  • index index, index number number - number
  • number - number .
  • op att (identifier) txt() split(regex)
  • match(regex) search(regex).
  • Example
  • html0body0table0trth1table0.trtxt
    ()
  • html0body0table0att(border)

21
Wrapper Generation
  • Semi-Automatic approach
  • Through user interaction
  • Generation-by-example
  • User selects an entry page for searching
  • User marks relevant attributes in the page

22
Wrapper Generation
23
Wrapper Generation
  • Wrapper generator consists of
  • User interface
  • Provides view on source and source Desc
  • Generator
  • Create hierarchical path exp. for selected parts
    of HTML tree
  • Data module
  • Support file operation

24
Wrapper Adaption
  • Changes in providers offer web presentation
    force modification of source description file
  • Wrapper generator has to detect changes
  • Error detection algorithm is necessary

25
Wrapper Adaption
  • Basic concept of error detection
  • The queries are repeated
  • Results are compared with test data
  • Major differences indicates the use of extraction
    rule that is no valid

26
Wrapper Adaption
  • Error Detection Algorithm
  • Test data can be found on different position than
    expected
  • If attribute is found on pages having identical
    structure, it can be found on each of these web
    pages at the same position
  • Search result pages for each appearance of test
    data

27
Wrapper Adaption
  • Count the number of matches for each node of HTML
    tree
  • Extract the node with the highest number of
    matches
  • The extraction rule for this node become new rule
    for the attribute

28
Implementation
29
Implementation
30
Application
  • This wrapper provides
  • Accessing market information
  • Uniform access to information provider
  • Within UNICAT project,
  • Market infrastructure is being developed

31
Application
32
Application
  • User agent
  • Act as representatives of the customer
  • Traders
  • Provide a market-internal service of provider
    selection

33
Conclusion
  • Present wrapper approach among heterogeneous
    providers in open market
  • Wrapper can be adapted to different providers by
    loading a source description file.
  • The source description file can be created with
    the help of a wrapper generator.

34
Characteristics
  • Possible to install a wrapper for a provider
    without knowledge about business logic.
  • The wrapper can also be used for commercial Web
    sites.

35
Characteristics
  • An existing source description can easily be
    modified
  • Changes can be comprehended automatically
  • Implementation of wrapper is platform-independent.
Write a Comment
User Comments (0)
About PowerShow.com