Enhancing Web Data Extraction through Wrapper Usability and Semantic Wrapping - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Enhancing Web Data Extraction through Wrapper Usability and Semantic Wrapping

Description:

Enhancing wrapper usability. through ontology sharing. and large scale cooperation ... use HTML parse trees (DOM tree) for extraction rule creation (semi-automatically ... – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 27
Provided by: andrea143
Category:

less

Transcript and Presenter's Notes

Title: Enhancing Web Data Extraction through Wrapper Usability and Semantic Wrapping


1
Enhancing wrapper usability through ontology
sharing and large scale cooperation
Wolfgang Slany wsi_at_ist.tugraz.at
Pranjal Arya parya_at_ist.tugraz.at
Christian Schindler cschindl_at_ist.tugraz.at
Andreas Rath arath_at_ist.tugraz.at
Institute for Software TechnologyGraz University
of TechnologyInffeldgasse 16b/II, 8010
Graz/Austriahttp//www.ist.tugraz.at
2
Agenda
  • Introduction
  • Wrapper Technology
  • Lixto4All Project
  • Lixto4All Wrapper Usability
  • Questions Discussion

3
Introduction
  • 317 million hosts connected to the Web (Internet
    Software Consortium) 1
  • 80 of web pages are generated from databases 2
  • looking for information on the Web isa time
    consuming and tedious task

4
Bridging the Gap
  • Human readability
  • layout orientation of web pages
  • Computer processing
  • corporate data processing applications

5
Economic Advantages
  • faster data acquisition
  • automatic and easy data integration
  • cost effective
  • aggregation
  • personalization
  • processing
  • decision support

6
Wrapper Technology
  • Wrappers are specialized program routines that
    automatically extract data from Internet web
    sites and convert the information into a
    structured format. 2
  • The typical tasks 2 of a wrapper
  • fetching data from a remote resource
  • searching for, recognizing and extracting
    specified data
  • saving this data in a suitable structured format
    to enable further manipulation

7
Wrapper Creation
  • manual wrapper creation
  • basic text processing
  • regular expression
  • Java, Perl, C, Python,
  • wrapper generation tools

8
Wrapper Generation Tools (1/3) 4
  • Languages for Wrapper Development
  • first attempt
  • to ease extraction process
  • TSIMMIS, Minerva, WebOQL
  • NLP-based Tools
  • use natural language processing approachese.g.
    filtering, part-of-speech
  • rely on linguistic constraints
  • good on plain text, take no advantage of
    structure
  • Whisk, Rapier (Robust Automated Production of
    Information Extraction)

9
Wrapper Generation Tools (2/3)
  • Wrapper Induction Tools
  • rely on structural information and formatting
    features
  • need a set of training samples
  • Stalker, Wien (Wrapper Induction ENvironment)
  • Modeling-based tools
  • structure of source document is crucial
  • locate portions of data conforming a given
    structure
  • DEByE (Data Extraction By Example)NoDoSE
    (Northwestern Document Structure Extractor)

10
Wrapper Generation Tools (3/3)
  • HTML-aware Tools
  • rely on structural information of web pages
  • use HTML parse trees (DOM tree) for extraction
    rule creation (semi-automatically with GUI
    support)
  • Lixto, Xwrap
  • Ontology-based Tools
  • rely on the data itself and not on the structure
    of the source documents
  • ontology is used to locate constants and to
    construct objects (extraction ontologies)
  • BYU (Brigham Young University) tool

11
Lixto4All
Wrapper Technology for everyone!
  • started 02/2005
  • funded by the Austrian FFG under the FIT-IT
    program line as part of the NextWrap project
  • cooperation withLixto GmbH.
  • Products
  • Lixto Visual Wrapper
  • Lixto Transformation Server

12
Use Cases
  • TV program in advance
  • notification of films I like to watch
  • integration of TV program of different TV station
    web sites into an application
  • Top 10 films
  • stock market monitoring
  • share values
  • statistic evaluations

13
L4A Wrapper Usability
is the challenge of making the wrapper
technology available to all kind of users.
  • Easy Accessibility
  • Easy Configuration
  • Graphical User Guidance
  • Resistance Adaptability
  • Wrapper Algorithm Performance
  • Easy Maintenance

Easy Maintenance
Resistance Adaptability
Easy Accessibility
GUG
Wrapper Algorithm Performance
Easy Configuration
14
Easy Accessibility
deals with how users get and use the wrapper
application
  • Installation ?
  • Security Issues ?
  • accessible
  • by everyone
  • from everywhere
  • at any time

browser-based system
15
L4A GUI
16
Embedded Application
  • Advantages
  • no download or installation
  • only requires JavaScript
  • wrapper generation on server side
  • Disadvantages
  • no access to browser internals
  • limited to JavaScript possibilities
  • wrapping of dynamic content generated by
    scripting languages

17
Easy Configuration
  • straight forward and easy to understand
  • wrapper suggestion
  • different skilled users need different
    configuration options
  • user levels
  • novice users
  • advanced users
  • professional users/ enablers

collaboration
18
Graphical User Guidance
  • guidance during wrapper creation process
  • visual and interactive support
  • select, click, drag and drop scenario
  • work directly on the visual presentation of the
    web page
  • Hand in hand with user levels
  • user skills featured GUI
  • wrapper creation options
  • different steps

19
Resistance Adaptability
Resistance
  • minor web page changes
  • structure and the content
  • added or modified source
  • wrapper verification
  • less reconfiguration

Adaptability
  • work on similar web pages
  • applying same wrapper on different sources

20
Wrapper Algorithm Performance
  • performance in terms of
  • creation time
  • wrapping time
  • space and memory
  • heuristics for classifying wrapped content
  • structure based (e.g. lists, tables, )
  • semantic based (e.g. ontologies, knowledge bases)
  • central server for
  • wrapper scheduling
  • wrapper task pooling
  • transformation and distribution

21
Easy Maintenance
  • large scale cooperation
  • centralized wrapper configuration and processing
  • cooperation of different skilled users
  • wrapper creation
  • wrapper sharing
  • wrapper subscriptions
  • creation, contribution and sharing of
  • semantic concepts (e.g. regular expressions)
  • ontologies (e.g. OWL, RDFS, )

WikiWikiWeb
22
Status Quo
  • web-based application
  • tree based extraction mechanism
  • basic standard user interface
  • simple wrapper creation
  • navigation
  • correct presentation of the web page
  • wrapper scheduling

23
Future Work
  • Enhancing extraction mechanism
  • WikiWikiWeb integration
  • configuration
  • collaboration
  • maintenance
  • wrapping script generated parts of web pages
  • accessing remote services (wordnet, wikipedia)
  • wrapper configuration ontology

24
Questions Discussion
25
References
  • 1 Internet Software Consortium, www.isc.org,
    Jannuary 2005
  • 2 Arnaud Sahuguet and Fabien Azavant. Web
    ecology Recycling HTML pages as XML documents
    using w4f. WebDB (Informal Proceedings), pages
    31-36, 1999
  • 3 Stefan Kuhlins and Ross Tredwell. Toolkits
    for generating wrappers. NetObjectDays 2002
    Erfurt, Springer
  • 4 Alberto H. F. Laender, Ribeiro-Neto, et.al.,
    A brief survey of web data extraction tools.
    Volume 31, pages 84-93, NY, USA, ACM Press 2002.

26
Wolfgang Slany wsi_at_ist.tugraz.at
Pranjal Arya parya_at_ist.tugraz.at
Christian Schindler cschindl_at_ist.tugraz.at
Andreas Rath arath_at_ist.tugraz.at
Institute for Software TechnologyGraz University
of TechnologyInffeldgasse 16b/II, 8010
Graz/Austriahttp//www.ist.tugraz.at
Write a Comment
User Comments (0)
About PowerShow.com