Title: Enhancing Web Data Extraction through Wrapper Usability and Semantic Wrapping
1Enhancing wrapper usability through ontology
sharing and large scale cooperation
Wolfgang Slany wsi_at_ist.tugraz.at
Pranjal Arya parya_at_ist.tugraz.at
Christian Schindler cschindl_at_ist.tugraz.at
Andreas Rath arath_at_ist.tugraz.at
Institute for Software TechnologyGraz University
of TechnologyInffeldgasse 16b/II, 8010
Graz/Austriahttp//www.ist.tugraz.at
2Agenda
- Introduction
- Wrapper Technology
- Lixto4All Project
- Lixto4All Wrapper Usability
- Questions Discussion
3Introduction
- 317 million hosts connected to the Web (Internet
Software Consortium) 1 - 80 of web pages are generated from databases 2
- looking for information on the Web isa time
consuming and tedious task
4Bridging the Gap
- Human readability
- layout orientation of web pages
- Computer processing
- corporate data processing applications
5Economic Advantages
- faster data acquisition
- automatic and easy data integration
- cost effective
- aggregation
- personalization
- processing
- decision support
6Wrapper Technology
- Wrappers are specialized program routines that
automatically extract data from Internet web
sites and convert the information into a
structured format. 2 - The typical tasks 2 of a wrapper
- fetching data from a remote resource
- searching for, recognizing and extracting
specified data - saving this data in a suitable structured format
to enable further manipulation
7Wrapper Creation
- manual wrapper creation
- basic text processing
- regular expression
- Java, Perl, C, Python,
- wrapper generation tools
8Wrapper Generation Tools (1/3) 4
- Languages for Wrapper Development
- first attempt
- to ease extraction process
- TSIMMIS, Minerva, WebOQL
- NLP-based Tools
- use natural language processing approachese.g.
filtering, part-of-speech - rely on linguistic constraints
- good on plain text, take no advantage of
structure - Whisk, Rapier (Robust Automated Production of
Information Extraction)
9Wrapper Generation Tools (2/3)
- Wrapper Induction Tools
- rely on structural information and formatting
features - need a set of training samples
- Stalker, Wien (Wrapper Induction ENvironment)
- Modeling-based tools
- structure of source document is crucial
- locate portions of data conforming a given
structure - DEByE (Data Extraction By Example)NoDoSE
(Northwestern Document Structure Extractor)
10Wrapper Generation Tools (3/3)
- HTML-aware Tools
- rely on structural information of web pages
- use HTML parse trees (DOM tree) for extraction
rule creation (semi-automatically with GUI
support) - Lixto, Xwrap
- Ontology-based Tools
- rely on the data itself and not on the structure
of the source documents - ontology is used to locate constants and to
construct objects (extraction ontologies) - BYU (Brigham Young University) tool
11Lixto4All
Wrapper Technology for everyone!
- started 02/2005
- funded by the Austrian FFG under the FIT-IT
program line as part of the NextWrap project - cooperation withLixto GmbH.
- Products
- Lixto Visual Wrapper
- Lixto Transformation Server
12Use Cases
- TV program in advance
- notification of films I like to watch
- integration of TV program of different TV station
web sites into an application - Top 10 films
- stock market monitoring
- share values
- statistic evaluations
13L4A Wrapper Usability
is the challenge of making the wrapper
technology available to all kind of users.
- Easy Accessibility
- Easy Configuration
- Graphical User Guidance
- Resistance Adaptability
- Wrapper Algorithm Performance
- Easy Maintenance
Easy Maintenance
Resistance Adaptability
Easy Accessibility
GUG
Wrapper Algorithm Performance
Easy Configuration
14Easy Accessibility
deals with how users get and use the wrapper
application
- Installation ?
- Security Issues ?
- accessible
- by everyone
- from everywhere
- at any time
browser-based system
15L4A GUI
16Embedded Application
- Advantages
- no download or installation
- only requires JavaScript
- wrapper generation on server side
- Disadvantages
- no access to browser internals
- limited to JavaScript possibilities
- wrapping of dynamic content generated by
scripting languages
17Easy Configuration
- straight forward and easy to understand
- wrapper suggestion
- different skilled users need different
configuration options - user levels
- novice users
- advanced users
- professional users/ enablers
collaboration
18Graphical User Guidance
- guidance during wrapper creation process
- visual and interactive support
- select, click, drag and drop scenario
- work directly on the visual presentation of the
web page - Hand in hand with user levels
- user skills featured GUI
- wrapper creation options
- different steps
19Resistance Adaptability
Resistance
- minor web page changes
- structure and the content
- added or modified source
- wrapper verification
- less reconfiguration
Adaptability
- work on similar web pages
- applying same wrapper on different sources
20Wrapper Algorithm Performance
- performance in terms of
- creation time
- wrapping time
- space and memory
- heuristics for classifying wrapped content
- structure based (e.g. lists, tables, )
- semantic based (e.g. ontologies, knowledge bases)
- central server for
- wrapper scheduling
- wrapper task pooling
- transformation and distribution
21Easy Maintenance
- large scale cooperation
- centralized wrapper configuration and processing
- cooperation of different skilled users
- wrapper creation
- wrapper sharing
- wrapper subscriptions
- creation, contribution and sharing of
- semantic concepts (e.g. regular expressions)
- ontologies (e.g. OWL, RDFS, )
WikiWikiWeb
22Status Quo
- web-based application
- tree based extraction mechanism
- basic standard user interface
- simple wrapper creation
- navigation
- correct presentation of the web page
- wrapper scheduling
23Future Work
- Enhancing extraction mechanism
- WikiWikiWeb integration
- configuration
- collaboration
- maintenance
- wrapping script generated parts of web pages
- accessing remote services (wordnet, wikipedia)
- wrapper configuration ontology
24Questions Discussion
25References
- 1 Internet Software Consortium, www.isc.org,
Jannuary 2005 - 2 Arnaud Sahuguet and Fabien Azavant. Web
ecology Recycling HTML pages as XML documents
using w4f. WebDB (Informal Proceedings), pages
31-36, 1999 - 3 Stefan Kuhlins and Ross Tredwell. Toolkits
for generating wrappers. NetObjectDays 2002
Erfurt, Springer - 4 Alberto H. F. Laender, Ribeiro-Neto, et.al.,
A brief survey of web data extraction tools.
Volume 31, pages 84-93, NY, USA, ACM Press 2002.
26Wolfgang Slany wsi_at_ist.tugraz.at
Pranjal Arya parya_at_ist.tugraz.at
Christian Schindler cschindl_at_ist.tugraz.at
Andreas Rath arath_at_ist.tugraz.at
Institute for Software TechnologyGraz University
of TechnologyInffeldgasse 16b/II, 8010
Graz/Austriahttp//www.ist.tugraz.at