Issues in Monitoring Web Data - PowerPoint PPT Presentation

View by Category
About This Presentation

Issues in Monitoring Web Data


Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 70
Provided by: serge178
Learn more at:


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Issues in Monitoring Web Data

Issues in Monitoring Web Data
  • Serge Abiteboul
  • INRIA and Xyleme

  • Introduction
  • What is there to monitor?
  • Why monitor?
  • Some applications of web monitoring
  • Web archiving
  • An experience the archiving of the French web
  • Page importance and change frequency
  • Creation of a warehouse using web resources
  • An experience the Xyleme Project
  • Monitoring in Xyleme
  • Queries and monitoring
  • Conclusion

1. Introduction
The Web Today
  • Billions of pages millions of servers
  • Query keywords to retrieve URLs
  • Imprecise query results are useless for further
  • Applications based on ad-hoc wrapping
  • Expensive incomplete short-lived, not adapted
    to the Web constant changes
  • Poor quality
  • Cannot be trusted spamming, rumors
  • Often stale
  • Our vision of it often out-of-date
  • Importance of monitoring

The HTML Web Structure
Source IBM, AltaVista, Compaq
HTML Percentage covered by Crawlers
So much for the world knowledge
  • Most of the web is not reached by crawlers
    (hidden web)
  • Some of the public HTML pages are never read
  • Most of what is on the web is junk anyway
  • Our knowledge of it may be stale
  • Do not junk the techno improve it!

What is there to monitor?
  • Documents HTML but also doc, pdf, ps
  • Many data exchange formats such as asn1, bibtex
  • New official data exchange format XML
  • Hidden web database queries behind forms or
  • Multimedia data ignored here
  • Public vs. private (Intranet or Internetpasswd)
  • Static vs. dynamic

What is changing?
  • XML is coming
  • Universal data exchange format
  • Marriage of document and database worlds
  • Standard query language XQuery
  • Quickly growing on Intranet and very slowly on
    public web (less than 1)
  • Web services are coming
  • Format for exporting services
  • Format for encapsulating queries
  • More semantics to be expected
  • RDF for data
  • WSDLUDDI for services

What is not changing fast or even getting worse
  • Massive quantity of data most of it junk
  • Lots of stale data
  • Very primitive HTML query mechanisms (keywords)
  • No real change control mechanism soon
  • Compare database queries (fresh data) with web
    search engines (possibly stale)
  • Compare database triggers (based on push) to web
    notification services (most of the times based on

The need to monitor the web
  • The web changes all the time
  • Users are often as interested in changes as by
    data new products, new press articles, new
  • Discover new resources
  • Keep our vision of the web up-to-date
  • Be aware of changes that may be of interest, have
    impact on our business

Analogy databases
  • Databases
  • Query instantaneous vision of data
  • Trigger alert/notification of some changes of
  • Web
  • Query need monitoring to give correct answer
  • Monitoring to support alert/notifications of
    changes of interest

Web vs. database monitoring
  • Quantity of data larger on the web
  • Knowledge of data
  • structure and semantics known in databases
  • Reliability and availability
  • High in databases null on the web
  • Data granularity
  • Tuple vs. page in HTML or element in XML
  • Change control
  • Databases support from data sources/triggers
  • Web no support pull only in general

2. Some applications ofweb monitoring
Comparative shopping
  • Unique entry point to many catalogs
  • Data integration problem
  • Main issue wrapping of web catalogs
  • Semi-automatic so limited to a few sites
  • Simpler and towards automatic with XML
  • Alternatives
  • Mediation when data change very fast
  • prices and availability of plane tickets
  • Warehousing otherwise ? need to monitor changes

Web surveillance
  • Applications
  • Anti-criminal and anti-terrorist intelligence,
    e.g., detecting suspicious acquisition of
    chemical products
  • Business intelligence, e.g., discovering
    potential customers, partners, competitors
  • Find the data (crawl the web)
  • Monitor the changes
  • new pages, deleted pages, changes in a page
  • Classify information and extract data of interest
  • Data mining, text understanding, knowledge
    representation and extraction, linguistic Very AI

Copy tracking
  • Example a press agency wants to check that
    people are not publishing copies of their wires
    without paying

Query to search engine Or specific crawl
Flow of candidate documents
Slice the document
Web archiving
  • We will discuss an experience in archiving the
    French web

Creation of a data warehouse with resources found
of the web
  • We will discuss some work in the Xyleme project
    on the construction of XML warehouses

3. Web archiving
  • An experience towards the archiving of the French
    web with
  • Bibliothèque Nationale de France

Dépôt légal (legal deposit)
  • Books are archived since 1537, a decision by King
    Francois the 1st
  • The Web is an important and valuable source of
    information that should also be archived
  • What is different?
  • Number of content providers 148000 sites vs.
    5000 editors
  • Quantity of information millions of pages
  • Quality of information lots of junk
  • Relationship with editors freedom of publication
    vs. traditional push model
  • Updates and changes occur continuously
  • The perimeter is unclear what is the French web?

Goal and Scope
  • Provide future generations with a representative
    archive of the cultural production
  • Provide material for cultural, political,
    sociological studies
  • The mission is to archive a wide range of
    material because nobody knows what will be of
    interest for future research
  • In traditional publication, publishers are
    filtering contents. No filter on the web

Similar Projects
  • The Internet Archive
  • The Wayback machine
  • Largest collection of versions of web pages
  • Human selection based approach
  • select a few hundred sites and choose a
    periodicity of archiving
  • Australia and Canada
  • The Nordic experience
  • Use robot crawler to archive a significant part
    of the surface web
  • Sweden, Finland, Norway
  • Problems encountered
  • Lack of updates of archived pages between two
  • The hidden Web

Orientation of our experiment
  • Goals
  • Cover a large portion of the French web
  • Automatic content gathering is necessary
  • Adapt robots to provide a continuous archiving
  • Have frequent versions of the sites, at least for
    the most important ones
  • Issues
  • The notion of important sites
  • Building a coherent Web archive
  • Discover and manage important sources of deep Web

First issue the perimeter
  • The perimeter of the French Web contents edited
    in France
  • Many criteria may be used
  • The French language but many French sites use
    English (e.g. INRIA) many French-speaking sites
    are from other French speaking countries or
    regions (e.g. Quebec)
  • Domain Name or resource locators .fr sites, but
    many are also in .com or .org
  • Address of the site physical location of the web
    servers or address of the owner
  • Other criteria than the perimeter
  • Little interest in commercial sites
  • Possibly interest in foreign sites that discuss
    French issues
  • Pure automatic does not work ? involve librarians

Second issueSite vs. Page archiving
  • The Web
  • Physical granularity HTML pages
  • The problem is inconsistent data and links
  • Read page P one week later read pages pointed by
    P may not exist anymore
  • Logical granularity?
  • Snapshot view of a web site
  • What is a site?
  • INRIA is
  • is the provider of many sites
  • There are technical issues (rapid firing, )

Importance of data
What is page importance?
  • Le Louvre homepage is more important than an
    unknown persons homepage
  • Important pages are pointed by
  • Other important pages
  • Many unimportant pages
  • This leads to Google definition of PageRank
  • Based on the link structure of the web
  • used with remarkable success by Google for
    ranking results
  • Useful but not sufficient for web archiving

Page Importance
  • Importance
  • Link matrix L
  • In short, page importance is the fixpoint X of
    the equation LX X
  • Storing the Link matrix and computing page
    importance uses lots of resources
  • We developed a new efficient technique to compute
    the fixpoint
  • Without having to store the Link matrix
  • Technique adapts to automatically to changes

Site vs. pages
  • Limitation of page importance
  • Google page importance works well when links have
    a strong semantic
  • More and more web pages are automatically
    generated and most links have little semantics
  • More limitation
  • Refresh at the page level presents drawbacks
  • So we also use link topology between sites and
    not only between pages

  • Crawl
  • We used between 2 to 8 PCs for Xyleme crawlers
    for 2 months
  • Discovery and refresh based on page importance
  • Discovery
  • We looked at more than 1.5 billion (most
    interesting) web pages
  • We discovered more than 15 million .fr pages
    about 1.5 of the web
  • We discovered 150 000 .fr sites
  • Refresh
  • Important pages were refreshed more often
  • Takes into account also the change rate of pages
  • Analysis of the relevance of site importance for
  • Comparison with ranking by librarians
  • Strong correlation with their rankings

Issues and on going workOther criteria for
  • Take into account indications by archivists
  • They know best -- man-machine-interface issue
  • Use classification and clustering techniques to
    refine the notion of site
  • Frequent use of infrequent words
  • Find pages dedicated to specific topics
  • Text Weight
  • Find pages with text content vs. raw data pages)
  • Others

4. Creation of a Warehouse from Web data
  • The Xyleme Project

Xyleme in short
  • The Xyleme project
  • Initiated at INRIA
  • Joint work with researchers from Orsay, Mannheim
    and CNAM-Paris universities
  • The Xyleme company
  • Started in 2000
  • About 30 people
  • Mission Deliver a new generation of content
    technologies to unlock the potential of XML
  • Here focus on the Xyleme project

Goal of the Xyleme project
  • Focus is on XML data (but also handle HTML)
  • Semantic
  • Understand tags, partition the Web into semantic
    domains, provide a simple view of each domain
  • Dynamicity
  • Find and monitor relevant data on the web
  • Control relevant changes in Web data
  • XML storage, index and queries
  • Manage efficiently millions of XML documents and
    process millions of simultaneous queries

Corporate information environment with Xyleme
Crawling interpreting data
XML Repository
Query Engine
Xyleme Server
Systematic updating
Information System
XML in short
  • Data exchange format
  • eXtensible Mark-up Language (child of SGML)
  • Promoted by W3C and major industry players
  • XML document ordered labeled tree
  • Other essential gadgets unicode, namespaces,
    attributes, pointers, typing (XML schema)

XML magic in short
  • Presentation is given elsewhere (style-sheet)
  • Semantic and structure are provided by labels
  • So it is easy to extract information
  • Universal format understood by more and more
    softwares (e.g., exported by most databases
    read by more and more editors)
  • More and more tools available

It is easy to extract information
4.1 XylemeFunctionality and architechture
The goal of Xyleme project XML Dynamic
  • Many research issues
  • Query Processor
  • Semantic Classification
  • Data Monitoring
  • Native Storage
  • XML document Versionning
  • XML automatic or user driven acquisition
  • Graphical User Interface through the Web

Functional Architecture
Query Processor
Repository and Index Manager
-------------------- I N T E R N E T
Prototype main choices
  • Network of Linux PCs
  • C on the server side
  • Corba for communications between PCs
  • HTTP SOAP for communications for external
  • Exception for query processing

  • Parallelism based on
  • Partitioning
  • XML documents
  • URL table
  • Indexes (semantic partitioning)
  • Memory replication
  • Autonomous machines (PCs)
  • Caches are used for data flow

4.2 XylemeData Acquisition
Data Acquisition
  • Xyleme crawler visits the HTML/XML web
  • Management of metadata on pages
  • Sophisticate strategy to optimize network
  • importance ranking of pages
  • change frequency and age of pages
  • publications (owners) subscriptions (users)
  • Each crawler visits about 4 million pages per day
  • Each index may create index for 1 million pages
    per day

4.3 XylemeChange Control
Change Management
  • The Web changes all the time
  • Data acquisition
  • automatic and via publication
  • Monitoring
  • subscriptions
  • continuous queries
  • versions

  • Users can subscribe to certain events, e.g.,
  • changes in all pages of a certain DTD or of a
    certain semantic domain
  • insertion of a new product in a particular
    catalog or in all catalogs with a particular DTD
  • They may request to be notified
  • at the time the event is detected by Xyleme
  • regularly, e.g., once a week

Continuous Queries
  • Queries asked regularly or when some events are
  • send me each Monday the list of movies in
  • send me each Monday the list of new movies in
  • each time you detect that a new member is added
    to the Stanford DB-group, send me their lists of
    publications from their homepages

Versions and Deltas
  • Store snapshots of documents
  • For some documents, store changes (deltas)
  • storage last version sequence of deltas
  • complete delta reconstruct old versions
  • partial delta allow to send changes to the user
    and allow refresh
  • Deltas are XML documents
  • so changes can be queried as standard data
  • Temporal queries
  • List of products that were introduced in this
    catalog since January 1st 2002

The Information Factory
subscription processor
send notification
changes detection
documents and deltas
continuous queries
version queries
  • Very efficient XML Diff algorithm
  • compute difference between consecutive versions
  • Representation of deltas based on an original
    naming scheme for XML elements
  • one element is assigned a unique identifier for
    its entire life
  • compact way of representing these IDs
  • Efficient versioning mechanism

  • Sophisticate monitoring algorithm
  • Detection of simple patterns (conjunctions) at
    the document level
  • Detection of changes between consecutive versions
    of the same documents
  • Scale to dozens of crawlers loading millions of
    documents per day for a single monitor

Issues languages for monitoring
  • In the spirit of temporal languages for
    relational databases
  • But
  • Data model is richer (trees vs. tables)
  • Context is richer versions, continuous queries,
    monitoring of data streams

4.4 XylemeSemantic Data Integration
Data Integration
  • One application domain -- Several schemas
  • heterogeneous vocabulary and structure
  • Xyleme Semantic Integration è
  • gives the illusion that the system maintains an
    homogeneous database for this domain
  • abstracts a set of DTDs into a hierarchy of
    pertinent terms for a particular domain
    (business, culture, tourism, biology, )  

Technology in short
  • Cluster DTDs into application domains
  • For an application domain semi-automatically
  • Organize tags into a hierarchy of concepts using
    thesauri such as Wordnet and other linguistic
  • This provides the abstract DTD for the particular
  • Generate mappings between concrete DTDs and the
    abstract one

4.5 XylemeQuery Processing
Xyleme Query Language
  • A mix of OQL and XQL, will use the W3C standard
    when there will be one.

Select product/name, product/price From doc
in catalogue, product in
doc/product Where product//components contains
flash and product/description
contains camera
Principle of Querying
query on abstract dtd
Union of concrete queries (possibly with Joins)
catalogue/product/price ? d1//camera/price ?
d2/product/cost catalogue/product/description
? d1//camera/description ?
d2/product/info, ref ? d2/description
MAPPINGS between concrete and abstract DTDs
Query Processing
  1. Partial translation, from abstract to concrete,
    to identify machines with relevant data
  2. Algebraic rewriting, linear search strategy based
    on simple heuristics in priority, use in memory
    indexes and minimize communication
  3. Decomposition into local physical subplans and
  4. Execution of plans
  5. If needed, Relaxation

Query processing
  • Essential use of a smart index combining
    full-text and structure

4.6 XylemeRepository
Storage System
  • Xyleme store
  • efficient storage of trees in variable length
    records within fixed length pages
  • Balancing of tree branches in case of overflow
  • minimize the number of I/O for direct access and
  • good compromise compaction / access time

Tree Balancing in Xyleme Store
Record 1
More children
Record 3
Record 2
Record 4
5. Conclusion
Web monitoring
  • Very challenging problem
  • Complexity due to the volume of data and the
    number of users
  • Complexity due to heterogeneity
  • Complexity due to lack of cooperation from data
  • Many issues to investigate

New directions
  • Active web sites
  • Friendly sites willing to cooperate
  • Web services provide the infrastructure
  • Support for triggers
  • Mobile data
  • Web sites on mobile devices
  • Issues of availability (device unplugged)
  • Issues in synchronization
  • Geography dependent queries