Issues in Monitoring Web Data - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

Issues in Monitoring Web Data

Description:

Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul_at_inria.fr – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 70
Provided by: serge178
Learn more at: http://abiteboul.com
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Issues in Monitoring Web Data


1
Issues in Monitoring Web Data
  • Serge Abiteboul
  • INRIA and Xyleme
  • Serge.Abiteboul_at_inria.fr

2
Organization
  • Introduction
  • What is there to monitor?
  • Why monitor?
  • Some applications of web monitoring
  • Web archiving
  • An experience the archiving of the French web
  • Page importance and change frequency
  • Creation of a warehouse using web resources
  • An experience the Xyleme Project
  • Monitoring in Xyleme
  • Queries and monitoring
  • Conclusion

3
1. Introduction
4
The Web Today
  • Billions of pages millions of servers
  • Query keywords to retrieve URLs
  • Imprecise query results are useless for further
    processing
  • Applications based on ad-hoc wrapping
  • Expensive incomplete short-lived, not adapted
    to the Web constant changes
  • Poor quality
  • Cannot be trusted spamming, rumors
  • Often stale
  • Our vision of it often out-of-date
  • Importance of monitoring

5
The HTML Web Structure
Source IBM, AltaVista, Compaq
6
HTML Percentage covered by Crawlers
Source searchenginewatch.com
7
So much for the world knowledge
  • Most of the web is not reached by crawlers
    (hidden web)
  • Some of the public HTML pages are never read
  • Most of what is on the web is junk anyway
  • Our knowledge of it may be stale
  • Do not junk the techno improve it!

8
What is there to monitor?
  • Documents HTML but also doc, pdf, ps
  • Many data exchange formats such as asn1, bibtex
  • New official data exchange format XML
  • Hidden web database queries behind forms or
    scripts
  • Multimedia data ignored here
  • Public vs. private (Intranet or Internetpasswd)
  • Static vs. dynamic

9
What is changing?
  • XML is coming
  • Universal data exchange format
  • Marriage of document and database worlds
  • Standard query language XQuery
  • Quickly growing on Intranet and very slowly on
    public web (less than 1)
  • Web services are coming
  • Format for exporting services
  • Format for encapsulating queries
  • More semantics to be expected
  • RDF for data
  • WSDLUDDI for services

10
What is not changing fast or even getting worse
  • Massive quantity of data most of it junk
  • Lots of stale data
  • Very primitive HTML query mechanisms (keywords)
  • No real change control mechanism soon
  • Compare database queries (fresh data) with web
    search engines (possibly stale)
  • Compare database triggers (based on push) to web
    notification services (most of the times based on
    pull/refresh)

11
The need to monitor the web
  • The web changes all the time
  • Users are often as interested in changes as by
    data new products, new press articles, new
    price
  • Discover new resources
  • Keep our vision of the web up-to-date
  • Be aware of changes that may be of interest, have
    impact on our business

12
Analogy databases
  • Databases
  • Query instantaneous vision of data
  • Trigger alert/notification of some changes of
    interest
  • Web
  • Query need monitoring to give correct answer
  • Monitoring to support alert/notifications of
    changes of interest

13
Web vs. database monitoring
  • Quantity of data larger on the web
  • Knowledge of data
  • structure and semantics known in databases
  • Reliability and availability
  • High in databases null on the web
  • Data granularity
  • Tuple vs. page in HTML or element in XML
  • Change control
  • Databases support from data sources/triggers
  • Web no support pull only in general

14
2. Some applications ofweb monitoring
15
Comparative shopping
  • Unique entry point to many catalogs
  • Data integration problem
  • Main issue wrapping of web catalogs
  • Semi-automatic so limited to a few sites
  • Simpler and towards automatic with XML
  • Alternatives
  • Mediation when data change very fast
  • prices and availability of plane tickets
  • Warehousing otherwise ? need to monitor changes

16
Web surveillance
  • Applications
  • Anti-criminal and anti-terrorist intelligence,
    e.g., detecting suspicious acquisition of
    chemical products
  • Business intelligence, e.g., discovering
    potential customers, partners, competitors
  • Find the data (crawl the web)
  • Monitor the changes
  • new pages, deleted pages, changes in a page
  • Classify information and extract data of interest
  • Data mining, text understanding, knowledge
    representation and extraction, linguistic Very AI

17
Copy tracking
  • Example a press agency wants to check that
    people are not publishing copies of their wires
    without paying

Query to search engine Or specific crawl
pre-filter
Filter
1
2
3
detection
Flow of candidate documents
Slice the document
18
Web archiving
  • We will discuss an experience in archiving the
    French web

19
Creation of a data warehouse with resources found
of the web
  • We will discuss some work in the Xyleme project
    on the construction of XML warehouses

20
3. Web archiving
  • An experience towards the archiving of the French
    web with
  • Bibliothèque Nationale de France

21
Dépôt légal (legal deposit)
  • Books are archived since 1537, a decision by King
    Francois the 1st
  • The Web is an important and valuable source of
    information that should also be archived
  • What is different?
  • Number of content providers 148000 sites vs.
    5000 editors
  • Quantity of information millions of pages
    video/audio
  • Quality of information lots of junk
  • Relationship with editors freedom of publication
    vs. traditional push model
  • Updates and changes occur continuously
  • The perimeter is unclear what is the French web?

22
Goal and Scope
  • Provide future generations with a representative
    archive of the cultural production
  • Provide material for cultural, political,
    sociological studies
  • The mission is to archive a wide range of
    material because nobody knows what will be of
    interest for future research
  • In traditional publication, publishers are
    filtering contents. No filter on the web

23
Similar Projects
  • The Internet Archive www.archive.org
  • The Wayback machine
  • Largest collection of versions of web pages
  • Human selection based approach
  • select a few hundred sites and choose a
    periodicity of archiving
  • Australia and Canada
  • The Nordic experience
  • Use robot crawler to archive a significant part
    of the surface web
  • Sweden, Finland, Norway
  • Problems encountered
  • Lack of updates of archived pages between two
    snapshots
  • The hidden Web

24
Orientation of our experiment
  • Goals
  • Cover a large portion of the French web
  • Automatic content gathering is necessary
  • Adapt robots to provide a continuous archiving
    facility
  • Have frequent versions of the sites, at least for
    the most important ones
  • Issues
  • The notion of important sites
  • Building a coherent Web archive
  • Discover and manage important sources of deep Web

25
First issue the perimeter
  • The perimeter of the French Web contents edited
    in France
  • Many criteria may be used
  • The French language but many French sites use
    English (e.g. INRIA) many French-speaking sites
    are from other French speaking countries or
    regions (e.g. Quebec)
  • Domain Name or resource locators .fr sites, but
    many are also in .com or .org
  • Address of the site physical location of the web
    servers or address of the owner
  • Other criteria than the perimeter
  • Little interest in commercial sites
  • Possibly interest in foreign sites that discuss
    French issues
  • Pure automatic does not work ? involve librarians

26
Second issueSite vs. Page archiving
  • The Web
  • Physical granularity HTML pages
  • The problem is inconsistent data and links
  • Read page P one week later read pages pointed by
    P may not exist anymore
  • Logical granularity?
  • Snapshot view of a web site
  • What is a site?
  • INRIA is www.inria.fr www-rocq.inria.fr
  • www.multimania.com is the provider of many sites
  • There are technical issues (rapid firing, )

27
Importance of data
28
What is page importance?
  • Le Louvre homepage is more important than an
    unknown persons homepage
  • Important pages are pointed by
  • Other important pages
  • Many unimportant pages
  • This leads to Google definition of PageRank
  • Based on the link structure of the web
  • used with remarkable success by Google for
    ranking results
  • Useful but not sufficient for web archiving

29
Page Importance
  • Importance
  • Link matrix L
  • In short, page importance is the fixpoint X of
    the equation LX X
  • Storing the Link matrix and computing page
    importance uses lots of resources
  • We developed a new efficient technique to compute
    the fixpoint
  • Without having to store the Link matrix
  • Technique adapts to automatically to changes

30
Site vs. pages
  • Limitation of page importance
  • Google page importance works well when links have
    a strong semantic
  • More and more web pages are automatically
    generated and most links have little semantics
  • More limitation
  • Refresh at the page level presents drawbacks
  • So we also use link topology between sites and
    not only between pages

31
Experiments
  • Crawl
  • We used between 2 to 8 PCs for Xyleme crawlers
    for 2 months
  • Discovery and refresh based on page importance
  • Discovery
  • We looked at more than 1.5 billion (most
    interesting) web pages
  • We discovered more than 15 million .fr pages
    about 1.5 of the web
  • We discovered 150 000 .fr sites
  • Refresh
  • Important pages were refreshed more often
  • Takes into account also the change rate of pages
  • Analysis of the relevance of site importance for
    librarians
  • Comparison with ranking by librarians
  • Strong correlation with their rankings

32
Issues and on going workOther criteria for
importance
  • Take into account indications by archivists
  • They know best -- man-machine-interface issue
  • Use classification and clustering techniques to
    refine the notion of site
  • Frequent use of infrequent words
  • Find pages dedicated to specific topics
  • Text Weight
  • Find pages with text content vs. raw data pages)
  • Others

33
4. Creation of a Warehouse from Web data
  • The Xyleme Project

34
Xyleme in short
  • The Xyleme project
  • Initiated at INRIA
  • Joint work with researchers from Orsay, Mannheim
    and CNAM-Paris universities
  • The Xyleme company www.xyleme.com
  • Started in 2000
  • About 30 people
  • Mission Deliver a new generation of content
    technologies to unlock the potential of XML
  • Here focus on the Xyleme project

35
Goal of the Xyleme project
  • Focus is on XML data (but also handle HTML)
  • Semantic
  • Understand tags, partition the Web into semantic
    domains, provide a simple view of each domain
  • Dynamicity
  • Find and monitor relevant data on the web
  • Control relevant changes in Web data
  • XML storage, index and queries
  • Manage efficiently millions of XML documents and
    process millions of simultaneous queries

36
Corporate information environment with Xyleme
Crawling interpreting data
XML Repository
Repository
Query Engine
Xyleme Server
Systematic updating
publishing
searches
queries
Information System
37
XML in short
  • Data exchange format
  • eXtensible Mark-up Language (child of SGML)
  • Promoted by W3C and major industry players
  • XML document ordered labeled tree
  • Other essential gadgets unicode, namespaces,
    attributes, pointers, typing (XML schema)

38
XML magic in short
  • Presentation is given elsewhere (style-sheet)
  • Semantic and structure are provided by labels
  • So it is easy to extract information
  • Universal format understood by more and more
    softwares (e.g., exported by most databases
    read by more and more editors)
  • More and more tools available

39
It is easy to extract information
40
4.1 XylemeFunctionality and architechture
41
The goal of Xyleme project XML Dynamic
Datawarehouse
  • Many research issues
  • Query Processor
  • Semantic Classification
  • Data Monitoring
  • Native Storage
  • XML document Versionning
  • XML automatic or user driven acquisition
  • Graphical User Interface through the Web

42
Functional Architecture
Query Processor
Repository and Index Manager
43
Architecture
-------------------- I N T E R N E T
-----------------------
44
Prototype main choices
  • Network of Linux PCs
  • C on the server side
  • Corba for communications between PCs
  • HTTP SOAP for communications for external
    communications
  • Exception for query processing

45
Scaling
  • Parallelism based on
  • Partitioning
  • XML documents
  • URL table
  • Indexes (semantic partitioning)
  • Memory replication
  • Autonomous machines (PCs)
  • Caches are used for data flow

46
4.2 XylemeData Acquisition
47
Data Acquisition
  • Xyleme crawler visits the HTML/XML web
  • Management of metadata on pages
  • Sophisticate strategy to optimize network
    bandwidth
  • importance ranking of pages
  • change frequency and age of pages
  • publications (owners) subscriptions (users)
  • Each crawler visits about 4 million pages per day
  • Each index may create index for 1 million pages
    per day

48
4.3 XylemeChange Control
49
Change Management
  • The Web changes all the time
  • Data acquisition
  • automatic and via publication
  • Monitoring
  • subscriptions
  • continuous queries
  • versions

50
Subscription
  • Users can subscribe to certain events, e.g.,
  • changes in all pages of a certain DTD or of a
    certain semantic domain
  • insertion of a new product in a particular
    catalog or in all catalogs with a particular DTD
  • They may request to be notified
  • at the time the event is detected by Xyleme
  • regularly, e.g., once a week

51
Continuous Queries
  • Queries asked regularly or when some events are
    detected
  • send me each Monday the list of movies in
    Pariscope
  • send me each Monday the list of new movies in
    Pariscope
  • each time you detect that a new member is added
    to the Stanford DB-group, send me their lists of
    publications from their homepages

52
Versions and Deltas
  • Store snapshots of documents
  • For some documents, store changes (deltas)
  • storage last version sequence of deltas
  • complete delta reconstruct old versions
  • partial delta allow to send changes to the user
    and allow refresh
  • Deltas are XML documents
  • so changes can be queried as standard data
  • Temporal queries
  • List of products that were introduced in this
    catalog since January 1st 2002

53
The Information Factory
Web
loaders
subscription processor
send notification
changes detection
documents and deltas
continuous queries
time
Repository
results
version queries
54
Results
  • Very efficient XML Diff algorithm
  • compute difference between consecutive versions
  • Representation of deltas based on an original
    naming scheme for XML elements
  • one element is assigned a unique identifier for
    its entire life
  • compact way of representing these IDs
  • Efficient versioning mechanism

55
Results
  • Sophisticate monitoring algorithm
  • Detection of simple patterns (conjunctions) at
    the document level
  • Detection of changes between consecutive versions
    of the same documents
  • Scale to dozens of crawlers loading millions of
    documents per day for a single monitor

56
Issues languages for monitoring
  • In the spirit of temporal languages for
    relational databases
  • But
  • Data model is richer (trees vs. tables)
  • Context is richer versions, continuous queries,
    monitoring of data streams

57
4.4 XylemeSemantic Data Integration
58
Data Integration
  • One application domain -- Several schemas
  • heterogeneous vocabulary and structure
  • Xyleme Semantic Integration è
  • gives the illusion that the system maintains an
    homogeneous database for this domain
  • abstracts a set of DTDs into a hierarchy of
    pertinent terms for a particular domain
    (business, culture, tourism, biology, )  

59
Technology in short
  • Cluster DTDs into application domains
  • For an application domain semi-automatically
  • Organize tags into a hierarchy of concepts using
    thesauri such as Wordnet and other linguistic
    tool
  • This provides the abstract DTD for the particular
    domain
  • Generate mappings between concrete DTDs and the
    abstract one

60
4.5 XylemeQuery Processing
61
Xyleme Query Language
  • A mix of OQL and XQL, will use the W3C standard
    when there will be one.

Select product/name, product/price From doc
in catalogue, product in
doc/product Where product//components contains
flash and product/description
contains camera
62
Principle of Querying
query on abstract dtd
Union of concrete queries (possibly with Joins)
catalogue/product/price ? d1//camera/price ?
d2/product/cost catalogue/product/description
? d1//camera/description ?
d2/product/info, ref ? d2/description
MAPPINGS between concrete and abstract DTDs
63
Query Processing
  1. Partial translation, from abstract to concrete,
    to identify machines with relevant data
  2. Algebraic rewriting, linear search strategy based
    on simple heuristics in priority, use in memory
    indexes and minimize communication
  3. Decomposition into local physical subplans and
    installation
  4. Execution of plans
  5. If needed, Relaxation

64
Query processing
  • Essential use of a smart index combining
    full-text and structure

65
4.6 XylemeRepository
66
Storage System
  • Xyleme store
  • efficient storage of trees in variable length
    records within fixed length pages
  • Balancing of tree branches in case of overflow
  • minimize the number of I/O for direct access and
    scanning
  • good compromise compaction / access time

67
Tree Balancing in Xyleme Store
Record 1
More children
Record 3
Record 2
Record 4
68
5. Conclusion
69
Web monitoring
  • Very challenging problem
  • Complexity due to the volume of data and the
    number of users
  • Complexity due to heterogeneity
  • Complexity due to lack of cooperation from data
    sources
  • Many issues to investigate

70
New directions
  • Active web sites
  • Friendly sites willing to cooperate
  • Web services provide the infrastructure
  • Support for triggers
  • Mobile data
  • Web sites on mobile devices
  • Issues of availability (device unplugged)
  • Issues in synchronization
  • Geography dependent queries
About PowerShow.com