A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA - PowerPoint PPT Presentation

About This Presentation

Title:

A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA

Description:

Xyleme, January 2001 -- Zurich. 1. A Dynamic Warehouse for the XML data of the Web ... schema (XML schema), stylesheet (XSL), resource description (RDF... – PowerPoint PPT presentation

Number of Views:119

Avg rating:3.0/5.0

Slides: 71

Provided by: Fer7150

Category:

more less

Transcript and Presenter's Notes

Title: A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA

1
A Dynamic Warehouse for the XML data of the
WebSerge AbiteboulINRIA Xyleme
SASerge.Abiteboul_at_inria.fr Serge.Abiteboul_at_xyle
me.comhttp//www-rocq.inria.fr/verso http//www.x
yleme.com
2
Organization

The Web and XML
Xyleme
1. Data Acquisition and Maintenance
2. XML Repository
3. Semantic Data Integration
4. Query Processing
5. Query Subscription
Conclusion

3
The Web and XML
4
The Web today

Terabytes of data
Private web not publicly available pages
Deep web data hidden behind forms
A lot of public pages
1 billion in 06/2000
several millions of servers

5
The Web today

Browsing
Search engines
Google indexes more than 1 billion pages 11/00
in list of words
out sorted list of URLs
based on occurrence of words in documents
based on the link structure of the web

6
The Web today

Queries keywords to retrieve URLs
Imprecise
Query results cannot be directly processed
Difficult to extract data of interest
Applications based on hand-made wrappers
Expensive
Incomplete
Short-lived, not adapted to the Web constant
changes

7
The Coming of XML

HTML
comes from SGML
hypertext language
fixed number of tags
content and presentation are mixed
very difficult to extract data from a page
old standard

XML
also
semistructured data
not fixed
not mixed
very easy
new standard

8
HTML Hypertext Language
hard
Text presentation Where is the data ?
9
XML Semistructured Data
ltproduct-tablegt lt product referenceX23"gt
ltdesignationgt camera lt/designationgt ltprice
unitDollarsgt 359.99 lt/pricegt ltdescriptiongt
lt/descriptiongt lt/productgt lt product
referenceR2D2"gt ltdesignationgt Robot
lt/designationgt ltprice unitDollarsgt 19350
lt/pricegt ltdescriptiongt lt/descriptiongt ... lt/p
roduct-tablegt
easy
Data Structure Semistructured more flexible
XML
10
XML Tree Types
product-table
product
reference
price
designation
description

Semantics and structure are in paths
product-table/product/reference
product-table/product/price

11
XML

Very active/noisy field - standards
schema (XML schema), stylesheet (XSL), resource
description (RDF...)
WML (wap), MathML, SMIL (multimedia), RSS (news),
RDF (metadata)...
How fast will XML conquer the web?
so far rather slow (about 1 now of the visible
web much more in intranets)
much faster since the arrival of Explorer 5.5

12
A Dynamic Warehouse for the XML Data of the Web

Xyleme

13
Xyleme

Warehouse
Xyleme stores huge quantities of data (teraB)
Xyleme is not a search engine (only index) or a
mediator (only virtual data)
XML
Xyleme is focused on XML, i.e., trees
Dynamic
Xyleme is interested in data evolution/changes

14
Xyleme

September 1999 a group of researchers from
Inria Rocquencourt, Verso Group
U. of Mannheim, Database Group
U. of Orsay, IASI Group
CNAM, Vertigo Group
September 2000 creation of a start-up
November 2000 about 15 people

15
Corporate Information Today
ad-hoc applications written by web-experts
tailored for specific tasks and data. I.e.
inflexible and expensive
manual searches using browsers
Information System
manual updates
16
Corporate Information with Xyleme
Crawling interpreting data
Repository
Query Engine
Xyleme-warehouse
publishing
searches
Xyleme API
queries
updates
Information System
17
Five Challenges

1. Data Acquisition and Maintenance
discover data of interest and maintain it up to
date
2. Repository
store this data and index it so that it can be
processed efficiently
3. Query Processing
support efficiently an SQL-style query language

18
Five Challenges - continued

4. Semantic Integration
Understand DTD and tags, partition the Web into
semantic domains, provide a simple view of each
domain
5. Change Control
Monitor the web and offer services such as Query
Subscription

19
Challenges - continued

Scale to the web
Size of data millions/billions of pages
Size of index terabytes
Number of customers
thousands of simultaneous queries
millions of subscriptions

20
Functional Architecture
-------------------- I N T E R N E T
-----------------------
Web Interface
Query Processor
Repository and Index Manager
21
Architecture

Cluster of PCs
Developed with Linux and C
Communications
local Corba
external HTTP
Distribution between autonomous machines

22
Architecture
-------------------- I N T E R N E T
-----------------------
E T H E R N E T
23
1. Data Acquisition and Maintenance
24
Goals

Discover XML pages on the web that are of
interest for customers
For this crawl the web (HTMLXML)
Maintain them up to date
Do this under bounded resources

25
Life Cycle of a page in Xyleme

The URL of D is discovered as a link in another
page (or published by a customer)
The page scheduler decides to read D
The meta data of D is read
type, last_date_update...
The document D is loaded
The document D is re(read) regularly

26
Main Issues

Loading of pages
we can load up to 5 millions of pages/day on a
standard PC
main cost is Internet connection
Metadata management
Page scheduling
decide which page to read or refresh next

27
Metadata Management

Example management of the link matrix
page i points to page j
for 1 billion URL, about 30 children/url
matrix has 30.109 edges (very sparse)
For each page that is read,
find the IDs of the 30 children
50 pages/second ? 1500 database calls/second

28
Page Scheduling

Decide which page to read next
discovery (read first) and refresh (read again)
Based on
Importance of the page
read often important pages
also used to order query results
Change rate of the page
dont read a page that is probably up-to-date

29
Page Scheduling for Refresh

Determine refresh frequency fi for each page i
to minimize a cost function
Minimize Under the constraint
?1N costi(fi) G ? ?1N fi
where costi(fi), penalty for page i, depends on
the estimated importance and staleness of the
page

30
Cost Function

costi(fi), penalty for page i, depends on the
estimated importance and staleness of the page
Importance of the page
link structure
pub/sub
Staleness of the data
penalty for being out of date
penalty for aging

31
Evaluation of Change Rate

Based on the Last Date of Change
provided by HTTP header of the page
in general reliable but
Based on the number M of changes detected the
last N times the pages was refreshed
limits do not know the actual number of changes
First one more precise

32
Page Importance Link Structure

Intuition a page is important if many important
pages reference it fixpoint
Link Matrix
M(i,j) if page i refers to page j
M is a 109 ? 109 matrix
out(i) the outdegree of page i
Fixpoint
W0(k) 1/N (initialization)
Wm(k) ?i M(i,k) Wm-1(i)/out(i)

33
Page Importance Algorithm
M(i,-)
Wm
Wm-1(k)
k

out(k)
Wm(k)
M(i,-) is stored as a list computation of Wm
(line/line) for i 1 to N do read M(i,-)
process the line
34
Page Importance Fixpoint

Techniques for fixpoint convergence
Some results
convergence is fast (?OK after 10)
simple precision suffices
possible on a standard PC
Distribution and incremental evaluation

35
Page Importance Refresh
Standard importance for HTML/XML pages
HTML pages are useful only to discover XML
Taking pub/sub into account
circle HTML square XML triangle pub/sub
36
2. XML Repository
37
Storing XML documents

Relational store (e.g., Oracle 8i)
binary long objects not possible to access
directly elements
very typed data and Tables efficient
otherwise too many joins and inefficient
Object database store (ODMG)
better adapted
XML Native storage Natix

38
Natix Repository

Goal
minimize I/O for direct access and scanning
efficient direct accesses using indexing
good compaction but not at the cost of access
Efficient storage of trees
use fixed length storage pages
variable length records inside a page
Main issue tree balancing

39
Tree Balancing
Record 1
Record 3
Record 2
40
Tree Balancing - continued
Large collections may use several records
41
3. Semantic Data Integration
42
Web Heterogeneity

Semantic domains, e.g., cinema
Many possible types for data in this domain, many
DTDs
Semantic Integration
one abstract DTD for the domain
gives the illusion that the system maintains an
homogeneous database for this domain
1 domain 1 abstract DTD

43
Cluster DTDs and Documents
Relationship is not visible unless one knows the
relationships between story and tale.
44
Discover the Domains
adtd1
cdtd1 . cdtd2 . cdtd3 .

Cluster DTDs sharing similar tags using data
mining techniques (frequent item sets) and
linguistic tools (e.g., thesaurus, heuristics to
extract words from composite words or
abbreviations, etc.)
to obtain domains

adtd2
cdtd4 .
cdtd5 . cdtd6 .
cdtd7 . cdtd8 . cdtd9 . cdtd10 .
adtd4
Many concrete DTDs
Fewer abstract DTDs
45
Wordnet Useful Relationships

Synonyms ? One concept, two terms
Hypernyms / Hyponyms ? two concepts linked
through generalization/specialization
- e.g., vehicle car
Meronyms / Holonyms ? two concepts linked
through composition/inclusion
- e.g., country city

46
Choose an Abstract DTD / Domain

Automatically
The analysis of a cluster, leads to clusters of
tags
Use a thesaurus (e.g., Wordnet) to build a
hierarchy from the clusters of tags
Manually
Performed by a domain expert
Hybrid

47
Mapping Concrete to Abstract

For each concrete DTD in a domain, find how it
relates to the abstract DTD
Associate concrete tags to abstract tags using
linguistic tools
Provide relationships between paths in the
concrete and abstract DTD
E.g. cdtd3/œuvre/nom/prénom and
adtd2/book/author/name/firstname
Possibly automatic, manual or hybrid

48
4. Query Processing
49
Xyleme Query Language

Today A mix of OQL and XQL
Tomorrow the future W3C standard
Example
select product/name, product/price
from doc in catalogue,
product in doc/product
where product//components contains flash
and product/description contains
camera

50
Data Distribution

Cluster of documents physical collection of
documents (? semantic domain)
Distribution
Storage machine
in charge of a cluster of documents
Index machine
index for a cluster

51
Step 0 Indexing

Standard inverted index
word ? documents that contain this word
Xyleme index
word ? elements that contain this word
document element identifier
Goal more work can be performed without
accessing data

52
Step 1 Localization
global query on abstract dtd

Query on an abstract dtd
Localization of machines that host concrete DTDs
that will participate in the query

catalogue/product/price relevant for machine
56 machine 45
local queries
union of queries on local machines
53
Step 2 Optimization

Algebraic rewriting
Linear search strategy based on simple heuristics
use in memory indexes
minimize communication
Optimization of the global plan
Optimization of the local plans

54
Step 3 Execution

A plan usually consists of
1. parallel translation from abstract queries to
concrete patterns on the relevant index machines
2. parallel index scans to identify the relevant
elements for a concrete pattern
3. parallel construction of resulting elements
4. pipeline evaluation (i.e., no intermediate
data structure)
Note 2. Requires smart indexes

55
Execution Abstract2Concrete
for catalogue/product/price scan relevant
concrete pattern ? d1//camera/price ?
d2/product/cost ? d3/piano/price ...
For each concrete pattern, the local plan is
optimized dynamically
for each concrete pattern scan the element ids
? 234 ? 177
56
Element Identifiers

Essential for query processing
Identifier (preorder rank/postorder rank)
X ancestor of Y ltgt pre(X) lt pre(Y) and
post(X) gt post(Y)
E.g., 2lt5 and 4 gt2 gt (2,4) ancestor (5,2)

1
A B C D E
F G
7
2
6
4
6
3
4
7
1
3
5
5
2
Text
57
Patterns and Indexes
(d1, 12, 200), (d1, 201, 400)
(d1,1,11), (d1, 205,224)
(d1,228, 237)
(d1, 229), (d2, 14)
Heuristics to perform joins, start with the
smallest cardinality (to minimize size
intermediary results)
58
5. Change Control
59
The Web changes all the time

Data acquisition maintenance
keep the warehouse up-to-date
Version management
representation and storage of changes
Change monitoring
query subscription

60
Versions

Version some documents or some sites
Version some continuous queries
continuous query query that is evaluated
regularly
get each Monday the list of movies showing in
Paris

61
Representing Versions Deltas

Version storage
current document
persistent identifiers for elements
description of changes - completed deltas
Deltas are XML documents
Changes can be processed like other data
exchanged send me changes since June 1st!
queried what are the products inserted since
2/1/99?

62
Completed Delta

ltdeltagt
ltunit delta previousversion1 thistime2/gt
ltdelete xid11 xid-children(17-21) gt
ltProductgt
ltNamegtDVDlt/Namegt ltPricegt500lt/Pricegt
lt/Productgt lt/deletegt
ltmove xid16 new_parent11 new_position2
old_parent11
old_position1 /gt
ltupdate xid3 new_value50 old_value100
/gt
ltupdate xid8 new_value100 old_value150
/gt
lt/unit_deltagt
ltunit deltagt...lt/unit deltagt
lt/deltagt

persistent identifier
63
Query Subscription

Users may subscribe to certain events, e.g.,
changes in a page, a set of pages,
changes in pages from a particular semantic
domain, containing some specific words or with a
particular DTD
changes of particular elements somewhere (new
products in a catalog)
Users may request to be notified
immediately at the time the event is detected
regularly, e.g., weekly
after a certain number of event detections

64
Example

subscription myPariscope
what are the new movie entries in Pariscope
site
monitoring newMovies
select URL
where URL extends www.pariscope.fr/movies/
and new(self)
manage the changes in the movies showing in
Paris
continuous delta Showing
select ... from ... where
when daily
notify daily send me a daily report

65
Step 1 Atomic Event Detection
5 millions of pages/day
d
atomic event 46 URL matches pattern
www.inria.fr/ atomic event 67 XML
document contains the tag painter
metadata manager
document alerts d/46
HTML parser
complex event detection
XML loader
d/46,67
loading
66
Step2 Complex Event Detection
Millions of alerts of pages/day Millions of
subscriptions
HTML parser
complex event detection
XML loader
comple event 12 67 46 (XML document contains
the tag painter and URL matches pattern
www.inria.fr/)
67
Step 3 Notification Processor
notification/monitoring
complex event detection
alerts
notification processor
Millions of notifications/day
triggers
continuous queries
notification/results
clock
68
Conclusion
69
One Question Only

The web is turning from a large collection of
documents into a huge knowledge base
When will I be able to get
the precise knowledge I need?
Database Knowledge Base Linguistic ...

70
Merci

Write a Comment

User Comments (0)