A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA - PowerPoint PPT Presentation

About This Presentation
Title:

A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA

Description:

Xyleme, January 2001 -- Zurich. 1. A Dynamic Warehouse for the XML data of the Web ... schema (XML schema), stylesheet (XSL), resource description (RDF... – PowerPoint PPT presentation

Number of Views:119
Avg rating:3.0/5.0
Slides: 71
Provided by: Fer7150
Category:

less

Transcript and Presenter's Notes

Title: A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA


1
A Dynamic Warehouse for the XML data of the
WebSerge AbiteboulINRIA Xyleme
SASerge.Abiteboul_at_inria.fr Serge.Abiteboul_at_xyle
me.comhttp//www-rocq.inria.fr/verso http//www.x
yleme.com
2
Organization
  • The Web and XML
  • Xyleme
  • 1. Data Acquisition and Maintenance
  • 2. XML Repository
  • 3. Semantic Data Integration
  • 4. Query Processing
  • 5. Query Subscription
  • Conclusion

3
The Web and XML
4
The Web today
  • Terabytes of data
  • Private web not publicly available pages
  • Deep web data hidden behind forms
  • A lot of public pages
  • 1 billion in 06/2000
  • several millions of servers

5
The Web today
  • Browsing
  • Search engines
  • Google indexes more than 1 billion pages 11/00
  • in list of words
  • out sorted list of URLs
  • based on occurrence of words in documents
  • based on the link structure of the web

6
The Web today
  • Queries keywords to retrieve URLs
  • Imprecise
  • Query results cannot be directly processed
  • Difficult to extract data of interest
  • Applications based on hand-made wrappers
  • Expensive
  • Incomplete
  • Short-lived, not adapted to the Web constant
    changes

7
The Coming of XML
  • HTML
  • comes from SGML
  • hypertext language
  • fixed number of tags
  • content and presentation are mixed
  • very difficult to extract data from a page
  • old standard
  • XML
  • also
  • semistructured data
  • not fixed
  • not mixed
  • very easy
  • new standard

8
HTML Hypertext Language
hard
Text presentation Where is the data ?
9
XML Semistructured Data
ltproduct-tablegt lt product referenceX23"gt
ltdesignationgt camera lt/designationgt ltprice
unitDollarsgt 359.99 lt/pricegt ltdescriptiongt
lt/descriptiongt lt/productgt lt product
referenceR2D2"gt ltdesignationgt Robot
lt/designationgt ltprice unitDollarsgt 19350
lt/pricegt ltdescriptiongt lt/descriptiongt ... lt/p
roduct-tablegt
easy
Data Structure Semistructured more flexible
XML
10
XML Tree Types
product-table
product
reference
price
designation
description
  • Semantics and structure are in paths
  • product-table/product/reference
  • product-table/product/price

11
XML
  • Very active/noisy field - standards
  • schema (XML schema), stylesheet (XSL), resource
    description (RDF...)
  • WML (wap), MathML, SMIL (multimedia), RSS (news),
    RDF (metadata)...
  • How fast will XML conquer the web?
  • so far rather slow (about 1 now of the visible
    web much more in intranets)
  • much faster since the arrival of Explorer 5.5

12
A Dynamic Warehouse for the XML Data of the Web
  • Xyleme

13
Xyleme
  • Warehouse
  • Xyleme stores huge quantities of data (teraB)
  • Xyleme is not a search engine (only index) or a
    mediator (only virtual data)
  • XML
  • Xyleme is focused on XML, i.e., trees
  • Dynamic
  • Xyleme is interested in data evolution/changes

14
Xyleme
  • September 1999 a group of researchers from
  • Inria Rocquencourt, Verso Group
  • U. of Mannheim, Database Group
  • U. of Orsay, IASI Group
  • CNAM, Vertigo Group
  • September 2000 creation of a start-up
  • November 2000 about 15 people

15
Corporate Information Today
ad-hoc applications written by web-experts
tailored for specific tasks and data. I.e.
inflexible and expensive
manual searches using browsers
Information System
manual updates
16
Corporate Information with Xyleme
Crawling interpreting data
Repository
Query Engine
Xyleme-warehouse
publishing
searches
Xyleme API
queries
updates
Information System
17
Five Challenges
  • 1. Data Acquisition and Maintenance
  • discover data of interest and maintain it up to
    date
  • 2. Repository
  • store this data and index it so that it can be
    processed efficiently
  • 3. Query Processing
  • support efficiently an SQL-style query language

18
Five Challenges - continued
  • 4. Semantic Integration
  • Understand DTD and tags, partition the Web into
    semantic domains, provide a simple view of each
    domain
  • 5. Change Control
  • Monitor the web and offer services such as Query
    Subscription

19
Challenges - continued
  • Scale to the web
  • Size of data millions/billions of pages
  • Size of index terabytes
  • Number of customers
  • thousands of simultaneous queries
  • millions of subscriptions

20
Functional Architecture
-------------------- I N T E R N E T
-----------------------
Web Interface
Query Processor
Repository and Index Manager
21
Architecture
  • Cluster of PCs
  • Developed with Linux and C
  • Communications
  • local Corba
  • external HTTP
  • Distribution between autonomous machines

22
Architecture
-------------------- I N T E R N E T
-----------------------
E T H E R N E T
23
1. Data Acquisition and Maintenance
24
Goals
  • Discover XML pages on the web that are of
    interest for customers
  • For this crawl the web (HTMLXML)
  • Maintain them up to date
  • Do this under bounded resources

25
Life Cycle of a page in Xyleme
  • The URL of D is discovered as a link in another
    page (or published by a customer)
  • The page scheduler decides to read D
  • The meta data of D is read
  • type, last_date_update...
  • The document D is loaded
  • The document D is re(read) regularly

26
Main Issues
  • Loading of pages
  • we can load up to 5 millions of pages/day on a
    standard PC
  • main cost is Internet connection
  • Metadata management
  • Page scheduling
  • decide which page to read or refresh next

27
Metadata Management
  • Example management of the link matrix
  • page i points to page j
  • for 1 billion URL, about 30 children/url
  • matrix has 30.109 edges (very sparse)
  • For each page that is read,
  • find the IDs of the 30 children
  • 50 pages/second ? 1500 database calls/second

28
Page Scheduling
  • Decide which page to read next
  • discovery (read first) and refresh (read again)
  • Based on
  • Importance of the page
  • read often important pages
  • also used to order query results
  • Change rate of the page
  • dont read a page that is probably up-to-date

29
Page Scheduling for Refresh
  • Determine refresh frequency fi for each page i
    to minimize a cost function
  • Minimize Under the constraint
  • ?1N costi(fi) G ? ?1N fi
  • where costi(fi), penalty for page i, depends on
    the estimated importance and staleness of the
    page

30
Cost Function
  • costi(fi), penalty for page i, depends on the
    estimated importance and staleness of the page
  • Importance of the page
  • link structure
  • pub/sub
  • Staleness of the data
  • penalty for being out of date
  • penalty for aging

31
Evaluation of Change Rate
  • Based on the Last Date of Change
  • provided by HTTP header of the page
  • in general reliable but
  • Based on the number M of changes detected the
    last N times the pages was refreshed
  • limits do not know the actual number of changes
  • First one more precise

32
Page Importance Link Structure
  • Intuition a page is important if many important
    pages reference it fixpoint
  • Link Matrix
  • M(i,j) if page i refers to page j
  • M is a 109 ? 109 matrix
  • out(i) the outdegree of page i
  • Fixpoint
  • W0(k) 1/N (initialization)
  • Wm(k) ?i M(i,k) Wm-1(i)/out(i)

33
Page Importance Algorithm
M(i,-)
Wm
Wm-1(k)
k

out(k)
Wm(k)
M(i,-) is stored as a list computation of Wm
(line/line) for i 1 to N do read M(i,-)
process the line
34
Page Importance Fixpoint
  • Techniques for fixpoint convergence
  • Some results
  • convergence is fast (?OK after 10)
  • simple precision suffices
  • possible on a standard PC
  • Distribution and incremental evaluation

35
Page Importance Refresh
Standard importance for HTML/XML pages
HTML pages are useful only to discover XML
Taking pub/sub into account
circle HTML square XML triangle pub/sub
36
2. XML Repository
37
Storing XML documents
  • Relational store (e.g., Oracle 8i)
  • binary long objects not possible to access
    directly elements
  • very typed data and Tables efficient
  • otherwise too many joins and inefficient
  • Object database store (ODMG)
  • better adapted
  • XML Native storage Natix

38
Natix Repository
  • Goal
  • minimize I/O for direct access and scanning
  • efficient direct accesses using indexing
  • good compaction but not at the cost of access
  • Efficient storage of trees
  • use fixed length storage pages
  • variable length records inside a page
  • Main issue tree balancing

39
Tree Balancing
Record 1
Record 3
Record 2
40
Tree Balancing - continued
Large collections may use several records
41
3. Semantic Data Integration
42
Web Heterogeneity
  • Semantic domains, e.g., cinema
  • Many possible types for data in this domain, many
    DTDs
  • Semantic Integration
  • one abstract DTD for the domain
  • gives the illusion that the system maintains an
    homogeneous database for this domain
  • 1 domain 1 abstract DTD

43
Cluster DTDs and Documents
Relationship is not visible unless one knows the
relationships between story and tale.
44
Discover the Domains
adtd1
cdtd1 . cdtd2 . cdtd3 .
  • Cluster DTDs sharing similar  tags  using data
    mining techniques (frequent item sets) and
    linguistic tools (e.g., thesaurus, heuristics to
    extract words from composite words or
    abbreviations, etc.)
  • to obtain domains

adtd2
cdtd4 .
cdtd5 . cdtd6 .
cdtd7 . cdtd8 . cdtd9 . cdtd10 .
adtd4
Many concrete DTDs
Fewer abstract DTDs
45
Wordnet Useful Relationships
  • Synonyms ? One concept, two terms
  • Hypernyms / Hyponyms ? two concepts linked
  • through generalization/specialization
  • - e.g., vehicle car
  • Meronyms / Holonyms ? two concepts linked
  • through composition/inclusion
  • - e.g., country city


46
Choose an Abstract DTD / Domain
  • Automatically
  • The analysis of a cluster, leads to  clusters of
    tags 
  • Use a thesaurus (e.g., Wordnet) to build a
    hierarchy from the clusters of tags
  • Manually
  • Performed by a domain expert
  • Hybrid

47
Mapping Concrete to Abstract
  • For each concrete DTD in a domain, find how it
    relates to the abstract DTD
  • Associate concrete tags to abstract tags using
    linguistic tools
  • Provide relationships between paths in the
    concrete and abstract DTD
  • E.g. cdtd3/œuvre/nom/prénom and
  • adtd2/book/author/name/firstname
  • Possibly automatic, manual or hybrid


48
4. Query Processing
49
Xyleme Query Language
  • Today A mix of OQL and XQL
  • Tomorrow the future W3C standard
  • Example
  • select product/name, product/price
  • from doc in catalogue,
  • product in doc/product
  • where product//components contains flash
  • and product/description contains
    camera

50
Data Distribution
  • Cluster of documents physical collection of
    documents (? semantic domain)
  • Distribution
  • Storage machine
  • in charge of a cluster of documents
  • Index machine
  • index for a cluster

51
Step 0 Indexing
  • Standard inverted index
  • word ? documents that contain this word
  • Xyleme index
  • word ? elements that contain this word
  • document element identifier
  • Goal more work can be performed without
    accessing data

52
Step 1 Localization
global query on abstract dtd
  • Query on an abstract dtd
  • Localization of machines that host concrete DTDs
    that will participate in the query


catalogue/product/price relevant for machine
56 machine 45
local queries
union of queries on local machines
53
Step 2 Optimization
  • Algebraic rewriting
  • Linear search strategy based on simple heuristics
  • use in memory indexes
  • minimize communication
  • Optimization of the global plan
  • Optimization of the local plans

54
Step 3 Execution
  • A plan usually consists of
  • 1. parallel translation from abstract queries to
    concrete patterns on the relevant index machines
  • 2. parallel index scans to identify the relevant
    elements for a concrete pattern
  • 3. parallel construction of resulting elements
  • 4. pipeline evaluation (i.e., no intermediate
    data structure)
  • Note 2. Requires smart indexes

55
Execution Abstract2Concrete
for catalogue/product/price scan relevant
concrete pattern ? d1//camera/price ?
d2/product/cost ? d3/piano/price ...
For each concrete pattern, the local plan is
optimized dynamically
for each concrete pattern scan the element ids
? 234 ? 177
56
Element Identifiers
  • Essential for query processing
  • Identifier (preorder rank/postorder rank)
  • X ancestor of Y ltgt pre(X) lt pre(Y) and
    post(X) gt post(Y)
  • E.g., 2lt5 and 4 gt2 gt (2,4) ancestor (5,2)

1
A B C D E
F G
7
2
6
4
6
3
4
7
1
3
5
5
2
Text
57
Patterns and Indexes
(d1, 12, 200), (d1, 201, 400)
(d1,1,11), (d1, 205,224)
(d1,228, 237)
(d1, 229), (d2, 14)
Heuristics to perform joins, start with the
smallest cardinality (to minimize size
intermediary results)
58
5. Change Control
59
The Web changes all the time
  • Data acquisition maintenance
  • keep the warehouse up-to-date
  • Version management
  • representation and storage of changes
  • Change monitoring
  • query subscription

60
Versions
  • Version some documents or some sites
  • Version some continuous queries
  • continuous query query that is evaluated
    regularly
  • get each Monday the list of movies showing in
    Paris

61
Representing Versions Deltas
  • Version storage
  • current document
  • persistent identifiers for elements
  • description of changes - completed deltas
  • Deltas are XML documents
  • Changes can be processed like other data
  • exchanged send me changes since June 1st!
  • queried what are the products inserted since
    2/1/99?

62
Completed Delta
  • ltdeltagt
  • ltunit delta previousversion1 thistime2/gt
  • ltdelete xid11 xid-children(17-21) gt
  • ltProductgt
  • ltNamegtDVDlt/Namegt ltPricegt500lt/Pricegt
  • lt/Productgt lt/deletegt
  • ltmove xid16 new_parent11 new_position2
  • old_parent11
    old_position1 /gt
  • ltupdate xid3 new_value50 old_value100
    /gt
  • ltupdate xid8 new_value100 old_value150
    /gt
  • lt/unit_deltagt
  • ltunit deltagt...lt/unit deltagt
  • lt/deltagt

persistent identifier
63
Query Subscription
  • Users may subscribe to certain events, e.g.,
  • changes in a page, a set of pages,
  • changes in pages from a particular semantic
    domain, containing some specific words or with a
    particular DTD
  • changes of particular elements somewhere (new
    products in a catalog)
  • Users may request to be notified
  • immediately at the time the event is detected
  • regularly, e.g., weekly
  • after a certain number of event detections

64
Example
  • subscription myPariscope
  • what are the new movie entries in Pariscope
    site
  • monitoring newMovies
  • select URL
  • where URL extends www.pariscope.fr/movies/
  • and new(self)
  • manage the changes in the movies showing in
    Paris
  • continuous delta Showing
  • select ... from ... where
  • when daily
  • notify daily send me a daily report

65
Step 1 Atomic Event Detection
5 millions of pages/day
d
atomic event 46 URL matches pattern
www.inria.fr/ atomic event 67 XML
document contains the tag painter
metadata manager
document alerts d/46
HTML parser
complex event detection
XML loader
d/46,67
loading
66
Step2 Complex Event Detection
Millions of alerts of pages/day Millions of
subscriptions
HTML parser
complex event detection
XML loader
comple event 12 67 46 (XML document contains
the tag painter and URL matches pattern
www.inria.fr/)
67
Step 3 Notification Processor
notification/monitoring
complex event detection
alerts
notification processor
Millions of notifications/day
triggers
continuous queries
notification/results
clock
68
Conclusion
69
One Question Only
  • The web is turning from a large collection of
    documents into a huge knowledge base
  • When will I be able to get
  • the precise knowledge I need?
  • Database Knowledge Base Linguistic ...

70
Merci
Write a Comment
User Comments (0)
About PowerShow.com