XML Warehousing and Xyleme - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

XML Warehousing and Xyleme

Description:

... emerging standards: XML schema, XSL/T, Xquery, domain specific ... XSL-T or Xquery mappings for XML sources. XML-izers to load data from other formats ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 70
Provided by: abite
Category:
Tags: xml | warehousing | xsl | xyleme

less

Transcript and Presenter's Notes

Title: XML Warehousing and Xyleme


1
XML Warehousing and Xyleme
  • S. Abiteboul
  • INRIA and Xyleme
  • Serge.Abiteboul_at_inria.fr
  • December 2002

2
Organization
  • The context and motivations
  • XML warehouse
  • Xyleme An XML warehouse
  • Zooms on some aspects of the technology
  • Scaling
  • Mass storage of XML
  • XML query processing
  • Semantic integration
  • Web page ranking
  • Query subscription
  • Xyleme the company, in very brief

3
The context
  • The Web and XML are changing dramatically the
    world of distributed information

4
The Web of yesterday
  • Protocol HTTP
  • Documents HTML
  • Millions of independent web sites and billions of
    documents
  • Browsing and keyword search (full-text indexing)
  • Publication of databases using forms
  • Data management with the Web
  • HTML is primarily for humans
  • Data management applications on the Web
  • Based on hand-made wrappers
  • Expensive, incomplete, short-lived, not adapted
    to the Web constant change
  • No real support for distributed data management!

5
What is changing
  • Information used to live in islands and a lot of
    its value was wasted
  • Different formats relational, meta data,
    documents and text, data exchange formats
  • A Web standard for data exchange, XML, is fixing
    it
  • XML can capture all kinds of information over a
    wide spectrum of information
  • XML comes with a family of emerging standards
    XML schema, XSL/T, Xquery, domain specific
    schemas
  • Different computers, platforms, languages,
    applications
  • Web services, e.g., SOAP, are fixing it
  • SOAP allows ubiquitous computing on the Internet
  • SOAP comes with a family of emerging standards
    WSDL, UDDI

6
What is changing
  • XML and Web services provide a uniform access to
    information, independent of platform, system,
    language, communication protocol and data format
  • The dream for distributed data management
  • The gathering, integration, consolidation,
    analysis of distributed information become
    feasible at a much lower cost

7
(1) XML covers the information spectrum
Structured Data
Meta data
Hierarchy
Books Contracts Catalogs Bank
accounts Emails Financial Reports Insurance
Policies Economical Analysis
Derivatives Inventory Political
analysis Insurance Claims Financial
News Sports News Resumes
8
XML covers the information spectrum
  • Very structured information such as databases
  • Most DBMS now export in XML
  • Semi-structured data such as data exchange
    formats (ASN.1, SGML), e.g., technical
    documentation
  • Documents
  • Meta-data Author, date, status
  • Existing structure in them chapter, section,
    table of content and index
  • Possibly tagging of elements in it (citation,
    lists)
  • Links to other documents
  • Meta data for unstructured data such as images
    and sound
  • Plain text

XML
9
XMLs asset the marriage of text and structure
  • labeled ordered trees where leaves are text
  • Marriage of document and database worlds
  • Marriage of full text indexing (keyword search)
    and structure indexing (SQL-style query)
  • Is it the ultimate data model? No
  • Purely syntax more semantics needed
  • Is it OK for now? Definitely yes (because it is a
    standard)

10
XMLs asset typing
  • Applications need typing and XML data can be
    typed if needed (DTD and XML schema)
  • Trees
  • Logical Granularity neither page or document
    level but the piece of information that is
    needed
  • Semantics and structure are in tags and paths
  • product-table/product/reference
  • product-table/product/price

11
HTML
hard
Text presentation - Where is the data ?
12
XML
easy
Data Structure Semistructured (presentation
elsewhere)
13
(2) Web services and ubiquitous distributed
computing
  • Possibility to activate a method on some remote
    web server
  • Exchange information in XML input and result are
    in XML
  • Ubiquitous XML distributed computing
    infrastructure
  • 2 main applications
  • E-commerce
  • Access to remote data
  • With XML and Web services, it is possible
  • To get information from virtually anywhere
  • To provide information to virtually anywhere

14
Accessing remote information
Query some data services that provide candidate
genes
Heterogeneous formats, protocols, etc.
Gene banks
Application using gene banks
processing
Use some processing services
processing
processing
15
Same with web services
Query some data services that provide candidate
genes
Uniform access to information
Web
Gene banks
Application using gene banks
processing
Use some processing services
processing
processing
16
XML and Web services
  • Exchange of information
  • E-commerce, B2B, G2C
  • Cooperative work
  • Information brokers
  • Web sites, portals
  • Content publication in general
  • Mediation mode get the XML pages when needed
  • Warehouse mode load them in advance

17
Advantages of a warehouse approach
  • Allows for support of complex query processing
    with high performance
  • Allows for complex analysis of the data
  • Allows for enriching the information
  • Allows for better monitoring of information
  • Allows for versioning, archiving, temporal
    queries if needed
  • Mediator approach is preferable or compulsory in
    some applications
  • Supply chain
  • Comparative shopping
  • Typically for volatile information such as plane
    ticket price

18
XML warehouse
19
Main functionalities
Admin GUI
User GUI
Access Reporting Sub
User GUI Editing Pub
View Integration
Enrichment
Feeding
Exploitation
Repository
API
API
Warehousing Analysis (data
warehouse) (OLAP)
20
Main functionalities(1) Feeding
  • Loading from the Web (Internet and Intranet)
  • Web search
  • Web crawl
  • Access Web data via forms or Web services
  • Plug-ins to load from
  • File systems, document management systems
  • Data bases, LDAP
  • Newsgroup, emails
  • Other applications
  • Extraction and transformation
  • XSL-T or Xquery mappings for XML sources
  • XML-izers to load data from other formats
  • Monitoring of the feeding

21
Main functionalities(1) Feeding continued
  • User feeding
  • Document editing
  • Meta data editing
  • Using WebDAV protocol
  • Publication
  • By GUI or from programs (SOAP-based API)

22
Main functionalities(2) Repository
  • Storage of massive volume of XML (terabytes)
  • Indexing of massive volume of XML
  • By structure
  • By full-text
  • Linguistic support stemming, synonyms, etc.
  • Very efficient XML query processing
  • Importance ranking
  • Monitoring of the warehouse (support for
    subscriptions)
  • Access control and security
  • Versioning, archiving
  • Recovery
  • No full transaction mechanism

23
Main functionalities(3) Enrichment
  • Global organization
  • Global schema management
  • Management of collections
  • Incorporate domain ontologies and thesauri
  • Document classification
  • Cleaning by filtering out documents from
    collections, etc.
  • Document enrichment
  • Concept extraction and tagging
  • Cleaning inside de document
  • Summarization, etc.
  • Relationships between documents
  • Tables of contents
  • Tables of index
  • Cross referencing, etc.

24
Main functionalities(4) View and integration
  • View management
  • Document restructuring/mapping
  • Schema to schema mapping
  • Semantic integration
  • Manual for complex ones and (semi-) automatic for
    simple ones
  • Tools to analyze a set of schemas
  • Tools to integrate them
  • Processing for queries on integration view
  • Management of virtual data in a mediator style

25
Functionalities(5) Exploitation
  • Access to the warehouse
  • Browsing
  • Querying by keywords, XPaths or Xquery
  • Temporal queries
  • Query subscription
  • Reporting
  • Generation of complex reports with pointers to
    documents, counts, abstracts
  • Organized by collections, content, domains
  • By GUI or from programs (Web service-based API)

26
Admin Specify the lifecycle of information in
the warehouse starting from its acquisition
  • Specify with parameters (in red) documents to
    process
  • Add from a toolbox, some processing to apply (in
    pink)
  • Specify when processing should be applied (in
    green)

27
Specifying the enrichment
  • What processing should be performed
  • Applications that come with the system
  • Arbitrary processing provided as Web services
  • Interface of services
  • XML input the documents or collection of
    documents in the warehouse to be processed
  • XML output the result
  • Where to plug the result
  • Where to store the new documents (collections,
    names)
  • Where to put enrichments in existing documents
  • When to start the processing
  • At the time the document is loaded
  • At some later time, assuming some information has
    already been gathered (dependencies)

28
User queries and reporting
Choose the collections of interest
Choose the criteria of selection
Choose what to extract as a result
WHERE CLAUSE
SELECT CLAUSE
FROM CLAUSE
Quantity of results Preference ranking and
possible relaxation
PREFER CLAUSE
Classify/group results for presentation and
drilling
ORGANIZE CLAUSE
Choose presentation style
STYLE CLAUSE
29
Example
  • From collections MuséeRodin, WebMuseum, LACMA
  • Where Art_Item/ artist NameRodin
  • Select Name, Owner, Annotations
  • Prefer
  • Rodin in title page
  • Owner is public or owner is in France
  • Get first 20
  • Organize as
  • Art_Item/material sculpture, painting, others
  • Owner
  • Present as

30
XylemeAn XML warehouse
Zooms on some aspects of the technology
31
Xyleme a dynamic XML warehouse
  • Scaling
  • Feeder
  • E.g., loading with a single PC millions of Web
    documents per day and scale up with more
    machines
  • Repository
  • E.g., storing and indexing of tera Bytes of XML
    (other formats, e.g., pdf)
  • Enrichment
  • E.g., tools (together with partner) for
    classification and concept extraction
  • View and semantic integration
  • E.g., a suite of tools of XML integration
  • Exploitation
  • E.g., access via SOAP and graphic interfaces

32
1. An architecture to scale
33
The scaling
  • Size of data billions of XML documents
  • Size of data and index terabytes
  • Number of customers
  • thousands of simultaneous queries
  • millions of subscriptions
  • An architecture based on distribution

34
Architecture
  • Cluster of PCs
  • Runs on Linux and C (also Solaris)
  • Communications
  • local Corba (Orbacus)
  • external HTTP, SOAP
  • Distribution between autonomous machines

35
Functional architecture
-------------------- I N T E R N E T
-----------------------
Web Interface
Query Processor
Repository and Index Manager
36
Architecture and scaling
-------------------- I N T E R N E T
-----------------------
E T H E R N E T
37
2. Data Acquisition and Maintenance of Web pages
(internet or intranet)
38
Crawl le Web
  • Discover HTML/XML pages on the web (intranet or
    internet)
  • Parse/load pages and follow links
  • Manage metadata for the known pages
  • Do this under bounded resources
  • Network bandwidth
  • Memory and disk resources
  • Tested on the Internet in October 2001
  • Millions of pages crawled per day on each crawler
  • Up to 10 crawlers and close to 1 billion HTML/XML
    pages discovered in a couple of months

39
Optimization
Page Scheduling
  • Optimization problem
  • Decide which page to crawl or refresh next to
    optimize the quality of the warehouse
  • Criteria
  • Read more often important pages
  • Based on customers preferences
  • Page importance can also be used to order query
    results
  • Dont read a page that is probably up-to-date
  • Uses an estimate of the change frequency for each
    page
  • Advantages
  • Have a fresh view of useful portions of
    information

40
Page scheduling
  • Determine which page to read next
  • minimize a particular cost function under some
    constraint (bandwidth of crawlers)
  • The penalty for a page takes into account
  • importance of the page (to be defined next)
  • customer needs (obtained via pub/sub)
  • staleness of the data
  • penalty for being out of date
  • penalty for aging
  • The page scheduler fully controls the crawling
  • vs. random crawling in classic search engines

41
Page Importance
  • Based on customers criteria and on the link
    structure of the web
  • Intuition a page is important if many important
    pages reference it
  • Fixpoint definition importance vector Imp
  • Proposed by IBM used by search engines such as
    Google
  • Link matrix M(i,j) if page i refers to page j
  • Outdegree of page i out(i)
  • Imp0(k) 1/N (initialization)
  • Impm(k) ?i M(i,k) Impm-1(i)/out(i)
    (iteration)
  • Imp is the limit

42
Page Importance
  • Novel technology developed by Xyleme
  • Patent pending
  • On-line evaluation of page importance
  • Use much less resources
  • Faster reaction to changes on the web

43
2. XML Repository
44
Storing XML
  • Document systems
  • Good for keyword search
  • No or inefficient support for structure search
  • Relational store (e.g., Oracle 8i)
  • Well adapted for some applications
  • Very typed data and Tables efficient
  • Otherwise too many joins and inefficient
  • Object database store (e.g., Excellon) and Native
    XML databases (e.g., Tamino)
  • Same issues
  • Xyleme XML Native storage

45
Repository
  • Goal
  • minimize I/O for direct access and scanning
  • efficient direct accesses both with fulltext
    indexing and structure indexing
  • good compaction but not at the cost of access
  • Efficient storage of trees
  • use fixed length storage pages
  • variable length records inside a page
  • Main issue tree balancing

46
Tree Balancing
Record 1
Record 3
Record 2
47
Tree Balancing
Large collections may use several records
48
3. Semantic Data Integration
49
Classification
  • Based on word occurrences in document and
    statistical resources
  • Classification by semantic domain
  • Classification by language
  • Use the XXX classifier

50
Semantic Integration
  • Web Heterogeneity
  • Many possible types for data in a particular
    domain, many DTDs
  • Semantic Integration
  • one abstract DTD for the domain
  • gives the illusion that the system maintains an
    homogeneous database for this domain
  • 1 domain 1 abstract DTD

51
Views
  • Choose an abstract DTD for each domain
  • For each concrete DTD in a domain, find how it
    relates to the abstract DTD using linguistic
    tools such as WordNet
  • Provide relationships between paths in the
    concrete and abstract DTD
  • Possibly automatic, manual or hybrid
  • With manual mapping, a domain expert may specify
    much more complex views
  • Query processing process queries on the Abstract
    DTD


52
4. Query Processing
53
Query Language
  • Today A mix of OQL and XQL
  • Tomorrow the future W3C standard
  • Example
  • select product/name, product/price
  • from doc in catalogue,
  • product in doc/product
  • where product//components contains flash
  • and product/description contains
    camera

54
Data Distribution
  • Cluster of documents physical collection of
    documents (? semantic domain)
  • Distribution
  • Storage machine
  • in charge of a cluster of documents
  • Index machine
  • index for a cluster

55
Step0 Indexing
  • Standard inverted index
  • word ? documents that contain this word
  • Xyleme index
  • word ? elements that contain this word
  • document element identifier
  • Goal more work can be performed without
    accessing data

56
Step1 Localization
global query on abstract dtd
  • Query on an abstract dtd
  • Localization of machines that host concrete DTDs
    that will participate in the query


catalogue/product/price relevant for machine
56 machine 45
local queries
union of queries on local machines
57
Step2 Optimization
  • Algebraic rewriting
  • Linear search strategy based on simple heuristics
  • use in memory indexes
  • minimize communication
  • Optimization of the global plan
  • Optimization of the local plans

58
Step3 Execution
  • A plan usually consists of
  • 1. parallel translation from abstract queries to
    concrete patterns on the relevant index machines
  • 2. parallel index scans to identify the relevant
    elements for a concrete pattern
  • 3. parallel construction of resulting elements
  • 4. pipeline evaluation (i.e., no intermediate
    data structure)
  • Note 2. Requires smart indexes

59
Abstract2Concrete
for catalogue/product/price scan relevant
concrete pattern ? d1//camera/price ?
d2/product/cost ? d3/piano/price ...
For each concrete pattern, the local plan is
optimized dynamically
for each concrete pattern scan the element ids
? 234 ? 177
60
Identifiers
  • Essential for query processing
  • Identifier (preorder rank/postorder rank)
  • X ancestor of Y ltgt pre(X) lt pre(Y) and
    post(X) gt post(Y)
  • E.g., 2lt5 and 4 gt2 gt (2,4) ancestor (5,2)

1
A B C D E
F G
7
2
6
4
6
3
4
7
1
3
5
5
2
Text
61
5. Change Control
62
Change management
  • Users are often interested in changes to the web
  • Change monitoring
  • query subscription
  • Soon to come Version management
  • representation and storage of changes

63
Query Subscription
  • Users subscribe to certain events such as
  • Update of a particular page, a page in a given
    site
  • Discovery of a new page containing some specific
    words
  • Insertion of a particular element in some pages
    (new products in a catalog)
  • Detection of illegal copies of selected documents
  • Users may request to be notified
  • Immediately at the time the event is detected
  • Regularly, e.g., weekly
  • After a certain number of event detections

64
Examples
  • subscription myPariscope
  • what are the new movie entries in Pariscope
    site
  • monitoring newMovies
  • select URL
  • where URL extends www.pariscope.fr/movies/
  • and new(self)
  • manage the changes in the movies showing in
    Paris
  • continuous delta Showing
  • select ... from ... where
  • when daily
  • notify daily send me a daily report

65
Atomic Events
document
Loading of millions of pages/day
d
atomic event 46 URL matches pattern
www.xyz.com/ atomic event 67 XML
document contains the tag soccer
metadata manager
HTML parser
d/46
complex event detection
XML loader
d/46,67
loading
66
Complex Events
Several millions of pages crawled per
day Hundreds of millions of alerts raised
HTML parser
complex event detection
Millions of subscriptions
XML loader
complex event 12 67 46 (XML document contains
the tag soccer and URL matches pattern
www.xyz.com/)
67
Notification Processing
  • Very efficient/scalable algorithm for complex
    event detection
  • Notifications by
  • Email
  • Web posting
  • Web services in SOAP

notification processor
complex event detection
alerts
notifications
Millions of notifications/day
68
Xyleme in short
  • Spin-off of lINRIA (National Research Institute)
  • Technology developed in research project of 60
    man/years
  • Creation of Xyleme SA in September 2000
  • Now about 25 persons 13 RD, 4 Services, 10
    marketing, sales admin.
  • Customers include Press agency (AFP), Newspaper
    groups (Moniteur, Le Monde), National library
    (BNF)
  • First round of capital in 2000 (SGAM
    Viventures).
  • Second round in 2002 (Deutsche Bank)

69
Thank you() If you want to know more about
Xyleme http//www.xyleme.com Serge.Abiteboul_at_xy
leme.com Amir.Milo_at_xyleme.com
Write a Comment
User Comments (0)
About PowerShow.com