Title: Entrepts de contenu autour de XML et des services Web Serge Abiteboul INRIAFuturs et LRIParis 11
1Entrepôts de contenu autour de XML et des
services WebSerge Abiteboul INRIA-Futurs et
LRI-Paris 11
2Introduction
3Joint works some participants projects
- Xyleme Sophie Cluet, Guy Ferran many others
- Acware within Edot project Benjamin Nguyen,
Gabriela Ruberg, Gregory Cobena - Active XML within DbGlobe project Omar
Benjelloun, Ioana Manolescu, Tova Milo many
others - KadoP with Edos project Ioana Manolescu,
Nicoleta Preda many others
4Success stories in the time of the Internet
bubble Information management
- Google management of Web pages
- Mapquest management of maps
- Amazone book catalogue
- eBay product catalogue
- Napster (emule, bearshare, etc.) music database
- Flickr picture database
- Wikipedia dictionary
- Even in France
- Meetic dating database
- Kelkoo comparative shopping
5The trend is towards peer-to-peer infoware
- Why?
- The Web is switching from centralized servers to
communities and syndication - Buzzwords such as Web 2.0 (?)
- Infoware classe de logiciels dont l'objectif est
non plus de traiter de l'information, mais de la
gérer globalement, tellement les quantités sont
de plus en plus importantes - Analogy
- Software development by very structured and
controlled groups of programmers vs. - open-source software produced by large
communities of autonomous developers
6Outline
- Introduction
- Content warehouse
- Concept
- XML and Web services
- Xyleme
- Peer-to-Peer content warehouse
- Concept
- Active XML
- KadoP
- Conclusion
7Content warehouse
8Warehouse
- Goal integrated access to heterogeneous,
autonomous, distributed sources of information - Main functionalities acquire, transform, filter,
clean and integrate data, support for queries - Warehouse vs. mediation
- Warehouse information is acquired in advance
- ? Mediation information acquired when needed
- Classical tradeoff between updates and queries
- Typically mix of both
9Content warehouse
- All kinds of content
- Mail, reports, news, web pages, contacts,
catalogs, annotations, etc - Text, multimedia, etc.
- Little is numerical vs. OLAP
- some may me mixed, e.g., financial reports
- Typically found on the Web and not in relational
databases
10Content vs. data warehouse
11XML Warehouse
Same as a relational warehouse
- Import data from many sources
- Add value to it without interfering with
operational data - Export integrated views of it
12The basis of content management
- Standard for data exchange
- XML, XML Schema
- Extensible Markup Language
- Labeled ordered trees
- Foundations tree automata
- Query languages
- XPATH, XQuery
- Foundations tree automata
- Not perfect but at least exist
XML
Xquery Xpath
SOAP WSDL
13Functionalities
Exploiting
GUI, Web services, reporting
Feeding
Web
14Functionalities Feeding
- Loading from the Web (Internet and Intranet)
- Web search
- Web crawl
- Access Web data via forms or Web services
- Plug-ins to load from
- File systems, document management systems
- Data bases, LDAP
- Newsgroup, emails
- Other applications
- Extraction and transformation
- XSL-T or Xquery mappings for XML sources
- XML-izers to load data from other formats
- Monitoring of the feeding
15Functionalities More feeding
- User feeding
- Document editing
- Meta data editing
- Publication
- API SOAP and WebDAV
16Functionalities Storage
- Storage of (massive volume of) XML (terabytes)
- Indexing of (massive volume of) XML
- By structure
- By full-text
- Linguistic support multi language, stemming,
synonyms, etc. - Very efficient XML query processing
- Importance ranking
- Monitoring of the warehouse (support for
subscriptions) - Access control and security
- Versioning, archiving
- Recovery
- Possibly transaction mechanism
17Functionalities Enrichment
- Global organization
- Global schema management
- Management of collections
- Incorporate domain ontologies and thesauri
- Document classification
- Cleaning by filtering out documents from
collections, etc. - Document enrichment
- Concept extraction and tagging
- Cleaning inside de document
- Summarization, etc.
- Relationships between documents
- Tables of contents
- Tables of index
- Cross referencing, etc.
18Functionalities View integration
- View management
- Document restructuring/mapping
- Schema to schema mapping
- Semantic integration
- Manual for complex ones and (semi-) automatic for
simple ones - Tools to analyze a set of schemas
- Tools to integrate them
- Processing for queries on integration view
- Management of virtual data in a mediator style
-
19Functionalities Exploitation
- Access to the warehouse
- Browsing
- Querying by keywords, XPaths or Xquery
- Temporal queries
- Query subscription
- Reporting
- Generation of complex reports with pointers to
documents, counts, abstracts - Organized by collections, content, domains
- By GUI or from programs (Web service-based API)
20Xyleme Content warehouse
21Xyleme in short
- 1999 Xyleme research project at INRIA
- 2000 Creation of a spin-off
- 2006 About 40 people
- Technology a content warehouse built around a
very efficient and scalable XML repository - Application example all articles of Le Monde in
XML
22Xyleme Functionalities
Exploiting
GUI, Web services, reporting
Feeding
Web
23Xyleme Architecture
Client side
Applications IE/Java/C/.Net
Or Any Platform
HTTP Web Service API
Server side
Application Server TomcatSoap
or
Name Server User Manager Url Manager Notification
Mgr
Global Query Manager
Global Query Manager
Java/C API
Corba
...
24Structural identifiersand indexing
1
A
7
0
2
6
B
C
4
6
1
1
X ancestor of Y pre(X)
post(Y)
3
4
7
D
E
1
F
3
5
2
2
2
5
G
John
X parent of Y X ancestor of Y and level(X)
level(Y) - 1
2
3
Structural IDs Prefix-Postfix-Level
25Query evaluation based on Holistic twig joins
(d1, 201, 400)
A
C
D
(d1, 224, 201)
(d1, 228, 237)
John
(d1, 228, 237)
26Peer-to-peer content warehouse
27The golden triangle of distributed content
management on the Web
- Standard for data exchange
- XML, XML Schema
- Extensible Markup Language
- Labeled ordered trees
- Foundations tree automata
- Query languages
- XPATH, XQuery
- Standards for distributed computing Web services
- SOAP, WSDL, UDDI
- Simple Object Access Protocols
- Corba but simpler and on the Web
XML
Xquery Xpath
SOAP WSDL
28Peer-to-peer
- A large and varying number of computers cooperate
to solve some particular task without any
centralized authority - Goal build an efficient, robust, scalable system
based (typically) on inexpensive, unreliable
computers distributed in a wide area network - Examples
- seti_at_home search for extraterrestrial
intelligence - kazaa obtain free music/video over the net
- cabal decryption of 512 bits RSA code
- grub P2P Web search
29An XML warehouse in P2P
- Warehouse a very centralized system
- P2P an ultra distributed system (no authority)
- P2P warehouse an oxymoron?
- No!
- A warehouse from a logical viewpoint
- P2P system from a physical viewpoint
30P2P mediation
Centralized mediation
mediator
data sources
data sources
P2P mediator
warehouse (logical physical)
P2P warehouse (logical)
data sources
data sources
P2P warehouse (physical)
P2P warehouse
Centralized warehouse
31P2P XML Warehouse
- Data sources and peers are distributed, transient
and autonomous - Information is distributed and replicated
- Nothing is centralized
- Not the control, storage, indexing
- The machines are cooperating with some level of
trust to provide the functionalities of an XML
warehouse
32Advantages Disadvantages
- Performance
- Optimization of parallelism
- Avoid bottleneck
- Replication
- Availability
- Replication
- Cost
- Avoid the cost of server
- Share operational cost
- Dynamicity
- add/remove new data sources
- Better scaling
- Performance
- Cost for complex queries
- Communication cost
- Availability
- Peers can leave
- Consistency maintenance
- Difficult to support transaction
- Quality
- Difficult to guarantee quality
33Centralized vs. distributed data management
34Two classes of P2P networks
- Unstructured P2P networks
- Local exchange mappings relate content on
different peers - Queries are propagated (flooding)
- SomeWhere, ...
- Structured P2P networks
- Content is indexed globally and located via the
index - Local content, global access
- KadoP, ...
35ActiveXML A framework for distributed
datamanagement
36Active XML
- The standards of distributed data management
- Active XML XML documents with embedded Web
service calls where service calls are typically
in Xquery - Intensional Dynamic
- This is not a new idea
- Procedural attributes in relational systems
- Basis of Object Databases
- Suns JSP, PHPMySQL, Apache Jelly
37Active XML XML embedded service
calls(omitting syntactic details)
Aspen
Unisys.com/snow(Aspen)
.
Yahoo.com/GetHotels()
1
May contain calls to any SOAP web service to any
AXML web services - to be defined
38Not a new idea in databasesNot a new idea on the
Web
- Mixing calls to data is an old idea
- Procedural attributes in relational systems
- Basis of Object Databases
- In HTML world
- Suns JSP, PHPMySQL
- Call to Web services inside documents
- Macromedia MX, Apache Jelly
39What exactly to exchange
- A parameter of a call contains some service calls
- The result of a call contains some service calls
- Do we evaluate these calls before transmitting
the data or not - Hi John, what is the phone number of the CEO of
INRIA? - (33 1) 39 66 00 01
- Look in INRIA directory at Michel Cosnard
- Find his name at www.inria.fr then look on the
directory
40When to activate the call
- Explicit pull mode
- Frequency Daily, weekly, etc.
- After some event e.g., when another service call
completed - This aspect of the problem is related to active
databases - Implicit pull mode Lazy
- When the data is requested
- Difficulty detect that the result of a
particular request may be affected by a
particular call - This is related to deductive databases
- Push mode
- E.g., based on a query subscription the web
server pushes information to the client - E.g., synchronization with an external source
- This is related to stream and subscription
queries
41Active XML peer
AXML peer
soap
- Peer-to-peer architecture
- Each Active XML peer
- Repository manages Active XML data with
embedded web service calls - Web client uses Web services
- Web server provides (parameterized)
queries/updates over the repository as web
services - Open source system
- SUNs Java SDK 1.4
- XML parser
- XPath processor, XSLT engine
- Apache Tomcat 4.0 servlet engine
- Apache Axis SOAP toolkit 1.0
- X-OQL query processor
- persistent DOM repository
- JSP-based user interface
- JSTL 1.0 standard tag library
- see http//activexml.net
42KadoP a P2P system for sharing content
43KadoP model
- Data XML Document views Active XML Web
services - Simple semantics Concepts, namespaces, DTDs,
iSa, partOf, relatedTo, context documents (for
services) - Queries tree pattern query with join
- KadoP
- XML data distributed in the P2P network
- Index is distributed via a DHT
- Goal Efficient processing of terabytes of XML
with no centralized authority
44Distributed hash tables
- Typically on a WAN
- Peers come and go
- Small number of messages to locate the peer in
charge of key k log n - Standard interface put, get
- We tried Pastry, Chord and JXTA
- We use now Pastry
hash(k)
DHT
put(kv3)
v1,v2,v3
put(kv1)
put(kv2)
get(k)
45Indexing in KadoP
hash(C)
DHT
hash(John)
- Use structured ID as in Xyleme
- Publish them in a DHT
- Use Holistic twig join
- Main issue communications
- WAN vs. LAN
- Long posting lists
- Optimization techniques
- Use only docID wisconsin
- Ship smallest list
- Semi-join techniques
- Intensional indexing
put(Cd,p,6,6,1)
put(Johnd,p,3,1,2)
hash(C)
DHT
46Conclusion
47On going work
- AXML and distributed data management on the Web
- Opinion Xquery is a language for local XML
management - Language for distributed query management
- Active XML?
- What else?
- Foundation of distributed query optimization
- Recent proposal AXML send/receive
- KadoP and P2P (Active) XML indexing
- Now being tested and working on optimization
- ActiveXML is open-source see activexml.net
- KadoP soon will be already available upon
request - Application distribution of open-source software
(with Mandriva)
48Other issues for turning the network into a
scalable database
- Take an arbitrary problem for data or knowledge
management and look at it in the P2P setting with
Gigabytes of data - Examples
- Self tuning (joint work with Alkis Polyzotis)
- Semantic integration (lots of work in Gemo)
- Distributed access control (joint work with
Bogdan Cautis) - Monitoring (joint work with DistribCom group in
INRIA-Rennes)
49Publicité
- Lancement de webContent
- Une plateforme RNTL
- Entrepôt de données du Web pour la surveillance
- EADS, Thales, Bongrain, Xyleme, Exalead,
NewPhoenix - Recherche de jeunes ingénieurs pour travailler
dans webContent
50Merci
Merci