Entrepts de contenu autour de XML et des services Web Serge Abiteboul INRIAFuturs et LRIParis 11 - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Entrepts de contenu autour de XML et des services Web Serge Abiteboul INRIAFuturs et LRIParis 11

Description:

Joint works some participants & projects. Xyleme: ... Napster (emule, bearshare, etc.): music database. Flickr: picture database. Wikipedia: dictionary ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 51
Provided by: proje80
Category:

less

Transcript and Presenter's Notes

Title: Entrepts de contenu autour de XML et des services Web Serge Abiteboul INRIAFuturs et LRIParis 11


1
Entrepôts de contenu autour de XML et des
services WebSerge Abiteboul INRIA-Futurs et
LRI-Paris 11
2
Introduction
3
Joint works some participants projects
  • Xyleme Sophie Cluet, Guy Ferran many others
  • Acware within Edot project Benjamin Nguyen,
    Gabriela Ruberg, Gregory Cobena
  • Active XML within DbGlobe project Omar
    Benjelloun, Ioana Manolescu, Tova Milo many
    others
  • KadoP with Edos project Ioana Manolescu,
    Nicoleta Preda many others

4
Success stories in the time of the Internet
bubble Information management
  • Google management of Web pages
  • Mapquest management of maps
  • Amazone book catalogue
  • eBay product catalogue
  • Napster (emule, bearshare, etc.) music database
  • Flickr picture database
  • Wikipedia dictionary
  • Even in France
  • Meetic dating database
  • Kelkoo comparative shopping

5
The trend is towards peer-to-peer infoware
  • Why?
  • The Web is switching from centralized servers to
    communities and syndication
  • Buzzwords such as Web 2.0 (?)
  • Infoware classe de logiciels dont l'objectif est
    non plus de traiter de l'information, mais de la
    gérer globalement, tellement les quantités sont
    de plus en plus importantes
  • Analogy
  • Software development by very structured and
    controlled groups of programmers vs.
  • open-source software produced by large
    communities of autonomous developers

6
Outline
  • Introduction
  • Content warehouse
  • Concept
  • XML and Web services
  • Xyleme
  • Peer-to-Peer content warehouse
  • Concept
  • Active XML
  • KadoP
  • Conclusion

7
Content warehouse
8
Warehouse
  • Goal integrated access to heterogeneous,
    autonomous, distributed sources of information
  • Main functionalities acquire, transform, filter,
    clean and integrate data, support for queries
  • Warehouse vs. mediation
  • Warehouse information is acquired in advance
  • ? Mediation information acquired when needed
  • Classical tradeoff between updates and queries
  • Typically mix of both

9
Content warehouse
  • All kinds of content
  • Mail, reports, news, web pages, contacts,
    catalogs, annotations, etc
  • Text, multimedia, etc.
  • Little is numerical vs. OLAP
  • some may me mixed, e.g., financial reports
  • Typically found on the Web and not in relational
    databases

10
Content vs. data warehouse
11
XML Warehouse
Same as a relational warehouse
  • Import data from many sources
  • Add value to it without interfering with
    operational data
  • Export integrated views of it

12
The basis of content management
  • Standard for data exchange
  • XML, XML Schema
  • Extensible Markup Language
  • Labeled ordered trees
  • Foundations tree automata
  • Query languages
  • XPATH, XQuery
  • Foundations tree automata
  • Not perfect but at least exist

XML
Xquery Xpath
SOAP WSDL
13
Functionalities
Exploiting
GUI, Web services, reporting
Feeding
Web
14
Functionalities Feeding
  • Loading from the Web (Internet and Intranet)
  • Web search
  • Web crawl
  • Access Web data via forms or Web services
  • Plug-ins to load from
  • File systems, document management systems
  • Data bases, LDAP
  • Newsgroup, emails
  • Other applications
  • Extraction and transformation
  • XSL-T or Xquery mappings for XML sources
  • XML-izers to load data from other formats
  • Monitoring of the feeding

15
Functionalities More feeding
  • User feeding
  • Document editing
  • Meta data editing
  • Publication
  • API SOAP and WebDAV

16
Functionalities Storage
  • Storage of (massive volume of) XML (terabytes)
  • Indexing of (massive volume of) XML
  • By structure
  • By full-text
  • Linguistic support multi language, stemming,
    synonyms, etc.
  • Very efficient XML query processing
  • Importance ranking
  • Monitoring of the warehouse (support for
    subscriptions)
  • Access control and security
  • Versioning, archiving
  • Recovery
  • Possibly transaction mechanism

17
Functionalities Enrichment
  • Global organization
  • Global schema management
  • Management of collections
  • Incorporate domain ontologies and thesauri
  • Document classification
  • Cleaning by filtering out documents from
    collections, etc.
  • Document enrichment
  • Concept extraction and tagging
  • Cleaning inside de document
  • Summarization, etc.
  • Relationships between documents
  • Tables of contents
  • Tables of index
  • Cross referencing, etc.

18
Functionalities View integration
  • View management
  • Document restructuring/mapping
  • Schema to schema mapping
  • Semantic integration
  • Manual for complex ones and (semi-) automatic for
    simple ones
  • Tools to analyze a set of schemas
  • Tools to integrate them
  • Processing for queries on integration view
  • Management of virtual data in a mediator style

19
Functionalities Exploitation
  • Access to the warehouse
  • Browsing
  • Querying by keywords, XPaths or Xquery
  • Temporal queries
  • Query subscription
  • Reporting
  • Generation of complex reports with pointers to
    documents, counts, abstracts
  • Organized by collections, content, domains
  • By GUI or from programs (Web service-based API)

20
Xyleme Content warehouse
21
Xyleme in short
  • 1999 Xyleme research project at INRIA
  • 2000 Creation of a spin-off
  • 2006 About 40 people
  • Technology a content warehouse built around a
    very efficient and scalable XML repository
  • Application example all articles of Le Monde in
    XML

22
Xyleme Functionalities
Exploiting
GUI, Web services, reporting
Feeding
Web
23
Xyleme Architecture
Client side
Applications IE/Java/C/.Net
Or Any Platform
HTTP Web Service API
Server side
Application Server TomcatSoap
or
Name Server User Manager Url Manager Notification
Mgr
Global Query Manager
Global Query Manager
Java/C API
Corba
...
24
Structural identifiersand indexing
1

A
7
0
2
6
B
C
4
6
1
1
X ancestor of Y pre(X)
post(Y)
3
4
7
D
E
1
F
3
5
2
2
2
5
G
John
X parent of Y X ancestor of Y and level(X)
level(Y) - 1
2
3
Structural IDs Prefix-Postfix-Level
25
Query evaluation based on Holistic twig joins
(d1, 201, 400)
A
C
D
(d1, 224, 201)
(d1, 228, 237)
John
(d1, 228, 237)
26
Peer-to-peer content warehouse
27
The golden triangle of distributed content
management on the Web
  • Standard for data exchange
  • XML, XML Schema
  • Extensible Markup Language
  • Labeled ordered trees
  • Foundations tree automata
  • Query languages
  • XPATH, XQuery
  • Standards for distributed computing Web services
  • SOAP, WSDL, UDDI
  • Simple Object Access Protocols
  • Corba but simpler and on the Web

XML
Xquery Xpath
SOAP WSDL
28
Peer-to-peer
  • A large and varying number of computers cooperate
    to solve some particular task without any
    centralized authority
  • Goal build an efficient, robust, scalable system
    based (typically) on inexpensive, unreliable
    computers distributed in a wide area network
  • Examples
  • seti_at_home search for extraterrestrial
    intelligence
  • kazaa obtain free music/video over the net
  • cabal decryption of 512 bits RSA code
  • grub P2P Web search

29
An XML warehouse in P2P
  • Warehouse a very centralized system
  • P2P an ultra distributed system (no authority)
  • P2P warehouse an oxymoron?
  • No!
  • A warehouse from a logical viewpoint
  • P2P system from a physical viewpoint

30
P2P mediation
Centralized mediation
mediator
data sources
data sources
P2P mediator
warehouse (logical physical)
P2P warehouse (logical)
data sources
data sources
P2P warehouse (physical)
P2P warehouse
Centralized warehouse
31
P2P XML Warehouse
  • Data sources and peers are distributed, transient
    and autonomous
  • Information is distributed and replicated
  • Nothing is centralized
  • Not the control, storage, indexing
  • The machines are cooperating with some level of
    trust to provide the functionalities of an XML
    warehouse

32
Advantages Disadvantages
  • Performance
  • Optimization of parallelism
  • Avoid bottleneck
  • Replication
  • Availability
  • Replication
  • Cost
  • Avoid the cost of server
  • Share operational cost
  • Dynamicity
  • add/remove new data sources
  • Better scaling
  • Performance
  • Cost for complex queries
  • Communication cost
  • Availability
  • Peers can leave
  • Consistency maintenance
  • Difficult to support transaction
  • Quality
  • Difficult to guarantee quality

33
Centralized vs. distributed data management
34
Two classes of P2P networks
  • Unstructured P2P networks
  • Local exchange mappings relate content on
    different peers
  • Queries are propagated (flooding)
  • SomeWhere, ...
  • Structured P2P networks
  • Content is indexed globally and located via the
    index
  • Local content, global access
  • KadoP, ...

35
ActiveXML A framework for distributed
datamanagement
36
Active XML
  • The standards of distributed data management
  • Active XML XML documents with embedded Web
    service calls where service calls are typically
    in Xquery
  • Intensional Dynamic
  • This is not a new idea
  • Procedural attributes in relational systems
  • Basis of Object Databases
  • Suns JSP, PHPMySQL, Apache Jelly

37
Active XML XML embedded service
calls(omitting syntactic details)

Aspen
Unisys.com/snow(Aspen)
.
Yahoo.com/GetHotels()

1
May contain calls to any SOAP web service to any
AXML web services - to be defined
38
Not a new idea in databasesNot a new idea on the
Web
  • Mixing calls to data is an old idea
  • Procedural attributes in relational systems
  • Basis of Object Databases
  • In HTML world
  • Suns JSP, PHPMySQL
  • Call to Web services inside documents
  • Macromedia MX, Apache Jelly

39
What exactly to exchange
  • A parameter of a call contains some service calls
  • The result of a call contains some service calls
  • Do we evaluate these calls before transmitting
    the data or not
  • Hi John, what is the phone number of the CEO of
    INRIA?
  • (33 1) 39 66 00 01
  • Look in INRIA directory at Michel Cosnard
  • Find his name at www.inria.fr then look on the
    directory

40
When to activate the call
  • Explicit pull mode
  • Frequency Daily, weekly, etc.
  • After some event e.g., when another service call
    completed
  • This aspect of the problem is related to active
    databases
  • Implicit pull mode Lazy
  • When the data is requested
  • Difficulty detect that the result of a
    particular request may be affected by a
    particular call
  • This is related to deductive databases
  • Push mode
  • E.g., based on a query subscription the web
    server pushes information to the client
  • E.g., synchronization with an external source
  • This is related to stream and subscription
    queries

41
Active XML peer
AXML peer
soap
  • Peer-to-peer architecture
  • Each Active XML peer
  • Repository manages Active XML data with
    embedded web service calls
  • Web client uses Web services
  • Web server provides (parameterized)
    queries/updates over the repository as web
    services
  • Open source system
  • SUNs Java SDK 1.4
  • XML parser
  • XPath processor, XSLT engine
  • Apache Tomcat 4.0 servlet engine
  • Apache Axis SOAP toolkit 1.0
  • X-OQL query processor
  • persistent DOM repository
  • JSP-based user interface
  • JSTL 1.0 standard tag library
  • see http//activexml.net

42
KadoP a P2P system for sharing content
43
KadoP model
  • Data XML Document views Active XML Web
    services
  • Simple semantics Concepts, namespaces, DTDs,
    iSa, partOf, relatedTo, context documents (for
    services)
  • Queries tree pattern query with join
  • KadoP
  • XML data distributed in the P2P network
  • Index is distributed via a DHT
  • Goal Efficient processing of terabytes of XML
    with no centralized authority

44
Distributed hash tables
  • Typically on a WAN
  • Peers come and go
  • Small number of messages to locate the peer in
    charge of key k log n
  • Standard interface put, get
  • We tried Pastry, Chord and JXTA
  • We use now Pastry

hash(k)
DHT
put(kv3)
v1,v2,v3
put(kv1)
put(kv2)
get(k)
45
Indexing in KadoP
hash(C)
DHT
hash(John)
  • Use structured ID as in Xyleme
  • Publish them in a DHT
  • Use Holistic twig join
  • Main issue communications
  • WAN vs. LAN
  • Long posting lists
  • Optimization techniques
  • Use only docID wisconsin
  • Ship smallest list
  • Semi-join techniques
  • Intensional indexing

put(Cd,p,6,6,1)
put(Johnd,p,3,1,2)
hash(C)
DHT
46
Conclusion
47
On going work
  • AXML and distributed data management on the Web
  • Opinion Xquery is a language for local XML
    management
  • Language for distributed query management
  • Active XML?
  • What else?
  • Foundation of distributed query optimization
  • Recent proposal AXML send/receive
  • KadoP and P2P (Active) XML indexing
  • Now being tested and working on optimization
  • ActiveXML is open-source see activexml.net
  • KadoP soon will be already available upon
    request
  • Application distribution of open-source software
    (with Mandriva)

48
Other issues for turning the network into a
scalable database
  • Take an arbitrary problem for data or knowledge
    management and look at it in the P2P setting with
    Gigabytes of data
  • Examples
  • Self tuning (joint work with Alkis Polyzotis)
  • Semantic integration (lots of work in Gemo)
  • Distributed access control (joint work with
    Bogdan Cautis)
  • Monitoring (joint work with DistribCom group in
    INRIA-Rennes)

49
Publicité
  • Lancement de webContent
  • Une plateforme RNTL
  • Entrepôt de données du Web pour la surveillance
  • EADS, Thales, Bongrain, Xyleme, Exalead,
    NewPhoenix
  • Recherche de jeunes ingénieurs pour travailler
    dans webContent

50
Merci
Merci
Write a Comment
User Comments (0)
About PowerShow.com