Title: Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11
1Information Management in P2PSerge
AbiteboulINRIA-Futurs and Univ. Paris 11
2Introduction
3Success stories at the time of the Internet
bubble
- Google management of Web pages
- Mapquest management of maps
- Amazone book catalogue
- eBay product catalogue
- Napster (emule, bearshare, etc.) music database
- Flickr picture database
- Wikipedia dictionary
- del.icio.us annotations
- In France
- Meetic dating database
- Kelkoo comparative shopping
They are all about publishing some database
4The trend is towards peer-to-peerand
interactivity
- P2P A large and varying number of computers
cooperate to solve some particular task without
any centralized authority - Goal build an efficient, robust, scalable system
based (typically) on inexpensive, unreliable
computers distributed in a wide area network - seti_at_home kazaa cabal
- Switch from centralized servers to communities
and syndication - Interaction and Web 2.0
- Motivations Social, organizational
5Information management in a P2P network
- Private terminology data ring
- Information is heterogeneous, distributed,
replicated, dynamic - Which info Data meta-data knowledge
services - Peers are heterogeneous, autonomous and possibly
mobile - From sensors to PDA to mainframe
- Typically very large number of peers
- Variety of requirements QoS, performance,
security, etc.
6Acknowledgement
- Xyleme Scalable XML warehousing
- Sophie Cluet, Guy Ferran (Xyleme) many others
- ActiveXML Language for P2P data management
- Omar Benjelloun (Google), Ioana Manolescu, Tova
Milo (Tel Aviv) many others - KadoP P2P scalable XML indexing
- Ioana Manolescu, Nicoleta Preda others
- Data Ring Infrastructure for P2P data management
- Alkis Polyzotis (UC Santa Cruz)
-
7Outline
- Introduction the data ring
- Calculus for P2P data management (ActiveXML)
- Algebra for P2P data management (ActiveXML
algebra) - Indexing in P2P (KadoP)
- Conclusion
- Goal of the tutorial present issues and
technology on p2p information management - Warning it is very biased it is not a survey
8Outline
- Introduction the data ring
- Calculus for P2P data management (ActiveXML)
- Algebra for P2P data management (ActiveXML
algebra) - Indexing in P2P (KadoP)
- Conclusion
91. Introduction the data ring
10The information in a data ring
- Data tuples, collections, documents, relations
- Services data sources, possibly some processing
- Meta-data about resources attribute/values
pairs, annotations - Ontologies to explain data and metadata
- View definitions
- Data integration information, e.g., mappings
between ontologies - Physical data Indices and materialized views
11Functionalities of the data ring
- Storage, persistence, replication
- Indexing, caching, querying, updating,
optimization - Schema management, access control
- Fault tolerance, self tuning, monitoring
- Resource discovery, history, provenance,
annotations, multi-linguism, - Semantic enrichment, uncertain data
- Each functionality may be achieved by a peer or
by the network
12And now, what is a peer?
- A mainframe database
- A file system
- Web server
- A PC
- A PDA
- A telephone
- A sensor
- A home appliance
- A car
- A manufacturing tool
- A telecom equipment
- A toy
- Another data ring
Any connected device or software with some
information to share
A net address and some names of resources (e.g.
document or service)
13Advantages and disadvantages of P2P
- Scaling
- Performance
- Optimization of parallelism
- Avoid bottleneck
- Replication
- Availability
- Replication
- Cost
- Avoid the cost of server
- Share operational cost
- Dynamicity
- add/remove new data sources
- Complexity
- Performance
- Cost for complex queries
- Communication cost
- Availability
- Peers can leave
- Consistency maintenance
- Difficult to support transaction
- Quality
- Difficult to guarantee quality
14Crash course on Web standards
Owl RDFS
XML
- Data exchange format XML
- Labeled ordered trees
- Its main asset XML schema
- There is much more
- Distributed computing protocol Web services
- SOAP Simple Object Access Protocols
- WSDL Web Service Definition Language
- UDDI Universal Description, Discovery and
Integration - BPEL Business Process Execution Language
- Query languages XPath and XQuery
- Declarative query language for XML full-text
update language - Knowledge representation Owl or RDF/S
Xquery Xpath
SOAP WSDL
15Information used to live in islands but with the
Web, this is changing uniform access to
information the dream for distributed data
management
16Do you like the standards?
- It is the wrong question!
- Correct questions What can you do with it? What
is missing? - Is Xquery the ultimate query language for the
Web? No - It is a language for querying centralized XML
- We will see what it is missing
- We will not talk much about semantics
17Automatic and distributed management of the data
ring
- No centralized server
- No information administrator (no info manager)
- Most users are non-experts
- E.g., scientists
- Requirements
- Ease of deployment (zero-effort)
- Ease of administration (zero-effort)
- Ease of publication (epsilon-effort)
- Ease of exploitation (epsilon-effort)
- Participation in community building notably via
annotations
Happy database admin
18What should be made automatic
- Self-statistics from the monitoring of the data
ring - In particular, define the statistics that are
needed - Self-tuning based on the self-statistics
- Choose the most appropriate organization
- Decide to install access structures indexes,
views, etc. - Control replication of data and services
- Self-healing
- Recovery from errors
- E.g., replacement of a failing Web service
- And automatic file management
19Any hope?
- Technology exists (database self-tuning, machine
learning, etc.) - But self-tuning for databases has advanced very
slowly - Why can this work?
- There is no alternative (for db, this was just a
cool gadget) - KISS (keep it simple stupid!)
- The power of parallelism
- This is assuming lots of machine have free
cycles (true) and bandwidth is generous (not
always true)
20Distributed access control
- Goal Control access to ring resources
- Access to resources is based on access rights
(ACL) - Who is controlling ACLs?
- A node manages ACLs for a collection of
distributed resources - Easy but against the spirit and possible
bottleneck - The network manages access control
- Anybody can get the data
- The data is published with encryption and
signatures only nodes with proper access rights
can perform reads/writes - Some techniques exist
21Monitoring
- What is monitored?
- Web service calls and database updates
- The Web
- Web pages
- RSS feed
- What is produced?
- A stream of events
- As a continuous service
- As a RSS feed
- As a Web site/page
- Info-surveillance
- Self-statistics and tracing
- Basis for error diagnosis
22Streams are everywhere
- In query processing
- In indexing (KadoP)
- In recursive queries (AXML-QSQ)
- In messaging, monitoring and pub/sub
- That is why we will use an algebra over streams
of trees and not simply an algebra over trees
23Example Edos distribution system
- A system for the management of Linux distribution
- Joint work with Mandriva Software and U. Tel Aviv
- Community of open-source developers thousands
- System releases about 10 000 software packages
metadata - Functionalities
- Query the metadata
- Query subscription
- Retrieve packages
- Publish a new release or update an existing one
24Exemple WebContent
- WebContent an ANR platform for the management of
web content - Web surveillance
- Business, technical, web watching
- Participation of Gemo
- WP3 knowledge
- WP5 P2P content management
- Partners CEA, EADS, Thales, Bongrain, Xyleme,
Exalead, many research groups (UVSQ, Grenoble,
Paris 6, etc.)
25Taxonomy of such applications
- Parameters
- Number of peers and quantity of data
- How volatile the peers are
- The query/update workload
- The functionalities that are desired
- Edos peers and documents in thousands, mostly
append for updates, peers not too volatile - An extreme Google search engine in P2P for
billions of documents using millions of hyper
volatile peers - Mostly interested in the first case
26Thesis
- XQuery is fine for local XML processing and
publishing - Not sufficient for distributed data management
- The success of the relational model, i.e., of
tables on a server - A logic for defining tables
- An algebra for describing query plans over tables
- By analogy, we need for trees in a P2P system
- A logic for defining distributed tree data and
data services - An algebra for describing query plans over these
- Proposal ActiveXML logic and algebra
27Outline
- Introduction the data ring
- Calculus for P2P data management (ActiveXML)
- Algebra for P2P data management (ActiveXML
algebra) - Indexing in P2P (KadoP)
- Conclusion
282. Active XMLa logic for distributed data
management
29The basis
- AXML is a declarative language for distributed
information management and an infrastructure to
support the language in a P2P framework - Simple idea XML documents with embedded service
calls - Intensional data
- Some of the data is given explicitly whereas for
some, its definition (i.e. the means to acquire
it when needed) is given - Dynamic data
- If the data sources change, the same document
will provide different information
30Example(omitting syntactic details)
ltresorts stateColoradogt ltresortgt
ltnamegt Aspen lt/namegt ltscgt
Unisys.com/snow(Aspen) lt/scgt ltdepth
unitmetergt1lt/depthgt lthotels IDAspHotels
gt . Yahoo.com/GetHotels(ltcity
nameAspen/gt) lt/hotelsgt lt/resortgt
lt/resortsgt
- May contain calls
- to any SOAP web service
- e-bay.net, google.com
- to any AXML web services
- to be defined
31Marketing ? Philosophy
Active answer intensional and dynamic and
flexible Embedding calls in data is an old idea
in database
Manon Whats the capital of Brazil? Dad Lets
ask Wikipedia.com! Manon How do I get a cheap
ticket to Galapagos? Dad Lets place a
subscription on LastMinute.com! Manon What are
the countries in the EC? Dad France, Germany,
Holland, Belgium, and hum Lets ask
YouLists.com for more!
32Active XML peer
AXML peer
soap
- Peer-to-peer architecture
- Each Active XML peer
- Repository manages Active XML data
- Web client calls the services inside a document
- Web server provides (parameterized)
queries/updates over the repository as web
services - Exchange of AXML instead of XML
33What is an AXML peer?
- PC
- Now open source ObjectWeb queries in OQL
- Peer on a mass storage system
- eXist (open source XML database) queries in
XQuery - Xyleme queries in XyQL
- PDA or cell phone
- Persistence in file system and XPATH
- On going the entire network
- Data is stored in a P2P network - KadoP
- More java card, a relational database
34A key issue call activation
- When to activate the call?
- Explicit pull mode active databases
- Implicit pull mode deductive databases
- Push mode query subscription
- What to do with its result?
- How long is the returned data valid?
- Mediation and caching
- Where to find the arguments?
- Under the service call XML,XPATH or a service
call
35Another key issue what to send?
- Send some AXML tree t
- As result of a query or as parameter of a call
- The tree t contains calls, do we have to evaluate
them? - If I do, I may introduce service calls, do we
have to evaluate all these calls before
transmitting the data?
- Hi John, what is the phone number of the Prime
Minister of France? - Find his name at whoswho.com then look in the
phone dir - Look in the yellow pages for deVillepins in
phone dir of www.gov.fr - (33) 01 56 00 01
36Active XMLcool idea complex problems
- Blasphemous claim
- Active XML is the proper paradigm for data
exchange! - Not XML not XQuery
- Brings to a unique setting
- distributed db, deductive db, active db, stream
data - warehousing, mediation
- This is unreasonable? Yes!
- Plenty of works ahead to make it work
- But first, the algebra
37Outline
- Introduction the data ring
- Calculus for P2P data management (ActiveXML)
- Algebra for P2P data management (ActiveXML
algebra) - Query processing
- Query optimization
- Indexing in P2P (KadoP)
- Conclusion
383. Active XML algebra
39Motivation
- Relational model centralized tables
- optimization algebraic expression and
rewriting - Active XML model distributed trees
- optimization algebraic expression and
rewriting - Distributed query optimization based on algebraic
rewriting of Active XML trees - Based on experiences with AXML optimization
40Active XML peers
output stream
- We focus on positive AXML
- Set-oriented data
- Positive/monotone services
- Services tree-pattern-query-with-join queries
- Services produce streams
- Optimized by a local query optimizer
- Evaluated by a local query processor
- Out of our scope
p
Local query processing
join
?
p
input stream
input stream
41The problem
- An AXML system
- A set of peers
- For each peer a set of documents and services
- Extensional data is distributed
- Intensional data (knowledge) is distributed
- Defined using query services (TPQJ queries)
- These services are generic any peer can evaluate
a query - A query q to some peer
- Evaluate the answer to q with optimal response
time
42AXML algebra
- (AXML) algebraic expressions
AXML logic
d_at_p
Each such expression lives at some peer Includes
the AXML trees
43Algebraic expressions annotations
- Executing service call ?
- Terminated service call ?
- Subtlety
- q_at_p(5) definition of intensional data
- eval(q_at_p(5)) request to evaluate it during
query optimization - ? q_at_p(5) query is being evaluated during query
processing - ? q_at_p(5) query evaluation is complete
44Evaluation rules local rules
for l ? sc, s ? send, receive
45Evaluation rules transfer rules
?
- Site p asks p to do the work and send the result
to p
46Evaluation rules more transfer rules
x_at_p
?
Z
- When a query is evaluated, results appear
- They are sent to the place that requested them
- Also some rules for eof
47Evaluation
- Reminder setting
- An AXML system
- A request to evaluate query q at peer p eval_at_p(
q ) - Rewrite the trees in peer workspaces until
termination of the process - Results
- For positive XML, this process converges to a
possibly infinite state - This process computes the answer to q
- May be fairly inefficient need for optimization!
48Optimization
- More rewrite rules to evaluate a query more
efficiently
49Query optimization
- Well-known optimization techniques for
distributed data management - Pushing selections
- Semijoin reducers
- Horizontal, vertical, hybrid decomposition
- Recursive query processing and query-subquery
- Some specific AXML optimizations
- Pushing queries over service calls
- Lazy service call evaluation
- Optimizing subscription management
- All are captured by the algebraic framework
50Example pushing selections
Suppose q q1(?(q2))
- Same rule applies if d_at_p2 is replaced by a
continuous query
51Example interleaving of processing and
optimization
- At peer i di ri ? di1
- Query at p1 ?(d1)
- ?(d1) ? ?(r1) ? ?(d2)eval_at_p1(?(r1) ? ?(d2)) ?
eval_at_p1(?(r1)) ? eval_at_p1(?(d2))eval_at_p1(?(r1)) ?
??(r1) (starts streaming data) - ?(d2) ? ?(r2) ? ?(d3) ?(r2) starts streaming
data - ?(d3) ? ?(r3) ? ?(d4)
52Transfer and load balancing rules
Peer p1 delegates the evaluation of E to p2
53Transfer and load balancing rules
x_at_p1
x_at_p1
eval_at_p1
?
eval(E)
send_at_p1
send_at_p2
newRoot_at_p2()
x_at_p1
Peer p1 delegates the evaluation of E to p2
54Transfer and load balancing rules
x_at_p1
newRoot_at_p2()
x_at_p1
?
send_at_p2
eval(E)
x_at_p1
Peer p1 delegates the evaluation of E to p2
55Transfer and load balancing rules
x_at_p1
newRoot_at_p2()
x_at_p1
?
send_at_p2
eval(E)
x_at_p1
Peer p1 delegates the evaluation of E to p2
56Transfer and load balancing rules
x_at_p1
newRoot_at_p2()
x_at_p1
?
send_at_p2
x_at_p1
Peer p1 delegates the evaluation of E to p2
57Transfer and load balancing rules
x_at_p1
eval_at_p1
?
send_at_p1
?
send_at_p2
newRoot_at_p2()
eval_at_p2
x_at_p1
Peer p1 delegates the evaluation of E to p2
58Back to interleaved execution and optimization
?
?
?
?
Repeated transfers
?(r2)
?(r3)
?(r4)
?(r1)
Data transfers reduced More work for p1 merging
all the streams
Hierarchical stream merging
59Example Horizontal and vertical decomposition
- A relation d over ABC that is split both
horizontally and vertically - d (d1 ? d2) d3
- d1 ?Blt5 (d') and d2 ?Bgt5 (d')
- d', d1, d2 over AB and d3 over BC each di is at
a peer pi - Consider the query ?B0_at_p(d)
- ? ?B0_at_p( (?Blt5 (d') ? ?Bgt5 (d'))) d3_at_p3 )
- ? ?B0 _at_p( d1_at_p1 d3_at_p3 )
- ? ? _at_p (x_at_p?receive(d1_at_p1)?,
y_at_p?receive(d3_at_p3)?) - ? send_at_p1(x_at_p ?B0_at_p1(d1_at_p1) )
- ? send_at_p3(y_at_p d3_at_p3)
60Common sub-expression elimination
- eval_at_p(E), x_at_p?receive_at_p(E)? ?
- eval_at_p(x_at_p), x_at_p?receive_at_p(E)?
eval_at_p
x_at_p
?
receive_at_p
x_at_p
61Common sub-expression elimination
62Example recursive query processing
- Using a pseudo Datalog syntax
- s1_at_p(x, y) ? d2_at_p'(x, z), s2_at_p'(z, y)
- s2_at_p'(x, y) ? d1_at_p(x, z), s1_at_p(y, z)
- After rewriting
- on p x_at_p? ? receive_at_p(q1_at_p'(d2_at_p', s2_at_p') ) ?
- root_at_p? ? send_at_p(y_at_p', q2_at_p(? d1_at_p, ?x_at_p) ) ?
- on p' root_at_p'? ? send_at_p'(x_at_p,
- ? q1_at_p'(d2_at_p', y_at_p'? ? receive_at_p'(s2_at_p') ? ) )
?
63Generic and global services
- q_at_any where q is a query
- Any peer that has some query processor for q can
do it - f_at_any where f is a processing service call
- Example decryption or gene comparison
- q over a P2P collection
eval_at_p
eval_at_p
eval_at_p
eval_at_p
?
?
q_at_p2
q_at_p1
q
coll
q
index
_at_
_at_
q
64The AXML algebra conclusion
- Captures distributed XML query processing/optimiza
tion - Based on a communication model a la CCS
- Algebraic stream-oriented
- Orthogonal to the local XML query optimizer
- Orthogonal to the network support (DHT, small
world etc.) - What is not yet available? A cost model and
heuristics
65Outline
- Introduction the data ring
- Calculus for P2P data management (ActiveXML)
- Algebra for P2P data management (ActiveXML
algebra) - Indexing in P2P (KadoP)
- Conclusion
664. P2P XML indexing and query processing
67Efficient evaluation of tree-pattern-queries
- Many optimization techniques
- We are interested here in distributed query
evaluation/optimization - 1) We consider XML indexing
- 2) Holistic twig join that is based on indexing
- 3) P2P indexing
- 4) P2P query processing
- 5) Optimizing P2P indexing
68XML indexing structural identifiers
1
A
8
0
7
2
B
C
8
6
1
1
X ancestor of Y ltgt pre(X) lt pre(Y) and post(X)
post(Y)
3
8
5
D
F
E
4
8
6
2
2
2
6
4
G
X parent of Y ltgt X ancestor of Y and level(X)
level(Y) - 1
John
6
4
3
3
-Level
Structural IDs Prefix-Postfix
69Holistic Twig Join
- Input a document and a tree pattern query
- Find the bindings of the query in the document
- Holistic holistique
- (le tout et pas juste les parties)
- Twig brindille
- Join you know
- Sounds like Harry Potter?
70Query evaluation over a document
Ids for A (1,8,0)
Ids for C
Ids for D
John
Ids for John
Ids are sorted in lexicographical order Goals is
to find matching Ids
71The Holistic Twig Join Algorithm
level
0
r (1,25)
1
b (10,11)
a (16,17)
b (19,22)
2
c (11,11)
c (17,17)
b (20,21)
3
c (22,22)
c (21,21)
4
72The Holistic Twig Join Algorithm
(a7, b4, c8), (a7, b5, c8),
Stacks
(a7, b4 ,c9)
(a7 ,b6 ,c11)
a
a7
a1
a5
a7
a4
a6
a2
a3
b4
b6
b1
b2
b4
b6
b
b5
b3
c1
c2
c10
c5
c9
c8
c11
c6
c7
c4
c3
c9
c8
c11
c
Legend
This is the end
Head of the stream Find the match for the query
sub-tree determined by this node !!! The ID is
present also in the stack
73P2P XML processing
74XML indexing in Xyleme
- History
- 1999 INRIA research project
- 2000 Creation of a spin-off
- 2006 About 25 people
- Technology
- A scalable XML repository
- A content warehouse
- On a cluster of Linux PC
- XML query processing
- Twig join
- Index is distributed
- Keyword-based vs. document based
hash(C)
LAN
hash(John)
Put(Cd,p,6,6,1)
Put(Johnd,p,3,1,2)
75Query processing over a distributed collection
A
Ids for A (p12,d456, 1,7,0)
C
D
Ids for C
Ids for D
John
Ids include peerId and docId Ids are sorted in
lexicographical order Goals is to find matching
Ids in the collection
Ids for John
76XML indexing in KadoP
- Use structural Ids
- Publish them via a DHT
- Distributed Hash Table
- Peers come and go
- Locate(k) log(n) messages to fing the peer in
charge of key k - Put(k,v)
- Get(k) retrieves all the values for k
- We use Pastry
- We also tried P2PSim and JXTA
hash(C)
DHT
posting for C
hash(John)
put(Cd,p,6,6,1)
put(Johnd,p,3,1,2)
put(Cd,p,6,6,1)
77XML query processing in KadoP
- Given a tree pattern query Q
- Evaluate an index query indexQ to locate the
peers that can provide some answers - indexQ is a twig join
- Ship Q to these peers and evaluate it there
- If indexQ is imprecise, many false positive
- Example ship Q to all peers (maximal
parallelism) - Example Instead of structural Ids, just use
(label/word,peerId,docId)
78KadoP architecture
KadoP peer publish query
Semantic layer
Web interface
External Layer
ActiveXML engine
KadoP Engine
Indexing
Logical Layer
Query processing
DHT locate, put, get delete
Physical Layer
Index
79Some technical issues
- Our goal manage millions of documents with a
large number of peers - First experiments were a disaster
- Replace the index storage of the DHT in a FS by
storage in a database (Berkeley DB) - Extend the API of the DHT to have Append and not
only Read/Write - Extend the API of the DHT to have a streaming
exchange of postings - Useful because the XML algebra works better with
streams - Now it scales but there is the issue of long
postings
80The issue of long postings Google in P2P
- Using keyword distribution
- Suppose
- Peer for Ullman is in Europe
- Peer for XML is in US
- we have to ship one long posting between US and
Europe - For a large number of users, we absorb all the
bandwidth of Internet backbone - Need for replication
- Even for thousands of peers, the exchange of long
postings is an issue
Ullmann xml?
DHT
Ullman
xml
81Intensional indexing in KadoP
Distributed B-tree
- Long posting bad response time
- No long posting
- get h(name) then parallel fetch
- Possibility to optimize further
- f(docId55..docId75)
- may be it does not match
- no need to call f
long posting
h(Name)
f g h i
h(Name)
82More optimization
- Standard for P2P keyword search
- Gap compression and adaptive set intersection
- Standard distributed query optimization
techniques - Ship smallest list
- Load balancing
- Caching
- Replication
- Semi-join techniques notably Bloom semi-join
83Outline
- Introduction the data ring
- Calculus for P2P data management (ActiveXML)
- Algebra for P2P data management (ActiveXML
algebra) - Indexing in P2P (KadoP)
- Conclusion
846. Conclusion
85Conclusion
- Logic for distributed data management
- Opinion XQuery is a language for local XML
management - Proposal ActiveXML
- Algebraic foundation of distributed query
optimization - Proposal ActiveXML algebra
- P2P (Active) XML indexing
- KadoP is now being tested and we are working on
optimization - Software
- ActiveXML is open-source see activexml.net
- KadoP soon will be already available upon
request - EDOS distribution system as well
86Lots of related work and related systems
- This is going very fast in system devepments
- Structured P2P nets Pastry, Chord
- Content delivery net Coral, Akamai
- XML repositories Xyleme, DBMonet
- Multicas systemst Avalanche, Bullet
- File sharing systems BitTorrent, Kazaa
- Pub/Sub systems Scribe, Hyper
- Distributed storage systems OceanStore,
GoogleFS - Etc.
- Fundamental research is somewhat left behind
-
87Issues
- P2P query optimization
- P2P access control
- P2P archiving
- P2P self tuning
- P2P monitoring
- P2P knowledge management SomeWhere
- Also analysis and verification of these systems
- E.g., termination, error detection, diagnosis
88Find your own topic
- Pick your favorite problem for data or knowledge
management and study it in a P2P setting - with gigabytes of data and thousands of peers
- If you find it boring, consider it
- with terabytes of data and millions of peers
-
89Merci
Merci