Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

About This Presentation

Title:

Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

Description:

P2P Data Management, 2006, S. Abiteboul. 3 /89. Success ... Napster (emule, bearshare, etc.): music database. Flickr: picture database. Wikipedia: dictionary ... – PowerPoint PPT presentation

Number of Views:130

Avg rating:3.0/5.0

Slides: 90

Provided by: proje73

Category:

more less

Transcript and Presenter's Notes

Title: Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

1
Information Management in P2PSerge
AbiteboulINRIA-Futurs and Univ. Paris 11
2
Introduction
3
Success stories at the time of the Internet
bubble

Google management of Web pages
Mapquest management of maps
Amazone book catalogue
eBay product catalogue
Napster (emule, bearshare, etc.) music database
Flickr picture database
Wikipedia dictionary
del.icio.us annotations
In France
Meetic dating database
Kelkoo comparative shopping

They are all about publishing some database
4
The trend is towards peer-to-peerand
interactivity

P2P A large and varying number of computers
cooperate to solve some particular task without
any centralized authority
Goal build an efficient, robust, scalable system
based (typically) on inexpensive, unreliable
computers distributed in a wide area network
seti_at_home kazaa cabal
Switch from centralized servers to communities
and syndication
Interaction and Web 2.0
Motivations Social, organizational

5
Information management in a P2P network

Private terminology data ring
Information is heterogeneous, distributed,
replicated, dynamic
Which info Data meta-data knowledge
services
Peers are heterogeneous, autonomous and possibly
mobile
From sensors to PDA to mainframe
Typically very large number of peers
Variety of requirements QoS, performance,
security, etc.

6
Acknowledgement

Xyleme Scalable XML warehousing
Sophie Cluet, Guy Ferran (Xyleme) many others
ActiveXML Language for P2P data management
Omar Benjelloun (Google), Ioana Manolescu, Tova
Milo (Tel Aviv) many others
KadoP P2P scalable XML indexing
Ioana Manolescu, Nicoleta Preda others
Data Ring Infrastructure for P2P data management
Alkis Polyzotis (UC Santa Cruz)

7
Outline

Introduction the data ring
Calculus for P2P data management (ActiveXML)
Algebra for P2P data management (ActiveXML
algebra)
Indexing in P2P (KadoP)
Conclusion
Goal of the tutorial present issues and
technology on p2p information management
Warning it is very biased it is not a survey

8
Outline

Introduction the data ring
Calculus for P2P data management (ActiveXML)
Algebra for P2P data management (ActiveXML
algebra)
Indexing in P2P (KadoP)
Conclusion

9
1. Introduction the data ring
10
The information in a data ring

Data tuples, collections, documents, relations
Services data sources, possibly some processing
Meta-data about resources attribute/values
pairs, annotations
Ontologies to explain data and metadata
View definitions
Data integration information, e.g., mappings
between ontologies
Physical data Indices and materialized views

11
Functionalities of the data ring

Storage, persistence, replication
Indexing, caching, querying, updating,
optimization
Schema management, access control
Fault tolerance, self tuning, monitoring
Resource discovery, history, provenance,
annotations, multi-linguism,
Semantic enrichment, uncertain data
Each functionality may be achieved by a peer or
by the network

12
And now, what is a peer?

A mainframe database
A file system
Web server
A PC
A PDA
A telephone
A sensor
A home appliance
A car

A manufacturing tool
A telecom equipment
A toy
Another data ring

Any connected device or software with some
information to share
A net address and some names of resources (e.g.
document or service)
13
Advantages and disadvantages of P2P

Scaling
Performance
Optimization of parallelism
Avoid bottleneck
Replication
Availability
Replication
Cost
Avoid the cost of server
Share operational cost
Dynamicity
add/remove new data sources

Complexity
Performance
Cost for complex queries
Communication cost
Availability
Peers can leave
Consistency maintenance
Difficult to support transaction
Quality
Difficult to guarantee quality

14
Crash course on Web standards
Owl RDFS
XML

Data exchange format XML
Labeled ordered trees
Its main asset XML schema
There is much more
Distributed computing protocol Web services
SOAP Simple Object Access Protocols
WSDL Web Service Definition Language
UDDI Universal Description, Discovery and
Integration
BPEL Business Process Execution Language
Query languages XPath and XQuery
Declarative query language for XML full-text
update language
Knowledge representation Owl or RDF/S

Xquery Xpath
SOAP WSDL
15
Information used to live in islands but with the
Web, this is changing uniform access to
information the dream for distributed data
management
16
Do you like the standards?

It is the wrong question!
Correct questions What can you do with it? What
is missing?
Is Xquery the ultimate query language for the
Web? No
It is a language for querying centralized XML
We will see what it is missing
We will not talk much about semantics

17
Automatic and distributed management of the data
ring

No centralized server
No information administrator (no info manager)
Most users are non-experts
E.g., scientists
Requirements
Ease of deployment (zero-effort)
Ease of administration (zero-effort)
Ease of publication (epsilon-effort)
Ease of exploitation (epsilon-effort)
Participation in community building notably via
annotations

Happy database admin
18
What should be made automatic

Self-statistics from the monitoring of the data
ring
In particular, define the statistics that are
needed
Self-tuning based on the self-statistics
Choose the most appropriate organization
Decide to install access structures indexes,
views, etc.
Control replication of data and services
Self-healing
Recovery from errors
E.g., replacement of a failing Web service
And automatic file management

19
Any hope?

Technology exists (database self-tuning, machine
learning, etc.)
But self-tuning for databases has advanced very
slowly
Why can this work?
There is no alternative (for db, this was just a
cool gadget)
KISS (keep it simple stupid!)
The power of parallelism
This is assuming lots of machine have free
cycles (true) and bandwidth is generous (not
always true)

20
Distributed access control

Goal Control access to ring resources
Access to resources is based on access rights
(ACL)
Who is controlling ACLs?
A node manages ACLs for a collection of
distributed resources
Easy but against the spirit and possible
bottleneck
The network manages access control
Anybody can get the data
The data is published with encryption and
signatures only nodes with proper access rights
can perform reads/writes
Some techniques exist

21
Monitoring

What is monitored?
Web service calls and database updates
The Web
Web pages
RSS feed
What is produced?
A stream of events
As a continuous service
As a RSS feed
As a Web site/page
Info-surveillance
Self-statistics and tracing
Basis for error diagnosis

22
Streams are everywhere

In query processing
In indexing (KadoP)
In recursive queries (AXML-QSQ)
In messaging, monitoring and pub/sub
That is why we will use an algebra over streams
of trees and not simply an algebra over trees

23
Example Edos distribution system

A system for the management of Linux distribution
Joint work with Mandriva Software and U. Tel Aviv
Community of open-source developers thousands
System releases about 10 000 software packages
metadata
Functionalities
Query the metadata
Query subscription
Retrieve packages
Publish a new release or update an existing one

24
Exemple WebContent

WebContent an ANR platform for the management of
web content
Web surveillance
Business, technical, web watching
Participation of Gemo
WP3 knowledge
WP5 P2P content management
Partners CEA, EADS, Thales, Bongrain, Xyleme,
Exalead, many research groups (UVSQ, Grenoble,
Paris 6, etc.)

25
Taxonomy of such applications

Parameters
Number of peers and quantity of data
How volatile the peers are
The query/update workload
The functionalities that are desired
Edos peers and documents in thousands, mostly
append for updates, peers not too volatile
An extreme Google search engine in P2P for
billions of documents using millions of hyper
volatile peers
Mostly interested in the first case

26
Thesis

XQuery is fine for local XML processing and
publishing
Not sufficient for distributed data management
The success of the relational model, i.e., of
tables on a server
A logic for defining tables
An algebra for describing query plans over tables
By analogy, we need for trees in a P2P system
A logic for defining distributed tree data and
data services
An algebra for describing query plans over these
Proposal ActiveXML logic and algebra

27
Outline

Introduction the data ring
Calculus for P2P data management (ActiveXML)
Algebra for P2P data management (ActiveXML
algebra)
Indexing in P2P (KadoP)
Conclusion

28
2. Active XMLa logic for distributed data
management
29
The basis

AXML is a declarative language for distributed
information management and an infrastructure to
support the language in a P2P framework
Simple idea XML documents with embedded service
calls
Intensional data
Some of the data is given explicitly whereas for
some, its definition (i.e. the means to acquire
it when needed) is given
Dynamic data
If the data sources change, the same document
will provide different information

30
Example(omitting syntactic details)
ltresorts stateColoradogt ltresortgt
ltnamegt Aspen lt/namegt ltscgt
Unisys.com/snow(Aspen) lt/scgt ltdepth
unitmetergt1lt/depthgt lthotels IDAspHotels
gt . Yahoo.com/GetHotels(ltcity
nameAspen/gt) lt/hotelsgt lt/resortgt
lt/resortsgt

May contain calls
to any SOAP web service
e-bay.net, google.com
to any AXML web services
to be defined

31
Marketing ? Philosophy
Active answer intensional and dynamic and
flexible Embedding calls in data is an old idea
in database
Manon Whats the capital of Brazil? Dad Lets
ask Wikipedia.com! Manon How do I get a cheap
ticket to Galapagos? Dad Lets place a
subscription on LastMinute.com! Manon What are
the countries in the EC? Dad France, Germany,
Holland, Belgium, and hum Lets ask
YouLists.com for more!
32
Active XML peer
AXML peer
soap

Peer-to-peer architecture
Each Active XML peer
Repository manages Active XML data
Web client calls the services inside a document
Web server provides (parameterized)
queries/updates over the repository as web
services
Exchange of AXML instead of XML

33
What is an AXML peer?

PC
Now open source ObjectWeb queries in OQL
Peer on a mass storage system
eXist (open source XML database) queries in
XQuery
Xyleme queries in XyQL
PDA or cell phone
Persistence in file system and XPATH
On going the entire network
Data is stored in a P2P network - KadoP
More java card, a relational database

34
A key issue call activation

When to activate the call?
Explicit pull mode active databases
Implicit pull mode deductive databases
Push mode query subscription
What to do with its result?
How long is the returned data valid?
Mediation and caching
Where to find the arguments?
Under the service call XML,XPATH or a service
call

35
Another key issue what to send?

Send some AXML tree t
As result of a query or as parameter of a call
The tree t contains calls, do we have to evaluate
them?
If I do, I may introduce service calls, do we
have to evaluate all these calls before
transmitting the data?

Hi John, what is the phone number of the Prime
Minister of France?
Find his name at whoswho.com then look in the
phone dir
Look in the yellow pages for deVillepins in
phone dir of www.gov.fr
(33) 01 56 00 01

36
Active XMLcool idea complex problems

Blasphemous claim
Active XML is the proper paradigm for data
exchange!
Not XML not XQuery
Brings to a unique setting
distributed db, deductive db, active db, stream
data
warehousing, mediation
This is unreasonable? Yes!
Plenty of works ahead to make it work
But first, the algebra

37
Outline

Introduction the data ring
Calculus for P2P data management (ActiveXML)
Algebra for P2P data management (ActiveXML
algebra)
Query processing
Query optimization
Indexing in P2P (KadoP)
Conclusion

38
3. Active XML algebra
39
Motivation

Relational model centralized tables
optimization algebraic expression and
rewriting
Active XML model distributed trees
optimization algebraic expression and
rewriting
Distributed query optimization based on algebraic
rewriting of Active XML trees
Based on experiences with AXML optimization

40
Active XML peers
output stream

We focus on positive AXML
Set-oriented data
Positive/monotone services
Services tree-pattern-query-with-join queries
Services produce streams
Optimized by a local query optimizer
Evaluated by a local query processor
Out of our scope

p
Local query processing
join
?
p
input stream
input stream
41
The problem

An AXML system
A set of peers
For each peer a set of documents and services
Extensional data is distributed
Intensional data (knowledge) is distributed
Defined using query services (TPQJ queries)
These services are generic any peer can evaluate
a query
A query q to some peer
Evaluate the answer to q with optimal response
time

42
AXML algebra

(AXML) algebraic expressions

AXML logic
d_at_p
Each such expression lives at some peer Includes
the AXML trees
43
Algebraic expressions annotations

Executing service call ?
Terminated service call ?
Subtlety
q_at_p(5) definition of intensional data
eval(q_at_p(5)) request to evaluate it during
query optimization
? q_at_p(5) query is being evaluated during query
processing
? q_at_p(5) query evaluation is complete

44
Evaluation rules local rules
for l ? sc, s ? send, receive
45
Evaluation rules transfer rules
?

Site p asks p to do the work and send the result
to p

46
Evaluation rules more transfer rules
x_at_p
?

Z

When a query is evaluated, results appear
They are sent to the place that requested them
Also some rules for eof

47
Evaluation

Reminder setting
An AXML system
A request to evaluate query q at peer p eval_at_p(
q )
Rewrite the trees in peer workspaces until
termination of the process
Results
For positive XML, this process converges to a
possibly infinite state
This process computes the answer to q
May be fairly inefficient need for optimization!

48
Optimization

More rewrite rules to evaluate a query more
efficiently

49
Query optimization

Well-known optimization techniques for
distributed data management
Pushing selections
Semijoin reducers
Horizontal, vertical, hybrid decomposition
Recursive query processing and query-subquery
Some specific AXML optimizations
Pushing queries over service calls
Lazy service call evaluation
Optimizing subscription management
All are captured by the algebraic framework

50
Example pushing selections
Suppose q q1(?(q2))

Same rule applies if d_at_p2 is replaced by a
continuous query

51
Example interleaving of processing and
optimization

At peer i di ri ? di1
Query at p1 ?(d1)
?(d1) ? ?(r1) ? ?(d2)eval_at_p1(?(r1) ? ?(d2)) ?
eval_at_p1(?(r1)) ? eval_at_p1(?(d2))eval_at_p1(?(r1)) ?
??(r1) (starts streaming data)
?(d2) ? ?(r2) ? ?(d3) ?(r2) starts streaming
data
?(d3) ? ?(r3) ? ?(d4)

52
Transfer and load balancing rules
Peer p1 delegates the evaluation of E to p2
53
Transfer and load balancing rules
x_at_p1
x_at_p1
eval_at_p1
?
eval(E)
send_at_p1
send_at_p2
newRoot_at_p2()
x_at_p1
Peer p1 delegates the evaluation of E to p2
54
Transfer and load balancing rules
x_at_p1
newRoot_at_p2()
x_at_p1
?

send_at_p2
eval(E)
x_at_p1
Peer p1 delegates the evaluation of E to p2
55
Transfer and load balancing rules
x_at_p1
newRoot_at_p2()
x_at_p1
?

send_at_p2
eval(E)
x_at_p1
Peer p1 delegates the evaluation of E to p2
56
Transfer and load balancing rules
x_at_p1
newRoot_at_p2()
x_at_p1
?

send_at_p2
x_at_p1
Peer p1 delegates the evaluation of E to p2
57
Transfer and load balancing rules
x_at_p1
eval_at_p1
?
send_at_p1
?
send_at_p2
newRoot_at_p2()
eval_at_p2
x_at_p1
Peer p1 delegates the evaluation of E to p2
58
Back to interleaved execution and optimization

?
?
?
?
Repeated transfers
?(r2)
?(r3)
?(r4)
?(r1)
Data transfers reduced More work for p1 merging
all the streams
Hierarchical stream merging
59
Example Horizontal and vertical decomposition

A relation d over ABC that is split both
horizontally and vertically
d (d1 ? d2) d3
d1 ?Blt5 (d') and d2 ?Bgt5 (d')
d', d1, d2 over AB and d3 over BC each di is at
a peer pi
Consider the query ?B0_at_p(d)
? ?B0_at_p( (?Blt5 (d') ? ?Bgt5 (d'))) d3_at_p3 )
? ?B0 _at_p( d1_at_p1 d3_at_p3 )
? ? _at_p (x_at_p?receive(d1_at_p1)?,
y_at_p?receive(d3_at_p3)?)
? send_at_p1(x_at_p ?B0_at_p1(d1_at_p1) )
? send_at_p3(y_at_p d3_at_p3)

60
Common sub-expression elimination

eval_at_p(E), x_at_p?receive_at_p(E)? ?
eval_at_p(x_at_p), x_at_p?receive_at_p(E)?

eval_at_p
x_at_p
?

receive_at_p
x_at_p
61
Common sub-expression elimination
62
Example recursive query processing

Using a pseudo Datalog syntax
s1_at_p(x, y) ? d2_at_p'(x, z), s2_at_p'(z, y)
s2_at_p'(x, y) ? d1_at_p(x, z), s1_at_p(y, z)
After rewriting
on p x_at_p? ? receive_at_p(q1_at_p'(d2_at_p', s2_at_p') ) ?
root_at_p? ? send_at_p(y_at_p', q2_at_p(? d1_at_p, ?x_at_p) ) ?
on p' root_at_p'? ? send_at_p'(x_at_p,
? q1_at_p'(d2_at_p', y_at_p'? ? receive_at_p'(s2_at_p') ? ) )
?

63
Generic and global services

q_at_any where q is a query
Any peer that has some query processor for q can
do it
f_at_any where f is a processing service call
Example decryption or gene comparison
q over a P2P collection

eval_at_p
eval_at_p
eval_at_p
eval_at_p
?
?
q_at_p2
q_at_p1
q
coll
q
index
_at_
_at_
q
64
The AXML algebra conclusion

Captures distributed XML query processing/optimiza
tion
Based on a communication model a la CCS
Algebraic stream-oriented
Orthogonal to the local XML query optimizer
Orthogonal to the network support (DHT, small
world etc.)
What is not yet available? A cost model and
heuristics

65
Outline

Introduction the data ring
Calculus for P2P data management (ActiveXML)
Algebra for P2P data management (ActiveXML
algebra)
Indexing in P2P (KadoP)
Conclusion

66
4. P2P XML indexing and query processing
67
Efficient evaluation of tree-pattern-queries

Many optimization techniques
We are interested here in distributed query
evaluation/optimization
1) We consider XML indexing
2) Holistic twig join that is based on indexing
3) P2P indexing
4) P2P query processing
5) Optimizing P2P indexing

68
XML indexing structural identifiers

1
A
8
0
7
2
B
C
8
6
1
1
X ancestor of Y ltgt pre(X) lt pre(Y) and post(X)
post(Y)
3
8
5
D
F
E
4
8
6
2
2
2
6
4
G
X parent of Y ltgt X ancestor of Y and level(X)
level(Y) - 1
John
6
4
3
3
-Level
Structural IDs Prefix-Postfix
69
Holistic Twig Join

Input a document and a tree pattern query
Find the bindings of the query in the document
Holistic holistique
(le tout et pas juste les parties)
Twig brindille
Join you know
Sounds like Harry Potter?

70
Query evaluation over a document
Ids for A (1,8,0)
Ids for C
Ids for D
John
Ids for John
Ids are sorted in lexicographical order Goals is
to find matching Ids
71
The Holistic Twig Join Algorithm
level
0
r (1,25)
1
b (10,11)
a (16,17)
b (19,22)
2
c (11,11)
c (17,17)
b (20,21)
3
c (22,22)
c (21,21)
4
72
The Holistic Twig Join Algorithm
(a7, b4, c8), (a7, b5, c8),
Stacks
(a7, b4 ,c9)
(a7 ,b6 ,c11)
a
a7
a1
a5
a7
a4
a6
a2
a3
b4
b6
b1
b2
b4
b6
b
b5
b3
c1
c2
c10
c5
c9
c8
c11
c6
c7
c4
c3
c9
c8
c11
c
Legend
This is the end
Head of the stream Find the match for the query
sub-tree determined by this node !!! The ID is
present also in the stack
73
P2P XML processing
74
XML indexing in Xyleme

History
1999 INRIA research project
2000 Creation of a spin-off
2006 About 25 people
Technology
A scalable XML repository
A content warehouse
On a cluster of Linux PC
XML query processing
Twig join
Index is distributed
Keyword-based vs. document based

hash(C)
LAN
hash(John)
Put(Cd,p,6,6,1)
Put(Johnd,p,3,1,2)
75
Query processing over a distributed collection
A
Ids for A (p12,d456, 1,7,0)
C
D
Ids for C
Ids for D
John
Ids include peerId and docId Ids are sorted in
lexicographical order Goals is to find matching
Ids in the collection
Ids for John
76
XML indexing in KadoP

Use structural Ids
Publish them via a DHT
Distributed Hash Table
Peers come and go
Locate(k) log(n) messages to fing the peer in
charge of key k
Put(k,v)
Get(k) retrieves all the values for k
We use Pastry
We also tried P2PSim and JXTA

hash(C)
DHT
posting for C
hash(John)
put(Cd,p,6,6,1)
put(Johnd,p,3,1,2)
put(Cd,p,6,6,1)
77
XML query processing in KadoP

Given a tree pattern query Q
Evaluate an index query indexQ to locate the
peers that can provide some answers
indexQ is a twig join
Ship Q to these peers and evaluate it there
If indexQ is imprecise, many false positive
Example ship Q to all peers (maximal
parallelism)
Example Instead of structural Ids, just use
(label/word,peerId,docId)

78
KadoP architecture
KadoP peer publish query
Semantic layer
Web interface
External Layer
ActiveXML engine
KadoP Engine
Indexing
Logical Layer
Query processing
DHT locate, put, get delete
Physical Layer
Index
79
Some technical issues

Our goal manage millions of documents with a
large number of peers
First experiments were a disaster
Replace the index storage of the DHT in a FS by
storage in a database (Berkeley DB)
Extend the API of the DHT to have Append and not
only Read/Write
Extend the API of the DHT to have a streaming
exchange of postings
Useful because the XML algebra works better with
streams
Now it scales but there is the issue of long
postings

80
The issue of long postings Google in P2P

Using keyword distribution
Suppose
Peer for Ullman is in Europe
Peer for XML is in US
we have to ship one long posting between US and
Europe
For a large number of users, we absorb all the
bandwidth of Internet backbone
Need for replication
Even for thousands of peers, the exchange of long
postings is an issue

Ullmann xml?
DHT
Ullman
xml
81
Intensional indexing in KadoP
Distributed B-tree

Long posting bad response time
No long posting
get h(name) then parallel fetch
Possibility to optimize further
f(docId55..docId75)
may be it does not match
no need to call f

long posting
h(Name)
f g h i
h(Name)
82
More optimization

Standard for P2P keyword search
Gap compression and adaptive set intersection
Standard distributed query optimization
techniques
Ship smallest list
Load balancing
Caching
Replication
Semi-join techniques notably Bloom semi-join

83
Outline

Introduction the data ring
Calculus for P2P data management (ActiveXML)
Algebra for P2P data management (ActiveXML
algebra)
Indexing in P2P (KadoP)
Conclusion

84
6. Conclusion
85
Conclusion

Logic for distributed data management
Opinion XQuery is a language for local XML
management
Proposal ActiveXML
Algebraic foundation of distributed query
optimization
Proposal ActiveXML algebra
P2P (Active) XML indexing
KadoP is now being tested and we are working on
optimization
Software
ActiveXML is open-source see activexml.net
KadoP soon will be already available upon
request
EDOS distribution system as well

86
Lots of related work and related systems

This is going very fast in system devepments
Structured P2P nets Pastry, Chord
Content delivery net Coral, Akamai
XML repositories Xyleme, DBMonet
Multicas systemst Avalanche, Bullet
File sharing systems BitTorrent, Kazaa
Pub/Sub systems Scribe, Hyper
Distributed storage systems OceanStore,
GoogleFS
Etc.
Fundamental research is somewhat left behind

87
Issues

P2P query optimization
P2P access control
P2P archiving
P2P self tuning
P2P monitoring
P2P knowledge management SomeWhere
Also analysis and verification of these systems
E.g., termination, error detection, diagnosis

88
Find your own topic

Pick your favorite problem for data or knowledge
management and study it in a P2P setting
with gigabytes of data and thousands of peers
If you find it boring, consider it
with terabytes of data and millions of peers

89
Merci
Merci

Write a Comment

User Comments (0)