ICS 214B: Transaction Processing and Distributed Data Management - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

ICS 214B: Transaction Processing and Distributed Data Management

Description:

Lecture 18: Data Management in Peer-to-Peer Systems Professor Chen Li Based on s developed by Beverley Yang and Hector Garcia-Molina – PowerPoint PPT presentation

Number of Views:127

Avg rating:3.0/5.0

Slides: 42

Provided by: Chen187

Category:

more less

Transcript and Presenter's Notes

Title: ICS 214B: Transaction Processing and Distributed Data Management

1
ICS 214B Transaction Processing and Distributed
Data Management

Lecture 18 Data Management in Peer-to-Peer
Systems
Professor Chen Li
Based on slides developed by
Beverley Yang and Hector Garcia-Molina

2
What is P2P?
pastry
jxta
can
fiorana
napster
freenet
united devices
open cola
?
aim
ocean store
netmeeting
farsite
gnutella
icq
ebay
morpheus
limewire
seti_at_home
bearshare
uddi
grove
jabber
popular power
kazaa
folding_at_home
tapestry
mojo nation
process tree
chord
3
Napster
central server
...
4
Gnutella
5
PeerCast
UCI
source
6
What is Peer-to-Peer?

Definition
Nodes of equal roles exchanging information
and services directly
Is this a new idea?
IP routing (1970s)
Mariposa (1980s)
Distributed Databases!
What are people really thinking?

7
Implicit Definition of P2P

Scale millions (billions?) of peers
Nature of peers PCs
Application lightweight semantics
(e.g., file-sharing)

8
P2P vs. Distributed DBMS

Traditional DDBMS Issues
Transactions
Network Partitions
Distributed Query Optimization
Interoperation of heterogeneous data sources
Reliability/failure of nodes
Complex features do not scale

9
P2P vs. Distributed DBMS

Example application file-sharing
Simple data model and query language
No complex query optimization
Easy interoperation
No guarantee on quality of results
Individual site availability unimportant
Local updates
No transactions
Network partitions OK
Simple Amenable to large-scale network of
PCs

10
Potential Benefits

Efficiency harnessing unused resources
Self-organizing
Effectively sharing cost of ownership
Robustness and availability through replication
Anonymity/legal protection

11
Challenges

No authority to enforce behavior
Cooperation
Unreliability of individual peers
Efficiency of distributed operations (absolute
resources)

12
Research Areas

Resource Management
Security
Efficient Search

13
Resource Management

Resource
Storage/information
CPU processing
bandwidth
Issues
fairness
load balancing

14
Example Data Trading
site 1
site 2
site 3
A1
C1
B1
A2
B2
C2
15
Example Data Trading
site 1
site 2
site 3
A1
C1
B1
A2
B2
C2
16
Data Trading

Order of trades impacts availability
Issues
Swaps vs. Deeds
Fixed price vs. bids
Preference to
sites with a lot of space?
reliable sites?
desperate sites?

17
Security

Issues
Reputation
Trust
Accountability
Information Preservation
Information Quality
Denial of service attacks
Problem Detecting and punishing bad behavior

18
Information Preservation

Example Policy make 3 copies of documents

A1
make copies
What can go wrong?
19
What Can Go Wrong?

Bad sites deletes copies
Bad site alters copy
Bad site publishes fake
Bad site makes many copies at other sites
...

A1
A1
make copies
A1
20
Reputation Systems

Peers evaluate each other
Good reviews -gt Good reputation
Bad reviews -gt Bad reputation
No reviews -gt ?
Problems
Trustworthiness of reviews
Permanence of identity

21
Efficiency of Search

Problem finding needle in haystack
Efficiency measured in terms of absolute
resources consumed

22
Architecture

Hybrid
Centralized index, P2P
file storage and transfer
Super-peer
A pure network of
hybrid clusters
Pure
functionality completely
distributed

23
Goal

Develop search techniques for loose systems
that are
Efficient
Simple (easy to implement, no hidden costs)
Realistically and thoroughly evaluated

24
Current Techniques Gnutella
Breadth-First Search (BFS)
25
Metrics

Cost (aggregate)
Bandwidth
Processing Power
Quality of Results
Number of results
Satisfaction (true if results gt X, false
otherwise)
Time to satisfaction

26
Iterative Deepening

Interested in satisfaction, not of results
BFS returns too many results ? expensive
Iterative Deepening common technique to reduce
the cost of BFS
Intuition A search at a small depth is much
cheaper than at a larger depth

27
Iterative Deepening
source
forward query
processed query
found result
forward response
28
Directed BFS

Sends query to a subset of neighbors
Maintains statistics on neighbors
E.g., ping latency, history of number of results
Chooses subset intelligently (via heuristics), to
maximize quality of results
E.g., Neighbors with shortest message queue,
since long message queue implies neighbor is
saturated/dead

29
Directed BFS
source
forward query
processed query
?
found result
forward response
30
Directed BFS Heuristics
RAND (Random)
RES Returned greatest results in past
TIME Had shorted avg. time to satisfaction in past
HOPS Had smallest avg. hops for response messages in past
MSG Sent our client greatest of messages
QLEN Shortest message queue
DEG Highest degree
31
Local Indices

Each node maintains index over other nodes
collections
r is the radius of the index
Index covers all nodes within r hops away
Can process query at fewer nodes, but get just as
many results back

r
32
Local Indices (r1)
source
forward query
processed query
found result
forward response
33
Evaluation

Goal realistic evaluation of techniques
Cannot directly evaluate techniques in a real
environment
Simulation of large-scale distributed systems is
hard
Use Gnutella as a laboratory for gathering data
Use analysis driven by query traces to project
cost

34
Passive Observation
Gnutella Network

Statistics
Size of collection
redundant messages
Sample queries (Qrep)

35
Gathering Data

hops traveled
IP address

hops traveled
IP address
Timestamp
Individual result records

36
Gathering Data

For each query Q

L(Q) Length of query string
M(Q,n) response messages from n hops away
R(Q,n) results from n hops away
S(Q,n,Z) True if gt Z results received from n hops away
T(Q,Z,W,P) Time to satisfaction
N(Q,n) nodes n hops away
C(Q,n) redundant edges n hops away
37
Example Trace-driven Cost Projection
source
forward query
processed query
found result
forward response
38
Example Calculating Message Size

Use the Gnutella protocol, trace data
e.g., Query message consists of
Gnutella header (22 B)
Options field (2 B)
Query string (L(Q))
TCP/IP and Ethernet headers (58 B)
Total size of Query message for query Q
82 L(Q) bytes

39
Calculating Cost

We know the sizes of each type of message
We know messages sent, for each type of
message, for query Q
Put together aggregate bandwidth for Q
Similar process to compute aggregate processing
power

40
Overall Comparison
B
I
D
L
B
I
D
L
B
I
D
L
Time to Satisfy
Prob. of Satisfying
results
BFS
B
Iterative Deepening (d5,W6)
I
B
D
I
L
D
Directed BFS (gtRES)
L
Bandwidth Cost
Local Indices (r1)
41
Summary Efficient Search