ICS 214B: Transaction Processing and Distributed Data Management - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

ICS 214B: Transaction Processing and Distributed Data Management

Description:

Lecture 18: Data Management in Peer-to-Peer Systems Professor Chen Li Based on s developed by Beverley Yang and Hector Garcia-Molina – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 42
Provided by: Chen187
Category:

less

Transcript and Presenter's Notes

Title: ICS 214B: Transaction Processing and Distributed Data Management


1
ICS 214B Transaction Processing and Distributed
Data Management
  • Lecture 18 Data Management in Peer-to-Peer
    Systems
  • Professor Chen Li
  • Based on slides developed by
  • Beverley Yang and Hector Garcia-Molina

2
What is P2P?
pastry
jxta
can
fiorana
napster
freenet
united devices
open cola
?
aim
ocean store
netmeeting
farsite
gnutella
icq
ebay
morpheus
limewire
seti_at_home
bearshare
uddi
grove
jabber
popular power
kazaa
folding_at_home
tapestry
mojo nation
process tree
chord
3
Napster
central server
...
4
Gnutella
5
PeerCast
UCI
source
6
What is Peer-to-Peer?
  • Definition
  • Nodes of equal roles exchanging information
    and services directly
  • Is this a new idea?
  • IP routing (1970s)
  • Mariposa (1980s)
  • Distributed Databases!
  • What are people really thinking?

7
Implicit Definition of P2P
  • Scale millions (billions?) of peers
  • Nature of peers PCs
  • Application lightweight semantics
  • (e.g., file-sharing)

8
P2P vs. Distributed DBMS
  • Traditional DDBMS Issues
  • Transactions
  • Network Partitions
  • Distributed Query Optimization
  • Interoperation of heterogeneous data sources
  • Reliability/failure of nodes
  • Complex features do not scale

9
P2P vs. Distributed DBMS
  • Example application file-sharing
  • Simple data model and query language
  • No complex query optimization
  • Easy interoperation
  • No guarantee on quality of results
  • Individual site availability unimportant
  • Local updates
  • No transactions
  • Network partitions OK
  • Simple Amenable to large-scale network of
    PCs

10
Potential Benefits
  • Efficiency harnessing unused resources
  • Self-organizing
  • Effectively sharing cost of ownership
  • Robustness and availability through replication
  • Anonymity/legal protection

11
Challenges
  • No authority to enforce behavior
  • Cooperation
  • Unreliability of individual peers
  • Efficiency of distributed operations (absolute
    resources)

12
Research Areas
  • Resource Management
  • Security
  • Efficient Search

13
Resource Management
  • Resource
  • Storage/information
  • CPU processing
  • bandwidth
  • Issues
  • fairness
  • load balancing

14
Example Data Trading
site 1
site 2
site 3
A1
C1
B1
A2
B2
C2
15
Example Data Trading
site 1
site 2
site 3
A1
C1
B1
A2
B2
C2
16
Data Trading
  • Order of trades impacts availability
  • Issues
  • Swaps vs. Deeds
  • Fixed price vs. bids
  • Preference to
  • sites with a lot of space?
  • reliable sites?
  • desperate sites?

17
Security
  • Issues
  • Reputation
  • Trust
  • Accountability
  • Information Preservation
  • Information Quality
  • Denial of service attacks
  • Problem Detecting and punishing bad behavior

18
Information Preservation
  • Example Policy make 3 copies of documents

A1
make copies
What can go wrong?
19
What Can Go Wrong?
  • Bad sites deletes copies
  • Bad site alters copy
  • Bad site publishes fake
  • Bad site makes many copies at other sites
  • ...

A1
A1
make copies
A1
20
Reputation Systems
  • Peers evaluate each other
  • Good reviews -gt Good reputation
  • Bad reviews -gt Bad reputation
  • No reviews -gt ?
  • Problems
  • Trustworthiness of reviews
  • Permanence of identity

21
Efficiency of Search
  • Problem finding needle in haystack
  • Efficiency measured in terms of absolute
    resources consumed

22
Architecture
  • Hybrid
  • Centralized index, P2P
  • file storage and transfer
  • Super-peer
  • A pure network of
  • hybrid clusters
  • Pure
  • functionality completely
  • distributed

23
Goal
  • Develop search techniques for loose systems
    that are
  • Efficient
  • Simple (easy to implement, no hidden costs)
  • Realistically and thoroughly evaluated

24
Current Techniques Gnutella
Breadth-First Search (BFS)
25
Metrics
  • Cost (aggregate)
  • Bandwidth
  • Processing Power
  • Quality of Results
  • Number of results
  • Satisfaction (true if results gt X, false
    otherwise)
  • Time to satisfaction

26
Iterative Deepening
  • Interested in satisfaction, not of results
  • BFS returns too many results ? expensive
  • Iterative Deepening common technique to reduce
    the cost of BFS
  • Intuition A search at a small depth is much
    cheaper than at a larger depth

27
Iterative Deepening
source
forward query
processed query
found result
forward response
28
Directed BFS
  • Sends query to a subset of neighbors
  • Maintains statistics on neighbors
  • E.g., ping latency, history of number of results
  • Chooses subset intelligently (via heuristics), to
    maximize quality of results
  • E.g., Neighbors with shortest message queue,
    since long message queue implies neighbor is
    saturated/dead

29
Directed BFS
source
forward query
processed query
?
found result
forward response
30
Directed BFS Heuristics
RAND (Random)
RES Returned greatest results in past
TIME Had shorted avg. time to satisfaction in past
HOPS Had smallest avg. hops for response messages in past
MSG Sent our client greatest of messages
QLEN Shortest message queue
DEG Highest degree
31
Local Indices
  • Each node maintains index over other nodes
    collections
  • r is the radius of the index
  • Index covers all nodes within r hops away
  • Can process query at fewer nodes, but get just as
    many results back

r
32
Local Indices (r1)
source
forward query
processed query
found result
forward response
33
Evaluation
  • Goal realistic evaluation of techniques
  • Cannot directly evaluate techniques in a real
    environment
  • Simulation of large-scale distributed systems is
    hard
  • Use Gnutella as a laboratory for gathering data
  • Use analysis driven by query traces to project
    cost

34
Passive Observation
Gnutella Network
  • Statistics
  • Size of collection
  • redundant messages
  • Sample queries (Qrep)

35
Gathering Data
  • hops traveled
  • IP address
  • hops traveled
  • IP address
  • Timestamp
  • Individual result records

36
Gathering Data
  • For each query Q

L(Q) Length of query string
M(Q,n) response messages from n hops away
R(Q,n) results from n hops away
S(Q,n,Z) True if gt Z results received from n hops away
T(Q,Z,W,P) Time to satisfaction
N(Q,n) nodes n hops away
C(Q,n) redundant edges n hops away
37
Example Trace-driven Cost Projection
source
forward query
processed query
found result
forward response
38
Example Calculating Message Size
  • Use the Gnutella protocol, trace data
  • e.g., Query message consists of
  • Gnutella header (22 B)
  • Options field (2 B)
  • Query string (L(Q))
  • TCP/IP and Ethernet headers (58 B)
  • Total size of Query message for query Q
  • 82 L(Q) bytes

39
Calculating Cost
  • We know the sizes of each type of message
  • We know messages sent, for each type of
    message, for query Q
  • Put together aggregate bandwidth for Q
  • Similar process to compute aggregate processing
    power

40
Overall Comparison
B
I
D
L
B
I
D
L
B
I
D
L
Time to Satisfy
Prob. of Satisfying
results
BFS
B
Iterative Deepening (d5,W6)
I
B
D
I
L
D
Directed BFS (gtRES)
L
Bandwidth Cost
Local Indices (r1)
41
Summary Efficient Search
  • What weve done
  • Proposed techniques to improve performance
  • Kept simple
  • Evaluated techniques using extensive real data
  • Improved performance, with tradeoffs
  • Open issues
  • More efficient!
  • Make intelligent use of topology, replication
  • Take advantage of heterogeneity (e.g.,
    super-peers)
Write a Comment
User Comments (0)
About PowerShow.com