PeertoPeer P2P Computing

About This Presentation

Title:

PeertoPeer P2P Computing

Description:

On a typical day KaZaA has over 3 million active users, and over 500 TeraBytes of content ... Lawsuit against KaZaA eventually successful. software comes with ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 56

Provided by: henrica

Category:

more less

Transcript and Presenter's Notes

Title: PeertoPeer P2P Computing

1
Peer-to-Peer (P2P)Computing
2
Centralized Architectures

In the previous set of we talked about systems
that have massive scale and that have a
centralized architecture
They have been shown to work well in the area of
volunteer computing
Centralized systems are easy to develop, deploy,
and maintain
Server-side control, updates, etc.
Problems with centralized architectures
The server can be a performance bottleneck
e.g., SETI_at_home pays a lot of money each year to
buy network bandwidth
e.g., SETIhome purchases and maintain decent
servers
The server can be a central point of failure
e.g., if there is a network outage in the
SETI_at_home building, then nothing works for a
while
An alternative peer-to-peer systems
Some content here adapted from material generated
by Michael Welzl, at the University of Innsbruck,
Austria

3
A Peer?

Peer one that is of equal standing with
another
P2P builds on the capacity of end-nodes that
participate in the system, treating them all
equal
end-nodes computers of participants
Made popular for file sharing applications
But the same idea is used in other domains
P2P ad-hoc networks (sensor networks)
Content distribution (BitTorrent)
Communication (Skype)
Netowork monitoring
etc.

4
P2P Essential Principles

Self-organizing, no central management
A peer is autonomous
Sharing of resources (storage, CPU, content)
resources at the edges of the network
Peers are equal (more or less)
Large numbers of peers
Churn is the common case
Intermittent connectivity, peers come and go
To be contrasted to the standard client-server
architecture

5
P2P Principles

The big question is how to do something useful
and that works with a bunch of uncoordinated
peers that need to be autonomous
The pay-offs are multiple
No need for any infrastructure, just a piece of
code that people hopefully install and run on
whatever machine they have
Better resilience to attacks
No peers is special, so it can go down without
compromising anything
The system relies on many heterogeneous computers
Different OSs and/or OS versions should make it
difficult for a virus to take the whole thing
down
Some have a vision of almost everything being
P2P
No more Web servers, mail servers, etc.

6
Napster

The term P2P was coined in 1999 by Shawn Fanning,
the original Napster developer
The success of Napster brought the P2P idea to
everybodys attention and made it very popular
A large fraction of all network traffic today is
due to P2P applications
Dropping fraction due to increasing video
streaming
2007 Youtube bit BitTorrent
Ironically, Napster wasnt fully P2P
Files where downloaded directly from
participants computer, without a central data
repository
But there was a centralized server that held the
catalog of files (which computer stores what
right now)
which was really the way in which Napster was
brought down from a legal point of view

7
P2P Trend?

P2P has become very popular, and there is a
little bit of a centralized systems are not
cool feeling around
However, its clear that centralized can work
(look at Google)
So when to decide to build a P2P system?
Things to consider
Budget
Resource relevance
Trust
Rate of system change
Criticality

8
P2P or not P2P?

Budget
If you have enough money, build a centralized
system
Again, look at Google
Note that centralized doesnt mean that there
arent multiple servers
Its just not about peers, but about clients
and servers
P2P is useful when budget isnt unlimited
Resource relevance
If many users care about the resources, then P2P
is viable
Otherwise it wont work, as there will never be
enough of a core number of active systems
Trust
Its difficult to build a P2P system with many
untrusted participants (active research problem)

9
P2P or not P2P?

Rate of system change
peers joining/leaving, content being updated
Tolerating high change rates is a difficult
research challenges for P2P systems
Criticality
If you cant live without the service provided by
the system, P2P is a bit iffy

10
Structured vs. Unstructured

P2P systems are typically classified into two
kinds
In unstructured systems, content may be stored on
any peer
In structured systems, content has to be stored
by specific peers
Lets first look at a few important unstructured
systems and discuss their strengths and weaknesses

11
Napster

Napster was the first widely popular P2P system
Dont mistake the new Napster store with the
old Napster P2P system
Only sharing of MP3 files was possible
How it worked
User registers with a central index server
Gives list of files to be shared
Central server knows all the peers and files in
the network
Searching based on keywords
Search results were a list of files with
information about the file and the peer sharing
it
e.g., encoding rate, size of file, peers
bandwidth
some information entered by the user, hence
unreliable

12
Napster
13
Napster

Strengths
Consistent view of the network
Some answers guaranteed to be correct (e.g.,
nothing found)
Fast and efficient searches
Weaknesses
Usual problem with a centralized server
Money can be thrown at it (e.g., Google)
Central server susceptible to attacks
viruses and legal attacks
Results unreliable
True of all P2P systems to some degree

14
Gnutella

Gnutella came soon after Napster
Originally developed by AOL, but the code was out
on the net by mistake. Before it was pulled out
it was too late, and it was out...
Fully decentralized
No index server
Had an open protocol
Which was great for research
It was never a huge network
Because it was quickly surpassed by better
systems
No longer in use

15
Gnutella

There are only peers
Peers are connected in an overlay network
To join the network, a new peer only needs to
know of one existing peer that is currently a
member
Done via some out-of-band mechanism, like a Web
site
Once a peer joins the network, it learns about
other peers and about the topology of the overlay
network
Queries are flooded over the network
Downloads happen directly between peers

16
Gnutella

Queries are sent to neighbors
Neighbors forward queries to their neighbors, and
so on
Until some threshold is reached (a time-to-live
or TTL)
If some reply was found, then its routed back to
the query originator following the path in reverse

17
Gnutella

Strengths
Fully distributed, no central point of failure
Open protocol (easy to write clients)
Very robust against random node failures
Weaknesses
Flooding is very inefficient and fails to find
thats looked for pretty often
How to pick the best query radius? is pretty
much impossible to answer

18
KaZaA

KaZaA proposed a very different architecture,
that has influenced most file-sharing systems
after it
On a typical day KaZaA has over 3 million active
users, and over 500 TeraBytes of content
Based on a super-node architecture
Some peers are better and thus special
Introducing some hierarchy in the system helps

19
KaZaA

Each SN keeps track of a subset of the peers
A new peer registers to one SN only

20
KaZaA Search

The KaZaA Query
A peer sends a query to its SN
The SN answers for all its peers and then
forwards to other SNs via flooding
Note that the SNs are not fully connected in the
peer-to-peer network of SNs
Other SNs reply
Finding SuperNodes?
A normal peer can be promoted if it demonstrates
that it has enough resources
A user can always refuse to become a SN
About 30,000 SNs at a given time

21
KaZaA

Strengths
Combine strengths of Napster and Gnutella
Weaknesses
Query are still not comprehensive due to limited
flooding
But a much better reach than Gnutella
Lawsuit against KaZaA eventually successful
software comes with a list of well-known
supernodes

22
Content Distribution

BitTorrent provided a new approach for file
sharing
Widely used for fully legal content
Linux distribution, software patches, etc.
Has its share of litigations
Goal Quickly replicate a file to a large number
of clients
A new overlay network is built for every file
thats being distributed
You have to know the file reference or torrent
contains metadata on the content
You can send a torrent to people, or publish it
There is no real searching in BitTorrent itself
Although out-of-band catalogs exist of course

23
BitTorrent

For each new BitTorrent file, one server hosts
the original copy
The file is broken into chunks
There is also a torrent file which is typically
kept on some web server(s)
Clients download the torrent file
whose metadata identifies a tracker
The tracker is a server that keeps track of
currently active clients for a file
The tracker doe not participate in the download
and never holds any data
Note that lawsuits have been successful against
people running trackers!

24
BitTorrent
25
BitTorrent

Terminology
Seed Client with a complete copy of the file
Leecher Client still downloading the file
Client contacts tracker and gets a list of other
clients
Gets list of 50 peers
Client maintains connections to 20-40 peers
Contacts tracker if number of connections drops
below 20
This set of peers is called peer set
Client downloads chunks from peers in peer set
and provides them with its own chunks
Chunks typically 256 KB
Chunks make it possible to use parallel download

26
BitTorrent Tit-for-Tat

A peer serves peers that serve it
Encourages cooperation, discourage free-riding
Peers use rarest first policy for chunk
downloads
Having a rare chunk makes peer attractive to
others
Others want to download it, peer can then
download the chunks it wants
Goal of chunk selection is to maximize entropy of
each chunk
For first chunk, just randomly pick something, so
that peer has something to share
Endgame mode
Send requests for last sub-chunks to all known
peers
End of download not stalled by slow peers

27
BitTorrent Choke/Unchoke

Peer serves e.g. 4 (default value) peers in peer
set simultaneously
Seeks best (fastest) downloaders if its a seed
Seeks best uploaders if its a leecher
Choke is a temporary refusal to upload to a peer
Leecher serves 4 best uploaders, chokes all
others
Every 10 seconds, it evaluates the transfer
speed
If there is a better peer, choke the worst of
the current 4
Every 30 seconds peer makes an optimistic
unchoke
Randomly unchoke a peer from peer set
Idea Maybe it offers better service
Seeds behave exactly the same way, except they
look at download speed instead of upload speed

28
Searching vs. Addressing

In the peer-to-peer networks weve discussed so
far, on searches for content
The content could be on any peer, so we need to
look for it somehow, e.g.,using keywords
When the system answers didnt find it, that
doesnt mean the content isnt there
This is not at all the way in which a storage
system works
e.g., a file system on your machine
Storage systems work based on an addressing
scheme
Content (e.g., a file) is known by a unique name
There is a way to know (not find) where that
unique name is stored
Searching by keyword can be implemented, but as a
separate feature (e.g., Spotlight on Mac OS X)
Such a storage system is typically more efficient
But perhaps less user friendly
Some P2P systems attempt to implement content
addressing rather than content searching

29
Structured vs. Unstructured

Unstructured networks/systems
Based on searching
Unstructured does NOT mean complete lack of
structure
Network has graph structure
But peers are free to join anywhere and objects
can be stored anywhere
Structured networks/systems
Based on addressing
Network structure determines where peers belong
in the network and where objects are stored
Should be efficient for locating objects
Big question How to build structured networks?

30
Addressing in a Network

To enable addressing, we must have a scheme to
figure out on which peer a particular file is
stored
This is typically done via some hashing
Has the file name (e.g., a fully qualified path)
using some has function to create a unique fileID
Using a good hash function is a crucial
Large hash, so that there are no collision
Hash that balances the load across the hash space
A useful abstraction (i.e., abstract data type)
to implement addressing is a Distributed Hash
Table
put(key, value) stores something in the network
e.g., key hash of file name
e.g., value file content
lookup(key) locates something in the network
returns the value

31
HT and DHT
0 2
0 1 2 3 4 5 6 7 8 9
peer A
5
peer B
7 8
peer C
DHT
HT
32
Using the Abstraction
Distributed Application
put(key, value)
lookup(key)
value
DHT Implementation
33
Implementing a DHT

Question Which network structure do we use to
support the DHT abstraction???
How to we identify peers?
Which other peers does one peer know about?
How to we route queries?
Which peer stores what?

34
Network Topologies

The topic of network topologies was a hot topic
in the area of supercomputers
Goal organize nodes of a supercomputer as
vertices of a graph, such that
The graph scales well
i.e., not too many links per node
which cost a lot of money in the case of physical
links
The graph has good performance
i.e., its diameter is small
diameter max number of hops between two nodes
Lets see a few examples

35
Fully Connected Graph

Diameter 1
Number of connections per node N
Great performance, poor scalability

36
Ring

Diameter N/2
Number of connections per node 2
Poor performance, great scalability

37
Torus/Grid

Diameter N/4
Number of connections per node 4
Better performance than a ring
Poorer scalability than a ring

38
Hypercube

Diameter log N
Number of connections per node log N
Considered like a good compromise by many (used
to build machines)

Defined by its dimension, d (N 2d)

39
Hypercube Routing

Each node is identified by a d-bit name
routing from xxxx to yyyy just keep going to a
neighbor that has a smaller Hamming distance!
we will see this idea again

1111
1110
0110
0111
1010
0011
0010
1011
1101
0101
1100
0100
1001
1000
0001
0000
40
Overlay network topologies

Here were building a P2P system, not a
supercomputer, so maintaining 10 network
connections to 10 neighbor peers doesnt require
10 network cards/links!
Still, we cant go fully connected due to the
size of the routing tables
Lets say we want to have a P2P network with 107
peers (10 million)
Each peer must maintain a routing table that
lists the peers along with some information on
them
at a minimum IP address, port number, peerID
This could represent quite a bit of memory
Going through the routing table to fix/repair it
due to churn would take too much time time (and
most of its content would be erroneous)
Therefore, it doesnt scale well
How about a Hypercube?
diameterlog(N) and number of connectionlog(N)
is g-r-e-a-t
the easy routing is g-r-e-a-t
The problem here is that its easily broken by
churn, and its difficult to accommodate new
nodes (number of nodes is power of 2)
Could work with many tweaks
Question whats a good structure that has some
of the nice properties of an hypercube and is
robust to churn?

41
Chord

Lets look at Chord, a famous DHT project
Developed at MIT in 2001
Fairly simple to understand (unlike other DHTs)
File names and node names are hashed to the same
space, i.e., numbers between 0 and 2m-1
where m is large enough and the hash function
good enough that collisions happen only with
infinitesimal probability
Each file has a unique fileID
e.g., hash of its name
Each peer has a unique peerID
e.g., hash of its IP address)
Important there is no difference between a
fileID and a peerID
Theyre just numbers that can be sorted and
compared easily

42
The Chord Ring

Peers are organized as a sorted ring
Peers are along the ring in increasing order of
peerID
Remember, peerIDs are just numbers
Called a Chord ring
Each peer knows its successor and predecessor in
the ring
For now lets assume no churn what-so-ever
No peer arrives, no peer departs
Main Chord idea A Peer stores Keys that are
immediately lower than its peerID
Lets look at an example Chord ring and see which
peer stores what

43
A Chord Ring
10 peers
P1
Stores keys in 51,56
P8
Stores keys in 8,14
P56
P14
P51
P21
P48
Stores keys in 21,32
P42
P32
P38
Stores keys in 32,38
A peer stores (key,value) pairs whose keys are
lower than the peers peerID and higher than the
peerID of the peers predecessor
44
Put() and Lookup()
Principles is the same for Put() and Lookup()
P1
P8
P56
P14
P51
P21
P48
P42
P32
find key 49
P38
45
Put() and Lookup()
P1
P8
P56
P14
P51
P21
P48
P42
P32
find key 49
P38
Go around the ring, following the successor
links Stop at the first peerID that is larger
than 49 (peer 51 here) If key 49 was stored in
the network, peer 51 has it
46
Scalability and Performance

The Chord ring as we have shown it is very
scalable
Each peer only needs to know about two other
peers
Very small routing table!
The problem is that the performance is very poor
The worst case complexity for a lookup is O(N)
hops, where N is the number of peers
Since N can be on the order of millions, clearly
its not even remotely acceptable
Each hop will take hundreds of milliseconds
Question how can we make the number of hops
O(log N)?
Answer By adding more edges in the network

47
Chord Fingers

Each peer maintains a finger table that
contains m entries
We have 2m potential peers in the system
So the finger table has at most O(log N) entries
The ith entry in the finger table of peer A
stores the peerID of the first peer B that
succeeds A by at least 2i-1 on the chord ring
B successor(n2i-1)
B is called the ith finger of peer A
Lets see an example

48
Chord Fingers
P1
P8
P56
P14
P51
P21
P48
P42
P32
P38
fingeri first peer that succeeds peer (p2i)
mod 2m
49
Using Chord Fingers
Find key 54
P1
P8
P56
P14
P51
P21
P48
P42
P32
P38

With the finger table, a peer can forward a query
a least halfway to its destination in one hop
One can easily prove that the worst case number
of hops is O(log N)

50
Peers joining and leaving

We now have the nice hypercube property
routing table O(log N)
number of hops O(log N)
Question What happens when a pear joins/leaves
the system?
gracefully, not due to crashes
Leaving is straightforward
give (key,value) pairs to successor
Joining is a bit more complicated, but still
simple
insert oneself in the ring
take over part of the key space of successor

51
Peer Joining
P1
P8
P56
P14
P51
P21
P48
P42
P32
I am a new peer and my peerID is 40
P38
P40
52
Peer Joining
P1
P8
P56
P14
P51
P21
P48
P42
P32
I am a new peer and my peerID is 40
P40
P38

Do a lookup for Key 40 (pretending its a
fileID), to identify along the way the first node
with ID 40 and the first node with ID
Then insert the new peer (in this case between
P38 and P42)
Requires a few successor and predecessor pointer
updates
Requires computing/updating fingers all over the
place (O(log N) messages)
Then take over (key,value) pairs in range 38,40
from P42

53
What about crashes?

Crashes are difficult to handle
Yet they happen all the time
Chord uses a stabilization protocol
Each node periodically engages in some
communications that repair successor and
predecessor pointers and finger tables
Uses a simple mechanism each peer stores
pointers to Log(N) successors, rather than just
one
Therefore its possible to detect missing nodes,
and to repair all connections
There are many theoretical and practical results
that show that this works well in practice
e.g., Lookup failure rate Peer departure rate
In fact, even graceful peer departures are
treating as crashes, but the stabilization
protocol works so well

54
Lookup failures?

Lookup failures will happen when nodes crash
The data they stored is no longer there!
One solution use replication at a higher level
e.g., use Individual Chord rings so that 10
copies of each value are stored, each with a
different pair
When a lookup fails for one of the keys, try
another on
Then restore the copy that disappeared
Chord is being used as the basis for several
project
shared storage
digital libraries
Downloadable at http//pdos.csail.mit.edu/chord/

55
Conclusion

P2P systems have been successfully used in
several domains
Two classes
unstructured successful file-sharing systems and
content distribution systems
based on searching
structured more on the research side, but much
more powerful
based on addressing within DHTs
Although its difficult to forecast, the future
of P2P system should be pretty cool