PeertoPeer P2P Computing - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

PeertoPeer P2P Computing

Description:

On a typical day KaZaA has over 3 million active users, and over 500 TeraBytes of content ... Lawsuit against KaZaA eventually successful. software comes with ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 56
Provided by: henrica
Category:

less

Transcript and Presenter's Notes

Title: PeertoPeer P2P Computing


1
Peer-to-Peer (P2P)Computing
2
Centralized Architectures
  • In the previous set of we talked about systems
    that have massive scale and that have a
    centralized architecture
  • They have been shown to work well in the area of
    volunteer computing
  • Centralized systems are easy to develop, deploy,
    and maintain
  • Server-side control, updates, etc.
  • Problems with centralized architectures
  • The server can be a performance bottleneck
  • e.g., SETI_at_home pays a lot of money each year to
    buy network bandwidth
  • e.g., SETIhome purchases and maintain decent
    servers
  • The server can be a central point of failure
  • e.g., if there is a network outage in the
    SETI_at_home building, then nothing works for a
    while
  • An alternative peer-to-peer systems
  • Some content here adapted from material generated
    by Michael Welzl, at the University of Innsbruck,
    Austria

3
A Peer?
  • Peer one that is of equal standing with
    another
  • P2P builds on the capacity of end-nodes that
    participate in the system, treating them all
    equal
  • end-nodes computers of participants
  • Made popular for file sharing applications
  • But the same idea is used in other domains
  • P2P ad-hoc networks (sensor networks)
  • Content distribution (BitTorrent)
  • Communication (Skype)
  • Netowork monitoring
  • etc.

4
P2P Essential Principles
  • Self-organizing, no central management
  • A peer is autonomous
  • Sharing of resources (storage, CPU, content)
  • resources at the edges of the network
  • Peers are equal (more or less)
  • Large numbers of peers
  • Churn is the common case
  • Intermittent connectivity, peers come and go
  • To be contrasted to the standard client-server
    architecture

5
P2P Principles
  • The big question is how to do something useful
    and that works with a bunch of uncoordinated
    peers that need to be autonomous
  • The pay-offs are multiple
  • No need for any infrastructure, just a piece of
    code that people hopefully install and run on
    whatever machine they have
  • Better resilience to attacks
  • No peers is special, so it can go down without
    compromising anything
  • The system relies on many heterogeneous computers
  • Different OSs and/or OS versions should make it
    difficult for a virus to take the whole thing
    down
  • Some have a vision of almost everything being
    P2P
  • No more Web servers, mail servers, etc.

6
Napster
  • The term P2P was coined in 1999 by Shawn Fanning,
    the original Napster developer
  • The success of Napster brought the P2P idea to
    everybodys attention and made it very popular
  • A large fraction of all network traffic today is
    due to P2P applications
  • Dropping fraction due to increasing video
    streaming
  • 2007 Youtube bit BitTorrent
  • Ironically, Napster wasnt fully P2P
  • Files where downloaded directly from
    participants computer, without a central data
    repository
  • But there was a centralized server that held the
    catalog of files (which computer stores what
    right now)
  • which was really the way in which Napster was
    brought down from a legal point of view

7
P2P Trend?
  • P2P has become very popular, and there is a
    little bit of a centralized systems are not
    cool feeling around
  • However, its clear that centralized can work
    (look at Google)
  • So when to decide to build a P2P system?
  • Things to consider
  • Budget
  • Resource relevance
  • Trust
  • Rate of system change
  • Criticality

8
P2P or not P2P?
  • Budget
  • If you have enough money, build a centralized
    system
  • Again, look at Google
  • Note that centralized doesnt mean that there
    arent multiple servers
  • Its just not about peers, but about clients
    and servers
  • P2P is useful when budget isnt unlimited
  • Resource relevance
  • If many users care about the resources, then P2P
    is viable
  • Otherwise it wont work, as there will never be
    enough of a core number of active systems
  • Trust
  • Its difficult to build a P2P system with many
    untrusted participants (active research problem)

9
P2P or not P2P?
  • Rate of system change
  • peers joining/leaving, content being updated
  • Tolerating high change rates is a difficult
    research challenges for P2P systems
  • Criticality
  • If you cant live without the service provided by
    the system, P2P is a bit iffy

10
Structured vs. Unstructured
  • P2P systems are typically classified into two
    kinds
  • In unstructured systems, content may be stored on
    any peer
  • In structured systems, content has to be stored
    by specific peers
  • Lets first look at a few important unstructured
    systems and discuss their strengths and weaknesses

11
Napster
  • Napster was the first widely popular P2P system
  • Dont mistake the new Napster store with the
    old Napster P2P system
  • Only sharing of MP3 files was possible
  • How it worked
  • User registers with a central index server
  • Gives list of files to be shared
  • Central server knows all the peers and files in
    the network
  • Searching based on keywords
  • Search results were a list of files with
    information about the file and the peer sharing
    it
  • e.g., encoding rate, size of file, peers
    bandwidth
  • some information entered by the user, hence
    unreliable

12
Napster
13
Napster
  • Strengths
  • Consistent view of the network
  • Some answers guaranteed to be correct (e.g.,
    nothing found)
  • Fast and efficient searches
  • Weaknesses
  • Usual problem with a centralized server
  • Money can be thrown at it (e.g., Google)
  • Central server susceptible to attacks
  • viruses and legal attacks
  • Results unreliable
  • True of all P2P systems to some degree

14
Gnutella
  • Gnutella came soon after Napster
  • Originally developed by AOL, but the code was out
    on the net by mistake. Before it was pulled out
    it was too late, and it was out...
  • Fully decentralized
  • No index server
  • Had an open protocol
  • Which was great for research
  • It was never a huge network
  • Because it was quickly surpassed by better
    systems
  • No longer in use

15
Gnutella
  • There are only peers
  • Peers are connected in an overlay network
  • To join the network, a new peer only needs to
    know of one existing peer that is currently a
    member
  • Done via some out-of-band mechanism, like a Web
    site
  • Once a peer joins the network, it learns about
    other peers and about the topology of the overlay
    network
  • Queries are flooded over the network
  • Downloads happen directly between peers

16
Gnutella
  • Queries are sent to neighbors
  • Neighbors forward queries to their neighbors, and
    so on
  • Until some threshold is reached (a time-to-live
    or TTL)
  • If some reply was found, then its routed back to
    the query originator following the path in reverse

17
Gnutella
  • Strengths
  • Fully distributed, no central point of failure
  • Open protocol (easy to write clients)
  • Very robust against random node failures
  • Weaknesses
  • Flooding is very inefficient and fails to find
    thats looked for pretty often
  • How to pick the best query radius? is pretty
    much impossible to answer

18
KaZaA
  • KaZaA proposed a very different architecture,
    that has influenced most file-sharing systems
    after it
  • On a typical day KaZaA has over 3 million active
    users, and over 500 TeraBytes of content
  • Based on a super-node architecture
  • Some peers are better and thus special
  • Introducing some hierarchy in the system helps

19
KaZaA
  • Each SN keeps track of a subset of the peers
  • A new peer registers to one SN only

20
KaZaA Search
  • The KaZaA Query
  • A peer sends a query to its SN
  • The SN answers for all its peers and then
    forwards to other SNs via flooding
  • Note that the SNs are not fully connected in the
    peer-to-peer network of SNs
  • Other SNs reply
  • Finding SuperNodes?
  • A normal peer can be promoted if it demonstrates
    that it has enough resources
  • A user can always refuse to become a SN
  • About 30,000 SNs at a given time

21
KaZaA
  • Strengths
  • Combine strengths of Napster and Gnutella
  • Weaknesses
  • Query are still not comprehensive due to limited
    flooding
  • But a much better reach than Gnutella
  • Lawsuit against KaZaA eventually successful
  • software comes with a list of well-known
    supernodes

22
Content Distribution
  • BitTorrent provided a new approach for file
    sharing
  • Widely used for fully legal content
  • Linux distribution, software patches, etc.
  • Has its share of litigations
  • Goal Quickly replicate a file to a large number
    of clients
  • A new overlay network is built for every file
    thats being distributed
  • You have to know the file reference or torrent
  • contains metadata on the content
  • You can send a torrent to people, or publish it
  • There is no real searching in BitTorrent itself
  • Although out-of-band catalogs exist of course

23
BitTorrent
  • For each new BitTorrent file, one server hosts
    the original copy
  • The file is broken into chunks
  • There is also a torrent file which is typically
    kept on some web server(s)
  • Clients download the torrent file
  • whose metadata identifies a tracker
  • The tracker is a server that keeps track of
    currently active clients for a file
  • The tracker doe not participate in the download
    and never holds any data
  • Note that lawsuits have been successful against
    people running trackers!

24
BitTorrent
25
BitTorrent
  • Terminology
  • Seed Client with a complete copy of the file
  • Leecher Client still downloading the file
  • Client contacts tracker and gets a list of other
    clients
  • Gets list of 50 peers
  • Client maintains connections to 20-40 peers
  • Contacts tracker if number of connections drops
    below 20
  • This set of peers is called peer set
  • Client downloads chunks from peers in peer set
    and provides them with its own chunks
  • Chunks typically 256 KB
  • Chunks make it possible to use parallel download

26
BitTorrent Tit-for-Tat
  • A peer serves peers that serve it
  • Encourages cooperation, discourage free-riding
  • Peers use rarest first policy for chunk
    downloads
  • Having a rare chunk makes peer attractive to
    others
  • Others want to download it, peer can then
    download the chunks it wants
  • Goal of chunk selection is to maximize entropy of
    each chunk
  • For first chunk, just randomly pick something, so
    that peer has something to share
  • Endgame mode
  • Send requests for last sub-chunks to all known
    peers
  • End of download not stalled by slow peers

27
BitTorrent Choke/Unchoke
  • Peer serves e.g. 4 (default value) peers in peer
    set simultaneously
  • Seeks best (fastest) downloaders if its a seed
  • Seeks best uploaders if its a leecher
  • Choke is a temporary refusal to upload to a peer
  • Leecher serves 4 best uploaders, chokes all
    others
  • Every 10 seconds, it evaluates the transfer
    speed
  • If there is a better peer, choke the worst of
    the current 4
  • Every 30 seconds peer makes an optimistic
    unchoke
  • Randomly unchoke a peer from peer set
  • Idea Maybe it offers better service
  • Seeds behave exactly the same way, except they
    look at download speed instead of upload speed

28
Searching vs. Addressing
  • In the peer-to-peer networks weve discussed so
    far, on searches for content
  • The content could be on any peer, so we need to
    look for it somehow, e.g.,using keywords
  • When the system answers didnt find it, that
    doesnt mean the content isnt there
  • This is not at all the way in which a storage
    system works
  • e.g., a file system on your machine
  • Storage systems work based on an addressing
    scheme
  • Content (e.g., a file) is known by a unique name
  • There is a way to know (not find) where that
    unique name is stored
  • Searching by keyword can be implemented, but as a
    separate feature (e.g., Spotlight on Mac OS X)
  • Such a storage system is typically more efficient
  • But perhaps less user friendly
  • Some P2P systems attempt to implement content
    addressing rather than content searching

29
Structured vs. Unstructured
  • Unstructured networks/systems
  • Based on searching
  • Unstructured does NOT mean complete lack of
    structure
  • Network has graph structure
  • But peers are free to join anywhere and objects
    can be stored anywhere
  • Structured networks/systems
  • Based on addressing
  • Network structure determines where peers belong
    in the network and where objects are stored
  • Should be efficient for locating objects
  • Big question How to build structured networks?

30
Addressing in a Network
  • To enable addressing, we must have a scheme to
    figure out on which peer a particular file is
    stored
  • This is typically done via some hashing
  • Has the file name (e.g., a fully qualified path)
    using some has function to create a unique fileID
  • Using a good hash function is a crucial
  • Large hash, so that there are no collision
  • Hash that balances the load across the hash space
  • A useful abstraction (i.e., abstract data type)
    to implement addressing is a Distributed Hash
    Table
  • put(key, value) stores something in the network
  • e.g., key hash of file name
  • e.g., value file content
  • lookup(key) locates something in the network
  • returns the value

31
HT and DHT
0 2
0 1 2 3 4 5 6 7 8 9
peer A
5
peer B
7 8
peer C
DHT
HT
32
Using the Abstraction
Distributed Application
put(key, value)
lookup(key)
value
DHT Implementation
33
Implementing a DHT
  • Question Which network structure do we use to
    support the DHT abstraction???
  • How to we identify peers?
  • Which other peers does one peer know about?
  • How to we route queries?
  • Which peer stores what?

34
Network Topologies
  • The topic of network topologies was a hot topic
    in the area of supercomputers
  • Goal organize nodes of a supercomputer as
    vertices of a graph, such that
  • The graph scales well
  • i.e., not too many links per node
  • which cost a lot of money in the case of physical
    links
  • The graph has good performance
  • i.e., its diameter is small
  • diameter max number of hops between two nodes
  • Lets see a few examples

35
Fully Connected Graph
  • Diameter 1
  • Number of connections per node N
  • Great performance, poor scalability

36
Ring
  • Diameter N/2
  • Number of connections per node 2
  • Poor performance, great scalability

37
Torus/Grid
  • Diameter N/4
  • Number of connections per node 4
  • Better performance than a ring
  • Poorer scalability than a ring

38
Hypercube
  • Diameter log N
  • Number of connections per node log N
  • Considered like a good compromise by many (used
    to build machines)
  • Defined by its dimension, d (N 2d)

39
Hypercube Routing
  • Each node is identified by a d-bit name
  • routing from xxxx to yyyy just keep going to a
    neighbor that has a smaller Hamming distance!
  • we will see this idea again

1111
1110
0110
0111
1010
0011
0010
1011
1101
0101
1100
0100
1001
1000
0001
0000
40
Overlay network topologies
  • Here were building a P2P system, not a
    supercomputer, so maintaining 10 network
    connections to 10 neighbor peers doesnt require
    10 network cards/links!
  • Still, we cant go fully connected due to the
    size of the routing tables
  • Lets say we want to have a P2P network with 107
    peers (10 million)
  • Each peer must maintain a routing table that
    lists the peers along with some information on
    them
  • at a minimum IP address, port number, peerID
  • This could represent quite a bit of memory
  • Going through the routing table to fix/repair it
    due to churn would take too much time time (and
    most of its content would be erroneous)
  • Therefore, it doesnt scale well
  • How about a Hypercube?
  • diameterlog(N) and number of connectionlog(N)
    is g-r-e-a-t
  • the easy routing is g-r-e-a-t
  • The problem here is that its easily broken by
    churn, and its difficult to accommodate new
    nodes (number of nodes is power of 2)
  • Could work with many tweaks
  • Question whats a good structure that has some
    of the nice properties of an hypercube and is
    robust to churn?

41
Chord
  • Lets look at Chord, a famous DHT project
  • Developed at MIT in 2001
  • Fairly simple to understand (unlike other DHTs)
  • File names and node names are hashed to the same
    space, i.e., numbers between 0 and 2m-1
  • where m is large enough and the hash function
    good enough that collisions happen only with
    infinitesimal probability
  • Each file has a unique fileID
  • e.g., hash of its name
  • Each peer has a unique peerID
  • e.g., hash of its IP address)
  • Important there is no difference between a
    fileID and a peerID
  • Theyre just numbers that can be sorted and
    compared easily

42
The Chord Ring
  • Peers are organized as a sorted ring
  • Peers are along the ring in increasing order of
    peerID
  • Remember, peerIDs are just numbers
  • Called a Chord ring
  • Each peer knows its successor and predecessor in
    the ring
  • For now lets assume no churn what-so-ever
  • No peer arrives, no peer departs
  • Main Chord idea A Peer stores Keys that are
    immediately lower than its peerID
  • Lets look at an example Chord ring and see which
    peer stores what

43
A Chord Ring
10 peers
P1
Stores keys in 51,56
P8
Stores keys in 8,14
P56
P14
P51
P21
P48
Stores keys in 21,32
P42
P32
P38
Stores keys in 32,38
A peer stores (key,value) pairs whose keys are
lower than the peers peerID and higher than the
peerID of the peers predecessor
44
Put() and Lookup()
Principles is the same for Put() and Lookup()
P1
P8
P56
P14
P51
P21
P48
P42
P32
find key 49
P38
45
Put() and Lookup()
P1
P8
P56
P14
P51
P21
P48
P42
P32
find key 49
P38
Go around the ring, following the successor
links Stop at the first peerID that is larger
than 49 (peer 51 here) If key 49 was stored in
the network, peer 51 has it
46
Scalability and Performance
  • The Chord ring as we have shown it is very
    scalable
  • Each peer only needs to know about two other
    peers
  • Very small routing table!
  • The problem is that the performance is very poor
  • The worst case complexity for a lookup is O(N)
    hops, where N is the number of peers
  • Since N can be on the order of millions, clearly
    its not even remotely acceptable
  • Each hop will take hundreds of milliseconds
  • Question how can we make the number of hops
    O(log N)?
  • Answer By adding more edges in the network

47
Chord Fingers
  • Each peer maintains a finger table that
    contains m entries
  • We have 2m potential peers in the system
  • So the finger table has at most O(log N) entries
  • The ith entry in the finger table of peer A
    stores the peerID of the first peer B that
    succeeds A by at least 2i-1 on the chord ring
  • B successor(n2i-1)
  • B is called the ith finger of peer A
  • Lets see an example

48
Chord Fingers
P1
P8
P56
P14
P51
P21
P48
P42
P32
P38
fingeri first peer that succeeds peer (p2i)
mod 2m
49
Using Chord Fingers
Find key 54
P1
P8
P56
P14
P51
P21
P48
P42
P32
P38
  • With the finger table, a peer can forward a query
    a least halfway to its destination in one hop
  • One can easily prove that the worst case number
    of hops is O(log N)

50
Peers joining and leaving
  • We now have the nice hypercube property
  • routing table O(log N)
  • number of hops O(log N)
  • Question What happens when a pear joins/leaves
    the system?
  • gracefully, not due to crashes
  • Leaving is straightforward
  • give (key,value) pairs to successor
  • Joining is a bit more complicated, but still
    simple
  • insert oneself in the ring
  • take over part of the key space of successor

51
Peer Joining
P1
P8
P56
P14
P51
P21
P48
P42
P32
I am a new peer and my peerID is 40
P38
P40
52
Peer Joining
P1
P8
P56
P14
P51
P21
P48
P42
P32
I am a new peer and my peerID is 40
P40
P38
  • Do a lookup for Key 40 (pretending its a
    fileID), to identify along the way the first node
    with ID 40 and the first node with ID
  • Then insert the new peer (in this case between
    P38 and P42)
  • Requires a few successor and predecessor pointer
    updates
  • Requires computing/updating fingers all over the
    place (O(log N) messages)
  • Then take over (key,value) pairs in range 38,40
    from P42

53
What about crashes?
  • Crashes are difficult to handle
  • Yet they happen all the time
  • Chord uses a stabilization protocol
  • Each node periodically engages in some
    communications that repair successor and
    predecessor pointers and finger tables
  • Uses a simple mechanism each peer stores
    pointers to Log(N) successors, rather than just
    one
  • Therefore its possible to detect missing nodes,
    and to repair all connections
  • There are many theoretical and practical results
    that show that this works well in practice
  • e.g., Lookup failure rate Peer departure rate
  • In fact, even graceful peer departures are
    treating as crashes, but the stabilization
    protocol works so well

54
Lookup failures?
  • Lookup failures will happen when nodes crash
  • The data they stored is no longer there!
  • One solution use replication at a higher level
  • e.g., use Individual Chord rings so that 10
    copies of each value are stored, each with a
    different pair
  • When a lookup fails for one of the keys, try
    another on
  • Then restore the copy that disappeared
  • Chord is being used as the basis for several
    project
  • shared storage
  • digital libraries
  • Downloadable at http//pdos.csail.mit.edu/chord/

55
Conclusion
  • P2P systems have been successfully used in
    several domains
  • Two classes
  • unstructured successful file-sharing systems and
    content distribution systems
  • based on searching
  • structured more on the research side, but much
    more powerful
  • based on addressing within DHTs
  • Although its difficult to forecast, the future
    of P2P system should be pretty cool
Write a Comment
User Comments (0)
About PowerShow.com