Taming Aggressive Replication in the Pangaea Widearea File System - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Taming Aggressive Replication in the Pangaea Widearea File System

Description:

downlinks: Set(NodeID) ts: TimeStamp. File Creation ... Harbingers propagate down fastest links first ... Compile on C1 then time compile on C2. Pangaea ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 56
Provided by: Jas135
Category:

less

Transcript and Presenter's Notes

Title: Taming Aggressive Replication in the Pangaea Widearea File System


1
Taming Aggressive Replication in the Pangaea
Wide-area File System
  • Y. Saito, C. Kaamanolis, M. Karlsson, M.
    Mahalingam Presented by Jason Waddle

2
Pangaea Wide-area File System
  • Support the daily storage needs of distributed
    users.
  • Enable ad-hoc data sharing.

3
Pangaea Design Goals
  • Speed
  • Hide wide-area latency,file access time local
    file system
  • Availability autonomy
  • Avoid single point-of-failure
  • Adapt to churn
  • Network economy
  • Minimize use of wide-area network
  • Exploit physical locality

4
Pangaea Assumptions (Non-goals)
  • Servers are trusted
  • Weak data consistency is sufficient (consistency
    in seconds)

5
Symbiotic Design
6
Symbiotic Design
Autonomous
Each server operates when disconnected from
network.
7
Symbiotic Design
Autonomous
Cooperative
Each server operates when disconnected from
network.
When connected, servers cooperate to enhance
overall performance and availability.
8
Pervasive Replication
  • Replicate at file/directory level
  • Aggressively create replicas whenever a file or
    directory is accessed
  • No single master replica
  • A replica may be read / written at any time
  • Replicas exchange updates in a peer-to-peer
    fashion

9
Graph-based Replica Management
  • Replicas connected in a sparse, strongly-
    connected, random graph
  • Updates propagate along edges
  • Edges used for discovery and removal

10
Benefits of Graph-based Approach
  • Inexpensive
  • Graph is sparse, adding/removing replicas O(1)
  • Available update distribution
  • As long as graph is connected, updates reach
    every replica
  • Network economy
  • High connectivity for close replicas,build
    spanning tree along fast edges

11
Optimistic Replica Coordination
  • Aim for maximum availability over strong
    data-consistency
  • Any node issues updates at any time
  • Update transmission and and conflict resolution
    in background

12
Optimistic Replica Coordination
  • Eventual consistency ( 5s in tests)
  • No strong consistency guaranteesno support for
    locks, lock-files, etc.

13
Pangaea Structure
Region(lt5ms RTT)
Server or Node
14
Server Structure
I/O request(application)
NFS protocol handler
Pangaea server
log
Replication engine
membership
User space
Kernel space
Inter-node communication
NFS client
15
Server Modules
  • NFS protocol handler
  • Receives requests from apps, updates local
    replicas, generates requests to

16
Server Modules
  • NFS protocol handler
  • Receives requests from apps, updates local
    replicas, generates requests to
  • Replication engine
  • Accepts local and remote requests
  • Modifies replicas
  • Forwards requests to other nodes

17
Server Modules
  • NFS protocol handler
  • Receives requests from apps, updates local
    replicas, generates requests to
  • Replication engine
  • Accepts local and remote requests
  • Modifies replicas
  • Forwards requests to other nodes
  • Log module
  • Transaction-like semantics for local updates

18
Server Modules
  • Membership module maintains
  • List of regions, their members, estimated RTT
    between regions
  • Location of root directory replicas
  • Information coordinated by gossiping
  • Landmark nodes bootstrap newly joining nodes

Maintaining RTT information main scalability
bottleneck
19
File System Structure
  • Gold replicas
  • Listed in directory entries
  • Form clique in replica graph
  • Fixed number (e.g., 3)
  • All replicas (gold and bronze)
  • Unidirectional edges to all gold replicas
  • Bidirectional peer-edges
  • Backpointer to parent directory

20
File System Structure
/joe
/joe/foo
21
File System Structure
struct Replica fid FileID ts TimeStamp vv
VersionVector goldPeers Set(NodeID) peers
Set(NodeID) backptrs Set(FileID, String)
struct DirEntry fname String fid
FileID downlinks Set(NodeID) ts TimeStamp
22
File Creation
  • Select locations for g gold replicas (e.g., g3)
  • One on current server
  • Others on random servers from different regions
  • Create entry in parent directory
  • Flood updates
  • Update to parent directory
  • File contents (empty) to gold replicas

23
Replica Creation
  • Recursively get replicas for ancestor directories
  • Find a close replica (shortcutting)
  • Send request to the closest gold replica
  • Gold replica forwards request to its neighbor
    closest to requester, who then sends

24
Replica Creation
  • Select m peer-edges (e.g., m4)
  • Include a gold replica (for future shortcutting)
  • Include closest neighbor from a random gold
    replica
  • Get remaining nodes from random walks starting at
    a random gold replica
  • Create m bidirectional peer-edges

25
Bronze Replica Removal
  • To recover disk space
  • Using GD-Size algorithm, throw out largest,
    least-accessed replica
  • Drop useless replicas
  • Too many updates before an access (e.g., 4)
  • Must notify peer-edges of removal peers use
    random walk to choose new edge

26
Replica Updates
  • Flood entire file to replica graph neighbors
  • Updates reach all replicas as long as the graph
    is strongly connected
  • Optional user can block on update until all
    neighbors reply (red-button mode)
  • Network economy???

27
Optimized Replica Updates
  • Send only differences (deltas)
  • Include old timestamp, new timestamp
  • Only apply delta to replica if old timestamp
    matches
  • Revert to full-content transfer if necessary
  • Merge deltas when possible

28
Optimized Replica Updates
  • Dont send large (e.g., gt 1KB) updates to each of
    m neighbors
  • Instead, use harbingers to dynamically build a
    spanning-tree update graph
  • Harbinger small message with updates timestamps
  • Send updates along spanning-tree edges
  • Happens in two phases

29
Optimized Replica Updates
  • Exploit Physical Topology
  • Before pushing a harbinger to a neighbor, add a
    random delay RTT (e.g., 10RTT)
  • Harbingers propagate down fastest links first
  • Dynamically builds an update spanning-tree with
    fast edges

30
Update Example (Phase 1)
B
F
A
C
D
E
31
Update Example (Phase 1)
B
F
A
C
D
E
32
Update Example (Phase 1)
B
F
A
C
D
E
33
Update Example (Phase 1)
B
F
A
C
D
E
34
Update Example (Phase 1)
B
F
A
C
D
E
35
Update Example (Phase 1)
B
F
A
C
D
E
36
Update Example (Phase 2)
B
F
A
C
D
E
37
Update Example (Phase 2)
B
F
A
C
D
E
38
Update Example (Phase 2)
B
F
A
C
D
E
39
Conflict Resolution
  • Use a combination of version vectors and
    last-writer wins to resolve
  • If timestamps mismatch, full-content is
    transferred
  • Missing update just overwrite replica

40
Regular File Conflict (Three Solutions)
  • Last-writer-wins, using update timestamps
  • Requires server clock synchronization
  • Concatenate both updates
  • Make the user fix it
  • Possibly application-specific resolution

41
Directory Conflict
alice mv /foo /alice/foo
bob mv /foo /bob/foo
42
Directory Conflict
alice mv /foo /alice/foo
bob mv /foo /bob/foo
/bob replica set
/alice replica set
43
Directory Conflict
alice mv /foo /alice/foo
bob mv /foo /bob/foo
Let the child (foo) decide!
  • Implement mv as a change to the files
    backpointer
  • Single file resolves conflicting updates
  • File then updates affected directories

44
Temporary Failure Recovery
  • Log outstanding remote operations
  • Update, random walk, edge addition, etc.
  • Retry logged updates
  • On reboot
  • On recovery of another node
  • Can create superfluous edges
  • Retains m-connectedness

45
Permanent Failures
  • A garbage collector (GC) scans for failed nodes
  • Bronze replica on failed node
  • GC causes replicas neighbors to replace link
    with a new peer using random walk

46
Permanent Failure
  • Gold replica on failed node
  • Discovered by another gold (clique)
  • Chooses new gold by random walk
  • Flood choice to all replicas
  • Update parent directory to contain new gold
    replica nodes
  • Resolve conflicts with last-writer-wins
  • Expensive!

47
Performance LAN
Andrew-Tcl benchmarks, time in seconds
48
Performance Slow Link
The importance of local replicas
49
Performance Roaming
Compile on C1 then time compile on C2. Pangaea
utilizes fast links to a peers replicas.
50
Performance Non-uniform Net
A model of HPs corporate network.
51
Performance Non-uniform Net
52
Performance Update Propagation
Harbinger time is the window of inconsistency.
53
Performance Large Scale
HP 3000 Node 7-region HP Network U 500 regions,
6 Nodes per region, 200ms RTT 5Mb/s
Latency improves with more replicas.
54
Performance Large Scale
HP 3000 Node 7-region HP Network U 500 regions,
6 Nodes per region, 200ms RTT 5Mb/s
Network economy improves with more replicas.
55
Performance Availability
Numbers in parenthesis are relative storage
overhead.
Write a Comment
User Comments (0)
About PowerShow.com