New Protocols for Remote File Synchronization Based on Erasure Codes PowerPoint PPT Presentation

presentation player overlay
1 / 28
About This Presentation
Transcript and Presenter's Notes

Title: New Protocols for Remote File Synchronization Based on Erasure Codes


1
New Protocols for Remote File Synchronization
Based on Erasure Codes
  • Utku Irmak
  • Svilen Mihaylov
  • Torsten Suel
  • Polytechnic University

2
Outline
  • Introduction and Common Applications
  • Problem Formalization
  • Contributions
  • An Approach Based on Erasure Codes
  • A Simple Multi-Round Protocol
  • An Efficient Single-Round Protocol
  • A Practical Protocol Based on Erasure Codes
  • Implementation Overview
  • Preliminary Results
  • Conclusions

3
Introduction
Machine A
Machine B
Current Version
Outdated Version
  • Remote File Synchronization Problem How to
    update the outdated version of a file over a
    network with minimal amount of communication
  • When the versions are very similar, the total
    data transmitted should be significantly smaller
    than the file size

4
Common Applications
  • Synchronization of User Files
  • Synchronization between different machines that
    may only be connected over over a slow network
    (home and work machine)
  • Both rsync and unison are widely used tools
  • Web and Ftp Site Mirroring
  • Significant similarities between successive
    versions
  • Including sites distributing new versions of a
    software
  • rsync is widely used

5
Common Applications
  • Content Distribution Networks
  • File synchronization is a natural approach to for
    updating content replicated at the network edge
  • Web Access over Slow Links
  • A user revisiting a webpage may already have a
    previous version in the browser cache
  • It would be desirable to avoid the entire
    transmission
  • This idea is implemented in rproxy which uses
    rsync algorithm

6
Problem Formalization
  • We have two files (strings) over some alphabet
    fnew (current file), fold (outdated file)
  • We have two machines C (the client), S (the
    server) connected by a communication link
  • C only has a copy of fold, and S only has a copy
    of fnew
  • Goal Design a protocol between the parties that
    result C holding a copy of fnew while minimizing
    the total communication cost

7
Problem Formalization
  • The communication cost should depend on the
    degree of similarity between the two files
  • The Hamming distance
  • The edit distance
  • The edit distance with block moves
  • We focus mainly on the edit distance with block
    moves. We assume that each block move operation
    adds 3 to the distance, while other operations
    add 1

8
Problem Formalization
  • We focus on single-round protocols between client
    and server
  • Single-round protocols can be more easily
    integrated into existing tools currently relying
    on rsync
  • Multiple rounds are undesirable in many scenarios
    involving small files or large latencies
  • Multi-round protocols can introduce other
    complications due to state that may have to be
    kept at the server for best performance

9
Assumptions
  • The collection consists of unstructured files
  • We are not concerned with issues of consistency
    in between synchronization steps
  • A simple two-party scenario where it is known
    which files need to be updated and which is the
    current version

10
Contributions
  • We describe a new approach to single-round file
    synchronization based on erasure codes
  • We derive a protocol that communicates at most
    O(k lg(n) lg(n/k)) bits on files with edit
    distance with block moves of at most k
  • We derive another practical algorithm and
    optimized implementation that achieves very
    promising improvements over rsync

11
Outline
  • Introduction and Common Applications
  • Problem Formalization
  • Contributions
  • An Approach Based on Erasure Codes
  • A Simple Multi-Round Protocol
  • An Efficient Single-Round Protocol
  • A Practical Protocol Based on Erasure Codes
  • Implementation Overview
  • Preliminary Results
  • Conclusions

12
A Simple Multi-Round Protocol
  • Runs in a number of rounds
  • In the first round, server partitions the file
    into blocks of size bmax and sends a hash (MD5)
    for each block
  • Client attempts to match the received hashes to
    all possible alignments in the outdated file.
  • Client responds with a bit vector to notify the
    server which of the hashes are understood
  • Server repeats the process for the blocks whose
    hashes did not find a match
  • Once block size bmin is reached, the server sends
    all the unmatched blocks

13
A Simple Multi-Round Protocol
14
A Simple Multi-Round Protocol
  • Given two files with edit distance with block
    moves of k, if we choose
  • bmax next smaller power of 2 of n/k
  • bmin lg(n)
  • hash size 4lg(n) bits
  • Lemma If we partition fnew into some number of
    blocks, then at most k of these blocks do not
    occur in fold
  • On each level, at most k hashes do not find a
    match
  • The algorithm transmits at most O(k lg(n) lg(n/k)
    ) bits and correctly updates the file with
    probability at least 1-1/n

15
Outline
  • Introduction and Common Applications
  • Problem Formalization
  • Contributions
  • An Approach Based on Erasure Codes
  • A Simple Multi-Round Protocol
  • An Efficient Single-Round Protocol
  • A Practical Protocol Based on Erasure Codes
  • Implementation Overview
  • Preliminary Results
  • Conclusions

16
An Efficient Single-Round Protocol
  • First, we define complete multi-round algorithm
  • Sends hashes for all blocks
  • Second, we describe Systematic Erasure Code
    briefly

17
Erasure Code
  • Erasure Code Given k source data items of size
    s, which are encoded into ngtk encoded items of
    same size s.
  • If any n-k of the encoded items are lost they can
    be recovered
  • A systematic erasure code is the one where the
    encoded data items consist of k source items plus
    n-k additional items

Figure by Luigi Rizzo
18
An Efficient Single-Round Protocol
  • Any hash value sent in the complete multi-round
    algorithm that would not be sent in the simple
    multi-round algorithm is not transmitted

19
An Efficient Single-Round Protocol
  • Any hash value that would be sent by the simple
    multi-round algorithm is also not sent to the
    client, but considered lost

20
An Efficient Single-Round Protocol
  • On each level there can be at most 2k lost blocks
  • Client can recreate the entire level of hashes
    using the 2k erasure hashes and recovering the
    lost hashes

21
An Efficient Single-Round Protocol
  • Theorem Given a bound k on the edit distance
    between fold and fnew, the erasure-based file
    synchronization algorithm correctly updates fold
    to fnew with probability at least 1-1/n, using a
    single message of O(k lg(n) lg(n/k)) bits
  • We note that there are highly efficient
    single-message protocols for estimating the file
    distance k
  • Another property of the protocol is that by
    broadcasting a single message, the current
    version can be communicated to several clients
    that have different outdated versions

22
Outline
  • Introduction and Common Applications
  • Problem Formalization
  • Contributions
  • An Approach Based on Erasure Codes
  • A Simple Multi-Round Protocol
  • An Efficient Single-Round Protocol
  • A Practical Protocol Based on Erasure Codes
  • Implementation Overview
  • Preliminary Results
  • Conclusions

23
A Practical Protocol Based on Erasure Codes
  • Previous protocol has two main shortcomings
  • The protocol requires us to estimate an upper
    bound on the file distance k. An underestimation
    would make the recovery impossible at the client
  • More importantly, the algorithm does not support
    compression of unmatched literals
  • To address these problems we design another
    erasure-based algorithm that works better in
    practice

24
A Practical Protocol Based on Erasure Codes
  • The hashes are sent from client to server
  • For level i, mi erasure hashes are sent
  • The server identifies the common blocks and then
    sends unmatched literals in compressed form

25
Implementation Overview
  • We included three additional optimizations over
    rsync
  • We replace gzip algorithm used for transmission
    of the unmatched literals and match tokens with
    an optimized delta compressor
  • Server now transmits the resulting delta and bit
    vector to allow the client create the same
    reference file

26
Implementation Overview
  • We make a better choice of the number of bits per
    hash
  • We assume some upper bound on the probability of
    a collision, say 1/2d for some d, then we use
    lg(n)lg(y)d bits per hash
  • n is the file size
  • y is the total number of hashes sent from client
    to server
  • We integrate decomposable hashes
  • This technique allows the hash of a child block
    to be computed from the hashes of its parent and
    sibling, halving the number of erasure hashes
    transmitted

27
Preliminary Results
  • For the experiments we used the gcc and emacs
    datasets, consisting of 2.7.0 and 2.7.1 of gcc
    and 19.28 and 19.29 of emacs

28
Conclusions
  • We have described a new approach to remote file
    synchronization based on erasure codes
  • Using this approach, we derived a single-round
    protocol that is feasible and communication
    efficient w.r.t a common file distance measure
Write a Comment
User Comments (0)
About PowerShow.com