Similarity based deduplication - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Similarity based deduplication

Description:

... deduplication. By: Lior Aronovich, Ron Asher, Eitan Bachmat, Haim Bitner, Michael Hirsch , Tomi ... There is a lot of redundancy in stored data, especially ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 23
Provided by: Eit94
Category:

less

Transcript and Presenter's Notes

Title: Similarity based deduplication


1
Similarity based deduplication
  • By Lior Aronovich, Ron Asher, Eitan Bachmat,
    Haim Bitner, Michael Hirsch , Tomi Klein

2
Deduplication
  • There is a lot of redundancy in stored data,
    especially in backup data.
  • Deduplication aims to store only the differences
    between different versions.

3
Different types of deduplication
  • Inline or offline.
  • Hash comparison or byte by byte.
  • Similarity based or identity based.

4
Our initial design requirements
  • Support for a petabyte of physical storage.
  • A deduplication rate of at least 350MB/sec.
  • Inline
  • Byte to byte (B2B) comparison

5
Standard approach
  • Break up the incoming data stream into segments,
    a few KB in size. The break up boundaries are
    computed using patterns in a rolling hash.
  • Identify each segment using a long hash.
  • Check if the hash belongs to a previous segment
  • If so place a pointer to the segment.

6
Standard approach
  • Can be fast, can be inline, however
  • Doesnt scale to a physical Petabyte (because of
    KB sized segments)
  • No B2B comparison

7
Our approach
  • Break up incoming stream to chunks of a few MB
    size.
  • Compute a similarity (not identity!) signature so
    that chunks which are alike (even only 50) will
    have signatures which are alike (could be only
    25).
  • Do a B2B comparison between the incoming chunk
    and similar repository segments.
  • Store differences based on the B2B.

8
Similarity signatures
  • Compute a rolling hash (Karp-Rabin) to all blocks
    of the chunk.
  • Three possibilities
  • (Breen et al) Take k random block hashes
  • (Broder, Heintze, Manbar) Take the largest k
    hashes
  • (Our choice) Take k hashes of blocks which are
    close to those that produced the k largest hashes


9
Criteria for comparison
  • Similarity checking speed
  • Successful identification of similarity
    percentage
  • Low probability of false positives
  • Likelihood of finding the most similar match

10
Comparison of methods
  • The first method (random block hashes) is slow
    and has many false positives, likelihood of
    finding best similarity is lower compared to the
    other methods
  • The second method (k maximal hashes) is faster,
    but still has false positives
  • The third method solves all issues

11
B2B phase
  • Once similarity is detected, we know where in the
    repository the similar data is located and we
    have a few anchoring matches.
  • The B2B comparison itself is completely decoupled
    from the similarity search ! We have the
    anchors and computed hashes to support the B2B.

12
Implementation
  • The TS7650 from IBM, formerly from Diligent
  • Has been available since 2005.
  • Many clients managing many petabytes
  • Very large installations

13
Did we achieve our goals?
  • Up to 850MB/sec on a single system node
  • Up to 1PB of physically usable storage with only
    12 GB of memory
  • Inline
  • B2B comparison

14
Some of the competition
  • Data Domain 690 series
  • An HP system, the DD4000 from 2008, academic
    paper in FAST 2009, the only other similarity
    based product, but with hash comparison, uses a
    variant of the second method.

15
How do we compare?
16
We are better !!
17
In more detail
  • While our solution supports 1PB physical, Data
    domain supports at most 50TB and the HP product
    at most 10TB, we have actual installations which
    are far bigger than either of these numbers.
  • Our solution is faster, somewhat faster than Data
    Domain, much faster than HP
  • We still find time to do B2B comparison, they
    dont
  • Our solution has faster reconstruction rate,
    remember thats when you have a data outage
    situation !

18
Customer site
19
401 dedupe ratio
20
Daily back-up fluctuations
21
throughput
22
Thanks !!
Write a Comment
User Comments (0)
About PowerShow.com