Similarity based deduplication

About This Presentation

Title:

Description:

Number of Views:80

Avg rating:3.0/5.0

Slides: 23

Provided by: Eit94

Category:

Tags: asher | based | deduplication | similarity

Transcript and Presenter's Notes

Title: Similarity based deduplication

1
Similarity based deduplication

By Lior Aronovich, Ron Asher, Eitan Bachmat,
Haim Bitner, Michael Hirsch , Tomi Klein

2
Deduplication

3
Different types of deduplication

4
Our initial design requirements

5
Standard approach

Break up the incoming data stream into segments,
a few KB in size. The break up boundaries are
computed using patterns in a rolling hash.
Identify each segment using a long hash.
Check if the hash belongs to a previous segment
If so place a pointer to the segment.

6
Standard approach

7
Our approach

Break up incoming stream to chunks of a few MB
size.
Compute a similarity (not identity!) signature so
that chunks which are alike (even only 50) will
have signatures which are alike (could be only
25).
Do a B2B comparison between the incoming chunk
and similar repository segments.
Store differences based on the B2B.

8
Similarity signatures

Compute a rolling hash (Karp-Rabin) to all blocks
of the chunk.
Three possibilities
(Breen et al) Take k random block hashes
(Broder, Heintze, Manbar) Take the largest k
hashes
(Our choice) Take k hashes of blocks which are
close to those that produced the k largest hashes

9
Criteria for comparison

10
Comparison of methods

The first method (random block hashes) is slow
and has many false positives, likelihood of
finding best similarity is lower compared to the
other methods
The second method (k maximal hashes) is faster,
but still has false positives
The third method solves all issues

11
B2B phase

Once similarity is detected, we know where in the
repository the similar data is located and we
have a few anchoring matches.
The B2B comparison itself is completely decoupled
from the similarity search ! We have the
anchors and computed hashes to support the B2B.

12
Implementation

13
Did we achieve our goals?

14
Some of the competition

Data Domain 690 series
An HP system, the DD4000 from 2008, academic
paper in FAST 2009, the only other similarity
based product, but with hash comparison, uses a
variant of the second method.

15
How do we compare?
16
We are better !!
17
In more detail

While our solution supports 1PB physical, Data
domain supports at most 50TB and the HP product
at most 10TB, we have actual installations which
are far bigger than either of these numbers.
Our solution is faster, somewhat faster than Data
Domain, much faster than HP
We still find time to do B2B comparison, they
dont
Our solution has faster reconstruction rate,
remember thats when you have a data outage
situation !