Providing High Reliability in a Minimum Redundancy Archival Storage System

About This Presentation

Title:

Providing High Reliability in a Minimum Redundancy Archival Storage System

Description:

UC Santa Cruz. Providing High Reliability in a Minimum Redundancy ... Kristal Pollack. Darrell D. E. Long. Ethan L. Miller. Storage Systems Research Center ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 22

Provided by: www249

Learn more at: https://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: Providing High Reliability in a Minimum Redundancy Archival Storage System

1
Providing High Reliability in a Minimum
Redundancy Archival Storage System

Deepavali Bhagwat
Kristal Pollack
Darrell D. E. Long
Ethan L. Miller
Storage Systems Research Center
University of California, Santa Cruz

Thomas Schwarz Computer Engineering
Department Santa Clara University Jehan-François
Pâris Department of Computer Science University
of Houston, Texas
2
Introduction

Archival data will increase ten-fold from 2007 to
2010
J. McKnight, T. Asaro, and B. Babineau, Digital
ArchivingEnd-User Survey and Market Forecast
2006 - 2010. The Enterprise Strategy Group, Jan.
2006.
Data compression techniques used to reduce
storage costs
Deep Store - An archival storage system
Uses interfile and intrafile compression
techniques
Uses chunking
Compression hurts reliability
Loss of a shared chunk ? Disproportionate data
loss
Our solution Reinvest the saved storage space to
improve reliability
Selective replication of chunks
Our results
Better reliability compared to that of mirrored
Lempel-Ziv compressed files using only about half
of the storage space

3
Deep Store An overview

Whole File Hashing
Content Addressable Storage
Delta Compression
Chunk-based Compression
File broken down into variable-length chunks
using a sliding window technique
A chunk identifier/digest used to look for
identical chunks
Only unique chunks stored

variable chunk size
sliding window
w fixed size window
chunk end/start
window fingerprint
chunk ID(content address)
4
Effects of Compression on Reliability

Chunk-based compression ? Interfile dependencies
Loss of a shared chunk ? Disproportionate amount
of data loss

Files
Chunks
5
Effects of Compression on Reliability..

Simple experiment to show the effects of
interfile dependencies
9.8 GB of data from several websites, The
Internet Archive
Compressed using chunking to 1.83 GB. (5.62 GB
using gzip)
Chunks were mirrored and distributed evenly onto
179 devices, 20 MB each.

6
Compression and Reliability

Chunking
Minimizes redundancies. Gives us excellent
compression ratios
Introduces interfile dependencies
Interfile dependencies are detrimental to
reliability
Hence, reintroduce redundancies
Selective replication of chunks
Some chunks more important than others. How
important?
The amount of data depending on a chunk (byte
count)
The number of files depending on a chunk
(reference count)
Selective replication strategy
Weight of a chunk (w) ? Number of replicas for a
chunk (k)
We use a heuristic function to calculate k

7
Heuristic Function

k Number of replicas
w Weight of a chunk
a Base level of replication, independent of w
b To boost the number of replicas for chunks
with high weight
Every chunk is mirrored
kmax Maximum number of replicas
As replicas increase the gain in reliability
obtained as a result of every additional replica
reduces
k rounded off to the nearest integer.

8
Distribution of Chunks

An archival system receives files in batches
Files stored onto a disk until the disk is full
For every file
Chunks extracted and compressed
Unique chunks stored
A non unique chunk stored only if
The present disk does not contain the chunk
For this chunk, k lt kmax
At the end of the batch
All chunks revisited and replicas made for
appropriate chunks
A chunk is not proactively replicated
Wait for a chunks replica to arrive as a chunk
of a future file
Reduce inter-device dependencies for a file.

9
Experimental Setup

We measure Robustness
The fraction of the data available given a
certain percentage of unavailable storage devices
We use Replication to introduce redundancies
Future work will investigate erasure codes
Data Set
HTML, PDF, image files from The Internet Archive.
(9.8 GB)
HTML, image (JPG and TIFF), PDF, Microsoft Word
files from The Santa Cruz Sentinel (40.22 GB)
We compare the Robustness and Storage Space
utilization of archives that use
Chunking with selective redundancies and
Lempel-Ziv compression with mirroring

10
Details of the Experimental Data
11
Weight of a Chunk

When using dependent data (byte count) as a
heuristic
w D/d
D sum of the sizes of all files depending on a
chunk
d average size of a chunk
When using the number of files (reference count)
as a heuristic
w F
F number of files depending on a chunk

12
Robustness, Effect of varying a, wF, b1,
kmax4, The Internet Archive
13
Robustness, Effect of varying a, wD/d,b0.4,
kmax4, The Internet Archive
14
Robustness, Effect of limiting k, wD/d, b0.55,
a0, The Internet Archive
15
Robustness, Effect of varying b, wD/d, a0,
kmax4, The Internet Archive
16
Robustness, Effect of varying b, wD/d, a0,
kmax4, The Sentinel
17
Choice of a Heuristic

Choice of a heuristic depends on the corpus
If file size is indicative of file importance,
choose wD/d
If files importance is independent of its size,
choose wF
Use the same metric to measure robustness

18
Future Work

Study reliability of Deep Store
With a recovery model in place
When using delta compression
Use different redundancy mechanisms such as
erasure codes
Data placement in conjunction with hardware
statistics

19
Related Work

Many archival systems use Content Addressable
Storage
EMCs Centera
Variable-length chunks LBFS
Fixed-size chunks Venti
OceanStore aims to provide continuous access to
persistent data
uses automatic replication for high reliability
erasure codes for high availability
FARSITE a distributed file system
Replication of metadata, replication chosen avoid
the overhead of data reconstruction when using
erasure codes
PASIS, Glacier use aggressive replication as a
protection against data loss
LOCKSS provides long term access to digital data
uses peer-to-peer audit and repair protocol to
preserve the integrity and long-term access to
document collections

20
Conclusion

Chunking gives excellent compression ratios but
introduces interfile dependencies that adversely
affect system reliability.
Selective replication of chunks using heuristics
gives
better robustness than mirrored LZ-compressed
files
significantly high storage space efficiency --
only uses about half of the space used by
mirrored LZ-compressed files
We use simple replication. Our results will only
improve with other forms of redundancies.

21
Providing High Reliability in a Minimum
Redundancy Archival Storage System

Deepavali Bhagwat
Kristal Pollack
Darrell D. E. Long
Ethan L. Miller
Storage Systems Research Center
University of California, Santa Cruz

Thomas Schwarz Computer Engineering
Department Santa Clara University Jehan-François
Pâris Department of Computer Science University
of Houston, Texas

Write a Comment

User Comments (0)

About PowerShow.com

Providing High Reliability in a Minimum Redundancy Archival Storage System - PowerPoint PPT Presentation

Providing High Reliability in a Minimum Redundancy Archival Storage System

UC Santa Cruz. Providing High Reliability in a Minimum Redundancy ... Kristal Pollack. Darrell D. E. Long. Ethan L. Miller. Storage Systems Research Center ... – PowerPoint PPT presentation