Title: Providing High Reliability in a Minimum Redundancy Archival Storage System
1Providing High Reliability in a Minimum
Redundancy Archival Storage System
- Deepavali Bhagwat
- Kristal Pollack
- Darrell D. E. Long
- Ethan L. Miller
- Storage Systems Research Center
- University of California, Santa Cruz
Thomas Schwarz Computer Engineering
Department Santa Clara University Jehan-François
Pâris Department of Computer Science University
of Houston, Texas
2Introduction
- Archival data will increase ten-fold from 2007 to
2010 - J. McKnight, T. Asaro, and B. Babineau, Digital
ArchivingEnd-User Survey and Market Forecast
2006 - 2010. The Enterprise Strategy Group, Jan.
2006. - Data compression techniques used to reduce
storage costs - Deep Store - An archival storage system
- Uses interfile and intrafile compression
techniques - Uses chunking
- Compression hurts reliability
- Loss of a shared chunk ? Disproportionate data
loss - Our solution Reinvest the saved storage space to
improve reliability - Selective replication of chunks
- Our results
- Better reliability compared to that of mirrored
Lempel-Ziv compressed files using only about half
of the storage space
3Deep Store An overview
- Whole File Hashing
- Content Addressable Storage
- Delta Compression
- Chunk-based Compression
- File broken down into variable-length chunks
using a sliding window technique - A chunk identifier/digest used to look for
identical chunks - Only unique chunks stored
variable chunk size
sliding window
w fixed size window
chunk end/start
window fingerprint
chunk ID(content address)
4Effects of Compression on Reliability
- Chunk-based compression ? Interfile dependencies
- Loss of a shared chunk ? Disproportionate amount
of data loss
Files
Chunks
5Effects of Compression on Reliability..
- Simple experiment to show the effects of
interfile dependencies - 9.8 GB of data from several websites, The
Internet Archive - Compressed using chunking to 1.83 GB. (5.62 GB
using gzip) - Chunks were mirrored and distributed evenly onto
179 devices, 20 MB each.
6Compression and Reliability
- Chunking
- Minimizes redundancies. Gives us excellent
compression ratios - Introduces interfile dependencies
- Interfile dependencies are detrimental to
reliability - Hence, reintroduce redundancies
- Selective replication of chunks
- Some chunks more important than others. How
important? - The amount of data depending on a chunk (byte
count) - The number of files depending on a chunk
(reference count) - Selective replication strategy
- Weight of a chunk (w) ? Number of replicas for a
chunk (k) - We use a heuristic function to calculate k
7Heuristic Function
- k Number of replicas
- w Weight of a chunk
- a Base level of replication, independent of w
- b To boost the number of replicas for chunks
with high weight - Every chunk is mirrored
- kmax Maximum number of replicas
- As replicas increase the gain in reliability
obtained as a result of every additional replica
reduces - k rounded off to the nearest integer.
8Distribution of Chunks
- An archival system receives files in batches
- Files stored onto a disk until the disk is full
- For every file
- Chunks extracted and compressed
- Unique chunks stored
- A non unique chunk stored only if
- The present disk does not contain the chunk
- For this chunk, k lt kmax
- At the end of the batch
- All chunks revisited and replicas made for
appropriate chunks - A chunk is not proactively replicated
- Wait for a chunks replica to arrive as a chunk
of a future file - Reduce inter-device dependencies for a file.
9Experimental Setup
- We measure Robustness
- The fraction of the data available given a
certain percentage of unavailable storage devices - We use Replication to introduce redundancies
- Future work will investigate erasure codes
- Data Set
- HTML, PDF, image files from The Internet Archive.
(9.8 GB) - HTML, image (JPG and TIFF), PDF, Microsoft Word
files from The Santa Cruz Sentinel (40.22 GB) - We compare the Robustness and Storage Space
utilization of archives that use - Chunking with selective redundancies and
- Lempel-Ziv compression with mirroring
10Details of the Experimental Data
11Weight of a Chunk
- When using dependent data (byte count) as a
heuristic - w D/d
- D sum of the sizes of all files depending on a
chunk - d average size of a chunk
- When using the number of files (reference count)
as a heuristic - w F
- F number of files depending on a chunk
12Robustness, Effect of varying a, wF, b1,
kmax4, The Internet Archive
13Robustness, Effect of varying a, wD/d,b0.4,
kmax4, The Internet Archive
14Robustness, Effect of limiting k, wD/d, b0.55,
a0, The Internet Archive
15Robustness, Effect of varying b, wD/d, a0,
kmax4, The Internet Archive
16Robustness, Effect of varying b, wD/d, a0,
kmax4, The Sentinel
17Choice of a Heuristic
- Choice of a heuristic depends on the corpus
- If file size is indicative of file importance,
choose wD/d - If files importance is independent of its size,
choose wF - Use the same metric to measure robustness
18Future Work
- Study reliability of Deep Store
- With a recovery model in place
- When using delta compression
- Use different redundancy mechanisms such as
erasure codes - Data placement in conjunction with hardware
statistics
19Related Work
- Many archival systems use Content Addressable
Storage - EMCs Centera
- Variable-length chunks LBFS
- Fixed-size chunks Venti
- OceanStore aims to provide continuous access to
persistent data - uses automatic replication for high reliability
- erasure codes for high availability
- FARSITE a distributed file system
- Replication of metadata, replication chosen avoid
the overhead of data reconstruction when using
erasure codes - PASIS, Glacier use aggressive replication as a
protection against data loss - LOCKSS provides long term access to digital data
- uses peer-to-peer audit and repair protocol to
preserve the integrity and long-term access to
document collections
20Conclusion
- Chunking gives excellent compression ratios but
introduces interfile dependencies that adversely
affect system reliability. - Selective replication of chunks using heuristics
gives - better robustness than mirrored LZ-compressed
files - significantly high storage space efficiency --
only uses about half of the space used by
mirrored LZ-compressed files - We use simple replication. Our results will only
improve with other forms of redundancies.
21Providing High Reliability in a Minimum
Redundancy Archival Storage System
- Deepavali Bhagwat
- Kristal Pollack
- Darrell D. E. Long
- Ethan L. Miller
- Storage Systems Research Center
- University of California, Santa Cruz
Thomas Schwarz Computer Engineering
Department Santa Clara University Jehan-François
Pâris Department of Computer Science University
of Houston, Texas