Providing High Reliability in a Minimum Redundancy Archival Storage System - PowerPoint PPT Presentation

About This Presentation
Title:

Providing High Reliability in a Minimum Redundancy Archival Storage System

Description:

UC Santa Cruz. Providing High Reliability in a Minimum Redundancy ... Kristal Pollack. Darrell D. E. Long. Ethan L. Miller. Storage Systems Research Center ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 22
Provided by: www249
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: Providing High Reliability in a Minimum Redundancy Archival Storage System


1
Providing High Reliability in a Minimum
Redundancy Archival Storage System
  • Deepavali Bhagwat
  • Kristal Pollack
  • Darrell D. E. Long
  • Ethan L. Miller
  • Storage Systems Research Center
  • University of California, Santa Cruz

Thomas Schwarz Computer Engineering
Department Santa Clara University Jehan-François
Pâris Department of Computer Science University
of Houston, Texas
2
Introduction
  • Archival data will increase ten-fold from 2007 to
    2010
  • J. McKnight, T. Asaro, and B. Babineau, Digital
    ArchivingEnd-User Survey and Market Forecast
    2006 - 2010. The Enterprise Strategy Group, Jan.
    2006.
  • Data compression techniques used to reduce
    storage costs
  • Deep Store - An archival storage system
  • Uses interfile and intrafile compression
    techniques
  • Uses chunking
  • Compression hurts reliability
  • Loss of a shared chunk ? Disproportionate data
    loss
  • Our solution Reinvest the saved storage space to
    improve reliability
  • Selective replication of chunks
  • Our results
  • Better reliability compared to that of mirrored
    Lempel-Ziv compressed files using only about half
    of the storage space

3
Deep Store An overview
  • Whole File Hashing
  • Content Addressable Storage
  • Delta Compression
  • Chunk-based Compression
  • File broken down into variable-length chunks
    using a sliding window technique
  • A chunk identifier/digest used to look for
    identical chunks
  • Only unique chunks stored

variable chunk size
sliding window
w fixed size window
chunk end/start
window fingerprint
chunk ID(content address)
4
Effects of Compression on Reliability
  • Chunk-based compression ? Interfile dependencies
  • Loss of a shared chunk ? Disproportionate amount
    of data loss

Files
Chunks
5
Effects of Compression on Reliability..
  • Simple experiment to show the effects of
    interfile dependencies
  • 9.8 GB of data from several websites, The
    Internet Archive
  • Compressed using chunking to 1.83 GB. (5.62 GB
    using gzip)
  • Chunks were mirrored and distributed evenly onto
    179 devices, 20 MB each.

6
Compression and Reliability
  • Chunking
  • Minimizes redundancies. Gives us excellent
    compression ratios
  • Introduces interfile dependencies
  • Interfile dependencies are detrimental to
    reliability
  • Hence, reintroduce redundancies
  • Selective replication of chunks
  • Some chunks more important than others. How
    important?
  • The amount of data depending on a chunk (byte
    count)
  • The number of files depending on a chunk
    (reference count)
  • Selective replication strategy
  • Weight of a chunk (w) ? Number of replicas for a
    chunk (k)
  • We use a heuristic function to calculate k

7
Heuristic Function
  • k Number of replicas
  • w Weight of a chunk
  • a Base level of replication, independent of w
  • b To boost the number of replicas for chunks
    with high weight
  • Every chunk is mirrored
  • kmax Maximum number of replicas
  • As replicas increase the gain in reliability
    obtained as a result of every additional replica
    reduces
  • k rounded off to the nearest integer.

8
Distribution of Chunks
  • An archival system receives files in batches
  • Files stored onto a disk until the disk is full
  • For every file
  • Chunks extracted and compressed
  • Unique chunks stored
  • A non unique chunk stored only if
  • The present disk does not contain the chunk
  • For this chunk, k lt kmax
  • At the end of the batch
  • All chunks revisited and replicas made for
    appropriate chunks
  • A chunk is not proactively replicated
  • Wait for a chunks replica to arrive as a chunk
    of a future file
  • Reduce inter-device dependencies for a file.

9
Experimental Setup
  • We measure Robustness
  • The fraction of the data available given a
    certain percentage of unavailable storage devices
  • We use Replication to introduce redundancies
  • Future work will investigate erasure codes
  • Data Set
  • HTML, PDF, image files from The Internet Archive.
    (9.8 GB)
  • HTML, image (JPG and TIFF), PDF, Microsoft Word
    files from The Santa Cruz Sentinel (40.22 GB)
  • We compare the Robustness and Storage Space
    utilization of archives that use
  • Chunking with selective redundancies and
  • Lempel-Ziv compression with mirroring

10
Details of the Experimental Data
11
Weight of a Chunk
  • When using dependent data (byte count) as a
    heuristic
  • w D/d
  • D sum of the sizes of all files depending on a
    chunk
  • d average size of a chunk
  • When using the number of files (reference count)
    as a heuristic
  • w F
  • F number of files depending on a chunk

12
Robustness, Effect of varying a, wF, b1,
kmax4, The Internet Archive
13
Robustness, Effect of varying a, wD/d,b0.4,
kmax4, The Internet Archive
14
Robustness, Effect of limiting k, wD/d, b0.55,
a0, The Internet Archive
15
Robustness, Effect of varying b, wD/d, a0,
kmax4, The Internet Archive
16
Robustness, Effect of varying b, wD/d, a0,
kmax4, The Sentinel
17
Choice of a Heuristic
  • Choice of a heuristic depends on the corpus
  • If file size is indicative of file importance,
    choose wD/d
  • If files importance is independent of its size,
    choose wF
  • Use the same metric to measure robustness

18
Future Work
  • Study reliability of Deep Store
  • With a recovery model in place
  • When using delta compression
  • Use different redundancy mechanisms such as
    erasure codes
  • Data placement in conjunction with hardware
    statistics

19
Related Work
  • Many archival systems use Content Addressable
    Storage
  • EMCs Centera
  • Variable-length chunks LBFS
  • Fixed-size chunks Venti
  • OceanStore aims to provide continuous access to
    persistent data
  • uses automatic replication for high reliability
  • erasure codes for high availability
  • FARSITE a distributed file system
  • Replication of metadata, replication chosen avoid
    the overhead of data reconstruction when using
    erasure codes
  • PASIS, Glacier use aggressive replication as a
    protection against data loss
  • LOCKSS provides long term access to digital data
  • uses peer-to-peer audit and repair protocol to
    preserve the integrity and long-term access to
    document collections

20
Conclusion
  • Chunking gives excellent compression ratios but
    introduces interfile dependencies that adversely
    affect system reliability.
  • Selective replication of chunks using heuristics
    gives
  • better robustness than mirrored LZ-compressed
    files
  • significantly high storage space efficiency --
    only uses about half of the space used by
    mirrored LZ-compressed files
  • We use simple replication. Our results will only
    improve with other forms of redundancies.

21
Providing High Reliability in a Minimum
Redundancy Archival Storage System
  • Deepavali Bhagwat
  • Kristal Pollack
  • Darrell D. E. Long
  • Ethan L. Miller
  • Storage Systems Research Center
  • University of California, Santa Cruz

Thomas Schwarz Computer Engineering
Department Santa Clara University Jehan-François
Pâris Department of Computer Science University
of Houston, Texas
Write a Comment
User Comments (0)
About PowerShow.com