Fast, Inexpensive ContentAddressed Storage in Foundation - PowerPoint PPT Presentation

About This Presentation
Title:

Fast, Inexpensive ContentAddressed Storage in Foundation

Description:

Users increasingly store their most valuable data digitally. Wedding/baby photographs ... backup server that automatically coalesces duplicate media collections ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 40
Provided by: sean88
Category:

less

Transcript and Presenter's Notes

Title: Fast, Inexpensive ContentAddressed Storage in Foundation


1
Fast, Inexpensive Content-Addressed Storage in
Foundation
  • Sean Rhea Russ Cox, Alex Pesterev
  • Meraki, Inc. MIT CSAIL
  • Work done while at Intel Research, Berkeley.

2
Digital Dark Ages?
  • Users increasingly store their most valuable data
    digitally
  • Wedding/baby photographs
  • Letters (now called email)
  • Diaries, scrapbooks, tax returns
  • Yet digital information remains especially
    vulnerable
  • Terry Kuny We are living in the midst of
    digital Dark Ages
  • Hard drives crash
  • Removable media evolve (e.g., 5 ¼ floppies)
  • File formats become obsolete (e.g., WordStar,
    Lotus 1-2-3)
  • What will the world remember of the late 20th
    century?

3
  • As a community, were not bad at storing
    important data over the long term.
  • Weve only just begun to think about how well
    interpret that data 30 years from now.

4
For Example
  • Viewing an old PowerPoint presentation
  • Do we still have PowerPoint at all? And Windows?
  • Does the presentation use non-standard
    fonts/codecs?
  • Has some newer application overwritten a shared
    library with an incompatible version (DLL
    Hell)?
  • Not just a Microsoft problem consider a web page
  • Even current IE/Safari/Firefox dont agree on
    formatting
  • All kinds of plugins necessary sound, video,
    Flash

5
The Foundation Idea
  • Make daily backups of entire software stack
  • Archives users applications, OS, and
    configuration state
  • Dont worry about identifying dependencies
  • Just save it all Every byte, every night
  • To recover an obscure file, boot the relevant
    stack in an emulator
  • View file with the application that created it

6
Foundation FAQ
  • Why preserve the entire disk?
  • Preserve software stack dependencies preserve
    the data with the right application, libraries,
    and operating system as a single unit
  • Works for all applications, not just ones
    designed for preservation
  • Why daily images?
  • Want to preserve machine state as close as
    possible to last write of users data (i.e.,
    preserve image before something changes)
  • Also allows recovery from user errors
  • Why emulate hardware?
  • Much better track record than emulating software
  • Software example OpenOffice emulating Microsoft
    Word (yikes)
  • Hardware emulators available today for Amiga,
    PDP-11, Nintendo

7
  • I would love to give a talk about why Foundation
    is a great solution to the digital preservation
    problem.
  • Really, though, I think its just a pretty good
    start.
  • Instead, Im going to talk about a fun problem we
    had to solve to make it work.

8
Every Byte, Every Night?Indefinitely? Really?
  • Plan 9 did exactly that
  • Archive changed blocks every night to optical
    jukebox
  • Found that storage capacity grew faster than
    usage
  • Later with Content-Addressable Storage (Venti)
  • Automatically coalesces duplicate data to save
    space
  • Required multiple, high-speed disks for
    performance
  • Challenge for Foundation provide similar storage
    efficiency on consumer hardware
  • Time Machine model one external USB drive

9
Talk Outline
  • Introduction
  • What is Foundation?
  • Review of Content-Addressed Storage (Venti)
  • Contributions
  • Making Cheap Content-Addressed Storage Fast
  • Avoiding Concerns over Hash Collisions
  • Related Work
  • Conclusions

10
Venti Review
  • Plan 9 file system was two-level
  • Spinning storage, mostly a normal file system
  • Archival storage, optical write-once jukebox
  • Venti replaced optical jukebox
  • Still write-once
  • Chunks of data named by their SHA-1 hashes
  • Content-Addressable Storage (CAS)
  • Automatically coalesces duplicate writes

11
Venti Review
seen it before?
update index
Hash ? Offset
reads 2nd block
reads 4th block
reads 1st block
5h( )?1 6 7 8 9
0 1 2 3h( )?0 4
h( )?2
append hash to summary
Summary
h( )
,h( )
,h( )
,
h( )
append to log
no log write!
Data Log
Users Hard Drive
External USB Drive
RAM
12
Venti Review
map hash to log offset
Hash ? Offset
restore block
0h( )?4 1 2h( )?3 3h( )?0 4h( )?7
5h( )?1 6h( )?6 7h( )?5 8h( )?2 9
lookup hash of 1st block
Crash!
Summary
h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ),
h( ), h( ), h( ), h( ), h( ), h( ), h( )
read block from log
Final step (not shown) archive summary in data
log as well
Data Log
Users Hard Drive
External USB Drive
RAM
13
Notes on Venti
  • The Good News
  • CAS stores each block with particular contents
    only once
  • Changing any one block and re-archiving uses only
    one more block in archive
  • Adding a duplicate file from a different source
    uses no additional storage
  • The Bad News
  • Synchronous, random reads to on-disk index

14
Venti Review
seen it before?
Hash ? Offset
reads 4th block
0 1 2 3h( )?0 4
5h( )?1 6 7 8 9
h( )?2
Summary
h( )
,h( )
,h( )
Have to seek to the right bucket
Data Log
Users Hard Drive
External USB Drive
RAM
15
Venti Review
map hash to log offset
Hash ? Offset
0h( )?4 1 2h( )?3 3h( )?0 4h( )?7
5h( )?1 6h( )?6 7h( )?5 8h( )?2 9
lookup hash of 1st block
Summary
h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ),
h( ), h( ), h( ), h( ), h( ), h( ), h( )
Have to seek to the right bucket
Data Log
Users Hard Drive
External USB Drive
RAM
16
Notes on Venti
  • The Good News
  • CAS stores each block with particular contents
    only once
  • Changing any one block and re-archiving uses only
    one more block in archive
  • Adding a duplicate file from a different source
    uses no additional storage
  • The Bad News
  • Synchronous, random reads to on-disk index
  • Best case, one-disk performance for 512-byte
    blocks
  • one 5 ms seek per 512 bytes archived 100 kB/s
  • Thats 12 days to archive a 100 GB disk!
  • Larger blocks give better throughput, less sharing

17
Notes on Venti (cont.)
  • Ventis solution use 8 high-speed disks for
    index
  • Untennable in consumer space
  • Wears disks out pretty quickly, too
  • The compare-by-hash controversy
  • Fear of hash collisions two different blocks
    with same hash breaks Venti
  • May be very unlikely, but cost (data corruption)
    is huge
  • Does CAS really require a cryptographically
    strong hash?

18
Talk Outline
  • Introduction
  • What is Foundation?
  • Review of Content-Addressed Storage (Venti)
  • Contributions
  • Making Cheap Content-Addressed Storage Fast
  • Avoiding Concerns over Hash Collisions
  • Related Work
  • Conclusions

19
Making Inexpensive CAS Fast
  • The problem disk seeks
  • Secure hash randomizes an otherwise sequential
    disk-to-disk transfer
  • To reduce seeks, must reduce hash table lookups
  • When do hash table lookups occur?
  • When writing data, to determine if weve seen it
    before
  • When writing data, to update the index
  • When reading data, to map hashes to disk locations

20
2. Updating the Index
  • After appending a block to the data log, must
    update the index
  • Psuedorandom hash causes a seek

21
Updating the Index
update index
Hash ? Offset
reads 2nd block
0 1 2 3h( )?0 4
5h( )?1 6 7 8 9
Summary
h( )
Have to seek to the right bucket
append to log
Data Log
Users Hard Drive
External USB Drive
RAM
22
2. Updating the Index
  • After appending a block to the data log, must
    update the index
  • Psuedorandom hash causes a seek
  • Easy to fix use a write-back index cache
  • Store index writes in memory
  • Flush to disk sequentially in large batches
  • On crash, reconstruct index from the data log

23
3. Mapping Hashes to Disk Locations During Reads
  • To restore disk
  • Start with the list of original blocks hashes
  • Lookup each block in index
  • Read block from data log and restore to disk

24
map hash to log offset
Hash ? Offset
0h( )?4 1 2h( )?3 3h( )?0 4h( )?7
5h( )?1 6h( )?6 7h( )?5 8h( )?2 9
lookup hash of 1st block
Summary
h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ),
h( ), h( ), h( ), h( ), h( ), h( ), h( )
Have to seek to the right bucket
Data Log
Users Hard Drive
External USB Drive
RAM
25
3. Mapping Hashes to Disk Locations During Reads
  • To restore disk
  • Start with the list of original blocks hashes
  • Lookup each block in index
  • Read block from data log and restore to disk
  • Observation data log is mostly ordered
  • Duplicate blocks often occur as part of duplicate
    files

26
Ordering in Data Log
Hash ? Offset
0h( )?4 1 2h( )?3 3h( )?0 4h( )?7
5h( )?1 6h( )?6 7h( )?5 8h( )?2 9
Summary
h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ),
h( ), h( ), h( ), h( ), h( ), h( ), h( )
Data Log
Users Hard Drive
External USB Drive
RAM
27
3. Mapping Hashes to Disk Locations During Reads
  • To restore disk
  • Start with the list of original blocks hashes
  • Lookup each block in index
  • Read block from data log and restore to disk
  • Observation data log is mostly ordered
  • Duplicate blocks often occur as part of duplicate
    files
  • Idea add another index, ordered by log offset
  • Read-ahead in this index to eliminate future
    lookups in original index

28
Index by Offset
map hash to log offset (seek!)
Hash ? Offset
restore block
0h( )?4 1 2h( )?3 3h( )?0 4h( )?7
5h( )?1 6h( )?6 7h( )?5 8h( )?2 9
lookup hash of 2nd block
lookup hash of 1st block
Crash!
Summary
h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ),
h( ), h( ), h( ), h( ), h( ), h( ), h( )
new index, sorted by offset
read block from log (seek!)
read block from log (no seek!)
prefetch hashes for next few offsets from
secondary index (seek!)
find log offset in secondary index no seek!
Hash ? Offset
Data Log
Users Hard Drive
External USB Drive
RAM
29
1. Is a Block New, or Duplicate?
  • Optimization for reads also helps duplicate
    writes
  • Index misses on first duplicate block
  • Hits on subsequent blocks rewritten in same order
  • Doesnt help for new data
  • Every lookup in primary index fails
  • Still suffer a seek for every new block

30
1. Is a Block New, or Duplicate?
  • Idea use a Bloom filter to identify new blocks
  • Lossy representation of the primary index
  • Uses much less memory than index itself
  • For any given block, Bloom filter tells us
  • Its definitely new ? append to log, update index
  • It might be duplicate ? lookup in index
  • If it really is a duplicate, we get the prefetch
    benefit
  • Otherwise, called a false positive
  • Using enough memory keeps false positives at 1

31
Results
  • Do these optimizations pay off?
  • Buffering index writes is an obvious win
  • Bloom filter is, too removes 99 of seeks when
    writing new data
  • Both trade RAM for seeks
  • Benefit of secondary index less clear
  • If duplicate data comes in long sequences, it
    reduces index seeks to two per sequence
  • If duplicate data comes in little fragments, it
    doubles the number of index seeks
  • Need traces of real data to answer this question

32
Results (cont.)
  • Research group at MIT has been running Venti as
    its backup server for two years
  • We looked at 400 nightly snapshots
  • Simulated archiving and restoring these in both
    Venti and Foundation

33
Talk Outline
  • Introduction
  • What is Foundation?
  • Review of Content-Addressed Storage (Venti)
  • Contributions
  • Making Cheap Content-Addressed Storage Fast
  • Avoiding Concerns over Hash Collisions
  • Related Work
  • Conclusions

34
Eliminating Compare by Hash
  • Some worried that same SHA-1 doesnt imply same
    contents (i.e., hash collisions are possible)
  • Even if very rare, consequences (corruption) too
    great
  • Stepping back a bit, CAS as a black box
  • Give it a data block, get back an opaque ID
  • Give it an opaque ID, get back the data block
  • Do we care that the ID is a SHA-1 hash?
  • What if the opaque ID was just the blocks
    location in the data log?

35
Using Locations As IDs
  • Pros
  • Reads require no index lookups at all
  • System can still find potential duplicates
    using hashing (with a weaker, faster hash
    function)
  • Cons
  • Need another mechanism to check integrity
  • Since hash untrusted, must compare suspected
    duplicates byte-by-byte
  • Others have claimed these byte-by-byte
    comparisons are a non-starter

36
2nd Disk Arm to the Rescue
  • Once we eliminate most index reads (via our
    previous optimizations), the backup disk is
    otherwise idle while backing up duplicate data
  • Can instead put it to work doing byte-by-byte
    comparisons of suspected duplicates

37
Talk Outline
  • Introduction
  • What is Foundation?
  • Review of Content-Addressed Storage (Venti)
  • Contributions
  • Making Cheap Content-Addressed Storage Fast
  • Avoiding Concerns over Hash Collisions
  • Related Work
  • Conclusions

38
Related Work
  • Apple Time Machine
  • Duplicates coalesced at file level via hard links
  • Netapp WAFL, ZFS
  • Copy-on-write coalesces blocks at the FS level
  • Misses duplicates that come into system
    separately
  • Data Domain Deduplication FS
  • Very similar to Foundation, in enterprise context
  • Depends on collision-freeness of hash function
  • Lots of other Content-Addressed Storage work
  • LBFS, SUNDR, Peabody

39
Conclusions
  • Consumer-grade CAS works now
  • A single, external USB drive is enough
  • Just have to be crafty about avoiding seeks
  • Lots of uses other than preservation
  • E.g., inexpensive household backup server that
    automatically coalesces duplicate media
    collections
  • Doesnt require a collision-free hash function
Write a Comment
User Comments (0)
About PowerShow.com