POND: THE OCEANSTORE PROTOTYPE - PowerPoint PPT Presentation

About This Presentation
Title:

POND: THE OCEANSTORE PROTOTYPE

Description:

Virtualization through Tapestry. OceanStore messages are addressed with a GUID. Tapestry forwards these messages to host containing a resource with that GUID ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 34
Provided by: jeha1
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: POND: THE OCEANSTORE PROTOTYPE


1
PONDTHE OCEANSTORE PROTOTYPE
  • S. Rea, P. Eaton, D. Geels,
  • H. Weatherspoon, J. Kubiatowicz
  • U. C. Berkeley

2
Key Ideas
  • Versioning file system
  • Location independent routing
  • Uses hashes instead of addresses
  • Mapping is done through Tapestry
  • Byzantine update commitment
  • By nodes holding primary copies (inner ring)
  • Proactive threshold signatures allow inner ring
    membership updates

3
Key Ideas
  • Push-based update of other copies
  • Through an overlay multicast network
  • Copies are not permanent
  • Continuous archiving in erasure-coded form
  • Very reliable
  • Very slow access

4
Motivation
  • Find a better solution forlong-term management
    of data
  • Enabling trends
  • Near universal connectivity through
    high-bandwidth links
  • Very fast increase of disk storage capacity per
    unit cost

5
OceanStore
  • Internet-scale cooperative file system
  • Will provide
  • High durability
  • Universal accessibility
  • Will use a two-tiered storage system
  • Stores data objects

6
Two-tiered organization
  • Upper tier
  • Powerful , well connected hosts
  • Serialize changes and archive results
  • Lower tier
  • Less powerful hosts
  • Can be user workstations
  • Provide storage resources

7
Two-tiered organization
Archive
Primary replica (in inner ring)
Secondary replica
Secondary replica
Secondary replica
8
Basic requirements
  • OceanStore should
  • Let information be accessed fromany location
  • Balance the tension between privacy and
    information sharing
  • Offer an easily understandable and usable model
    of data consistency
  • Guarantee data integrity

9
First basic assumption
  • Infrastructure cannot be trusted , except in
    aggregate
  • Host and routers can fail arbitrarily
  • Must consider
  • Passive failures host snooping,
  • Active failures host injecting malicious
    messages,

10
Second basic assumption
  • Infrastructure is continuously changing
  • Performance of communication paths varies
  • Resources enter and leave the network without
    warning
  • System should
  • Be at least self-organizing andself-repairing
  • Aim to be self-tuning

11
The challenge
  • Build a system that provides
  • An expressive user interface
  • High data availability
  • High data durability
  • High data privacy and integrity
  • atop an untrusted and ever changing base

More ambitious than FARSITE
12
The data model
  • OceanStore data object
  • Similar to a traditional file
  • Ordered sequence of read-only versions
  • Versioning
  • Simplifies consistency issues
  • Allows recovery of previous versions
  • Identical blocks are shared among versions

13
Data object implementation (I)
  • Each data object has an AGUID(Active
    Globally-Unique Identifier)
  • Secure hash of application-level name and private
    key of owner
  • Each version has a VGUID (Version GUID)
  • BGUID of root block of a version
  • Each block has a BGUID (Block GUID)
  • Secure hash of block contents

14
A data object
AGUID
VGUIDi
VGUIDi1
root block
COW
Indirect blocks
COW
Data blocks
15
Data object implementation
  • AGUID, VGUID and BGUID arelocation-transparent
  • OceanStore relies on a lower-level serviceto map
    GIDs into addresses

16
Application-level consistency (I)
  • Updating an object means creating a new version
  • Updates are
  • Atomic
  • Represented as an array of potential actions each
    guarded by a predicate

17
Application-level consistency (II)
  • Actions can be
  • Appending data
  • Replacing bytes at a specific address
  • Predicates can be
  • Checking the latest version number of the object
  • Verifying values of bytes at a specific address

18
Application-level consistency (II)
  • Actions can be
  • Appending data
  • Replacing bytes at a specific address
  • Predicates can be
  • Checking the latest version number of the object
  • Verifying values of bytes at a specific address

19
Application-level consistency (III)
  • Predicate and action model
  • Allows to implement multiple level of consistency
  • Atomic transactions satisfying ACID properties
    for database applications
  • Weaker consistency for mailboxes

20
A footnote
  • ACID properties of atomic transactions mean that
    atomic transactions
  • Are Atomic
  • Bring the database from one consistent state to
    another consistent state
  • Isolate their partial results until the
    transaction is completed
  • Guarantee the durability of final result

21
Virtualization through Tapestry
  • OceanStore messages are addressed with a GUID
  • Tapestry forwards these messages to host
    containing a resource with that GUID
  • Fully decentralized service
  • Hosts can
  • Join tapestry by supplying its GUID
  • Publish the GUIDs of the resources they have

22
Replication and consistency (I)
  • Each object has a single primary replica
  • Primary replica
  • Serializes and applies all updates
  • Creates a certificate (heartbeat ) mapping AGUID
    of object to GUID of its latest version
  • Controls access to the object

23
Replication and consistency (II)
  • Heartbeat contains
  • An AGUID
  • A VGUID
  • A timestamp
  • A version sequence number
  • Getting the most recent version of object means
    getting its most recent heartbeat

24
The inner ring
  • Small set of co-operating servers that manage
    primary replicas
  • Implement a Byzantine fault-tolerant protocol to
  • Agree on all updates to an object
  • Digitally sign the result

25
Archival storage
  • Stores object versions that are not frequently
    accessed
  • Uses erasure codes
  • Each block
  • Partitioned into m fragments
  • Encoded into n gt m fragments
  • Any subset of m fragments suffices to
    reconstitute the block

26
Caching of data objects
  • Retrieving data from archive is slow
  • OceanStore also maintains of whole blocks
  • Secondary replicas
  • Heartbeats always come from theprimary replica
  • Updates of secondary replicas are done through a
    dissemination tree

27
Path of an OceanStore update
Archive
Primary replica in inner ring
Application
Secondary replica
Secondary replica
Secondary replica
28
Updating primary replicas (I)
  • Use a Byzantine fault-tolerant protocol
  • Tolerates up to f failures in a system made up
    of 3f 1 hosts
  • Protocol uses digitally signed messages using
    symmetric key message authentication code
  • Faster than using public keys
  • Complicates the Byzantine agreement protocol

29
Updating primary replicas (II)
  • Solution was to use
  • Symmetric keys for all communications within the
    inner ring
  • Public keys to communicate with all other machines

30
Proactive threshold signatures
  • (listen to lecture)

31
Prototype software architecture
Disseminationtree/replicas
Inner ring
Clientinterface
Byzantineagreement
Application
Archive
32
The prototype
  • Written in Java

33
Conclusion
Write a Comment
User Comments (0)
About PowerShow.com