Title: POND: THE OCEANSTORE PROTOTYPE
1PONDTHE OCEANSTORE PROTOTYPE
- S. Rea, P. Eaton, D. Geels,
- H. Weatherspoon, J. Kubiatowicz
- U. C. Berkeley
2Key Ideas
- Versioning file system
- Location independent routing
- Uses hashes instead of addresses
- Mapping is done through Tapestry
- Byzantine update commitment
- By nodes holding primary copies (inner ring)
- Proactive threshold signatures allow inner ring
membership updates
3Key Ideas
- Push-based update of other copies
- Through an overlay multicast network
- Copies are not permanent
- Continuous archiving in erasure-coded form
- Very reliable
- Very slow access
4Motivation
- Find a better solution forlong-term management
of data - Enabling trends
- Near universal connectivity through
high-bandwidth links - Very fast increase of disk storage capacity per
unit cost
5OceanStore
- Internet-scale cooperative file system
- Will provide
- High durability
- Universal accessibility
- Will use a two-tiered storage system
- Stores data objects
6Two-tiered organization
- Upper tier
- Powerful , well connected hosts
- Serialize changes and archive results
- Lower tier
- Less powerful hosts
- Can be user workstations
- Provide storage resources
7Two-tiered organization
Archive
Primary replica (in inner ring)
Secondary replica
Secondary replica
Secondary replica
8Basic requirements
- OceanStore should
- Let information be accessed fromany location
- Balance the tension between privacy and
information sharing - Offer an easily understandable and usable model
of data consistency - Guarantee data integrity
9First basic assumption
- Infrastructure cannot be trusted , except in
aggregate - Host and routers can fail arbitrarily
- Must consider
- Passive failures host snooping,
- Active failures host injecting malicious
messages,
10Second basic assumption
- Infrastructure is continuously changing
- Performance of communication paths varies
- Resources enter and leave the network without
warning - System should
- Be self-organizing andself-repairing
- Aim to be self-tuning
11The challenge
- Build a system that provides
- An expressive user interface
- High data availability
- High data durability
- High data privacy and integrity
- atop an untrusted and ever changing base
More ambitious than FARSITE
12The data model
- OceanStore data object
- Similar to a traditional file
- Ordered sequence of read-only versions
- Versioning
- Simplifies consistency issues
- Allows recovery of previous versions
- Identical blocks are shared among versions
13Data object implementation (I)
- Each data object has an AGUID(Active
Globally-Unique Identifier) - Secure hash of application-level name and private
key of owner - Each version has a VGUID (Version GUID)
- BGUID of root block of a version
- Each block has a BGUID (Block GUID)
- Secure hash of block contents
14A data object
AGUID
VGUIDi
VGUIDi1
root block
COW
Indirect blocks
COW
Data blocks
15Data object implementation
- AGUID, VGUID and BGUID arelocation-transparent
- OceanStore relies on a lower-level serviceto map
GIDs into addresses
16Application-level consistency (I)
- Updating an object means creating a new version
- Updates are
- Atomic
- Represented as an array of potential actions each
guarded by a predicate
17Application-level consistency (II)
- Actions can be
- Appending data
- Replacing bytes at a specific address
- Predicates can be
- Checking the latest version number of the object
- Verifying values of bytes at a specific address
18Application-level consistency (II)
- Actions can be
- Appending data
- Replacing bytes at a specific address
- Predicates can be
- Checking the latest version number of the object
- Verifying values of bytes at a specific address
19Application-level consistency (III)
- Predicate and action model
- Allows to implement multiple level of consistency
- Atomic transactions satisfying ACID properties
for database applications - Weaker consistency for mailboxes
20A footnote
- ACID properties of atomic transactions mean that
atomic transactions - Are Atomic
- Bring the database from one consistent state to
another consistent state - Isolate their partial results until the
transaction is completed - Guarantee the durability of final result
21Virtualization through Tapestry
- OceanStore messages are addressed with a GUID
- Tapestry forwards these messages to host
containing a resource with that GUID - Fully decentralized service
- Hosts can
- Join tapestry by supplying its GUID
- Publish the GUIDs of the resources they have
22Replication and consistency (I)
- Each object has a single primary replica
- Primary replica
- Serializes and applies all updates
- Creates a certificate (heartbeat ) mapping AGUID
of object to GUID of its latest version - Controls access to the object
-
23Replication and consistency (II)
- Heartbeat contains
- An AGUID
- A VGUID
- A timestamp
- A version sequence number
- Getting the most recent version of object means
getting its most recent heartbeat
24The inner ring
- Small set of co-operating servers that manage
primary replicas - Implement a Byzantine fault-tolerant protocol to
- Agree on all updates to an object
- Digitally sign the result
25Archival storage
- Stores object versions that are not frequently
accessed - Uses erasure codes
- Each block
- Partitioned into m fragments
- Encoded into n gt m fragments
- Any subset of m fragments suffices to
reconstitute the block
26Caching of data objects
- Retrieving data from archive is slow
- OceanStore also maintains of whole blocks
- Secondary replicas
- Heartbeats always come from theprimary replica
- Updates of secondary replicas are done through a
dissemination tree
27Path of an OceanStore update
Archive
Primary replica in inner ring
Application
Secondary replica
Secondary replica
Secondary replica
28Updating primary replicas (I)
- Use a Byzantine fault-tolerant protocol
- Tolerates up to f failures in a system made up
of 3f 1 hosts - Protocol uses digitally signed messages using
symmetric key message authentication code - Faster than using public keys
- Complicates the Byzantine agreement protocol
29Updating primary replicas (II)
- Solution was to use
- Symmetric keys for all communications within the
inner ring - Public keys to communicate with all other machines
30Proactive threshold signatures
31Prototype software architecture
Disseminationtree/replicas
Inner ring
Clientinterface
Byzantineagreement
Application
Archive
32The prototype
33Conclusion