DStore: Recovery-friendly, self-managing clustered hash table - PowerPoint PPT Presentation

About This Presentation
Title:

DStore: Recovery-friendly, self-managing clustered hash table

Description:

LAN. LAN 2003 Andy Huang. User preferences: Explicit: ... LAN. Dlib. Bricks 2003 Andy Huang. Quorum algorithm 2003 Andy Huang. Algorithm: Wavering reads ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 21
Provided by: andyh1
Category:

less

Transcript and Presenter's Notes

Title: DStore: Recovery-friendly, self-managing clustered hash table


1
DStore Recovery-friendly, self-managing
clustered hash table
  • Andy Huang and Armando FoxStanford University

2
Outline
  • Proposal
  • Why? The goal
  • What? The class of state we focus on
  • How? Technique for achieving the goal
  • Quorum algorithm and recovery results
  • Repartitioning algorithm and availability results
  • Conclusion

3
  • Why? What? and How?

4
Why? Simplify state-management
Frontends
DB/FS
App Servers
SIMPLE COMPLEX
Configuration plug-and-play repartition
Recovery simple non-intrusive unavailability minutes
5
What? Non-transactional data
  • User preferences
  • Explicit name, address, etc.
  • Implicit usage statistics (Amazons items
    viewed)
  • Collaborative workflow data
  • Examples insurance claims, human resources files

read-mostly(catalogs) non-transactional r/w(user prefs, workflow data) transactional(billing)
6
How? Decouple using hash table and quorums
  • Hypothesis A state store designed for
    non-transactional data can be decoupled so that
    it can be managed like a stateless system
  • Technique 1 Expose a hash table API
  • Repartitioning scheme is simple (no complex data
    dependencies)
  • Technique 2 Use quorums (read/write majority)
  • Recovery is simple (no special case recovery
    mechanism)
  • Recovery is non-intrusive (data available
    throughout)

7
Architecture overview
  • Brick stores data
  • Dlib exposes hash table API to app server and
    executes quorum-based reads/writes on bricks
  • Replica groups bricks storing the same portion
    of the key space are in the same replica group

Bricks
App Servers
8
  • Quorum algorithm

9
Algorithm Wavering reads
  • No two-phase commit (complicates recovery and
    introduces coupling)
  • C1 attempts to write, but fails before completion
  • Quorum property violated reading a majority
    doesnt guarantee latest value is returned
  • Result wavering reads

10
Algorithm Read writeback
  • Idea commit partial write when it is first read
  • Commit point
  • Before x0
  • After x1
  • Proven linearizable under fail-stop model

11
Algorithm Crash recovery
  • Fail-stop not an accurate model implies client
    that generated the request fails permanently
  • With writeback, commit point occurs sometime in
    the future
  • A writer expects request to succeed or fail, not
    be in-progress

12
Algorithm Write in-progress
  • Requirement write must be committed/aborted on
    the next read
  • Record write in-progress on client
  • On submit write start cookie
  • On return write end cookie
  • On read if start cookie has no matching end,
    read all

13
Algorithm The common case
  • Write all, wait for a majority
  • Normally, all replicas perform the write
  • Read majority
  • Normally, replicas return non-conflicting values
  • Writeback performed when a brick fails or when it
    is temporarily overloaded and missed some writes
  • Read all performed when an app server fails

14
  • Recovery results

15
Results Simple, non-intrusive recovery
  • Normal operation majority must complete write
  • Failure if fewer than a majority fail, writes
    can succeed
  • Recovery equivalent to missing a few writes
    under normal operation
  • Simple no special code
  • Non-intrusive availability throughout

16
Benchmark Simple, non-intrusive recovery
  • Benchmark
  • t60 secone brick killed
  • t120 secbrick restarted
  • Summary
  • Data available during failure and recovery
  • Recovering brick restores throughput in seconds

17
Benchmark Availability under performance faults
  • Fault causes
  • cache warming
  • garbage collection
  • Benchmark
  • Degrade one brick by wasting CPU cycles
  • Comparison
  • DStore Throughput remains steady
  • ROWA Throughput throttled by slowest brick

18
  • Repartitioning algorithmAvailability results

19
Algorithm Online repartitioning
  • Split replica group ID (rgid), but announce both
  • Take a brick offline (looks just like a failure)
  • Copy data to new brick
  • Change rgid and bring both bricks online

20
Benchmark Online repartitioning
  • Benchmark
  • t120 secgroup 0 repartitioned
  • t240 secgroup 1 repartitioned
  • Non-intrusive
  • Data available during entire process
  • Appears as if brick just failed and recovered
    (but there are now more bricks)

21
Conclusion
  • Goal Simplify management for non-transactional
    data
  • Techniques Expose hash table API and use quorums
  • Results
  • Recovery is simple and non-intrusive
  • Repartitioning can be done fully online
  • Next steps
  • True plug-and-play automatically repartition
    when bricks are added/removed (simplified by hash
    table partitioning scheme)
  • Questions andy.huang_at_stanford.edu
Write a Comment
User Comments (0)
About PowerShow.com