Title: HQ Replication: Efficient Quorum Agreement for Reliable Distributed Systems
1HQ ReplicationEfficient Quorum Agreement
forReliable Distributed Systems
- James Cowling1, Daniel Myers1, Barbara Liskov1
- Rodrigo Rodrigues2, Liuba Shrira3
- 1MIT CSAIL
- 2INESC-ID and Instituto Superior Técnico
- 3Brandeis University
2Byzantine Fault Tolerance
- Reliable client-server distributed systems
- Server replicated across group of replica
machines - General operations
- Bounded number f of Byzantine replicas
- Must ensure correct system state
- Consistent ordering of client operations
3State of the Art
- Approaches
- State Machine Replication BFT
- 3f1 replicas
- Byzantine Quorums Q/U
- 5f1 replicas
- Increased performance
- Degradation when writes contend
4Contributions
- Low overhead Byzantine Fault Tolerance
- Performance of Byzantine Quorums without 5f1
replicas or contention degradation - Hybrid Quorum scheme for Byzantine Fault
Tolerance - Quorum approach in normal-case
- Use Byzantine agreement to resolve write
contention
5Outline
- Current Approaches
- HQ Replication
- BFT Improvements
- Performance Evaluation
- Conclusions
6State Machine Replication
- BFT - Castro and Liskov TOCS 02
- Operations ordered by primary
- Agreed upon by replicas
7Byzantine Quorums
- Q/U - Abd-El-Malek et al. SOSP 05
- Client controlled protocol
- Replicas order operations independently
- Optimistic
- Best case one-phase protocol
- Worst case unbounded
- Randomized backoff
8Advantages/Disadvantages
- BFT
- Good
- 3f1 replicas
- Bounded number of phases
- Bad
- Higher latency
- Quadratic communication
- Q/U
- Good
- Best-case performance
- One-phase write
- Low replica load
- Bad
- 5f1 replicas
- Degraded performance when writes contend
9Outline
- Current Approaches
- HQ Replication
- Normal-case Protocol
- Contention Resolution
- BFT Improvements
- Performance Evaluation
- Conclusions
10HQ Replication
- 3f1 replicas
- Supports general operations
- No all-to-all communication in normal-case
- BFT used to resolve contention
11HQ Replication
- One-phase read
- Two-phase write
12System Architecture (remove this?)
13High-level Write Protocol
- Two-phase write protocol
- Phase 1
- Client obtains timestamp grant from each replica
- Phase 2
- Client forms certificate from 2f1 matching
grants - Sends to replicas to complete write
14Grants
- Promise to execute operation at given sequence
number - Assuming agreement from quorum
- Grant
- Client ID
- Object ID
- Hash over requested operation
- Sequence Number (timestamp)
- Replica signature
15Certificates
- Certificate
- Quorum (2f1) matching grants
- Proves quorum of replicas agree to ordering of
operation - Uniquely identify client, operation and
sequential ordering - Existence of certificate precludes existence of
conflicting certificate
16Replica State
- Multiple independent objects
- State per-object
- Certificate supporting most recent write
- Operation status
- Active
- Write in progress, outstanding grant
- Quiescent
- No current write operation
17Write Phase 1
- Client sends write request to replicas
- If quiescent, replica assigns new grant to client
- If active, replica sends currently outstanding
grant - Several Possibilities
- All grants match
- Grants for different client
- Grants conflict
18Isolated Write
19Isolated Write
20Isolated Write
Write A
Write A
Write A
21Isolated Write
Write A
Write A
Write A
22Isolated Write
Grant lt1,1,Agt1
Grant lt1,1,Agt2
Grant lt1,1,Agt3
23Isolated Write
Grant lt1,1,Agt1
Grant lt1,1,Agt2
Grant lt1,1,Agt3
Matching grants Phase 2 write
24Isolated Write
Cert G1,G2,G3
Cert G1,G2,G3
Cert G1,G2,G3
Matching grants Phase 2 write
25Isolated Write
execute A
Cert G1,G2,G3
Cert G1,G2,G3
execute A
Cert G1,G2,G3
execute A
26Isolated Write
Result A
Result A
Result A
27Isolated Write
Result A
result
Result A
Result A
Write Complete
28Incomplete Write
29Incomplete Write
30Incomplete Write
Write A
Write A
Write A
31Incomplete Write
Write A
Write A
Write A
32Incomplete Write
Grant lt1,1,Agt1
Grant lt1,1,Agt2
Grant lt1,1,Agt3
33Incomplete Write
Grant lt1,1,Agt1
Grant lt1,1,Agt2
Grant lt1,1,Agt3
Client 1 slow or failed
34Incomplete Write
Write B
Write B
Write B
35Incomplete Write
Grantlt1,1,Agt1
Grant lt1,1,Agt2
Grant lt1,1,Agt3
Replicas active Return current grant
36Incomplete Write
Grantlt1,1,Agt1
Grant lt1,1,Agt2
Grant lt1,1,Agt3
Grants for different client Perform Writeback
37Incomplete Write
Cert G1,G2,G3, Write B
Cert G1,G2,G3, Write B
Cert G1,G2,G3, Write B
Grants for different client Perform Writeback
38Incomplete Write
execute A
Cert G1,G2,G3, Write B
execute A
Cert G1,G2,G3, Write B
Cert G1,G2,G3, Write B
execute A
39Incomplete Write
Cert G1,G2,G3, Write B
Cert G1,G2,G3, Write B
Cert G1,G2,G3, Write B
40Incomplete Write
Grantlt2,2,Bgt1
Grant lt2,2,Bgt2
Grant lt2,2,Bgt3
41Incomplete Write
Grantlt2,2,Bgt1
Grant lt2,2,Bgt2
Grant lt2,2,Bgt3
Matching grants Phase 2 write
42Write Contention
43Write Contention
Write A
44Write Contention
Write A
Write A
45Write Contention
Write A
Write A
Write A
Write B
46Write Contention
Write A
Write A
Write A
Write B
47Write Contention
Grant lt1,1,Agt1
Grant lt1,1,Agt2
Grant lt2,1,Bgt3
48Write Contention
Grant lt1,1,Agt1
Grant lt1,1,Agt2
Grant lt2,1,Bgt3
Conflicting grants Request resolution
49Write Contention
Resolve Request
Cert G1,G2,G3
Cert G1,G2,G3
Cert G1,G2,G3
Conflicting grants Request resolution
50Write Contention
Resolve Request
Cert G1,G2,G3
Cert G1,G2,G3
Cert G1,G2,G3
51Write Contention
Resolve Request
execute A
Cert G1,G2,G3
Cert G1,G2,G3
execute A
Cert G1,G2,G3
execute A
52Write Contention
Resolve Request
execute B
Cert G1,G2,G3
Cert G1,G2,G3
execute B
Cert G1,G2,G3
execute B
53Write Contention
Result A
Result A
Result A
54Write Contention
Result A
result
Result A
Result A
55Write Contention
Result B
Result B
Result B
56Write Contention
Result B
Result B
result
Result B
57Contention Resolution
- BFT module used to resolve contention
- Establish sequential order on contending ops
- On receiving resolve request
- Freeze local object state
- Send state to primary
- Primary runs BFT on combined state
- Replicas execute contending operations
58Read Protocol
- Client sends read request to replicas
- Replica returns current object state
- Supported by previous write certificate
- Read complete if quorum of matching responses
- Writeback used to retry if responses inconsistent
59Additional Details
- Read protocol
- State transfer
- Multi-object transactions
- Performance enhancements
60Performance Enhancements
- Preferred quorums
- Core protocol run by only 2f1 replicas
- Symmetric-key cryptography
- Authenticators instead of signatures
- Collection of 3f1 MACs
- ltmi,1,mi,2,,mi,ngt
- Lower CPU overhead
61BFT Improvements
- Preferred quorums
- Reduces degree of quadratic communication
- Single MAC per message
- Significant improvements over authenticators
62Outline
- Current Approaches
- HQ Replication
- BFT Improvements
- Performance Evaluation
- Analysis
- Experiments
- Conclusions
63Non-Contention Message Overhead
Messages sent/received at each replica per write
request
64Non-Contention Bandwidth Use
Total bandwidth at each replica per write request
65Experimental Setup
- HQ and BFT prototypes deployed on Emulab
- Up to 16 replicas (f5), 200 clients (4 per
machine) - New BFT codebase
- Implement counter service
- Negligible operation payload
- Multiple objects
- Private non-contention objects
- Shared contention object
66Non-contention Throughput
Maximum operation throughput
67Resilience to Contention
Throughput degradation with increasing
write-contention
68Resilience to Contention
new
Throughput degradation with increasing
write-contention
69BFT Batching
- BFT allows batching at primary
- Greatly reduces internal protocol communication
- Increased delay
Request
Pre-Prepare
Prepare
Commit
Reply
Client
once per batch
Primary
Replica 1
Replica 2
Replica 3
70Batched Performance
Effect of BFT batching on maximum write throughput
71Recommendations
- Use Q/U when
- Latency critical
- Contention low
- 5f1 replicas acceptable
- Use HQ when
- Low latency important
- Moderate contention
- Use BFT when
- Contention high
- Throughput more important than latency
72Conclusions
- First Byzantine Quorum protocol with 3f1
replicas - Supports general operations
- Resilient to Byzantine clients
- Introduced Hybrid technique
- Resolve contention without performance
degradation - Applicable to general quorum systems
- Found optimized BFT to perform well under high
load
73 74Further Details
- HQ Replication Properties and optimizations
- James Cowling, Daniel Myers, Barbara Liskov,
Rodrigo Rodrigues and Liuba Shrira. Technical
Memo In Prep., MIT Computer Science and
Artificial Laboratory, Cambridge, Massachusetts,
2006. - Contact
- cowling_at_csail.mit.edu
- http//people.csail.mit.edu/cowling/
75Write-back Operation
- Write certificate paired with a subsequent
request - Used to ensure progress with slow replicas or
clients - Completes phase 2 for a slow client
- Advances state of slow replicas
- Replica processes write phase 2 based on
certificate, then the paired request
76(No Transcript)
77Backups
78Slow Replicas
- Some grants in quorum have old timestamp
- Perform writeback to slow replicas, using
certificate provided with highest grant - Brings replicas up to date and solicits new grants
79Why 3f1?
- 3f1 replicas
- f of which can be faulty
- 2f1 agree on any ordering
- f of these may be Byzantine
- The remaining f may be slow
- Maximum of 2f can respond with old system state,
but not 2f1
80- Wont HQ have a higher rate of contention since
its two phase (higher latency) than Q/U? - No contention window only between first replica
receives phase 1 request to last replica receives
it. Hence independent of two-phase, and actually
smaller than in Q/U