Brahms

- Byzantine-Resilient Random Membership Sampling

Bortnikov, Gurevich, Keidar, Kliot, and Shraer

Edward (Eddie) Bortnikov

Maxim (Max) Gurevich

Idit Keidar

Alexander (Alex) Shraer

Gabriel (Gabi) Kliot

Why Random Node Sampling

- Gossip partners
- Random choices make gossip protocols work
- Unstructured overlay networks
- E.g., among super-peers
- Random links provide robustness, expansion
- Gathering statistics
- Probe random nodes
- Choosing cache locations

The Setting

- Many nodes n
- 10,000s, 100,000s, 1,000,000s,
- Come and go
- Churn
- Every joining node knows some others
- Connectivity
- Full network
- Like the Internet
- Byzantine failures

Byzantine Fault Tolerance (BFT)

- Faulty nodes (portion f)
- Arbitrary behavior bugs, intrusions, selfishness
- Choose f ids arbitrarily
- No CA, but no panacea for Cybil attacks
- May want to bias samples
- Isolate nodes, DoS nodes
- Promote themselves, bias statistics

Previous Work

- Benign gossip membership
- Small (logarithmic) views
- Robust to churn and benign failures
- Empirical study Lpbcast,Scamp,Cyclon,PSS
- Analytical study Allavena et al.
- Never proven uniform samples
- Spatial correlation among neighbors views PSS
- Byzantine-resilient gossip
- Full views MMR,MS,Fireflies,Drum,BAR
- Small views, some resilience SPSS
- We are not aware of any analytical work

Our Contributions

- Gossip-based BFT membership
- Linear portion f of Byzantine failures
- O(n1/3)-size partial views
- Correct nodes remain connected
- Mathematically analyzed, validated in simulations
- Random sampling
- Novel memory-efficient approach
- Converges to proven independent uniform samples

The view is not all bad

Better than benign gossip

Brahms

- Sampling - local component
- Gossip - distributed component

Gossip

view

Sampler

sample

Sampler Building Block

- Input data stream, one element at a time
- Bias some values appear more than others
- Used with stream of gossiped ids
- Output uniform random sample
- of unique elements seen thus far
- Independent of other Samplers
- One element at a time (converging)

next

Sampler

sample

Sampler Implementation

- Memory stores one element at a time
- Use random hash function h
- From min-wise independent family Broder et al.
- For each set X, and all ,

next

init

Sampler

Keep id with smallest hash so far

Choose random hash function

sample

Component S Sampling and Validation

id streamfrom gossip

init

next

Sampler

Sampler

Sampler

Sampler

using pings

sample

Validator

Validator

Validator

Validator

S

Gossip Process

- Provides the stream of ids for S
- Needs to ensure connectivity
- Use a bag of tricks to overcome attacks

Gossip-Based Membership Primer

- Small (sub-linear) local view V
- V constantly changes - essential due to churn
- Typically, evolves in (unsynchronized) rounds
- Push send my id to some node in V
- Reinforce underrepresented nodes
- Pull retrieve view from some node in V
- Spread knowledge within the network
- Allavena et al. 05 both are essential
- Low probability for partitions and star topologies

Brahms Gossip Rounds

- Each round
- Send pushes, pulls to random nodes from V
- Wait to receive pulls, pushes
- Update S with all received ids
- (Sometimes) re-compute V
- Tricky! Beware of adversary attacks

Problem 1 Push Drowning

Push Alice

A

E

M

M

Push Bob

Push Mallory

Push Carol

B

M

M

Push Ed

Push Dana

Push MM

D

Push Malfoy

M

Trick 1 Rate-Limit Pushes

- Use limited messages to bound faulty pushes

system-wide - E.g., computational puzzles/virtual currency
- Faulty nodes can send portion p of them
- Views wont be all bad

Problem 2 Quick Isolation

Ha! Shes out! Now lets move on to the next guy!

Push Alice

A

E

M

Push Bob

Push Carol

Push Mallory

Push Ed

Push Dana

Push MM

Push Malfoy

C

D

Trick 2 Detection Recovery

- Do not re-compute V in rounds when too many

pushes are received - Slows down isolation does not prevent it

Push Bob

Push Mallory

Hey! Im swamped! I better ignore all of em

pushes

Push MM

Push Malfoy

Trick 3 Balance Pulls Pushes

- Control contribution of push - aV ids versus

contribution of pull - ßV ids - Parameters a, ß
- Pull-only ? eventually all faulty ids
- Pull from faulty nodes all faulty ids, from

correct nodes some faulty ids - Push-only ? quick isolation of attacked node
- Push ensures system-wide not all bad ids
- Pull slows down (does not prevent) isolation

Trick 4 History Samples

- Attacker influences both push and pull
- Feedback ?V random ids from S
- Parameters a ß ? 1
- Attacker loses control - samples are eventually

perfectly uniform

Yoo-hoo, is there any good process out there?

View and Sample Maintenance

Pushed ids

Pulled ids

S

? V

?V

?V

View V

Sample

Key Property

- Samples take time to help
- Assume attack starts when samples are empty
- With appropriate parameters
- E.g.,
- Time to isolation gt time to convergence

Prove lower bound using tricks 1,2,3(not using

samples yet)

Prove upper bound until some good sample

persists forever

Self-healing from partitions

History Samples Rationale

- Judicious use essential
- Bootstrap, avoid slow convergence
- Deal with churn
- With a little bit of history samples (10) we

can cope with any adversary - Amplification!

Analysis

- Sampling - mathematical analysis
- Connectivity - analysis and simulation
- Full system simulation

Connectivity ? Sampling

- Theorem If overlay remains connected

indefinitely, samples are eventually uniform

Sampling ? Connectivity Ever After

- Perfect sample of a sampler with hash h the id

with the lowest h(id) system-wide - If correct, sticks once the sampler sees it
- Correct perfect sample ? self-healing from

partitions ever after - We analyze PSP(t) probability of perfect sample

at time t

Convergence to 1st Perfect Sample

- n 1000
- f 0.2
- 40 unique ids in stream

Scalability

- Analysis says
- For scalability, want small and constant

convergence time - independent of system size, e.g., when

Connectivity Analysis 1 Balanced Attacks

- Attack all nodes the same
- Maximizes faulty ids in views system-wide
- in any single round
- If repeated, system converges to fixed point

ratio of faulty ids in views, which is lt 1 if - ?0 (no history) and p lt 1/3 or
- History samples are used, any p

There are always good ids in views!

Fixed Point Analysis Push

Local view node i

Local view node 1

i

Time t

push

lost push

push from faulty node

1

Time t1

x(t) portion of faulty nodes in views at round

t portion of faulty pushes to correct nodes p

/ ( p ( 1 - p )( 1 - x(t) ) )

Fixed Point Analysis Pull

Local view node i

Local view node 1

i

Time t

pull from i faulty with probability x(t)

pull from faulty

Time t1

Ex(t1) ? p / (p (1 - p)(1 - x(t))) ?

( x(t) (1-x(t))?x(t) ) ?f

Faulty Ids in Fixed Point

Assumed perfect in analysis, real history in

simulations

With a few history samples, any portion of bad

nodes can be tolerated

Perfectly validated fixed pointsand convergence

Convergence to Fixed Point

- n 1000
- p 0.2
- aß0.5
- ?0

Connectivity Analysis 2Targeted Attack

Roadmap

- Step 1 analysis without history samples
- Isolation in logarithmic time
- but not too fast, thanks to tricks 1,2,3
- Step 2 analysis of history sample convergence
- Time-to-perfect-sample lt Time-to-Isolation
- Step 3 putting it all together
- Empirical evaluation
- No isolation happens

Targeted Attack Step 1

- Q How fast (lower bound) can an attacker isolate

one node from the rest? - Worst-case assumptions
- No use of history samples (? 0)
- Unrealistically strong adversary
- Observes the exact number of correct pushes and

complements it to aV - Attacked node not represented initially
- Balanced attack on the rest of the system

Isolation w/out History Samples

- n 1000
- p 0.2
- aß0.5
- ?0

Isolation time for V60

Depend on a,ß,p

Step 2 Sample Convergence

- n 1000
- p 0.2
- aß0.5, ?0
- 40 unique ids

Perfect sample in 2-3 rounds

Empirically verified

Step 3 Putting It All TogetherNo Isolation with

History Samples

- n 1000
- p 0.2
- aß0.45
- ?0.1

Works well despite small PSP

Sample Convergence (Balanced)

- p 0.2
- aß0.45
- ?0.1

Convergence twice as fast with

Summary

- O(n1/3)-size views
- Resist Byzantine failures of linear portion
- Converge to proven uniform samples
- Precise analysis of impact of failures

Balanced Attack Analysis (1)

- Assume (roughly) equal initial node degrees
- x(t) portion of faulty ids in correct node

views at time t - Compute Ex(t1) as function of x(t), p, ?, ?, ?
- Result 1 Short-term Optimality
- Any non-balanced schedule imposes a smaller x(t)

in a single round

Balanced Attack Analysis (2)

- Result 2 Existence of Fixed Point X
- Ex(t1) x(t) X
- Analyze X (function of p, ?, ?, ?)
- Conditions for uniqueness
- For ??0.5, p lt 1/3, exists X lt 1
- The view is not entirely poisoned history

samples are not essential - Result 3 Convergence to fixed point
- From any initial portion lt 1 of faulty ids
- From Hillam 1975 (sequence convergence)