recall

← recall

split brain pattern

A network partition makes two halves of a cluster each believe they're the live one. Both accept writes; data diverges. The hardest distributed-systems failure to recover from cleanly.

A network partition makes two halves of a cluster each believe they're the live one. Both accept writes; data diverges. The hardest distributed-systems failure to recover from cleanly.

symptoms

  • writes accepted on both sides of a partition
  • diverged state after the partition heals
  • leader election flapping

causes

  • no quorum requirement for leadership
  • asymmetric failure detection
  • trusting heartbeats without majority confirmation

fixes

  • quorum-based leader election (only one half can have a quorum)
  • fencing tokens at the resource layer
  • STONITH / shoot-the-other-node-in-the-head

you might say

  • we split-brained
  • two leaders at once

related

aliases: dual primary

topics: replication, failure-modes

references: