recall

← recall

chaos engineering pattern

Deliberately inject failures into production (or production-like) systems to verify the system actually handles them. The premise: you don't really know your system is resilient until you've broken it on purpose. Netflix's Chaos Monkey is the canonical example.

Deliberately inject failures into production (or production-like) systems to verify the system actually handles them. The premise: you don't really know your system is resilient until you've broken it on purpose. Netflix's Chaos Monkey is the canonical example.

symptoms

  • untested failure paths failing in real incidents
  • over-confidence in resilience patterns that have never been exercised

causes

  • no failure injection in dev/test/prod
  • staging that doesn't mirror prod failure characteristics

fixes

  • game days with planned failure injection
  • tools: Chaos Monkey, Gremlin, Litmus, k8s native tools
  • start small (one instance kill) and expand as the system holds

you might say

  • chaos test it
  • inject latency
  • kill a pod and see what happens

related

aliases: chaos testing, fault injection

topics: resilience, operations

references: