chaos engineering pattern

Deliberately inject failures into production (or production-like) systems to verify the system actually handles them. The premise: you don't really know your system is resilient until you've broken it on purpose. Netflix's Chaos Monkey is the canonical example.

Deliberately inject failures into production (or production-like) systems to verify the system actually handles them. The premise: you don't really know your system is resilient until you've broken it on purpose. Netflix's Chaos Monkey is the canonical example.

symptoms

untested failure paths failing in real incidents
over-confidence in resilience patterns that have never been exercised

causes

no failure injection in dev/test/prod
staging that doesn't mirror prod failure characteristics

fixes

game days with planned failure injection
tools: Chaos Monkey, Gremlin, Litmus, k8s native tools
start small (one instance kill) and expand as the system holds

you might say

chaos test it
inject latency
kill a pod and see what happens

related

aliases: chaos testing, fault injection

topics: resilience, operations

references:

Principles of Chaos Engineering