chaos engineering pattern
Deliberately inject failures into production (or production-like) systems to verify the system actually handles them. The premise: you don't really know your system is resilient until you've broken it on purpose. Netflix's Chaos Monkey is the canonical example.
Deliberately inject failures into production (or production-like) systems to verify the system actually handles them. The premise: you don't really know your system is resilient until you've broken it on purpose. Netflix's Chaos Monkey is the canonical example.
symptoms
- untested failure paths failing in real incidents
- over-confidence in resilience patterns that have never been exercised
causes
- no failure injection in dev/test/prod
- staging that doesn't mirror prod failure characteristics
fixes
- game days with planned failure injection
- tools: Chaos Monkey, Gremlin, Litmus, k8s native tools
- start small (one instance kill) and expand as the system holds
you might say
- chaos test it
- inject latency
- kill a pod and see what happens