Chaos Engineering
The disciplined practice of injecting controlled failures into a running system to expose weaknesses before real incidents do — proactive resilience testing at production scale.
Definition
Chaos engineering is the discipline of deliberately injecting controlled failures — killed instances, dropped network packets, throttled databases, region outages — into a running production or production-like system to verify that its resilience mechanisms work. The purpose is not chaos; the purpose is confidence. A system whose failure modes have been rehearsed is meaningfully more reliable than one whose failure modes are theoretical.
Origin
Chaos engineering originated at Netflix in 2010 with Chaos Monkey, a tool that killed randomly chosen virtual machines during business hours to force teams to build services that survived instance failure. The practice grew into a full suite (the Simian Army), was formalised by Netflix and the wider community in Principles of Chaos Engineering, and is now standard practice at Amazon, Google, Microsoft, and most large financial institutions. Netflix, Amazon, and PagerDuty publish their approaches openly.
The Discipline
- Steady state hypothesis — define, in measurable terms, what "normal" looks like.
- Vary real-world events — instance loss, network latency, region failure.
- Run experiments in production, safely — with a small blast radius that can be immediately contained.
- Automate — so experiments run continuously rather than as one-off exercises.
- Minimise blast radius — start small, escalate deliberately.
Real-World IT Example
A retail platform running a Black Friday preparation exercise ran a "region evacuation" chaos test in September. The test simulated the loss of an entire AWS region for 20 minutes. Three weaknesses surfaced: (1) a caching layer had a hard-coded regional endpoint that failed over incorrectly, (2) a session store did not replicate across regions as documented, and (3) the operations runbook for region evacuation was outdated. All three were fixed before November. On Black Friday itself, a real (unrelated) regional degradation occurred; the platform absorbed it with no customer impact — an outcome the team credited directly to the September chaos exercise.
Real-World Construction Analogue
The construction analogue is tabletop exercises and emergency response drills — deliberately simulating a scaffold collapse, a chemical spill, or a crane malfunction to test whether the site's real emergency procedures work. Every mature refinery, offshore platform, and mega-project runs these drills quarterly for exactly the reason chaos engineers run their experiments: rehearsed failure modes are recoverable; theoretical ones are not. The vocabulary differs; the discipline is the same.
Best Practices
- Define steady state in measurable terms before running any experiment.
- Start with a small blast radius; expand only after previous experiments succeed.
- Communicate. Chaos experiments are not sneak attacks; teams should know they are running, even if not exactly when.
- Automate for continuous exercise. One-off drills produce one-off learning.
- Roll experiments back the moment steady state deviates beyond thresholds.
Common Mistakes
- Running experiments without a defined steady state; you have no way to know whether the system passed.
- Skipping to a large blast radius — impressive-looking failures instead of controlled learning.
- No abort mechanism; a failed experiment becomes a real incident.
- Running once and not continuously; systems change, and so do failure modes.
- Confusing chaos engineering with load testing or penetration testing; each answers different questions.
Expert Tips
- Start in a staging environment. Production comes later, after confidence has been built.
- Involve product and operations, not just engineering. A chaos exercise that surfaces a runbook gap has done its job.
- Publish results. Chaos culture matures when learning is visible across the organisation.
- Tie experiments to real incidents. When an outage happens, ask what chaos experiment would have caught it.
- Do not build a "chaos team" in isolation. The practice belongs in the engineering teams that own the systems.
Practical Lessons Learned
- Every mature chaos programme found a weakness in the first ten experiments. If you have run ten experiments and found nothing, the experiments are too weak or the observations too shallow.
- Runbook gaps are surfaced more often than code gaps — the operations layer is where most surprises live.
- Continuous automated chaos catches regressions that human-run drills miss because change is faster than the drill cadence.
Key Takeaways
- Chaos engineering rehearses failure modes so real incidents become non-events.
- Steady-state hypothesis, controlled variation, small blast radius, automation — the four disciplines.
- Applies wherever operational resilience matters — cloud platforms, refineries, hospitals, mega-project sites.
- Continuous exercise beats one-off drills as change accelerates.
- Culture, not tooling, is the biggest determinant of success.
Related Encyclopedia Entries
Related Research Articles, Case Studies & Tools
Frequently Asked Questions
Isn't chaos engineering just intentionally breaking production?
Only in the sense that a fire drill is intentionally setting off the alarm. The point is controlled rehearsal with defined steady state, small blast radius, and an abort mechanism. Uncontrolled failure injection is not chaos engineering; it is negligence.How do you convince leadership to allow this?
Frame it in terms of the incidents already suffered and their cost. Chaos engineering is cheaper than the outage it prevents, and leadership responds to that arithmetic. Starting in staging environments and moving to production gradually also builds confidence.How large should the first experiments be?
Small enough that a total failure would be recoverable in minutes, and small enough that no external customer would notice. A single instance, a single service call path, a single dependency. Expand only after multiple experiments pass at the smaller scale.Do you need Netflix-scale tooling?
No. Simple scripts and open-source tools (Chaos Toolkit, LitmusChaos, AWS Fault Injection Service) are enough to start. Sophisticated platforms come later; the practice does not require them from day one.How is this different from load testing?
Load testing asks whether the system handles volume. Chaos engineering asks whether it handles failure. Different questions, different tests, both needed for a resilient platform.Do teams outside IT do chaos engineering?
Under different names, yes. Fire drills, emergency response exercises, business continuity tabletop tests, financial stress tests, medical simulation training — all are chaos engineering by another name. The underlying discipline (rehearse failure to make it recoverable) is universal.How often should experiments run?
Continuously in mature programmes, with automated experiments running daily or weekly against important paths. Manual, larger-scope 'game days' happen quarterly. Change frequency drives experiment frequency — the faster the system evolves, the more often it needs testing.What is a common misconception about Chaos Engineering?
That the topic is well-defined across all references. In practice, definitions vary between PMBOK, PRINCE2, AACE and ISO 21500 — this entry uses the definition most aligned with field practice on capital projects, and flags where the standards diverge.Which related encyclopedia entries should I read alongside Chaos Engineering?
Read Earned Value Management, Critical Path Method and the DCMA 14-point assessment next. The full A–Z is available in the PMMilestone Encyclopedia, and quick one-line definitions live in the PM Glossary on the flagship platform.How does Dr. Hassan Eliwa's research treat Chaos Engineering?
Dr. Hassan Eliwa's research focuses on owner-side project controls, schedule integrity and forensic delay analysis on capital construction and power programmes. Chaos Engineering is treated through that lens — what a planning or controls engineer is expected to do with it on a live project, not its textbook definition alone. See the full research library at PMMilestone Research Articles.How is Chaos Engineering defined on PMMilestone Research & Insights?
The disciplined practice of injecting controlled failures into a running system to expose weaknesses before real incidents do — proactive resilience testing at production scale. For the full treatment, see the definition, principles, applications and related entries above — every encyclopedia entry follows the same research-grade structure.
People also ask
Follow-up questions practitioners search for next — each one points to the calculator, template or reference entry that answers it.
Which learning track covers this end-to-end?
Structured tracks from beginner planner to programme controls director. Project Controls Academy ↗
Which book goes deeper than this entry?
Practitioner field handbooks with worked numerical examples. Books & Publications ↗
Which calculator on PMMilestone.org applies here?
The integrated EVM workbook covers most cost-schedule diagnostics. EVM Calculator ↗
Where is this in the glossary?
Quick-lookup definitions across 1,200+ PM terms. PM Glossary on PMMilestone.org ↗
Related Entries
Further reading on PMMilestone.org
Curated companion resources hosted on the flagship platform, PMMilestone.org.
- For practitioners who want to go deeper, the Learning Tracks.
- Engineers researching this topic typically continue with the Books & Publications.
- A practical companion to this entry is the EVM Calculator.
- Closely related on the flagship platform is the Schedule Health Checker.
- Useful alongside this article is the PMMilestone.org knowledge hub.