IT · Letter E

Error Budget

The maximum acceptable unreliability of a service in a period, derived from its SLO — the governance mechanism that lets SRE teams balance velocity with reliability.

By Dr. Hassan Eliwa, PhD · Founder of PMMilestone.org and PMMilestone.com · Updated 2026-07-02

Definition

An error budget is the maximum acceptable amount of unreliability a service is permitted over a given period, calculated as (1 − SLO). If the Service Level Objective is 99.9% availability over 30 days, the error budget is 0.1% — about 43 minutes. As long as the service stays within its error budget, engineering teams are free to ship features quickly. When the budget is depleted, the organisation is contractually committed to slowing down and prioritising reliability work. Error budgets are the mechanism that lets Site Reliability Engineering (SRE) reconcile the eternal argument between "ship faster" and "be more stable".

Real-World IT / Agile Example

A payments platform ran on a 99.95% availability SLO — 21.6 minutes of allowed downtime per 30-day window. In a quarter with two incidents totalling 34 minutes, the error budget went negative. Per the team's operating agreement, all feature work paused for two weeks while the team invested in the tooling and observability that had contributed to the outages. The Product Manager was uncomfortable with the pause; the CTO was not, because the error-budget contract had been signed six months earlier. Two months later, error-budget consumption dropped by 60% and delivery velocity recovered. The mechanism worked precisely because it took the conversation out of the room — the response was pre-agreed.

Real-World Construction Example

The concept transfers cleanly to construction as a safety budget. A tunnelling contractor agreed with the owner that any month with more than two recordable incidents would trigger a mandatory stand-down and root-cause week. In the first year, this happened twice — both times, incident rates dropped for months afterwards. The mechanism removed the political question of "do we stand down?" It was a pre-agreed rule, not a negotiation each time. Same design pattern as an error budget: define the threshold in advance, commit to the response in advance, execute automatically when the threshold is breached.

How to Set an SLO and Error Budget

  1. Identify the critical user journeys (login, checkout, search).
  2. Set a Service Level Indicator (SLI) — the measurable signal (e.g. request success rate).
  3. Set the SLO — the target (e.g. 99.9% of requests succeed in 30 days).
  4. Derive the error budget (1 − SLO).
  5. Agree the consequence of budget depletion — what work pauses, for how long, who signs off.

Expert Tips

  • Start with a modest SLO. 99.9% is a reasonable starting target for most services. 99.99% is genuinely expensive to hold.
  • Base the SLO on real user experience, not internal infrastructure metrics. A backend that is "up" but returns 500 errors is not up.
  • Publish the error-budget burn rate weekly. If half the budget is gone in the first week, that is a five-alarm fire.
  • Automate the response. When burn rate exceeds a threshold, page an engineer — do not wait for a monthly review.
  • Do not spend below the SLO if you do not need to. Every 9 of availability is roughly 10× the cost of the previous one.

Common Mistakes

  • Setting SLOs from a marketing brief ("100% uptime!") instead of from real user tolerance.
  • Measuring availability from internal probes instead of real user requests.
  • Not committing in advance to what happens when the budget depletes.
  • Averaging across all customers when one customer's experience is far worse than the average.
  • Treating error budget as a target to hit, not a maximum not to exceed — "we still have budget left, let's use it up" is upside-down thinking.

Practical Lessons Learned

  • The politically hardest conversation in SRE is not defining an SLO — it is enforcing the budget-depleted response. Teams that write the enforcement rule into their operating charter succeed; teams that leave it to be argued at the moment fail.
  • Small numbers of very-slow requests (99th-percentile latency) hurt user experience more than most people expect. Latency SLOs deserve as much attention as availability SLOs.
  • Every quarter, review the SLO itself. Are customers happy? Are we spending more than we should to hold it? Adjust up or down accordingly.

Key Takeaways

  • Error budget = 1 − SLO, expressed as time or events.
  • The mechanism balances speed of change with reliability.
  • The consequence of budget depletion must be agreed in advance.
  • Measure at the user, not the infrastructure.

Related Encyclopedia Entries

Related Research Articles, Case Studies & Tools

Frequently Asked Questions

  • What is the relationship between SLA, SLO and error budget?
    An SLA is the contractual promise to customers (often with financial penalties). An SLO is the internal target, usually stricter than the SLA. The error budget is the gap between the SLO and 100%. Confusing them in a customer conversation is a common and expensive mistake.
  • Should I have one SLO per service?
    Start with one per critical user journey — usually 2–4 per service. Too many SLOs dilute focus; too few miss important user experiences. Review annually.
  • What if leadership refuses to honour the budget-depleted response?
    Then you do not have an error-budget policy, you have a spreadsheet. This is the single biggest cultural test in adopting SRE. If leadership will not commit in advance to the trade-off, the mechanism will not work — but the argument itself is valuable and often shifts the culture over time.
  • How do I calculate 99.9% availability?
    0.1% of 30 days ≈ 43.2 minutes of allowed downtime; 0.1% of 90 days ≈ 129.6 minutes. Choose the window that matches customer expectations — monthly for most SaaS products, quarterly for internal platforms.
  • Does this apply to batch systems?
    Yes, with a different SLI — success rate of batch runs, not availability. A nightly ETL might have an SLO of 'succeeds within 4 hours 99% of the time'. The error-budget mechanism applies identically.
  • How do we handle planned maintenance in the budget?
    Two schools of thought: (a) exclude planned maintenance from the calculation if customers were informed in advance; (b) include everything so the team is incentivised to minimise downtime windows too. School (b) is stricter and produces better long-term reliability; school (a) is more politically viable in most orgs.
  • Can construction teams really use error budgets?
    The direct analogue is a safety incident budget or defect budget, not availability. The mechanism — pre-agreed threshold, pre-agreed response, automatic enforcement — works identically. I have seen it applied on tunnelling, marine, and hospital fit-out projects with measurable benefit.
  • Which calculators on PMMilestone.org apply to Error Budget?
    For Error Budget, the most relevant tools on the flagship platform are the EVM, SPI and CPI calculators on PMMilestone.org. They reproduce the formulas referenced in this entry against your own project data.
  • What is a common misconception about Error Budget?
    That the topic is well-defined across all references. In practice, definitions vary between PMBOK, PRINCE2, AACE and ISO 21500 — this entry uses the definition most aligned with field practice on capital projects, and flags where the standards diverge.
  • Which related encyclopedia entries should I read alongside Error Budget?
    Read Earned Value Management, Critical Path Method and the DCMA 14-point assessment next. The full A–Z is available in the PMMilestone Encyclopedia, and quick one-line definitions live in the PM Glossary on the flagship platform.
  • How does Dr. Hassan Eliwa's research treat Error Budget?
    Dr. Hassan Eliwa's research focuses on owner-side project controls, schedule integrity and forensic delay analysis on capital construction and power programmes. Error Budget is treated through that lens — what a planning or controls engineer is expected to do with it on a live project, not its textbook definition alone. See the full research library at PMMilestone Research Articles.
  • How is Error Budget defined on PMMilestone Research & Insights?
    The maximum acceptable unreliability of a service in a period, derived from its SLO — the governance mechanism that lets SRE teams balance velocity with reliability. For the full treatment, see the definition, principles, applications and related entries above — every encyclopedia entry follows the same research-grade structure.

People also ask

Follow-up questions practitioners search for next — each one points to the calculator, template or reference entry that answers it.

Related Entries

Browse more in this category

More in IT

View all IT entries →

Further reading on PMMilestone.org

Curated companion resources hosted on the flagship platform, PMMilestone.org.

Related Encyclopedia Entries
Career Guides
Tools on PMMilestone.org
Buy me a coffee