Incident Management
The structured response to unplanned events that disrupt service or safety — detection, declaration, containment, communication, resolution, and learning, all run on a playbook rather than improvised under pressure.
Definition
Incident management is the disciplined response to events that disrupt normal operations — a production outage in IT, a safety event on a construction site, a quality failure in manufacturing. The objective is to minimise harm, restore service, communicate with affected parties, and capture lessons that prevent recurrence. It is one of the most under-invested capabilities in most organisations and one of the highest-leverage when done well.
Why It Matters
The cost of an incident is rarely the direct outage cost. It is the reputational damage, the regulatory consequence, the trust erosion with customers and staff, and the opportunity cost of senior leaders pulled into a war room for days. Mature incident management does not prevent every incident — that is impossible — but it shrinks duration, contains blast radius, and converts events into durable improvements.
Core Components
- Detection. Monitoring, alerting, customer reports. Detection latency is half the battle.
- Declaration. A formal "incident is declared" moment with a severity classification.
- Roles. Incident commander, communications lead, technical lead, scribe. Defined in advance.
- Communication. Internal updates on a known cadence; external updates appropriate to the audience.
- Containment and resolution. Stop the bleeding first, then fix the underlying issue.
- Post-incident review. Blameless, structured, with action items that have owners and dates.
Severity Classification
- SEV-1. Major impact, all-hands response, regulator notification possible.
- SEV-2. Significant impact, dedicated team, executive awareness.
- SEV-3. Localised impact, normal working-hours response.
- SEV-4. Minor degradation, ticket-tracked.
Real-World IT Example
A SaaS platform suffered a database failover bug that took its API down for 47 minutes during business hours in Europe. The on-call engineer declared a SEV-1 within 4 minutes. The incident commander coordinated three engineers in a dedicated channel. The communications lead pushed status-page updates every 15 minutes, by name, with what was known and unknown. Resolution arrived at 38 minutes; the additional 9 minutes were verification. The post-incident review the next day generated seven action items, three of which closed the underlying class of bug forever. Customer churn following the incident was below the long-run average — the communication had outweighed the disruption.
Real-World Construction Example
On a $1.2B refinery turnaround, a hydrocarbon leak was detected at 03:14. The site emergency plan triggered immediate isolation, evacuation of the affected zone, and notification of the local authority. The incident commander — a duty shift manager — ran a structured response while operations leadership joined within 20 minutes. Total time from detection to safe isolation: 7 minutes. There were no injuries. The investigation found a corrosion-monitoring gap that had been visible in inspection data for three weeks but had not triggered escalation. The fix was a change to the monitoring threshold; the durable value was a step-change in how inspection anomalies were reviewed.
Common Mistakes
- No declared incident commander. Five engineers all trying to lead is no leadership.
- Solving before communicating. Customers without information assume the worst.
- Blame culture in post-incident reviews. Engineers stop reporting incidents.
- Action items without owners or dates. Lessons evaporate within a month.
- Skipping the review for "small" incidents. Small incidents are the early warning for big ones.
- Severity drift. Calling everything SEV-2 erodes meaning; calling nothing SEV-1 erodes urgency.
Expert Tips
- Run incident-response drills (game days) quarterly. The first real incident is not the time to discover the runbook is out of date.
- Measure mean-time-to-detect, mean-time-to-acknowledge, and mean-time-to-resolve separately. Each tells a different story.
- Treat the status page as a product. Calm, frequent, honest updates outperform polished, late ones.
- Rotate the incident commander role across engineering. The skill is leadership under uncertainty; it pays dividends everywhere.
- Connect incident learnings to risk management — patterns reveal where the risk register is wrong.
Practical Lessons Learned
- The best incident-response cultures I have seen are calm. Loud war rooms produce slower decisions than focused, structured channels.
- The post-incident review is where 80% of the long-term value lives. Treat it as a leadership obligation, not a chore.
- External communication is almost always under-invested. Customers forgive incidents; they do not forgive silence.
Key Takeaways
- Incident management is a discipline, not a heroic skill — playbooks beat improvisation every time.
- Clear roles (incident commander, comms lead, technical lead, scribe) prevent leaderless chaos.
- Communicate early, often, and honestly — customers forgive incidents but not silence.
- The post-incident review is where lasting value is captured; make it blameless, structured, and tracked.
- Track MTTD, MTTA, and MTTR separately — each tells a different story about response health.
Related Encyclopedia Entries
Related Research Articles, Case Studies & Tools
Frequently Asked Questions
What is the difference between an incident and a problem?
An incident is a single disruptive event; a problem is the underlying cause that may produce many incidents.How often should we run incident drills?
Quarterly at minimum for critical services. Monthly is common for high-stakes platforms.Who should be the incident commander?
A trained on-call engineer or duty manager. Not necessarily the most senior person, but the person whose role is to lead the response.Should we publish post-incident reviews?
For customer-impacting incidents, many mature organisations do. Honesty builds trust faster than polish.How do we avoid blame in reviews?
Focus on systems and decisions, not individuals. Ask 'what made this decision look reasonable at the time?'How long should a post-incident review take?
60–90 minutes for a SEV-1, 30 minutes for a SEV-3. Longer than that and people stop attending.Are construction and IT incident management really the same?
The technical specifics differ; the structure (detection, declaration, roles, communication, review) is identical.Which calculators on PMMilestone.org apply to Incident Management?
For Incident Management, the most relevant tools on the flagship platform are the EVM, SPI and CPI calculators — including Earned Schedule SPI(t). They reproduce the formulas referenced in this entry against your own project data.What is a common misconception about Incident Management?
That SPI = 1.0 at project end means schedule on track. Classic SPI mathematically converges to 1.0 as a late project finishes — switch to Earned Schedule SPI(t) past ~70% progress.Which related encyclopedia entries should I read alongside Incident Management?
Read Earned Value Management, SPI and CPI for the core formulas, and Earned Schedule for late-project diagnostics. The full A–Z is available in the PMMilestone Encyclopedia, and quick one-line definitions live in the PM Glossary on the flagship platform.How does Dr. Hassan Eliwa's research treat Incident Management?
Dr. Hassan Eliwa's research focuses on owner-side project controls, schedule integrity and forensic delay analysis on capital construction and power programmes. Incident Management is treated through that lens — what a planning or controls engineer is expected to do with it on a live project, not its textbook definition alone. See the full research library at PMMilestone Research Articles.How is Incident Management defined on PMMilestone Research & Insights?
The structured response to unplanned events that disrupt service or safety — detection, declaration, containment, communication, resolution, and learning, all run on a playbook rather than improvised under pressure. For the full treatment, see the definition, principles, applications and related entries above — every encyclopedia entry follows the same research-grade structure.
People also ask
Follow-up questions practitioners search for next — each one points to the calculator, template or reference entry that answers it.
What replaces SPI on a late project?
Time-based SPI(t) that does not collapse to 1.0 at completion. Earned Schedule →
How do I forecast end-of-project cost?
CPI-based EAC, plus weighted (CPI × SPI) variants. CPI Calculator ↗
Where is the standard definition?
Single-line definitions for EVM terms. PM Glossary on PMMilestone.org ↗
Which academy track covers performance measurement?
Includes an EVM-focused learning track from PV to EAC. Project Controls Academy ↗
Related Entries
Further reading on PMMilestone.org
Curated companion resources hosted on the flagship platform, PMMilestone.org.
- For practitioners who want to go deeper, the EVM Calculator.
- Engineers researching this topic typically continue with the CPI Calculator.
- A practical companion to this entry is the SPI Calculator.
- Closely related on the flagship platform is the Learning Tracks.
- Useful alongside this article is the PMMilestone.org knowledge hub.