Skip to content

Incident Report Template

Copy this template to create new incident reports. Replace all placeholder text with actual incident details.

Incident Summary

  • Date: YYYY-MM-DD
  • Cluster/Environment: [production/staging/development - specify cluster name]
  • Status: [Resolved/Investigating/Mitigated]
  • Severity: [P0/P1/P2/P3]
  • Duration: [Total time from start to resolution]
  • Detection Delay: [Time between incident start and detection]

Impact

User Impact

  • [Describe what users experienced]
  • [Number of affected users/services/requests]
  • [Customer-facing symptoms]

Business Impact

  • [Revenue/business process impact]
  • [SLA/SLO violations]
  • [Downstream service effects]

Timeline

Start Phase

  • HH:MM - [Initial symptom or trigger event]
  • HH:MM - [System behavior changes]

Detection Phase

  • HH:MM - [How incident was detected - alerts, user reports, etc.]
  • HH:MM - [Initial triage and assessment]

Investigation Phase

  • HH:MM - HH:MM - [Investigation activities and findings]
  • HH:MM - [Key discovery or breakthrough]
  • HH:MM - [Additional investigation steps]

Resolution Phase

  • HH:MM - [Mitigation steps taken]
  • HH:MM - HH:MM - [Implementation of fix]
  • HH:MM - [Service restoration confirmed]

End Phase

  • HH:MM - [All-clear given]
  • HH:MM - [Monitoring confirmed normal operations]

Summary Metrics

  • Total Duration: [HH:MM from start to resolution]
  • Detection Delay: [HH:MM from start to detection]

Root Cause Analysis

Immediate Cause

[What directly caused the incident to occur]

Contributing Factors

  • [Environmental factors]
  • [Process gaps]
  • [Technical debt]

Systemic Issues

[Underlying problems that allowed this incident to happen or made it worse]

What Could Be Improved

Prevention

  • [ ] [Action to prevent similar incidents]
  • [ ] [Process or system improvements]

Detection

  • [ ] [Monitoring or alerting improvements]
  • [ ] [Earlier warning systems]

Response

  • [ ] [Faster response procedures]
  • [ ] [Better escalation paths]

Next Steps

Immediate Actions (< 24h)

  • [ ] [Urgent follow-up items with owner and deadline]
  • [ ] [Critical monitoring or fixes]

Short-term Actions (< 1 week)

  • [ ] [Process improvements with owner and deadline]
  • [ ] [Documentation updates]

Long-term Actions (< 1 month)

  • [ ] [System improvements with owner and deadline]
  • [ ] [Technical debt reduction]
  • [Link to relevant runbooks]
  • [Link to system architecture docs]
  • [Link to previous similar incidents]