Incident Report Template
Copy this template to create new incident reports. Replace all placeholder text with actual incident details.
Incident Summary
- Date: YYYY-MM-DD
- Cluster/Environment: [production/staging/development - specify cluster name]
- Status: [Resolved/Investigating/Mitigated]
- Severity: [P0/P1/P2/P3]
- Duration: [Total time from start to resolution]
- Detection Delay: [Time between incident start and detection]
Impact
User Impact
- [Describe what users experienced]
- [Number of affected users/services/requests]
- [Customer-facing symptoms]
Business Impact
- [Revenue/business process impact]
- [SLA/SLO violations]
- [Downstream service effects]
Timeline
Start Phase
- HH:MM - [Initial symptom or trigger event]
- HH:MM - [System behavior changes]
Detection Phase
- HH:MM - [How incident was detected - alerts, user reports, etc.]
- HH:MM - [Initial triage and assessment]
Investigation Phase
- HH:MM - HH:MM - [Investigation activities and findings]
- HH:MM - [Key discovery or breakthrough]
- HH:MM - [Additional investigation steps]
Resolution Phase
- HH:MM - [Mitigation steps taken]
- HH:MM - HH:MM - [Implementation of fix]
- HH:MM - [Service restoration confirmed]
End Phase
- HH:MM - [All-clear given]
- HH:MM - [Monitoring confirmed normal operations]
Summary Metrics
- Total Duration: [HH:MM from start to resolution]
- Detection Delay: [HH:MM from start to detection]
Root Cause Analysis
Immediate Cause
[What directly caused the incident to occur]
Contributing Factors
- [Environmental factors]
- [Process gaps]
- [Technical debt]
Systemic Issues
[Underlying problems that allowed this incident to happen or made it worse]
What Could Be Improved
Prevention
- [ ] [Action to prevent similar incidents]
- [ ] [Process or system improvements]
Detection
- [ ] [Monitoring or alerting improvements]
- [ ] [Earlier warning systems]
Response
- [ ] [Faster response procedures]
- [ ] [Better escalation paths]
Next Steps
Immediate Actions (< 24h)
- [ ] [Urgent follow-up items with owner and deadline]
- [ ] [Critical monitoring or fixes]
Short-term Actions (< 1 week)
- [ ] [Process improvements with owner and deadline]
- [ ] [Documentation updates]
Long-term Actions (< 1 month)
- [ ] [System improvements with owner and deadline]
- [ ] [Technical debt reduction]
Related Documentation
- [Link to relevant runbooks]
- [Link to system architecture docs]
- [Link to previous similar incidents]