Incident Documentation Guidelines
This file provides specific guidance for writing incident reports in the docs/incidents/ directory.
Content Guidelines
Essential Sections
- Hook - A very interesting hook, it should provoke the curiosity of readers to want to understand and learn what happened in the incident
- Incident Summary - Include date, cluster/environment, status
- Impact - Focus on user and business impact
- Timeline - Use structured format with clear phases
- Root Cause Analysis - Both immediate and systemic causes
- What Could Be Improved - Focus on prevention and detection
- Next Steps - Actionable items with checkboxes
Timeline Structure Best Practices
When writing incident timelines, apply these key improvements:
- Chronological Flow - Events now follow a clear sequence from incident start to resolution
- Clear Phases - Separated into distinct stages (Start, Detection, Investigation, Resolution, End)
- Time Ranges - Added time windows for multi-step activities
- Detection Delay Highlighted - Made the detection gap very visible with summary metrics
- Summary Metrics - Added total duration and detection delay at the end
These improvements make incident timelines much easier to follow and help identify systemic issues like monitoring gaps.
Writing Style
- Use specific timestamps and durations
- Include actual log snippets when helpful
- Focus on learning and prevention
- Keep it concise but complete
- Use bullet points for clarity