Post Incident Review template
{Incident Date} - {Incident Name}
This template is for guidance rather than a strict set of rules, sometimes a section may not make sense, or an extra section may be needed. The scope of the doc can be adjusted to reflect the size of the incident.
The goal of a post-incident review is that staff without any context should be able to understand what happened and what we decided to do to stop it from happening again.
- Explain as clearly as possible the what, how, when of the incident.
- Describe the incident impact, generally this will be external impact to customers, but may be internal impact as well.
- Explain in detail the mechanism of the fault.
- Explain the alerting mechanism and how we could have detected the issue sooner.
- Provide a list of corrective actions to be taken to prevent the incident from occurring again.
Remember to keep this review blameless, always assume that every team and employee acted with the best intentions based on the information they had at the time. Instead of identifying whoever screwed up, focus on improving performance moving forward.
In general try to use UTC for all dates and timestamps, unless there's a special reason to use a different timezone.
Summary
Provide a short description of the event, important data is the precise time the event started and ended, the number of customers and resources impacted, as well as a description of the nature of the impact. Finally a description of the cause of the incident, and a description of any mitigations put in place and the final resolution.
If possible, add graphs and data showing the impact.
Example: On DATE from time X to time Y index provisioning on marquo.ai was unavailable for H hours. Overall N customers were affected and were not able to create new indexes. Existing indexes were unaffected. Overall M index creation tasks failed. The event was caused by
<reason>. The event was resolved by<corrective action>.
...
Impact
A more detailed description of the impact, including relevant data and graphs if available.
- The number of customers impacted
- List and number of resources impacted by type
Customer reach outs should be listed here as well.
...
Timeline
A timeline of the events from the lead up to the incident until resolution, including any alarms, escalations, manual actions taken.
Example:
Date Time (UTC) Description 2024-04-01 00:00:00 Deployment of component x started 00:05:00 504 Error rate alarm triggered on gateway 00:15:00 Oncall paged 00:20:00 Slack discussion <link>reveals deployment just happened00:25:00 Manual rollback initiated 00:30:00 Rollback complete 00:31:00 Metric out of alarm 00:41:00 Tech escalation declared event closed
| Date | Time (UTC) | Description |
|---|---|---|
Root cause analysis
Incident technical deep dive
A detailed technical discussion of what went wrong and why, this is your opportunity to dive as deep as possible. Add diagrams and graphs as necessary. Keep in mind while this document in general should be accessible to an observer without context, this section is targeted at your peers so feel free to pull out all the stops.
...
5 Whys
The 5 whys is just a brainstorming tool to help your investigation, don't take it too literally.
Work backwards from the incident itself and the main symptom. Ask these questions:
- What is the problem that needs to be solved?
- What is the root cause of the problem?
- What are the potential causes of the problem?
- What data or evidence do I have to support my hypothesis?
- What are the potential solutions to the problem?
Recursively apply the process 5 or more times until a clear reason surfaces or it does not make sense to keep going.
Example:
- Question: Why was there a reduction in plant output? Answer: The canning machine on row 5 was not functioning properly.
- Question: Why was the machine defective? Answer: A belt was out of place.
- Question: Why was the belt out of place? Answer: The machine did not receive its scheduled maintenance check this month.
- Question: Why didn't the machine receive a routine maintenance check this month? Answer: We didn't schedule a service provider to perform the maintenance.
- Question: Why was there no service provider scheduled to check the machine? Answer: The company is currently negotiating a contract with a new service provider. Primary cause: The business has no plan in place for machine maintenance while it's negotiating a contract deal with a new service provider.
...
Incident Detection and Response
How was the incident detected?
How did we identify something was wrong? Was the issue caught by our alarming or was it first noticed by a customer? Add details of all alarms triggered or customer reach outs.
...
How long did it take to detect?
How long did it take for us to become aware of the issue?
...
Could we have detected the issue sooner?
As a thought exercise, can we cut that time in half?
...
How long did it take to diagnose the issue?
How long did it take us to figure out what to do?
...
Could we have diagnosed the issue faster?
What would it take to cut the time to diagnosis in half?
...
How long did it take to resolve the issue?
Once a corrective action was decided, how long did it take to put it in place? Were we able to put in temporary mitigations?
...
Could we have resolved it faster?
Could we have gotten to resolution faster if something was different? What would it take to cut the time to resolution in half?
...
Findings
Briefly summarise what went wrong and how it could have been avoided.
...
Corrective actions
Any changes that should be done (or have already been made) to prevent this issue from happening again should be listed here. Include the action owner before review, make sure you invite them to the review. During review, update priority and add date if known.
| Action | Owner (team or person) | Priority (High/Medium/Low) | Date |
|---|---|---|---|