Post Incident Review template

{Incident Date} - {Incident Name}

This template is for guidance rather than a strict set of rules, sometimes a section may not make sense, or an extra section may be needed. The scope of the doc can be adjusted to reflect the size of the incident.

The goal of a post-incident review is that staff without any context should be able to understand what happened and what we decided to do to stop it from happening again.

Explain as clearly as possible the what, how, when of the incident.

Describe the incident impact, generally this will be external impact to customers, but may be internal impact as well.

Explain in detail the mechanism of the fault.

Explain the alerting mechanism and how we could have detected the issue sooner.

Provide a list of corrective actions to be taken to prevent the incident from occurring again.

Remember to keep this review blameless, always assume that every team and employee acted with the best intentions based on the information they had at the time. Instead of identifying whoever screwed up, focus on improving performance moving forward.

In general try to use UTC for all dates and timestamps, unless there's a special reason to use a different timezone.

Summary

Provide a short description of the event, important data is the precise time the event started and ended, the number of customers and resources impacted, as well as a description of the nature of the impact. Finally a description of the cause of the incident, and a description of any mitigations put in place and the final resolution.

If possible, add graphs and data showing the impact.

Example: On DATE from time X to time Y index provisioning on marquo.ai was unavailable for H hours. Overall N customers were affected and were not able to create new indexes. Existing indexes were unaffected. Overall M index creation tasks failed. The event was caused by <reason>. The event was resolved by <corrective action>.

...

Impact

A more detailed description of the impact, including relevant data and graphs if available.

The number of customers impacted

List and number of resources impacted by type

Customer reach outs should be listed here as well.

...

Timeline

A timeline of the events from the lead up to the incident until resolution, including any alarms, escalations, manual actions taken.

Example:

Date Time (UTC) Description
2024-04-01 00:00:00 Deployment of component x started
00:05:00 504 Error rate alarm triggered on gateway
00:15:00 Oncall paged
00:20:00 Slack discussion <link> reveals deployment just happened
00:25:00 Manual rollback initiated
00:30:00 Rollback complete
00:31:00 Metric out of alarm
00:41:00 Tech escalation declared event closed

Date	Time (UTC)	Description
2024-04-01	00:00:00	Deployment of component x started
	00:05:00	504 Error rate alarm triggered on gateway
	00:15:00	Oncall paged
	00:20:00	Slack discussion `<link>` reveals deployment just happened
	00:25:00	Manual rollback initiated
	00:30:00	Rollback complete
	00:31:00	Metric out of alarm
	00:41:00	Tech escalation declared event closed

Date	Time (UTC)	Description

Root cause analysis

Incident technical deep dive

A detailed technical discussion of what went wrong and why, this is your opportunity to dive as deep as possible. Add diagrams and graphs as necessary. Keep in mind while this document in general should be accessible to an observer without context, this section is targeted at your peers so feel free to pull out all the stops.

...

5 Whys

The 5 whys is just a brainstorming tool to help your investigation, don't take it too literally.

Work backwards from the incident itself and the main symptom. Ask these questions:

What is the problem that needs to be solved?

What is the root cause of the problem?

What are the potential causes of the problem?

What data or evidence do I have to support my hypothesis?

What are the potential solutions to the problem?

Recursively apply the process 5 or more times until a clear reason surfaces or it does not make sense to keep going.

Example:

Question: Why was there a reduction in plant output? Answer: The canning machine on row 5 was not functioning properly.

Question: Why was the machine defective? Answer: A belt was out of place.

Question: Why was the belt out of place? Answer: The machine did not receive its scheduled maintenance check this month.

Question: Why didn't the machine receive a routine maintenance check this month? Answer: We didn't schedule a service provider to perform the maintenance.

Question: Why was there no service provider scheduled to check the machine? Answer: The company is currently negotiating a contract with a new service provider. Primary cause: The business has no plan in place for machine maintenance while it's negotiating a contract deal with a new service provider.

...

Incident Detection and Response

How was the incident detected?

How did we identify something was wrong? Was the issue caught by our alarming or was it first noticed by a customer? Add details of all alarms triggered or customer reach outs.

...

How long did it take to detect?

How long did it take for us to become aware of the issue?

...

Could we have detected the issue sooner?

As a thought exercise, can we cut that time in half?

...

How long did it take to diagnose the issue?

How long did it take us to figure out what to do?

...

Could we have diagnosed the issue faster?

What would it take to cut the time to diagnosis in half?

...

How long did it take to resolve the issue?

Once a corrective action was decided, how long did it take to put it in place? Were we able to put in temporary mitigations?

...

Could we have resolved it faster?

Could we have gotten to resolution faster if something was different? What would it take to cut the time to resolution in half?

...

Findings

Briefly summarise what went wrong and how it could have been avoided.

...

Corrective actions

Any changes that should be done (or have already been made) to prevent this issue from happening again should be listed here. Include the action owner before review, make sure you invite them to the review. During review, update priority and add date if known.

Action	Owner (team or person)	Priority (High/Medium/Low)	Date

{Incident Date} - {Incident Name}​

Summary​

Impact​

Timeline​

Root cause analysis​

Incident technical deep dive​

5 Whys​

Incident Detection and Response​

How was the incident detected?​

How long did it take to detect?​

Could we have detected the issue sooner?​

How long did it take to diagnose the issue?​

Could we have diagnosed the issue faster?​

How long did it take to resolve the issue?​

Could we have resolved it faster?​

Findings​

Corrective actions​

{Incident Date} - {Incident Name}

Summary

Impact

Timeline

Root cause analysis

Incident technical deep dive

5 Whys

Incident Detection and Response

How was the incident detected?

How long did it take to detect?

Could we have detected the issue sooner?

How long did it take to diagnose the issue?

Could we have diagnosed the issue faster?

How long did it take to resolve the issue?

Could we have resolved it faster?

Findings

Corrective actions