Postmortem

A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.

Postmortem Culture: Learning from Failure, Site Reliability Engineering

What is postmortem?

A postmortem is a written record of an incident that describes:

  • The incident’s impact.
  • The actions taken to mitigate or resolve the incident.
  • The incident’s root cause.
  • Follow-up actions taken to prevent the incident from happening again.
How we run incident postmortems | Atlassian

What does it mean to have a ‘blameless’ Post-Mortem?
Does it mean everyone gets off the hook for making mistakes? No.

Well, maybe. It depends on what “gets off the hook” means. Let me explain.

Having a Just Culture means that you’re making effort to balance safety and accountability. It means that by investigating mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure, an organization can come out safer than it would normally be if it had simply punished the actors involved as a remediation.

Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:

  • what actions they took at what time,
  • what effects they observed,
  • expectations they had,
  • assumptions they had made,
  • and their understanding of timeline of events as they occurred.

…and that they can give this detailed account without fear of punishment or retribution.

Why shouldn’t they be punished or reprimanded? Because an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure. This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.

Kitchen Soap – Learning from Failure at Etsy

If we go with “blame” as the predominant approach, then we’re implicitly accepting that deterrence is how organizations become safer. This is founded in the belief that individuals, not situations, cause errors. It’s also aligned with the idea there has to be some fear that not doing one’s job correctly could lead to punishment. Because the fear of punishment will motivate people to act correctly in the future. Right?

This cycle of name/blame/shame can be looked at like this:

  1. Engineer takes action and contributes to a failure or incident.
  2. Engineer is punished, shamed, blamed, or retrained.
  3. Reduced trust between engineers on the ground (the “sharp end”) and management (the “blunt end”) looking for someone to scapegoat
  4. Engineers become silent on details about actions/situations/observations, resulting in “Cover-Your-Ass” engineering (from fear of punishment)
  5. Management becomes less aware and informed on how work is being performed day to day, and engineers become less educated on lurking or latent conditions for failure due to silence mentioned in #4, above
  6. Errors more likely, latent conditions can’t be identified due to #5, above
  7. Repeat from step 1

We need to avoid this cycle. We want the engineer who has made an error give details about why (either explicitly or implicitly) he or she did what they did; why the action made sense to them at the time. This is paramount to understanding the pathology of the failure. The action made sense to the person at the time they took it, because if it hadn’t made sense to them at the time, they wouldn’t have taken the action in the first place.

Kitchen Soap – Learning from Failure at Etsy

A funny thing happens when engineers make mistakes and feel safe when giving details about it: they are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company avoid the same error in the future. They are, after all, the most expert in their own error. They ought to be heavily involved in coming up with remediation items.

So technically, engineers are not at all “off the hook” with a blameless PostMortem process. They are very much on the hook for helping Etsy become safer and more resilient, in the end. And lo and behold: most engineers I know find this idea of making things better for others a worthwhile exercise.

Kitchen Soap – Learning from Failure at Etsy

Related reading,


Posted

in

by

Tags: