Learning from Incidents - what to do after you write a postmortem?
For folks who’ve made post mortems more meaningful at your company, it is important that you spread that learning around. A lot of companies have teams that do postmortems really well and a lot of engineering managers(EMs) want to spread it organically, but writing and following postmortems is the kind of practice that a lot of devs really just don’t think about or care about and it can get extremely hard to force this practice, especially without support from upper management. So what can you do as an engineering manager?
A lot of the best gains that most EMs made were either through word of mouth or providing a good example. First off, give people the time to actually do a good job with the retrospective process. Don’t make it sound like people need to rush things out the door to move on to “real” work. Second, you also must highlight how well-done postmortems are affecting how you’re thinking and approaching the work you’re looking at for the next month or quarter. John Allspaw has a lot of text on this basically saying that you know you have a healthy retrospective culture when you see teams attending other teams’ retrospectives.
Rethinking “action items”
One must also keep in mind that postmortems shouldn’t be about action items per se, but what you learned about how things work from a postmortem. It might appear to be a hard sell if your organization has a fixed mindset on intelligence or ability and might require conquering any inclusion and psychological safety issues at first, but it’s easier than it looks. Saying “we’re doing X because it was an action item from last month’s postmortem” is less interesting than “I understand why this thing we’re doing is important, because of how Cindy used this thing in the last incident”.
A common notion that you’ll have to dispel is that no one is going to want to participate in an honest retrospective if they think they are going to get screwed by doing so or if they think that the only thing anybody cares about is action. You need to make people realize that a retrospective with no action items was not a failure. Intangibles can be a hard sell.
For teams that spend a lot of thought on “action items” and “what can we do to prevent this from happening again” will notice that those action items lose a lot of the value along the way and the real fixes had happened already by the time you have retro discussions because the teams that own the systems know what it takes to improve them and want to improve them. Right after an incident, you might informally establish a time window when anything goes without a lot of planning overhead. But more often than not, all the formal action items might fall out of that, the highest ROI ones are, generally speaking, done informally.
So, how do you spread this team’s good habits around and how do you create the circumstances in which a really good post mortem is possible? One thing you can do is highlight and embrace the improvements that organically happened between the incident and the retro. Make it normal and expected to immediately address urgent issues. Instead of asking “what can we do better tomorrow” ask “what is better now”. The general observation is “action items” generally just leads to a giant pile of remediation items that never get done because no one thought they were actually worth it.
It also helps if your postmortems are interesting to read because that’s probably how most people are going to see it, at least if your company is of any size. Postmortems must not be a) dry and boring, or b) so inscrutable that no one without a deep knowledge of the incident or the systems involved can understand it. Consider this while hiring potential postmortem authors, whether they can write or tell a story, and understand that telling people about their problems is key to actually getting what you want.
From a management perspective, it is important to change your expectations from a postmortem. Most teams view management within incident reviews with a fair bit of dread and anxiety unless they are there only to listen. Instead of saying “we need to have items so this NEVER HAPPENS AGAIN”, establish the expectation that there is an improvement but separate from the “learning” retro. The creation of those remediation items requires people to have a good idea of how to materially improve their product already, and also a good grasp on how to choose the right level of generality for their fix. In all probability, an incident for which you just had a postmortem, will not happen again exactly like that time, and also your remediations are changes that also have the potential to break things too. Instead, center the retro/learning around broadening the team’s knowledge so that next time (or five times down the line) they have good ideas. You can also look at incentivizing the improvements by framing learning about better practices as a growth or stretch opportunity. Owning or championing incident management for a team can convert to bigger opportunities to own a critical project. Spreading awareness of best incident management practices across teams can be seen as mentorship. You can provide company-wide visibility regarding the impact of improved incident management, by calling it out (and naming names/teams who did the work!) publicly somewhere where the management can see. The idea is to a) help them see that this work is critical and highly valued, rather than a literal or figurative chore that’s a lower priority than feature work and b) use it to help grow your incident commanders.
Measuring impact to improve adoption and learning
One way engineering and product managers can see the need for improvement is by tracking the “impact on engineering” of incidents. The actual method might not be that important, but what is important is the realization that “this incident wasted so many hours we could have used more productively”. This can have more impact than by saying “we lost X in revenue”. It is also important to include generous “context switch” padding because in the hour after an incident nobody who is involved gets anything done.
Your teams need to be strong enough to be able to figure out how to improve their situation without “rewriting everything”. If you highlight on a regular basis the work that comes out of postmortem-team and how their $desirable_metric has gone up because of it, other teams may slowly follow. Having said that, I would caution against thinking correlation == causation.
Looking for an end-to-end incident alerting, on-call scheduling and response orchestration platform?
Sign up for a 14-day free trial of Zenduty. No CC required. Implement modern incident response and SRE best practices within your production operations and provide industry-leading SLAs to your customersSign up on Zenduty Login to Zenduty