4 Simple Ways To Achieve Blameless Postmortem Culture
Somehow we are used to the concept that QA is to blame when a client finds a bug in production. In recent CrowdStrike fail everyone jumped to conclusion that the testing was poor and that was why the problem happened. In these situations it is important to understand that the responsibility for software quality lies with everybody in the team. We work together as a team, we help each other and we are jointly responsible for every success and failure. This balance of mind in blameless postmortem culture is not easy to achieve. Here are some good examples for software teams on how to start embracing responsibility instead of pointing fingers at others.
Analysis or Postmortem
Postmortem (lat.) in software development is as its name says, something that happens after death. Well, not necessarily a literal death of someone, although some software errors can lead to loss of lives. However, postmortem analysis in this case is something software teams do after the damage is done. This is not the first step in the process of fixing the problem. Obviously the first thing to do is to fix the problem for the client. After solving the problem, we can conduct deep analysis. This analysis purpose is not to find a person responsible for the issue, but to understand better how did we get in the situation to release a fault in production.
It is very important to understand that every fault is a learning opportunity for the team. Those who do not learn from their own mistakes are destined to repeat them. Postmortem analysis should humble the team into understanding that the processes are not flawless. We must constantly review the processes, update them and monitor them. It is a continuous assignment as everything in software development.
How to start?
Shifting to blameless postmortem culture is not something that would happen overnight. People are suspicious by nature and it is difficult to explain the concept that the issue is no one’s blame but instead the management just wants to talk about it. Sounds very unnatural and forced, to be honest. However, fostering the safe space for the employees to discuss about their mistakes without judgment is the only path. Companies with the right mindset even set a simulations of previous failures for the employees to learn from them. In such environment, even when new issue happens everyone know their role and what to do next. Production issues are inevitable and having a process to work with them is a must in modern software development.
So, the first recommendation is to work on creating the right culture and open communication. The blameless postmortem culture is a consequence of embracing failure as a very expensive learning opportunity. However, the cost of such learning can be even higher if we repeat the same mistakes. The software teams should discuss the benefits of postmortems widely. The benefits should be endorsed by top management. The discussions about them should take regular place during work and their value should be emphasized every time when the process is improved.
Avoid counterproductive actions
I had an opportunity to be present during a few rants of different managers about mistakes made by employees. Several of those discussions had one thing in common – they were far from constructive. Repeating what someone did over an over again in a two hour meeting without even mentioning possible corrective actions for future references is a waste of time in my opinion. We must accept that the mistake has happened and the fact is, we cannot change people. People make mistakes, no matter how good they were trained. What we can change are the processes that lead to those mistakes. The second takeaway is to stop blaming people for being people and start thinking about improving processes to prevent mistakes.
The Analysis
The analysis is usually done by a single person after reviewing the logs and talking with people involved in the issue. The idea is to find the root cause of the issue, but not just satisfying the technical part of the analysis. We can easily label most of the root cause analysis as human error, but we should go deeper than that. The important part is behind this human error, specifically which part of the process allowed the human in question to make the error. The analysis should be deep enough to shed light on all connected factors which led to the occurrence of the issue. There is rarely just one factor for the issue. Third suggestion for successful blameless postmortem culture is to gather all the information, dig through the processes and suggest improvements.
The Documentation
Postmortems should be documented adequately. There are even tools specifically created for this. The documentation for postmortem shows the report about the issue, the course of actions conducted to resolve the issue and corrective actions for the future. There are a few things that should also be included in the documentation. What was done right, what didn’t go well and what should be changed in the future so the issue does not happen again. The final recommendation is to keep your documents about the issues as a learning resource. Some companies make their postmortems public as you can see in a list in this GitHub repo.
Final words
Achieving blameless postmortem culture is not an easy task. There are a lot of obstacles in the process. Some people wouldn’t see the benefits of such analysis, others might think that their time should be used in more productive ways. The reality is, we need resilient systems. Finding ways to make them less susceptible to human errors, less dependent on other systems and prevent us from repeating our own mistakes should be top priority in software engineering.