Many organizations are trying to foster a culture of accountability of learning, rather than of blame. This means encouraging people to take accountability for their mistakes and learn from them, while at the same time not making people feel bad for making those mistakes.
“The thinking behind assigning blame is that identifying the offender and punishing them will correct the poor behaviour. The reality is that the only thing people learn from being blamed is to become better at hiding their mistakes,” Avail Leadership wrote in a blog post. Avail Leadership explains that accountability is more constructive than blame because it focuses on the future.
Henning Jacobs, head of developer productivity at Zalando, has put together a list of Kubernetes failure stories. Operators tell the story of how they messed up, how they fixed the situation, and how they learned from those mistakes.
Currently, the collection contains 33 different stories from companies such as Spotify, Algolia, Target, Google, Gravitational, Google, Nordstrom, and more. For example, David Xia, an infrastructure engineer at Spotify, shares a story of how Spotify accidentally deleted all of its Kube clusters once. The stories range from blog posts to conference talks posted on YouTube.
So if you have some downtime, take a few moments to learn from their mistakes so that you don’t make the same ones. “Considering this environment, we don’t hear enough real-world horror stories to learn from each other! This compilation of failure stories should make it easier for people dealing with Kubernetes operations (SRE, Ops, platform/infrastructure teams) to learn from others and reduce the unknown unknowns of running Kubernetes in production,” the GitHub page states.