Pre-requisites to Practicing Reliability?

More Chaos Engineering Tips
Pre-requisites to Practicing Reliability?
Practicing Reliability to Prepare your whole Socio-Technical Team.

TL; DR

  • Reliability is something you do, not something you buy.
  • Practicing reliability helps you prove and improve how your system will handle and learn from difficult, real-world conditions, to make your system more reliable.
  • Practicing reliability supports resilience engineering, enabling you to invest in becoming better prepared for the unknowns such as awkward system failures and catastrophic surprise incidents and outages.
  • You can’t be ready for every situation, but you can be better prepared and poised to adapt to it if you practice reliability to develop your capacities for resilience.
  • Practicing reliability does not rely on any pre-requisites; it works in tandem with your choices by helping you evaluate, verify, learn and improve the real contribution that those technical and non-technical choices make when your whole system collectively encounters challenging conditions.

Does practicing reliability rely on quality observability, logging and tracing?

This question comes up, a lot. You could substitute “observability, logging and tracing” for “microservices”, “circuit breakers”, “continuous delivery”, “DevOps automation”… the list of good ideas does go on.

Before we answer that question, first let’s look at what practicing reliability is, and what you get out of doing it.

Practicing Reliability is...

Reliability is something you do, not something you buy.

Practicing reliability means you invest in doing work that develops how your systems, even your organisations, handle and learn from difficult conditions. For many, many reasons it could have been called “practicing resiliency engineering”, as that’s the clear foundation.

You’re aiming to strike a balance where you’re also working on proving and improving how reliable your system is.

Alongside your team’s work on increasing the utility of your system, often called feature work, you’re aiming to strike a balance where you’re also working on proving and improving how reliable your system is. Whether you’re focussing on your system’s availability, security, durability of data, performance; all of these factors reflect on how your system is perceived by its users and so come under “reliability”, i.e. your users can rely on your system.

You’re practicing reliability for your own benefit too.

You and your teams also want to rely on your system, in addition to being an important part of that system. Practicing reliability also means you’re investing in learning how the entire socio-technical system, including you, handles expected and unexpected events. You’re practicing reliability for your own benefit too, in terms of reducing the pain of running your system confidently to meet your users needs and expectations.

The Payback for Practicing Reliability

The payback of practicing reliability is that you prove and improve your whole system’s robustness strategies in regards to what you know might happens; chaos engineering and continuous verification are great tools here.

Prove and improve your whole system’s robustness strategies in regards to what you know might happens.

You also explore, learn and improve how prepared your system is for the unexpected; engineering your system’s readiness to adapt to new challenges; exploring how poised to adapt yo are; engineering your whole system’s resilience.

Explore, learn and improve how prepared your system is for the unexpected.

In resilience engineering terms, you explore, develop and improve your adaptive capacities so that you are better and better prepared for the unknowns, such as awkward system failures and catastrophic surprise incidents and outages.

If you want to know more about resilience engineering, Lorin Hochstein provides a great starting point for meeting the people, thought, practice and evidence in this exciting area. One extremely powerful paper is David Woods’ “Resilience is a Verb”.

A Key Question: How do you know?

One question that practicing reliability addresses is “How do you know?”. You might have invested in an incredible operational system, full of HA strategies, awesome operational playbooks, circuit breakers, carefully crafted bulkheads, and open and accessible observability events.

* N.B. Those aren’t all you need, just a sample off the top of my head right now.

You might have elasticity by the bucket-load running your system on the latest cloud infrastructure and services. You probably have a great set of people, including you, that’s standing by to make sure you deliver the best features, reliably, as possible.

What you’ve done is crafted the best individuals for your “team”, to use a sports analogy. You’ve bought the best individual players you think you could get, and you’ve put careful thought into how they might work together. You’ve probably even played some practice games, sometimes again real adversaries, and the team seems to be working together just fine.

You’ve bought the best individual players… but how do you know that the team is going to play well together?

The awkward question is “How do you know?”. How do you know how that team is going to handle unexpected conditions? How do you know how all these great players will play together in the heat of a difficult moment? How do you know that their individual strengths are going to come together to provide the best possible result?

You could have done everything right as far as using DevOps principles and automation to break silos of responsibility and communication down, enabling continuous delivery to work well, with all your all tests passing, and you still would be forgiven for being worried about what might happen when the unexpected comes calling.

You can’t be ready but you can be better prepared and poised if you practice reliability.

But… “Does practicing reliability rely on quality observability, logging and tracing?”

Now you know what practicing reliability aims to do, let’s unpack that question about prerequisites, and the answer is in the slightly hidden assumptions in the question.

Let’s start with the word “quality”. The hidden assumption here is that you know you have quality observability. Do you? How do you know?

One test of high quality observability is that you’ve followed the excellent thought of people like Charity Majors et al; that will get you in a great place. Buying the right tool (player) for your system (team), and by following all the great advice out there (individual practice), you could argue that you have a high quality observability system. But you don’t know that...

The proof is in the pudding, as they say. At this point you only know that you have a high quality observability system, not that it will work in a high quality way when it is needed. You believe you have a star player, but they’re not match-proven yet.

At this point you only know that you have a high quality observability system, not that it will work in a high quality way when it is needed.

By practicing reliability, perhaps through chaos experiments or Game Days, you can explore how your observability system, your player, contributes at the most difficult moments in the match. In surprising and difficult moments, is your observability system a net gain or net loss for the whole system? Is it providing great signal, or obfuscating noise?

You can’t say your observability is performing reliably and contributing to the reliability of the system until you practice reliability.

A mutually beneficial relationship

You can and should invest in great observability for your systems, that’s a no brainer. Practicing reliability through techniques and tools like Game Days and chaos experiments/tests leverages that investment, as well as helps you prove and improve when that investment will be needed most. In the heat of real-world incidents.

Running Reliable and Resilient Systems is a Team Sport

Game Days, chaos engineering and system objective verification, and many other techniques under the banner of Practicing Reliability, help you prove and improve how your whole system works together to handle and learn from the events it encounters, whether they be planned or utterly surprising, small or huge.

Cometh the Hour, Cometh the [System]

The sum of the whole, the most powerful parts being you and your people, could be greater or less than the sum of the individual parts. You don’t know until you start practicing reliability, then you can see where your individual players need to be improved and, more importantly, how well they play, match after match, reliability practice after reliability practice, with the rest of the team.

Don’t hope your system will perform well when the hour comes, proactively prepare for it. In a nutshel... Develop Resilience; Practice Reliability.

Photo by bantersnaps on Unsplash