From Chaos to Verification

More Chaos Engineering Tips
From Chaos to Verification
Verification brings the "Why!?" to Chaos Engineering.

TL;DR - System verification multiplies the value of chaos engineering by helping you decide what to explore and how to prioritise your results. ChaosIQ is now rolling out the tools you need for system verification. This article was given as a talk at the London Chaos and Resilience Engineering Community at Expedia Group. The slides are also available as well as accompanying videos from the ChaosIQ YouTube Channel.

Chaos engineering, as we practice it today, has a problem. In "Learning Chaos Engineering" by O'Reilly Media I stated that the value of chaos engineering was "to provide evidence of system weaknesses". That sounds great to many but it leaves a lot of people, especially non-technical stakeholders, feeling uncomfortable because it leaves them with awkward questions such as, "What evidence should I be investing in seeking?" and "What evidence is important to me to turn into system improvements?". Or, the really hard question, "How do I decide what to invest in?"...

Chaos engineering might not explicitly help with these questions but by taking chaos engineering one step further, and in fact providing a clearer approach that speaks to the business itself, you can answer those questions through applying chaos engineering under-the-hood. That extra step, and shortcut, is verification.

Chaos starts and ends with "What?"; verification starts and ends with "Why?"

Chaos engineering starts by looking to explore your entire sociotechnical system and surface evidence of weaknesses, but that doesn't answer the business question of "What should I take a risk on exploring?". Chaos engineering experiments and tests don't come for free, they require effort in design and implementation and, especially when they are early experiments, they come with the risk that you might learn something bigger than you expected. It's natural that the business should ask why you want to run an experiment, and it's natural for you to ask the same thing but from the engineering perspective of "What experiments should I even bother with?". For this reason, verification starts with a different perspective.

If chaos engineering often starts with "What dark debt do we want to try and surface?", verification changes the emphasis to "What objectives do you want to verify in your system?". Or, in even more simple terms, "What do you care about?".

Verification aims to answer why you are going to conduct any form of chaos engineering in the first place, and then to bring the results back into that context so you can make an informed decision about where to place your engineering efforts to improve reliability. Verification answers "Why are we exploring this?" and "What should we prioritise to work on after we've verified our system?", and, more importantly from the general business perspective, it answers, "What should I take a risk on exploring?" and "How do I decide what to invest in?".

The chaos engineering process usually looks something like the following:

A process of chaos engineering

You start by defining a hypothesis that you'd like to explore. A hypothesis is a belief in how your system will behave under turbulent conditions. You then set up a chaos experiment to be run as a Game Day or an automated chaos experiment that captures how the system is measured to be 'normal' (using a Steady-State Hypothesis), the turbulent conditions you want to apply (the 'method'), and maybe some rollbacks to be executed at the end of your experiment. You then run the experiment, collate the results and try to elicit evidence to support, or refute, your belief in how your system behaves under those conditions.

None of this is wrong, it just lacks context. In contrast, the verification process looks a little different:

A process of system verification

Verification starts with objectives. These are descriptions of what you care about in your system. Maybe you care about how fast your web-based console responds, or that your system will elegantly and seamlessly fail over between Amazon availability zones. Whatever you care about, if it's something you care about and, ideally, your users and customers care about, then it's a good candidate to be an objective. As part of your objective, you also specify a target for how much of the time you want this objective to be met or, more importantly, how much time it can not be met and your users will still be happy:

Specify an Objective; something you care about in your system.

Second comes your measurements. Measurements are indicators that tell you whether, at a given moment in time or perhaps based on a trend, your objective is being met:

Specify a Measurement; a sample of your system that will let you know that your objective is being currently met.

Finally you can design and run a number of verifications for your objective:

Create a Verification; verifies that your objective is being met under various, often turbulent, conditions.

A verification runs for a specific amount of time and is responsible for verifying that your objective is still met under conditions that you choose to apply:

Specify the conditions; specify the conditions you want to apply to your system in order to verify how your system behaves under those conditions.

For the keen-eyed amongst you this will probably feel very familiar for a couple of reasons. A verification is an extended version of a chaos experiment,so if you've used the Chaos Toolkit's existing experiment syntax then a verification is similar but executes for longer. Objectives and measurements will be familiar too if you have an awareness of Site Reliability Engineering (SRE). Objectives and Measures can be easily transposed into SRE SLOs and SLIs respectively.

Seeing the Impact on your Objectives

Chaos engineering tooling to date tends to stop at the "orchestration" level. This means that most tools offer various ways of injecting turbulence (the "Conditions" level), and some tooling such as the Chaos Toolkit adds the "orchestration" level, i.e. it brings the concepts of an experiment and a steady-state hypotheses:

The basic "Conditions" and "Orchestration" levels as provided in most chaos engineering tools.

However the real value of practicing chaos engineering is all contained in the next level up, what I'll call the "Decisions" level:

The higher "Decisions" level.

Without the "Decisions" level you can't easily decide what to explore, nor figure out the importance of the results. By providing the all-important business context through objectives, verification adds the concepts, workflow and information needed to be able to support your decision-making:

Verification adds the concepts critical to make your chaos engineering a powerful, decision-making tool.

Once you have executed a verification, either once or part of a continuous verification schedule, you get an assessment of the impact on your objectives based on what was observed. Chaos experiments provide facts and evidence, but verifications indicate the impacts on your Objectives of the conditions you've applied. Seeing this impact in the context of an objective means you can effectively close the loop by deciding on, creating and tracking system improvement actions:

The result of a verification is an indicator of impact on your system's behaviou, relative to the targets of your objective.

Start verifying your systems today!

You care about the experience your users and customers have of your systems. System verification provides the missing link between chaos engineering and the all important objectives that the engineering and business stakeholders want to meet to ensure their users and customers are happy.

At ChaosIQ we've seen system verification be a real eye-opener to our current users. Everyone can understand the value of what is being explored because of the first-class support for the real objectives everyone cares about. If you're interested in exploring system verification then please get in touch to get a free trial of ChaosIQ today.

The following videos accompany this article: