The System Verification Workflow

More Chaos Engineering Tips
The System Verification Workflow
In previous articles we’ve looked at how System Verification extends Chaos Engineering, bridging the gap between the scientific search of empirical evidence of system weaknesses to the business concerns that justify the science in the first place.

In this article you’re going to see what are the key concepts and workflow that you can adopt today to start verifying your systems to make your users happy.

The Job of Verification

There are lots of good reasons to verify your system. You might be worried about surprise downtime and looking to try to anticipate and prepare better for incidents. You might also be curious as to how your system might respond to challenging conditions, and want to explore that before your users do.

At ChaosIQ we believe that you will want to verify your system because you care about your users’ experience, and specifically about the reliability of that experience. It’s not about the threat of downtime alone, or even the cost of such events (which can be considerable!), ultimately we believe that successful systems, and businesses that are successful because of them, have happy users. Happy users can rely on your system because it does what they need when they want it to. Happy users lead to the all-important business impact of user retention and growth, which is why user happiness is a core tenet of System Verification (and also Site Reliability Engineering, and not by chance...).

The goal of System Verification is not only to show the impact that various conditions might have on your users happiness but to also turn those insights into prioritised, meaningful actions for improving a system's reliability. ChaosIQ brings this System Reliability Workflow to life with Objectives, Measurements, Conditions, Insights and Action Items.

Objectives and Measurements

Objectives are literally things you care about. If you care about it, and can measure it, then it’s an objective.

You might care that your homepage is available 99.9% of the time. You might care that your database is able to respond to a particular query under an amount of time. You might care that your users can interact with your system, perform some system function, 99.999% of the time as it’s super-critical! These are all Objectives, and they frame what’s important to you about your system. A good set of Objectives represents a great high-level description of how your system aims to behave, from your perspective, in order to make your users happy.

However an Objective is not particularly useful unless it can be measured. Objectives can be measured in lots of different ways, often in combination, to give you some confidence that an Objective, at a given moment, is being met:

Measurements, with accompanying tolerances, provide "coverage" to answer the question, "Are we meeting the Objective?".

The job of a measurement, or combination of measurements, is to be able provide a sample, an event, that is either “successful” or "failing" at a given moment in time. For example, you might measure that the response time for your homepage is within a tolerable time bracket at a given moment and, if it is within that tolerance, then it is a “successful” sample event. If the measurement shows that the homepage URL does not respond in the tolerated time, then the sample event would be designated "failed':

Measurements are used to sample the system to decide collectively if an Objective is being met.

The combination of Objectives and their corresponding measurements provides a shared understanding of how your system needs to behave in order to keep your users happy. This is the all-important framing necessary to then decide what you want to verify in order to see how the your Objectives, and your users’ happiness, is affected by your system’s behaviour under various conditions. Defining Objectives and Measures is the first step in system verification with ChaosIQ:

Defining Objectives and Measures in ChaosIQ.

Verifications

Verifying an Objective helps you build trust and confidence in how your system will behave under a set of interesting conditions. Those conditions could be behaviour such as security intrusions, infrastructure failures, platform failures, application failures, interesting user behaviour or even administrative changes. In fact, anything that you can conceive of that could happen to your system could be an interesting scenario to explore if you think there’s a likelihood that it may have an impact on one or more of your system’s Objectives.

In our book, ”Learning Chaos Engineering” by O'Reilly Media, a couple of ways of exploring the conditions you might want to explore are covered. In addition, when you are designing a Verification you have the all-important framing of the overall Objective to help you. The Objective helps you decide and prioritise what verifications to perform that will give you the most effective insights, not least because you may want to perform the verification in production where there is likely more risk from the exploration.

For example, you might decide that you want to verify how your system behaves when:

  • A virtual machine fails
  • A network starts to lose packets
  • A configuration change is made
  • A Kubernetes Pod, or collection of pods, fail
  • A cluster is starved of resources, such as nodes
  • A database begins to respond slowly

The list will go on, and you will almost certainly have your own unique conditions that you’re interested in. This is one of the reasons why ChaosIQ is built on top of the free and open source Chaos Toolkit, which makes ChaosIQ an easily customisable and extendable system verification and chaos engineering environment. You need that flexibility, because your conditions will often be unique to your systems:

ChaosIQ Target Systems and Conditions.

The first thing you describe in a Verification is the types of conditions you want to introduce to see how this affects your system’s behaviour in respect of your overall Objective:

A verification then needs a description of how it will be executed. While a verification is running it will be capturing your Objective’s measurements throughout, continuously gathering sample events that indicate whether the measurement was successful or failed in reference to the overall Objective:

Duration of the Verification, and Frequency of the Measurement.

A Verification needs two pieces of information to run itself: Duration and Frequency of Measurement. Duration is self-explanatory, it’s the total duration that you want the verification to be executed for. For example, you might want to run a Verification as the backdrop to a 3-hour GameDay (More on how to do this in a forthcoming article) so, as you’d expect, you’d set the duration of the Verification in this case to 3 hours.

Frequency of Measurement describes just how often you will want to gather a sample event from the system as described by a Measurement for your Objective. You might want to sample every second, or perhaps sampling a measurement every 10 minutes will give you the fidelity you’re looking for. As you run your Verification, all of the data captured by your Measurements’ sample events are collated for interpretation into Insights.

Insights

When your Verification has been executed the resulting, potentially large, set of data collated from all the sample event measurements is captured and made available to you. This timeline of event samples in its raw form can be useful but the most common first step is to try and assess what impact on your Objective was seen by those measurements during your verification’s execution. This insight is called the “Objective Impact”.

The ”Objective Impact“ insight is a calculation based on the duration of the verification and how many failed sample events were measureds. The insight gives you an extrapolated indicator of what size of impact the conditions of your verification have on your Objective. This information is key to then decide whether the impact is so minor that your system survived just fine, or whether it’s actually time to assemble your teams and look to the next stage of the workflow: Action Items.

Action Items

Insights give you the information you and your teams need to decide what to do to improve your system’s reliability. The insights are framed by the objectives you care about and so you have everything you need to be able to understand and prioritise what should be done and when. What should be done and when are Action Items.

Action Items are how you describe what system reliability improvements you are going to work on. You describe them in ChaosIQ while you analyze the insights from your Verifications. Then track their progress and eventually impacts on future Verification executions. ChaosIQ integrates with the systems you normally use to capture, prioritise and track your work, i.e. issue trackers. ChaosIQ mirrors Action Items as issues in your regular toolset so you can work on this reliability work and its progress is mirrored back into ChaosIQ.

Action Items close the learning loop of the System Verification workflow. Each time you and your teams create new Verifications and explore more and more insights, you will describe and action more reliability improvements balancing your need to innovate at speed with the need to provide a system your users can rely on.

Improve System Reliability through the ChaosIQ System Verification Workflow.

Summary

Happy users means successful systems and happy users need your system to be useful and reliable. Work needed to improve a system’s reliability needs to be proactively surfaced, described and prioritised alongside those tasks that improve the features that the system provides. The System Verification Workflow in ChaosIQ is your engine of system reliability improvement, helping you understand where, when and how to improve your system’s reliability provably and effectively.

ChaosIQ is available for Early Access today. Contact us to find out how to join our Early Access Programme.

Photo by Tomáš Nožina on Unsplash