Why Now? Why Chaos? Why… VerifyMore Chaos Engineering Tips
February 3rd, 2020
by Russ Miles, CEO of ChaosIQ.
TL;DR - If you care about your system's, you'll want to verify how your system's objectives are impacted by surprising conditions. In this article you'll get the high-level explanation of how verification meets the business need to verify your systems proactively _before your users feel the pain.
There are no best practices in system engineering... Well, ok, maybe that's too much. There are plenty of bad practices of course, but when it comes to context-independent, "Just do it" practices, there are very, very few. Everything is context-based and, after all, "Context is King".
Chaos engineering happens to be one of them, especially the chaos engineering mindset. The socio-technical system of production is a complex, often chaotic place, that experiences turbulent conditions from all sides, from technical to the people, practices and processes that surround it. But chaos engineering in practice in many contexts has significant limitations, especially when experiments and their findings are kept in a vacuum.
That vaccuum is the gulf between what you explore with chaos engineering, and what your company wants to know. Chaos engineering provides "evidence of system weaknesses", but that's not easily translated into something everyone can understand and get behind. The answer lies in the question, "What makes a system successful?"
What your company wants is what you want
A successful system, or service, can be judged in almost exactly the same way as a successful company. Regardless of what your system, or company, does, there are some simple objectives that both share that tell you whether they are good at their job. Every investor knows this, and it's the basics of economics that can help us all as engineers. Those simple measures are:
- User/Customer Growth
- User/Customer Retention
For the purposes of this article I'll refer to User when I mean either Customer or User.
Of course there are lots of other potential objectives and measures of success for systems, services and business, but ultimate if you don't have those two then you know that you have, at the very least, a very bad system, service or business. How does this relate to verification through chaos engineering? First let's look at what affects those objectives...
What encourages and hurts user growth and retention?
Systems can directly affect their user growth ultimatly by making their functionality attractive to a larger group of consumers. This usually means more features, features being anything from something the user can actually use through to improving documentation. Alongside this there are the usual drivers of marketing and incentives of course, but without great features for the growing user base, marketing and incentives only go so far.
Retention is more nuanced. Yes, features are important. Successful systems meet and evolve with the changing needs of their existing users as much as providing new, attractive features to encourage growth, and so features are important once again.
But there is another crucial factor, one that can have a catastrophic effect on your retention and growth. Sometimes called "Trust", "Confidence" or "Safety", in modern software system engineering this factor tends to get labelled as "Reliability".
We've probably all experienced disappointment in a system, where it's "let us down for the last time, dammit!". It's usually easy to remember at least one such system where you've been a happy user for a time, only to walk away after one frustration too many. With competition every increasing between similar systems, this moment is becoming even more damaging. For example, banking used to be fairly static and loyalty-driven. The system was consistent, and often you'd stay with your bank for life. In these days of modern online systems, moving bank accounts is becoming easier and easier meaning user happiness is more important than ever to attract and keep customers. And banks are far from alone in this dynamic, every modern system is under similar threat, and has a similar opportunity.
The two keys in modern systems to user growth and retention are speed of change, including new features, and reliability. Both are crucial to user happiness, and both need one another for your system to be a success. The good news is that modern companies are aware of this.
Ongoing digital transformation of business has led to more adaptability than ever, which has renabled the widespread adoption of agile development techniques that attempt to mix the cultures of the business and development. Greater speed of delivery has been enabled by the adoption of DevOps, breaking down the barriers between development and running systems in production. Cloud Native approaches bring increased flexibility to what, where and how you run your systems.
But what about greater reliability? In this respect, specific approaches such as Site Reliability Engineering (SRE) are leading the way, bringing ways to balance reliability and speed through crucial concepts such as the error budget.
The focus on SRE is on working towards user happines. Service Level Objectives (SLOs), Service Level Indicators and error budgets are there to help you make the call as to when reliability needs to achieve a better balance against pure feature velocity. The difficulty that SRE faces is that modern systems are sifficiently complex that prevention is often impossible and, due to the rick and cost of outages, reacting to incidents alone is unacceptable:
This is where verification comes in. Whether you're practicing SRE or not, your system will have objectives it tries to meet, ways of measuring those objectives, and some idea of the types of conditions, including failure, the system might need to survive, (survival being defined as "behaving as expected, if not optimally"). Verification refines chaos engineering to emphasize those objectives, measure and conditions, helping you proactively explore how your systems behave under those conditions of failure before they affect your users.
Verification encourages you to verify how your system will behave under difficult conditions, to verify that the compensating strategies you have in place will actually work and, if they don't, to show you the impact on your error budget to help you decide "What should I invest in to improve?".
What does Verification look like?
The process of Verification is pretty straightforward:
Verification starts with objectives, which are essentially equivalent to SLOs; descriptions of what you care about in your system::
Second comes your measurements. Measurements are indicators that tell you whether, at a given moment in time or perhaps based on a trend, your objective is being met:
Finally you can design and run a number of verifications for your objective:
A verification runs for a specific amount of time and is responsible for verifying that your objective is still met under conditions that you choose to apply:
Verify It: Start verifying your systems today with ChaosIQ!
You care about the experience your users and customers have of your systems. System verification provides the missing link between chaos engineering and the all important objectives that the engineering and business stakeholders want to meet to ensure their users and customers are happy.
At ChaosIQ we've seen system verification be a real eye-opener to our current users. Everyone can understand the value of what is being explored because of the first-class support for the real objectives everyone cares about. If you're interested in exploring system verification then please get in touch to get a free trial of ChaosIQ today.
Photo by Danielle MacInnes