Integrate your Reliability Toolkit with Your World, Part 1

More Chaos Engineering Tips
Integrate your Reliability Toolkit with Your World, Part 1
Choose any chaos inducing tool to be part of your powerful Reliability Toolkit chaos experiments and system verifications

TL; DR

  • Through the open-source Chaos Toolkit, you can perform chaos engineering experiments and verifications with the Reliability Toolkit using whatever failure injection tools you prefer.”
  • The new “Chaos Experiment Import” tool opens up your Reliability Toolkit to the full range of custom experiments and verifications you may want to explore.
  • Through the open source community, the Chaos Toolkit (and now the Reliability Toolkit) supports building experiments that can make the most of your existing investment in popular failure injection tools, including Litmus Chaos, Chaos Mesh, AWS Systems Manager (SSM), Chaos Blade, etc.

The Power of Open Chaos Engineering

Chaos Engineering is science, Chaos Engineering is at it’s most valuable when it’s highly coolaborative; where everyone can see what experiments are being pursued, when they are happening, and what findings they have surfaced.

Chaos Engineering, Chapt 14, Russ Miles

Chaos Engineering, like any scientific discipline that enables progress and learning, benefits from being open.

The Open Chaos Initiative proposes an open approach to chaos engineering, introducing open formats for experiments, findings and more. The open approach encourages sharing, collaboration, and collective learning. The open approach encourages sharing, collaboration, and collective learning. The open approach encourages sharing, collaboration, and collective learning.

Through the open-source Chaos Toolkit and its’ community ecosystem of extensions, you can use whatever chaos inducing approach you want from within your Reliability Toolkit. Chaos Toolkit is open-source and powered by its community.

The Chaos Toolkit uses an open format Chaos Experiment. You develop an experiment and you use the Chaos Toolkit to run the experiment against your target system.

Experiments combine probes and actions that are applied to your target system to play out your experiment; probes query your system for status and actions perform some state-changing event on your system, such as inducing a fault.

Customize your Toolkit with Community Open Source Extensions

You can use all of the Chaos Toolkit extensions in your Reliability Toolkit to help you to interact with your target system using platform-specific probes and actions. These extensions help you inject turbulent conditions through their actions, and to discover your system’s weaknesses by using their probes in your Steady-State Hypothesis.

The Chaos Toolkit’s experimental power and versatility come from its growing ecosystem of extension that allows users to interact with their systems and resources in their own unique ways. There are a large number of extensions available, many of which are community contributed.

A sample of the Chaos Toolkit Extensions

Bring your existing Chaos Toolkit Experiments into your Reliability Toolkit as Experiments and Verifications

You can now use your existing experiments in your Reliability Toolkit using the new “Import” tool. During this process you choose to upgrade your experiments as more powerful System Verifications.

To import an experiment once you have signed in to the Reliability Toolkit navigate to the verifications page and select import experiment.

Import experiment screen

Using the Import tool, you add the additional details to upgrade your experiment to a full verification. Full details are in the Reliability Toolkit documentation.

Once you have imported your experiment, upgraded as verification, you will be taken to your new verification’s page. From here you have the option to run the verification from a URL or you can download the verification and run it locally.

Reliability Toolkit's Verification page

When you run the verification from your Chaos Toolkit installation, the ChaosIQ Cloud plugin will publish the results to the Reliability Toolkit.

Opening up your Reliability Toolkit to the full pantheon of available Chaos Engineering tools

With the new Import tool, you can take full advantage of the many Chaos Toolkit Extensions inside of the Reliability Toolkit, which mean you can select whatever popular chaos engineering failure injection tools you like and work with them inside of the Reliability Toolkit, including working with Litmus Chaos, Chaos Mesh, AWS Systems Manager (SSM), Chaos Blade, etc.

Integrating your experiments with Litmus Chaos

Let’s now show how you can integrate a third-party chaos, failure-inducing tool into your Reliability Toolkit. To do this you’re going to use Litmus Chaos to inject failure within a simple Chaos Experiment. The experiment encourages you to apply a steady-state hypothesis before and after your failure injection step, this will help you determine if your service has been impacted.

To use Litmus Chaos you need to set it up with Kubernetes and install it on your cluster. The getting started section of the Litmus Chaos documentation fully covers this.

You will need a system under test and an exposed endpoint that you can use to measure it. The details for this are covered in one of the experiments in the Open Chaos Experiment Catalog. The experiment readme has all the details for setting this up.

The experiment uses the kubectl command to trigger fault injection with litmus chaos.

{
  "version": "1.0.0",
  "title": "Website responds with success status, when the server container restarts ",
  "description": "Check the Website continues to responds with success status when litmus chaos restarts a server container.",
  "tags": [
      "platform:Staging Cluster",
      "service:Website",
      "turbulence:litmuschaos"
  ],
  "configuration": {
      "endpoint_url": {
          "type": "env",
          "key": "ENDPOINT_URL"
      },
      "chaos_yaml": {
        "type": "env",
        "key": "CHAOS_YAML"
    }

  },
  "contributions": {
      "availability": "high",
      "reliability": "high",
      "safety": "none",
      "security": "none",
      "performability": "mdeium"
  },
  "steady-state-hypothesis": {
      "title": "Website responds with success",
      "probes": [
          {
              "type": "probe",
              "name": "website-must-respond-normally",
              "tolerance": 200,
              "provider": {
                  "type": "http",
                  "url": "${endpoint_url}",
                  "timeout": 3
              }
          }
      ]
  },
  "method": [
  {
          "type": "action",
          "name": "Restart the webserver container with litmus chaos",
          "provider": {
              "type": "process",
              "path": "kubectl",
              "arguments": "apply -f ${chaos_yaml}"
          }
      }
  ],
  "rollbacks": []
}

experiment.json hosted with ❤ by GitHub

The action step in the experiment runs a litmusChaos engine YAML file to kill the container:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
  namespace: nginx
spec:
  annotationCheck: 'true'
  engineState: 'active'
  appinfo:
    appns: 'nginx'
    applabel: 'app=nginx'
    appkind: 'deployment'
  chaosServiceAccount: container-kill-sa
  # use retain to keep the job for debug
  jobCleanUpPolicy: 'delete'
  experiments:
    - name: container-kill
      spec:
        components:
          env:
            # specify the name of the container to be killed
            - name: TARGET_CONTAINER
              value: 'nginx'

chaosengine.yaml hosted with ❤ by GitHub

You are now open to use the full catalog of failure injection methods that are available in Litmus Chaos Hub:

Litmus Chaos Hub

The problem with running this experiment is you often don't see the impact of the failure because it takes a bit of time for the failure injection method to take effect in your cluster and the experiment may be finished by then. The Reliability Toolkit verifications can help with this.

Upgrading your Experiment to a long-running Reliability Toolkit Verification

Reliability Toolkit verifications show how conditions impact an objective, such as an SLO, over time. To do this, verifications run longer and measure more of what is happening throughout those conditions.

Starting from the experiment you can use your Reliability Toolkit to upgrade the experiment to a verification. From the verification import page you can import the experiment:

Reliability Toolkit Import Experiment page

Upgrading to a verification also allows you to map the verification onto a business objective. The steps to upgrade an experiment to a verification is fully documented in the Reliability Toolkit documentation.

When you define your business objective or business SLO as part of that object you will define the business expectation for the objective, such as 95% uptime in any one day.

Then when adding your verification you will specify what you will measure to meet this objective. In this case, you will get the status from the endpoint URL every 5 seconds for a period of 2 minutes:

A verification imported in the Reliability Toolkit

The verification you have defined should show you an impact of the failure.

Once you have imported the verification download it to your local working directory and run it with:

ENDPOINT_URL=http://34.105.191.45 CHAOS_YAML=chaosengine.yaml  chaos  verify  verification.json
[2020-05-14 17:57:14 INFO] Validating the experiment's syntax
[2020-05-14 17:57:14 INFO] Experiment looks valid
[2020-05-14 17:57:14 INFO] Verification looks valid
[2020-05-14 17:57:16 INFO] Execution available at http://console.chaosiq.dev/ChaosIQ/chaos-ebn/executions/c1bbf8be-a758-4dcf-8d08-94c44bf5196a
[2020-05-14 17:57:17 INFO] Started run '2b53dc0c-9677-40a7-920b-4ad56ad98244' of verification 'Checks the hypothesis that a URL responds with a 200 status'
[2020-05-14 17:57:18 INFO] Starting verification warm-up period of None seconds
[2020-05-14 17:57:18 INFO] Finished verification warm-up
[2020-05-14 17:57:18 INFO] Triggering verification conditions
[2020-05-14 17:57:20 INFO] Starting verification measurement every 5 seconds
[2020-05-14 17:57:20 INFO] Running verification measurement 1
[2020-05-14 17:57:20 INFO] Steady state hypothesis: Application is normal with container restart
[2020-05-14 17:57:20 INFO] Probe: application-must-respond-normally
[2020-05-14 17:57:20 INFO] Steady state hypothesis is met!
[2020-05-14 17:57:22 INFO] Action: kubectl apply litmus chaos engine
[2020-05-14 17:57:26 INFO] Finished triggering verification conditions
[2020-05-14 17:57:26 INFO] Starting verification conditions for 120.0 seconds
[2020-05-14 17:57:26 INFO] Running verification measurement 2
[2020-05-14 17:57:26 INFO] Steady state hypothesis: Application is normal with container restart
[2020-05-14 17:57:26 INFO] Probe: application-must-respond-normally
[2020-05-14 17:57:26 INFO] Steady state hypothesis is met!
[2020-05-14 17:57:32 INFO] Running verification measurement 3
[2020-05-14 17:57:32 INFO] Steady state hypothesis: Application is normal with container restart
[2020-05-14 17:57:32 INFO] Probe: application-must-respond-normally
[2020-05-14 17:57:32 INFO] Steady state hypothesis is met!
[2020-05-14 17:57:39 INFO] Running verification measurement 4
.....

The listing above does not show the full output. In the middle of the verification run you will get some errors as a result of litmus chaos shutting down the container for the server:

The verification outputs displays errors

It is also useful to monitor the pods in your cluster with:

Every 2.0s: kubectl get pods --all-namespaces

The output from the above shows the Litmus Chaos operator, the container-kill pod, and the pumba-sig-kill pod running and injecting Chaos on the NGINX server:

Viewing the Impact of your Experiment in the Reliability Toolkit

The verification above is running locally using the Chaos Toolkit and results are being published to the Reliability Toolkit. This can be seen from the Timeline:

A screenshot of the Reliability Toolkit timeline

The Summary Timeline view has a horizontal timeline and a vertical timeline. The horizontal timeline allows you to see where the verification in relation to other events on your system. An example could be a deploy request from the CI/CD pipeline, it could be that deploy request that causes a deviation in the verification.

There is also a vertical timeline view that allows you to track the history of events in the team you have selected. From that history, you can drill into the details of those events, in this case, you can view the execution details of the verification:

Reliability Toolkit's Experiment Page

The insights view contains the measurements performed by the verification and the outcome of those measures on a timeline graph and all the details of your business objective and your verification:

Verification Insights screenshot

You can see from the verification timeline that there were 4 deviating measures in the middle of the verification run. It shows the impact of this deviation on your error budget, in this case, the deviations cause a 13% impact which blows the daily error budget of 5%. If this were a live system some pagers would be going right now.

The point of running the verification is to learn from the process. In this case, you have seen a significant impact on the business objective. If this was really going to be an event triggered by the CI/CD pipeline you would need to add some robustness to the system to reduce the impact of the container restart.

If you look back at the deployment spec for the webserver, you will see the spec only includes a single replica of the webserver. You could add one or more replicas and run the verification again and hopefully, the impact on the live site has gone.

If you were planning to make the live deploy part of the CI/CD pipeline, it would be worth making this verification a long-running verification that runs 24/7 monitoring the website, the Chaos Toolkit has a Kubernetes operator that can be used for that.

The Reliability Toolkit also has a Humio Plugin that could generate an alert if any deviations are detected, that alert can also be posted to the Reliability Toolkit timeline, enabling you to build a complete historical picture of the events impacting the reliability of your system.

Building your own Reliability Toolkit

The Reliability Toolkit along with the Chaos Toolkit and other Chaos Engineering tools, such as Litmus Chaos, give you plenty of options for building tools that you can use the improve the reliability of your systems.

The Reliability Toolkit allows you to collaboratively develop a suite Objective and Measures that you can run against your systems. Ongoing improvements to these, along with robustness strategies, will improve the reliability of your systems.

Running game days will help you learn how to react when there is turbulence injected into your system. Injecting faults into your system and learning how to respond to the impacts of those faults means you will be better prepared when unknown events impact your system.