Continuous Verification with Chaos Toolkit Experiments using a Kubernetes Operator

More Chaos Engineering Tips
Continuous Verification with Chaos Toolkit Experiments using a Kubernetes Operator

TL; DR

  • The Reliability Toolkit adds tools for continuous verification of reliability objectives (SLOs) using chaos experiments.
  • Define, deploy, and schedule continuous verifications as a Kubernetes Operator, using the Reliability Toolkit and the new Chaos Toolkit CRD.
  • The Reliability Toolkit’s Timeline helps you inspect and analyze your periodic and recurring chaos experiments and verifications.

Continuous Verification of Reliability Objectives

The Reliability Toolkit includes tooling that enables you to identify and build reliability objectives, such as SLOs, for your business.

You can also define a Verification associated with your business reliability objective that allows you to measure the impact of specific conditions on your system. When you define a Verification in the Reliability Toolkit you specify how often you sample the verification measure and over what duration.

This allows you to continuously monitor that verification for the specified duration. The Reliability Toolkit documentation has detailed instructions for defining an Objective, a Verification, and then finally running that Verification.

A Verification screen in the Reliability Toolkit

This screenshot shows the instructions for how to run the verification o see the impact on the linked objective, along with its availability expectation (95% uptime over 1 day). It also shows the verification strategy. This verification’s strategy is to sample the system every 10 seconds over a period of 5 minutes.

Define, deploy, and schedule continuous verifications as a Kubernetes Operator

If you're using Kubernetes you may wish to run continuous verifications within your cluster.

The Chaos Toolkit has defined a Kubernetes Operator/CRD for Kubernetes. Kubernetes operators are a popular approach for creating bespoke controllers on top of the Kubernetes API. The operator can be used to control Chaos Toolkit experiments on-demand by submitting custom-resource objects.

The Chaos Toolkit operator listens for experiment declarations and triggers a new Kubernetes pod, running the Chaos Toolkit with the specified experiment.

Run an Experiment with the Chaos Toolkit Kubernetes Operator

To run an experiment with the Chaos Toolkit Kubernetes Operator you will need to have a Kubernetes cluster available, either local, on Google Cloud, or on another cloud provider. If you don't have a cluster you can use this quick start will get you up and running quickly.

If you are using Kubernetes on Google Cloud ensure you have gcloud and kubectl installed locally and configured for your cluster. You can then deploy the Chaos Toolkit Operator on your cluster. Once deployed on your cluster you should see the operator running in the chaostoolkit-crd namespace.

kubectl -n chaostoolkit-crd get pods
NAME                                READY   STATUS    RESTARTS   AGE
chaostoolkit-crd-5596945476-n9cl5   1/1     Running   0          12m

Running an experiment

The full details for how you can run experiments can be seen in the Chaos Toolkit Documentation, but to get something running quickly you can use:

---
apiVersion: v1
kind: Namespace
metadata:
  name: chaostoolkit-run
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaostoolkit-env
  namespace: chaostoolkit-run
data:
  ENDPOINT_URL: "https://httpstat.us/200"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaostoolkit-experiment
  namespace: chaostoolkit-run
data:
  experiment.json: |
    {
        "version": "1.0.0",
        "title": "Checks the hypothesis that a URL responds with a 200 status",
        "description": "Check a given url responds with a 200 status",
        "tags": [
            "platform:local",
            "service:url"
        ],
         "configuration": {
            "endpoint_url": {
                "type": "env",
                "key": "ENDPOINT_URL"
            }
        },
        "contributions": {
            "availability": "high",
            "reliability": "none",
            "safety": "none",
            "security": "none",
            "performability": "none"
        },
        "steady-state-hypothesis": {
            "title": "Application is normal",
            "probes": [
                {
                    "type": "probe",
                    "name": "application-must-respond-normally",
                    "tolerance": 200,
                    "provider": {
                        "type": "http",
                        "url": "${endpoint_url}",
                        "timeout": 3
                    }
                }
            ]
        },
        "method": [
    		{
                "type": "action",
                "name": "dummy step",
                "provider": {
                    "type": "process",
                    "path": "echo",
                    "arguments": "URL used is: ${endpoint_url}"
                }
            }
        ],
        "rollbacks": []
    }
---
apiVersion: chaostoolkit.org/v1
kind: ChaosToolkitExperiment
metadata:
  name: url-endpoint-exp
  namespace: chaostoolkit-crd
spec:
  namespace: chaostoolkit-run
  pod:
    image: chaosiq/chaostoolkit
    env:
      configMapName: chaostoolkit-env
    chaosArgs:
    - --verbose
    - run
    - ${EXPERIMENT_PATH-$EXPERIMENT_URL}

basic-experiment.yaml hosted with ❤ by GitHub

If the above YAML is stored in the local file basic-url-experiment.yaml you can run the experiment with:

kubectl apply -f basic-url-experiment.yaml

Having run the above you can check to see what PODS you have in the chaostoolkit-run namespace:

kubectl -n chaostoolkit-run get pods
NAME                            READY   STATUS      RESTARTS   AGE
chaostoolkit-b1jaq              0/1     Running  0          16s

The experiment includes the verbose argument on the chaos run so you get a bit more detail in the logging. This is entirely optional but it can be really helpful while you are developing your experiments. You can view the log files by executing:

k -n chaostoolkit-run logs chaostoolkit-d6196
[2020-05-01 15:02:05 DEBUG] [cli:70] ###############################################################################
[2020-05-01 15:02:05 DEBUG] [cli:71] Running command 'run'
[2020-05-01 15:02:05 DEBUG] [cli:75] Using settings file '/root/.chaostoolkit/settings.yaml'[2020-05-01 15:02:05 DEBUG] [__init__:355] No controls to apply on 'loader'
[2020-05-01 15:02:05 DEBUG] [__init__:355] No controls to apply on 'loader'
[2020-05-01 15:02:05 DEBUG] [caching:25] Building activity cache...
[2020-05-01 15:02:05 DEBUG] [caching:35] Cached 2 activities
[2020-05-01 15:02:05 INFO] [experiment:54] Validating the experiment's syntax
[2020-05-01 15:02:05 DEBUG] [configuration:47] Loading configuration...
[2020-05-01 15:02:05 DEBUG] [secret:74] Loading secrets...
[2020-05-01 15:02:05 DEBUG] [secret:89] Secrets loaded
[2020-05-01 15:02:05 INFO] [experiment:103] Experiment looks valid
[2020-05-01 15:02:05 DEBUG] [caching:42] Clearing activities cache
[2020-05-01 15:02:05 DEBUG] [caching:25] Building activity cache...
[2020-05-01 15:02:05 DEBUG] [caching:35] Cached 2 activities
[2020-05-01 15:02:05 INFO] [experiment:182] Running experiment: Checks the hypothesis that a URL responds with a 200 status

Running a Reliability Toolkit Verification using Kubernetes

The Reliability Toolkit allows you to upgrade an experiment to a verification. A Reliability Toolkit Verification gives you the option to run the verification a repeated number of times over a specified period, so this is a great help if you want to verify the reliability of your system over a longer period of time. To upgrade an experiment to a verification you can use the Reliability Toolkit Verifications Import page.

Reliability Toolkit Experiment Import

On the import page you import and attach your experiment to an objective.

You will also be able to specify how often you want to repeat your verification over its duration. Once you have created the verification you have the option to download it or run it directly from a URL.

Reliability Toolkit Verification Run page

A verification adds an extension block to an experiment:

"extensions": [
  {
    "name": "chaosiq",
    "experiment_id": "e0bea10b-6d79-4216-85e2-dcfea2156e4f",
    "objective_id": "ab325c73-355c-4608-9eac-1c82d1a67854",
    "verification": {
      "id": "aaee6252-85bb-4257-88c0-9608fc34679c",
      "frequency-of-measurement": 5,
      "duration-of-conditions": 30.0
    },
    "team_id": "c1ee76a2-2a07-4158-b5d9-b1b71cbb714d",
    "org_id": "e8cd7e47-6b78-4510-bee7-ea93d082fbae"
  }
]

The extension block adds some metadata to the experiment. This includes:

  • A unique id for the experiment.
  • A cross-reference to the objective, that you created as part of the import.
  • A unique id for the verification, plus its frequency of measurement and duration.
  • Your organisation id and team id that you were using when you imported the experiment.

When running with the Chaos Toolkit operator the extension block can be added to the YAML file you used earlier, this means it can now be run as a verification.

ChaosIQ Cloud Plugin

The standard image used in the Kubernetes operator does not include the ChaosIQ Cloud plugin but this can easily be added by building a docker image. The instructions for adding the plugin to a docker image is documented in the Chaos Toolkit Documentation. You can also use the open-source docker image that is published on docker hub: chaosiq/chaostoolkit.

To run a verification connected to ChaosIQ, you will need to add the following configuration to the YAML that you used earlier:

settings:
   enabled: true
image: chaosiq/chaostoolkit

You will also need to pass Chaos Toolkit settings as a Kubernetes secret, details for doing this are in the Chaos Toolkit documentation. You will need a local setting.yaml file for this, this is covered in the Reliability Toolkit documentation.

Continuous Verification with the Kubernetes Operator

To make execute a long-running verification with the Kubernetes operator and publish the results to the Reliability Toolkit you need to make a couple more additions to the YAML file. Add the following to your chaos command’s arguments:

--settings /home/svc/.chaostoolkit/settings.yaml

The last thing needed is to add a schedule entry, this is added to the spec for the -chaostoolkit-run_ namespace. Again this is fully documented in the Chaos Toolkit documentation.

spec:
   namespace: chaostoolkit-run
   schedule:
      kind: cronJob
      value: "*/3 * * * *"

The above configuration will run the verification using the operator every 3 minutes. The final file for the long-running verification with the Kubernetes operator is:

---
apiVersion: v1
kind: Namespace
metadata:
  name: chaostoolkit-run
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaostoolkit-env
  namespace: chaostoolkit-run
data:
  ENDPOINT_URL: "https://httpstat.us/200"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaostoolkit-experiment
  namespace: chaostoolkit-run
data:
  experiment.json: |
    {
        "version": "1.0.0",
        "title": "Checks the hypothesis that a URL responds with a 200 status",
        "description": "Check a given url responds with a 200 status",
        "tags": [
            "platform:local",
            "service:url"
        ],
         "configuration": {
            "endpoint_url": {
                "type": "env",
                "key": "ENDPOINT_URL"
            }
        },
        "contributions": {
            "availability": "high",
            "reliability": "none",
            "safety": "none",
            "security": "none",
            "performability": "none"
        },
        "steady-state-hypothesis": {
            "title": "Application is normal",
            "probes": [
                {
                    "type": "probe",
                    "name": "application-must-respond-normally",
                    "tolerance": 200,
                    "provider": {
                        "type": "http",
                        "url": "${endpoint_url}",
                        "timeout": 3
                    }
                }
            ]
        },
        "method": [
    		{
                "type": "action",
                "name": "dummy step",
                "provider": {
                    "type": "process",
                    "path": "echo",
                    "arguments": "URL used is: ${endpoint_url}"
                }
            }
        ],
        "rollbacks": [],
        "extensions": [
          {
            "name": "chaosiq",
            "experiment_id": "<EXPERIMENT_UID>",
            "objective_id": "<OBJECTIVE_UID>",
            "verification": {
              "id": "<VERIFICATION_UID>",
              "frequency-of-measurement": 5,
              "duration-of-conditions": 30.0
            },
            "team_id": "<TEAM_UID>",
            "org_id": "<ORG_UID>"
          }
        ]
    }
---
apiVersion: chaostoolkit.org/v1
kind: ChaosToolkitExperiment
metadata:
  name: url-endpoint-verify1-rtk-sched
  namespace: chaostoolkit-crd
spec:
  namespace: chaostoolkit-run
  schedule:
    kind: cronJob
    value: "*/3 * * * *"
  pod:
    settings:
      enabled: true
    image: chaosiq/chaostoolkit
    env:
      configMapName: chaostoolkit-env
    chaosArgs:
    - --verbose
    - --settings /home/svc/.chaostoolkit/settings.yaml
    - verify
    - ${EXPERIMENT_PATH-$EXPERIMENT_URL}

This file for the long-running verification has references to a number of UUID fields that have been removed. These UUID fields are user/experiment specific and are generated by the Reliability Toolkit when you imported the experiment as a new verification.

You can kick off your the scheduled verification with the following command:

kubectl apply -f url-verification-sched.yaml

After around 3 minutes you can check the running pods using the following command:

kubectl -n chaostoolkit-run get pods
NAME                             READY   STATUS      RESTARTS   AGE
chaostoolkit-2xrzf                0/1     Completed   0          3m27s
chaostoolkit-o9igo-1588780-m5q6f  1/1     Running     0          59s

A new POD will be generated every 3 minutes with a unique identifier. You can view the logs for each POD using the command:

kubectl -n chaostoolkit-run logs chaostoolkit-o9igo-1588780-m5q6f

The id should be changed according to your POD identifier. If you want to terminate the scheduled verification you can run:

kubectl delete -f url-verification-sched.yaml

Viewing the Verification in the Reliability Toolkit

If you log in to the Reliability Toolkit you can also see your scheduled verification running and its results:

Reliability Toolkit Summary Timeline

For each execution you will also get an insights page and an executions page generated. If you look at the executions page and expand the general tab you can confirm the execution is executing using the Kubernetes operator.

Execution Detail page

The insights page is also generated for each verification: