Running the Chaos Toolkit and Reliability Toolkit on AWS

More Chaos Engineering Tips
Running the Chaos Toolkit and Reliability Toolkit on AWS
Highlights from Adrian Hornsby’s video on using the Chaos Toolkit and Reliability Toolkit to perform and explore chaos experiments on AWS using the AWS Systems Manager.

TL; DR

  • The Chaos Toolkit’s declarative open experiment format allows you to collaborate with your experiment and extend the learning through your experiment lifecycle.
  • The Chaos Toolkit’s extensibility allows integration with multiple platforms, the AWS extension can interact with most of the AWS components.
  • AWS Systems Manager helps you to automate tasks across AWS services, in this case by applying actions to inject failure into your AWS instances.
  • The Reliability Toolkit enhances the Chaos Toolkit so that you can manage reliability objectives and measurements, using verifications to see the impact on failures and other events on your system’s reliability in the Reliability Timeline.

Adian Hornsby, Principal Technical Evangelist at Amazon Web Services (AWS), has published a great video on YouTube on using the open-source Chaos Toolkit on AWS and how to integrate the results with the recently released Reliability Toolkit from ChaosIQ.

Here are some highlights from the video.

Use the Chaos Toolkit Open API to Express your Experiments

One of the important features of the Chaos Toolkit is its declarative open experiment format. The experiment format is JSON based format that can be developed collaboratively with your team.

As you collaborate on your experiment you can share and explore everyone’s knowledge of the system, providing an opportunity to compare and contrast mental models of the system and how those relate to what failures might be important to explore. You get a valuable opportunity to share the knowledge about your system even before you run your experiment.

Your experiment then becomes the basis for continuous verification using the Chaos Toolkit whether that is a one-off execution, part of a Game Day, or part of your CI/CD pipeline.

You can extend the Chaos Toolkit to meet your own needs

Another great plus of Chaos Toolkit is extensibility, there is the AWS Extension in addition to numerous others, most are community contributed and maintained by other companies.

A screenshot of the Chaos Toolkit website, showing a list of available extensions, including AWS, Azure, Google Cloud, Kubernetes, etc.

The AWS Extension, which enables you to integrate your Chaos Engineering Experiments with a large number of AWS based services including AWS Lambda, EC2, Cloudwatch, and others.

A screenshot of the AWS Extension page on the Chaos Toolkit website.

Run Chaos Experiments on your AWS instances

AWS Systems Manager is a collection of capabilities that helps you automate management tasks for every workload and instance type. AWS SSM is an agent-based service for managing servers on any infrastructure. As part of AWS SSM, you can take action on your intended target. Adrian has covered AWS SSM in detail in this article.

This enables you to run a Chaos Toolkit experiment on your remote EC2 instances and evaluate the impact from external measurements such as your EC2 endpoints. This does not require any extensions to the Chaos Toolkit, you are just leveraging AWS SSM via the AWS CLI. If you want more detail there is a tutorial available as part of the ChaosIQ documentation.

Enhancing the Chaos Toolkit with the Reliability Toolkit

ChaosIQ has recently introduced the Reliability Toolkit. The Reliability Toolkit fully integrates with and builds on the free and open-source Chaos Toolkit.

The Reliability Toolkit provides tools that help your practice reliability on your system, making your system more reliable for your end-users.

The Reliability Toolkit includes tools that help you manage reliability objectives, conduct chaos experiments, execute system verifications, and perform initial analysis using the Reliability Timeline.

A screenshot of the Reliability Toolkit's Timeline