Exploring Spring Boot Resiliency on AWS EKS

More Chaos Engineering Tips
Exploring Spring Boot Resiliency on AWS EKS
The power of great discoverers at the tip of fingers.

Over the past few months, we have been exploring various facets of Chaos Engineering. From the principles of the practice to some concrete examples such as improving your operational workflow by introducing controlled perturbations in your system. This summer, we will be introducing new exciting features we are looking forward to see our users play with.

Until then, I’d like to continue on the trail of examples I went through previously. But this time, we’ll do it against AWS EKS, the managed Kubernetes offering from AWS and using a Spring Boot application. Both solutions are fantastic. Spring Boot is extremely common as an application framework. EKS is fairly new on the managed Kubernetes scene but no doubt it will become a key player rapidly.

Setting the Scene

I wrote two Chaos Toolkit experiments to showcase two aspects of resiliency you would want to explore via Chaos Engineering.

The first one looks at how the application performs when latency is introduced between two services communicating over HTTP. This is achieved through the excellent Chaos Monkey for Spring via the corresponding Chaos Toolkit driver.

The second experiment takes the opportunity of exploring how your system reacts when a whole AWS EC2 instance goes down, removing a Kubernetes work node from the cluster along with it. This is achieved through the AWS API via the Chaos Toolkit driver for AWS. For good measure, we also use the driver for Kubernetes to probe the system along the way.

The system consists of two Spring Boot microservices conversing over HTTP. The frontend service simply calling the backend service to compute something based on the user input.

Those two services live on Kubernetes in a EKS-managed cluster. For the story, we use the awesome eksctl sponsored by Weave Works to create our cluster easily.

The entire code (application, manifests…) can be found here.

What’s the impact of latency between two services?

As mentioned above, the system is basic and consists of two Spring Boot applications: frontend and backend. Whenever a user hits the frontend, a call is made to the backend from the frontend application and its result is returned to the user.

However, one aspect is critical to this system (from a business point of view as well as a technical one), the response from the backend must be made under one second. So, while we feel confident the system should be that fast consistently, there is a risk it may be slower from time to time. Chaos Engineering is a perfect way of exploring what could happen in this case.

Our experiment is fairly basic to start with but it shows the idea behind the Chaos Toolkit and our flow.

14:55 $ chaos run experiments/latency-impact.json
[2018-07-09 14:57:06 INFO] Validating the experiment's syntax
[2018-07-09 14:57:06 INFO] Experiment looks valid
[2018-07-09 14:57:06 INFO] Running experiment: How does latency from the backend impacts the frontend?
[2018-07-09 14:57:06 INFO] Steady state hypothesis: We can multiply two numbers under a second
[2018-07-09 14:57:06 INFO] Probe: app-must-respond
[2018-07-09 14:57:07 INFO] Steady state hypothesis is met!
[2018-07-09 14:57:07 INFO] Action: enable_chaosmonkey
[2018-07-09 14:57:07 INFO] Action: configure_assaults
[2018-07-09 14:57:07 INFO] Steady state hypothesis: We can multiply two numbers under a second
[2018-07-09 14:57:07 INFO] Probe: app-must-respond
[2018-07-09 14:57:09 CRITICAL] Steady state probe 'app-must-respond' is not in the given tolerance so failing this experiment
[2018-07-09 14:57:09 INFO] Let's rollback...
[2018-07-09 14:57:09 INFO] Rollback: disable_chaosmonkey
[2018-07-09 14:57:09 INFO] Action: disable_chaosmonkey
[2018-07-09 14:57:09 INFO] Rollback: configure_assaults
[2018-07-09 14:57:09 INFO] Action: configure_assaults
[2018-07-09 14:57:09 INFO] Experiment ended with status: failed

As you can see, we start by talking to the frontend application and expect it to respond. Then we enable the Spring Chaos Monkey, embedded in the backend application itself, and ask it to add some latency to its network exchanges.

"method": [
    {
        "name": "enable_chaosmonkey",
        "type": "action",
        "provider": {
            "func": "enable_chaosmonkey",
            "module": "chaosspring.actions",
            "type": "python",
            "arguments": {
                "base_url": "${base_url}/backend/actuator"
            }
        }
    },
    {
        "name": "configure_assaults",
        "type": "action",
        "provider": {
            "func": "change_assaults_configuration",
            "module": "chaosspring.actions",
            "type": "python",
            "arguments": {
                "base_url": "${base_url}/backend/actuator",
                "assaults_configuration": {
                    "level": 1,
                    "latencyRangeStart": 10000,
                    "latencyRangeEnd": 10000,
                    "latencyActive": true,
                    "exceptionsActive": false,
                    "killApplicationActive": false,
                    "restartApplicationActive": false
                }
            }
        }
    }
],

Finally, we simply call the frontend again, which in this case tells us it went over budget and failed to match the tolerance of 1 second we had setup for it. The way we do this here is by setting a timeout on the call from the frontend to the backend.

"steady-state-hypothesis": {
    "title": "We can multiply two numbers under a second",
    "probes": [
        {
            "name": "app-must-respond",
            "type": "probe",
            "tolerance": {
                "type": "regex",
                "pattern": "^[0-9]*$",
                "target": "body"
            },
            "provider": {
                "type": "http",
                "url": "${base_url}/multiply?a=6&b=7"
            }
        }
    ]
}

The tolerance validates that the response is simply a number. It suceeds when we first call the frontend, before we introduce latency, and doesn’t even get to be called after the latency is introduced due to the timeout triggered by the slow backend. In that case, the frontend returns an error message which doesn’t pass the tolerance validator.

Through this simple(istic) application-level experiment, we are made aware of the consequences of a slow backend response. Obviously, in a richer microservices system, this could have dramatic ripple effect, and even cascading failure, difficult to debug after they’ve hit our users. Better trigger those conditions ourselves and observe their impact.

Can we sustain the loss of an EKS node?

The previous experiment targeted our application, but we can obviously learn from undearneath by exploring degraded conditions in our infrastructure.

For instance, do we know if our service remains available during the loss of a node? Again, Chaos Engineering gives you the tool to explore such scenario and get familiar with its dire consequences.

We use the system as above but, this time, our experiment hits the AWS infrastructure itself by stopping an EC2 instance running one of our EKS worker nodes. Obviously, this means a reduction in capacity but does it mean loss of availability?

15:21 $ chaos run experiments/losing-kubernetes-node.json
2018-07-09 15:22:24 INFO] Validating the experiment's syntax
2018-07-09 15:22:24 INFO] Experiment looks valid
2018-07-09 15:22:24 INFO] Running experiment: Are we still available when one of the nodes go down?
2018-07-09 15:22:24 INFO] Steady state hypothesis: We can multiply two numbers under a second
2018-07-09 15:22:24 INFO] Probe: app-must-respond
2018-07-09 15:22:25 INFO] Steady state hypothesis is met!
2018-07-09 15:22:25 INFO] Action: delete_one_randomly_picked_EKS_node
2018-07-09 15:22:27 INFO] Pausing after activity for 30s...
2018-07-09 15:22:57 INFO] Action: list_worker_nodes
2018-07-09 15:22:58 INFO] Action: count_backend_pods
2018-07-09 15:22:58 INFO] Action: count_frontend_pods
2018-07-09 15:22:59 INFO] Steady state hypothesis: We can multiply two numbers under a second
2018-07-09 15:22:59 INFO] Probe: app-must-respond
2018-07-09 15:22:59 INFO] Steady state hypothesis is met!
2018-07-09 15:22:59 INFO] Let's rollback...
2018-07-09 15:22:59 INFO] No declared rollbacks, let's move on.
2018-07-09 15:22:59 INFO] Experiment ended with status: completed

During the experiment method, we stopped an EC2 instance of the EKS pool at random:

Screenshot of the dashboard showing an EC2 instance stopping.

Then, we replay back our hypothesis that the service should remain in good shape. Lucky us! Even though we lost a node, we were able to keep the application available. Looking at the Chaos Toolkit probe logs, we can see:

[2018-07-09 15:22:58 INFO] Action: count_backend_pods
[2018-07-09 15:22:58 DEBUG] Found 1 pods matching label 
[2018-07-09 15:22:58 INFO] Action: count_frontend_pods
[2018-07-09 15:22:59 DEBUG] Found 1 pods matching label 'app=frontend-app'

Notice however that, with this sort of experiment, your result may vary and your experiment may fail. We are running a single instance of our application, so if the killed node is the one where it is running, Kubernetes will schedule it on the other node which takes time. Still, the approach remains the same and hopefully you get the idea of the learning loop here. It highlights however the need to run chaos experiments continuously since your system is not static. Chaos Toolkit is all about automation so that’s quite handy.

Continue exploring!

I hope these couple of experiments continue showing you the Chaos Toolkit flow and how it can help your understanding of your system. Richer experiments can be created, shared and collaborated on with your team for a healthy dose of familiarity with adverse conditions.

Please, feel free to join us on the Chaos Toolkit Slack workspace. We would love your feedback to make the toolkit an even more delightful tool that really enables automated chaos engineering for everyone!

Photo by Tony Webster