PayU’s first steps into Practicing ReliabilityMore User Stories
May 7th, 2020
This is the first episode in a new series where we look to share real-world stories as people begin to practice reliability through chaos engineering, continuous verification and all the other tools in their toolkit.
In this first episode of “User Stories: Tales from People Practicing Reliability”, Russ Miles, CEO of ChaosIQ, spoke with Idan Tovi, Head of SRE at PayU, about the organisation’s decision to explore chaos engineering and system verification and what they learned from their early forays with the practice.
Idan Tovi is Head of SRE at PayU GPO, focusing mainly on availability and reliability at scale. He also heads-up the Tel Aviv Chaos and Resilience Engineering Community. You can also catch Idan on GitHub.
Russ:> Hi Idan! Let’s start with why PayU began to even think about practicing reliability with chaos engineering experiments etc?
Idan:> Hi Russ, We have been talked about chaos engineering practices for a while but the first time it hit us was, like most of the companies I guess, after an incident in one of the AWS Availability Zones we run in. After that incident, the chaos engineering practice got some attention in order to help us prevent the next incident if it followed the same scenario. It actually got an AI in the post-mortem.
Russ:> It sounds like the benefit of proactive learning was a big deal?
Idan:> Indeed. We actually designed the system to survive such failures but that was only on paper, so the natural next step was to start test it and if we could do that with automation, like we do everything at PayU, it could be even better.
Russ:> Can you tell me a bit more about how you selected the first candidates for your own chaos engineering experiments?
Idan:> Actually it was pretty easy in our case, we started from the incident we had experienced. Our platform is based on AWS and we had some network issues with one of the AZs that caused one of our Cassandra nodes to be considered as down by the other nodes in the ring. Although the system expects to survive such a scenario we experienced some issues with some of the components so we decided to experiment and test our system for that kind of failure. Also it is very common to start by shutting down machines and the availability of the AWS API to do just that made it a very good and simple way to start.
Russ:> Initially you selected the Chaos Toolkit to design and implement your experiments. Do you do that design work as a team? Did you learn anything in the process of designing your experiments?
Idan:> Yes, it can definitely work that way in my opinion. It has its own limitations but it can work. We learned a lot from the first time we did it. The Chaos Toolkit was a perfect solution for what we intended to do, the AWS and Prometheus extensions provided a good way to simulate the attack and measure the results, and together with our CI/CD tool (Gitlab) we easily set up a process to automate the experiments using a dedicated GitLab-runner for security reasons and a scheduled pipeline.
Russ:> What are the main reasons you are looking to extend your use of the Chaos Toolkit with the Reliability Toolkit?
Idan:> The trigger to take a look at the Reliability Toolkit was the fact all the work we had already done with the Chaos Toolkit was transferrable and usable immediately. When we had a walkthrough of the Reliability Toolkit things just made sense to us, the steady-state we defined in our experiment was a system objective and we found there were other features we needed like on-going verification of the objective (steady-state), that made chaos practices more accessible to our engineers and helped everyone better engage with what we were doing and learning by providing one place to see all our current and future experiment findings and insights.
Since then the platform Reliability Toolkit has kept evolving and we like the new features that are coming out which make it more than just a chaos engineering platform.
there were other features we needed like on-going verification of the objective
Russ:> What did you learn when you first ran and analysed the findings from your first experiments?
Idan:> In our case, the first time we analysed the results there wasn’t that much immediate impact, because we had initiated the process based on a specific incident. Since that incident and up to the time we ran that first experiment the weaknesses we found during the incident had already been fixed so it just gave us the confidence that we had done everything right to fix the flaws we had found.
Funnily enough, the first time chaos experimentation found a flaw was after about a month had passed with the experiment continuously running twice a week during a heavy load test. We had a task to upgrade our Cassandra installation and we always run some tests during this process in our pre-production environment, which was the same environment where we had implemented the chaos experiment.
Usually, before you initiate such a process you want to do it after a successful repair process but we found out our repair process had been failing in the last couple of weeks. We started to investigate and found out that the repair process failed because of the chaos experiment. This was a well known weakness in Cassandra 3.x (a repair can’t run when a node is down) but the thing we were surprised by was that an alert that was configured and tested in the past no longer worked either. We used this learning to write a Run Book on how to resume the repair process.
the thing we were surprised by was that an alert that was configured and tested in the past no longer worked either
Russ:> What do you think might be your next steps? What are you exploring next?
Idan:> Continuous execution of experiments is already happening but we definitely will keep adding more automated verifications and probably some more objectives. We also plane to invest this year in introducing our first Game Day which will help us to improve our incident management and resolution process and be more ready for the unexpected.
we definitely will keep adding more automated verifications and probably some more objectives as we try to improve our incident management resolution and be ready for the unexpected
Russ:> What advice would you give to anyone considering their first steps into practicing reliability?
Idan:> The biggest problem with practicing reliability is whole-team buy-in. So my suggestion is to think and design a scenario, no matter if it is Game Day or a semi or completely automatic experiment which brings some value. A past incident is a good starting point, along with adding a set of system objectives that matter, ideally customer- facing objectives.
think and design a scenario, no matter if it is Game Day or a semi or completely automatic experiment which brings some value
The second piece of advice is you don’t have to automate everything straight away. When you do automate you will find yourself benefiting from your experiments not just for what you were looking for but continuously as the system evolves.
Last but not least, usually it is better to start with a pre-production environment until you get enough confidence, keeping in mind to simulate traffic in that environment otherwise your experiment doesn’t explore or test anything.
start with a pre-production environment until you get enough confidence
Russ:> Thanks so much for chatting today!
Coming up in Episode 2
In the next episode we’ll be chatting to Adrian Hornsby of AWS about the trends he sees in different companies as the navigate they develop their systems and culture towards practicing reliability and resilience engineering.