A guide to Chaos Engineering [Technique Tuesdays]
Why companies like Netflix , Facebook, and Microsoft. Related to a very important concept in AI.
To learn more about the newsletter, check our detailed About Page + FAQs
To help me understand you better, please fill out this anonymous, 2-min survey. If you liked this post, make sure you hit the heart icon in this email.
Recommend this publication to Substack over here
Chaos Engineering-
You’ve heard this term if you’re looking for roles at Netflix,
But what does Chaos Engineering actually mean? Why do so many Big Tech companies have entire projects dedicated to it? And what can you do to integrate this technique into your programming?
Today we’ll be covering this and much more. For my AI peeps, read till the end because we will talk about why this idea is going to be extremely important in AI.
Sparknotes Version
What is Chaos Engineering- Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. Put simply- In Chaos Engineering, you build your tests to replicate all kinds of failures that your system will face IRL.
Why Chaos Engineering- Chaos engineering can be used to achieve resilience against infrastructure failures, network failures, and application failures. By testing and trying to break your products before deployment, you will be able to foresee and tackle problems much better.
How to implement it- Build a hypothesis of how your system is supposed to work normally. Then think of all the ways the system will break and try those out. Build fail-safes to make sure things are constantly fine. A more advanced guide is given later in this article.
Chaos Engineering in AI- To those familiar with Machine Learning, this idea is similar to injecting random (sometimes even adversarial) samples into your datasets to create Learning Agents that are more robust. This is a technique that I’ve been covering for a while, and is finally becoming more common. For a guide on using this randomness for best results, read my article Using Randomness Effectively in Deep Learning
Some numbers for you- Meta made roughly 29 Billion USD in the last quarter(Q2). Just a single day of failure would cost them 318,681,319. That’s 319 Million Dollars of money lost, every day that Meta is down. For operations at scale, chaos engineering is crucial for making sure people don’t lose anything.
Clearly, this is an extremely important idea that you must be familiar with. I’m attaching a guide to the ideal chaos engineering pipeline. I got this from the phenomenal, Principles of Chaos Page. If this is something that interests you, consider joining the Chaos Community Google Groups here. This will be great for networking and coming across new ideas/concepts. For pointers on how to network effectively on more technical platforms like this, check out my guide on using Github to land software jobs.
Now for those of you really into the idea, here is the guide to advanced Chaos Engineering-
Ideal Chaos Engineering
Build a Hypothesis around Steady State Behavior
Focus on the measurable output of a system, rather than internal attributes of the system. Measurements of that output over a short period of time constitute a proxy for the system’s steady state. The overall system’s throughput, error rates, latency percentiles, etc. could all be metrics of interest representing steady state behavior. By focusing on systemic behavior patterns during experiments, Chaos verifies that the system does work, rather than trying to validate how it works.
Vary Real-world Events
Chaos variables reflect real-world events. Prioritize events either by potential impact or estimated frequency. Consider events that correspond to hardware failures like servers dying, software failures like malformed responses, and non-failure events like a spike in traffic or a scaling event. Any event capable of disrupting steady state is a potential variable in a Chaos experiment.
Run Experiments in Production
Systems behave differently depending on environment and traffic patterns. Since the behavior of utilization can change at any time, sampling real traffic is the only way to reliably capture the request path. To guarantee both authenticity of the way in which the system is exercised and relevance to the current deployed system, Chaos strongly prefers to experiment directly on production traffic.
Automate Experiments to Run Continuously
Running experiments manually is labor-intensive and ultimately unsustainable. Automate experiments and run them continuously. Chaos Engineering builds automation into the system to drive both orchestration and analysis.
Minimize Blast Radius
Experimenting in production has the potential to cause unnecessary customer pain. While there must be an allowance for some short-term negative impact, it is the responsibility and obligation of the Chaos Engineer to ensure the fallout from experiments are minimized and contained.
As I mentioned earlier, Chaos Engineering is something that Netflix really loves. They were in many ways, one of the pioneers of this movement, this is what they had to say-
At Netflix, our culture of freedom and responsibility led us not to force engineers to design their code in a specific way. Instead, we discovered that we could align our teams around the notion of infrastructure resilience by isolating the problems created by server neutralization and pushing them to the extreme. We have created Chaos Monkey, a program that randomly chooses a server and disables it during its usual hours of activity. Some will find that crazy, but we could not depend on the random occurrence of an event to test our behavior in the face of the very consequences of this event. Knowing that this would happen frequently has created a strong alignment among engineers to build redundancy and process automation to survive such incidents, without impacting the millions of Netflix users. Chaos Monkey is one of our most effective tools to improve the quality of our services."
I created Technology Interviews Made Simple using new techniques discovered through tutoring multiple people into top tech firms. The newsletter is designed to help you succeed, saving you from hours wasted on the Leetcode grind. I have a 100% satisfaction policy, so you can try it out at no risk to you. You can read the FAQs and find out more here. Use the button below to get 20% off for upto a whole year.
Before proceeding, if you have enjoyed this post so far, please make sure you like it (the little heart button in the email/post). I also have a special request for you.
***Special Request***
This newsletter has received a lot of love. If you haven’t already, I would really appreciate it if you could take 5 seconds to let Substack know that they should feature this publication on their pages. This will allow more people to see the newsletter.
There is a simple form in Substack that you can fill up for it. Here it is. Thank you.
https://docs.google.com/forms/d/e/1FAIpQLScs-yyToUvWUXIUuIfxz17dmZfzpNp5g7Gw7JUgzbFEhSxsvw/viewform
To get your Substack URL, follow the following steps-
Open - https://substack.com/
If you haven’t already, log in with your email.
In the top right corner, you will see your icon. Click on it. You will see the drop-down. Click on your name/profile. That will show you the link.
You will be redirected to your URL. Please put that in to the survey. Appreciate your help.
In the comments below, share what topic you want to focus on. I’d be interested in learning and will cover them. To learn more about the newsletter, check our detailed About Page + FAQs
If you liked this post, make sure you fill out this survey. It’s anonymous and will take 2 minutes of your time. It will help me understand you better, allowing for better content.
https://forms.gle/XfTXSjnC8W2wR9qT9
I see you living the dream.
Go kill all and Stay Woke,
Devansh <3
To make sure you get the most out of Technique Tuesdays, make sure you’re checking in the rest of the days as well. Leverage all the techniques I have discovered through my successful tutoring to easily succeed in your interviews and save your time and energy by joining the premium subscribers down below. Get a discount (for a whole year) using the button below
Reach out to me on:
Instagram: https://www.instagram.com/iseethings404/
Message me on Twitter: https://twitter.com/Machine01776819
My LinkedIn: https://www.linkedin.com/in/devansh-devansh-516004168/
My content:
Read my articles: https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Get a free stock on Robinhood. No risk to you, so not using the link is losing free money: https://join.robinhood.com/fnud75