How Netflix survived the AWS outage in 2011 [System Design Sundays]
How Netflix survived the outage that shut down major sites including Reddit, HootSuite, Foursquare and Quora
Hey, it’s your favorite cult leader here 🐱👤
On Sundays, I will go over various Systems Design topics⚙⚙. These can be mock interviews, writeups by various organizations, or overviews of topics that you need to design better systems. 📝📝
To get access to all my articles and support my crippling chocolate milk addiction, consider subscribing if you haven’t already!
p.s. you can learn more about the paid plan here.
Netflix had nearly 231 million paid subscribers worldwide as of the fourth quarter of 2022.
An outage would cost them billions of dollars of value lost.
So it should not come as a shocker that they are very hell-bent on keeping their services live, no matter what happens. And their dedication showed results in 2011. An AWS outage in 2011, rocked many major websites including Reddit, HootSuite, Foursquare, and Quora. Netflix survived this.
In this post, we will cover some of the techniques they utilized to build resilience in their systems. I found the content for this post on LinkedIn here. Check it out.
Before you get into it, I have something special to share!! Your boy just got published with the legendary AI publication, The Gradient, a publication started by some people at Stanford. My article is Artists enable AI art - shouldn't they be compensated? and it covers the debate around artist compensation and AI art. Make sure you give it a read and let me know what you think.
Back to the topic at hand-
How Netflix builds Swole 💪💪 Systems
Open Connect Appliances- This term might be confusing to you. What are Open Connect Appliances? To quote Netflix- The Netflix Open Connect program provides opportunities for ISP partners to improve their customers' Netflix user experience by localizing Netflix traffic and minimizing the delivery of traffic that is served over a transit provider. In other words, certain ISPs are given Netflix content downloaded onto certain devices. This allows much better Netflix streaming.
Stateless-service architecture- This allows any server to serve any request. Even if, one of the nodes failed, a new node can easily step in. We’ll be covering this in more detail on another Sunday.
Redundancy- Store data in multiple zones, so that you have backups in case of emergency. They also use "n+1" redundancy which means they have more nodes than required to serve the traffic.
Graceful Degradation- Graceful degradation is the ability of a computer, machine, electronic system or network to maintain limited functionality even when a large portion of it has been destroyed or rendered inoperative. Netflix uses a technique of graceful degradation which is based on three principles: 1) Fail-fast- Aggressive timeouts, so that dying systems are caught early.2) Feature Fallbacks - If one feature fails, its fallback will be used (There is hard coding for every error scenario). 3) If the feature is slow and uncritical, that feature is removed from the page. More on this on an upcoming Tuesday.
Use of S3- Netflix rearchitected their system to use the new technologies. They made heavy use of S3 as their data source. AWS S3 is resilient to zone failures and is highly reliable.
Actively integrating Chaos- To those of you that have been here a few months, you will remember the post that we did on Chaos Engineering. We mentioned the service called "Chaos Monkey", which breaks services live. If you’re looking for more information on how you can add chaos into your systems, check out the post- How to Stress Test your Systems [Technique Tuesdays].
Netlix also automated load distribution in the case of a zone failure to prevent manual intervention.
That is it for this piece. I appreciate your time. As always, if you’re interested in reaching out to me or checking out my other work, links will be at the end of this email/post. If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here. And if you found value in this write-up, I would appreciate you sharing it with more people.
For those of you interested in taking your skills to the next level, keep reading. I have something that you will love.
Upgrade your tech career with a premium subscription ‘Tech Made Simple’! Stay ahead of the curve in AI, software engineering, and tech industry with expert insights, tips, and resources. 20% off for new subscribers by clicking this link. Subscribe now and simplify your tech journey!
Reach out to me
Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.
If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here.
To help me understand you fill out this survey (anonymous)
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819