Amazon Prime Video reduced costs by 90% by ditching Microservices [System Design Sundays]

This article is not sponsored by the World Monolith Supremacy Association

May 07, 2023

Hey, it’s your favorite cult leader here 🐱‍👤

On Sundays, I will go over various Systems Design topics⚙⚙. These can be mock interviews, writeups by various organizations, or overviews of topics that you need to design better systems. 📝📝

To get access to all the articles, support my crippling chocolate milk addiction, and become a premium member of this cult, use the button below-

Help me buy chocolate milk

p.s. you can learn more about the paid plan here.

If any of you have been outside recently, you might have come across streams of salt water, particularly if you live near Silicon Valley.

Scientists have finally found the cause of this. Those salty streams and puddles you have been wading through are the tears of the Microservices bros. To those of you that missed it, Amazon (the poster child for service-oriented architectures) released a very interesting report- Scaling up the Prime Video audio/video monitoring service and reducing costs by 90%. In it, they wrote- The move from a distributed microservices architecture to a monolith application helped achieve higher scale, resilience, and reduce costs.

Moving our service to a monolith reduced our infrastructure cost by over 90%. It also increased our scaling capabilities. Today, we’re able to handle thousands of streams and we still have capacity to scale the service even further. Moving the solution to Amazon EC2 and Amazon ECS also allowed us to use the Amazon EC2 compute saving plans that will help drive costs down even further.

In this article, I will be covering their publication, what we learned from it, and what the future entails for distributed system design.

How Prime Video benefitted from moving away from Microservices

Amazon’s Microservices Architecture- Prime Video has a relatively simple microservices architecture, given the scale of their operations. There are three major components-
1. The media converter converts input audio/video streams to frames or decrypted audio buffers that are sent to detectors.
2. Defect detectors execute algorithms that analyze frames and audio buffers in real-time looking for defects (such as video freeze, block corruption, or audio/video synchronization problems) and send real-time notifications whenever a defect is found.
3. The third component provides orchestration that controls the flow in the service.
They are arranged in the following way-

The Scaling Problem- As with many other rushed microservices implementations, Amazon was not able to match what they had envisioned. We designed our initial solution as a distributed system using serverless components (for example, AWS Step Functions or AWS Lambda), which was a good choice for building the service quickly. In theory, this would allow us to scale each service component independently. However, the way we used some components caused us to hit a hard scaling limit at around 5% of the expected load. Also, the overall cost of all the building blocks was too high to accept the solution at a large scale.
What caused the scaling issues- Amazon identified two problems in the way they were handling issues-
1. The main scaling bottleneck in the architecture was the orchestration management that was implemented using AWS Step Functions. Our service performed multiple state transitions for every second of the stream, so we quickly reached account limits. Besides that, AWS Step Functions charges users per state transition.
2. The second cost problem we discovered was about the way we were passing video frames (images) around different components. To reduce computationally expensive video conversion jobs, we built a microservice that splits videos into frames and temporarily uploads images to an Amazon Simple Storage Service (Amazon S3) bucket. Defect detectors (where each of them also runs as a separate microservice) then download images and processed it concurrently using AWS Lambda. However, the high number of Tier-1 calls to the S3 bucket was expensive.
Prime Video’s new monolith architecture- The new monolith looks like this-
Conceptually, the high-level architecture remained the same. We still have exactly the same components as we had in the initial design (media conversion, detectors, or orchestration). This allowed us to reuse a lot of code and quickly migrate to a new architecture.
Are Microservices useless- This new development has got a lot of people asking if Microservices are useless as an architecture. In the words of DAVID HEINEMEIER HANSSON, the creator of Ruby on Rails, Replacing method calls and module separations with network invocations and service partitioning within a single, coherent team and application is madness in almost all cases. I’m a little more hesitant to write off the idea entirely. Older readers will remember the piece I did on Microservices vs Monoliths. In it, I reached the following verdict- Microservices are cool and will come with a boost to scalability. However, they can be much harder to set up and will require more careful planning. Monoliths are considered boring by a lot of people, but they are generally a safe pick. They are easier to design and will have lower overhead (different independent services sounds cool till you have to dig through a code base written in multiple languages, frameworks, and style). But if you are working with large teams, independent services, and a huge emphasis on scale, then microservices might be your jam.
In this case, it seems like there was some rushed decision-making in how the components would interact that caused this overhead. With a better-designed architecture, these problems could have been avoided. But maybe that’s just me chugging on that copium. After all, is it ever possible to architect systems so well that we overcome the inherent problems with microservices, without running into new ones? I’m starting to get a little skeptical.

Ultimately, I would love to see a study comparing the architectural complexities of monoliths vs microservices in operations of comparable scales. 2 months ago, I covered research into the costs of architectural complexity in the post- How bad is Architectural Complexity [System Design Sundays]. Take a snippet from that-

Within this research setting, we found that differences in architectural complexity could account for 50% drops in productivity, three-fold increases in defect density, and order-of-magnitude increases in staff turnover.

I would love to see how complexity changes with various architectural decisions. This isn’t something I know too much about, so I’d love to hear from you. Do you have experience architecting microservices and monoliths? What were the differences?

That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, my links will be at the end of this email/post. If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow.

Save the time, energy, and money you would burn by going through all those videos, courses, products, and ‘coaches’ and easily find all your needs met in one place at ‘Tech Made Simple’! Stay ahead of the curve in AI, software engineering, and the tech industry with expert insights, tips, and resources. 20% off for new subscribers by clicking this link. Subscribe now and simplify your tech journey!

Using this discount will drop the prices-

800 INR (10 USD) → 640 INR (8 USD) per Month

8000 INR (100 USD) → 6400INR (80 USD) per year (533 INR /month)

Get 20% off for 1 year

Reach out to me

Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.

Small Snippets about Tech, AI and Machine Learning over here

If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here.

To help me understand you fill out this survey (anonymous)

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819

Jake

May 11, 2023

The weird thing about the discussions on this topic is that we are talking about a narrow service in the first place. We aren't talking about all of Prime Video. I'm sure they still have plenty of services (micro or otherwise) for website, for billing, for transcoding video files, and for this use case: product analytics.

The entire article is just about their analytics system. And it make sense, that probably should have been, with the benefit of hindsight, a monolith to begin with. But that doesn't really tell you anything about the design of the overall system.

My personal take is that the DDD book had it right. Designs Services around distinct domains, that have independent problem domains, teams operating and clear interfaces and state seperation.

Expand full comment

2 replies by Devansh and others

Technology Made Simple

Discussion about this post