Why Discord ditched Cassandra [System Design Sundays]

The problem with Cassandra and why Discord moved on

Apr 03, 2023

Hey, it’s your favorite cult leader here 🐱‍👤

On Sundays, I will go over various Systems Design topics⚙⚙. These can be mock interviews, writeups by various organizations, or overviews of topics that you need to design better systems. 📝📝

If you’d like to work with me after I graduate in a month, my resume is here. You can reach me through my LinkedIn here

If you’ve gone through any system design resources online, you’ve probably come across the famous HOW DISCORD STORES BILLIONS OF MESSAGES. It’s one of those blog posts that is pretty famous in certain circles.

What you probably don’t know is that Discord recently updated their framework. In their newer article, HOW DISCORD STORES TRILLIONS OF MESSAGES, Discord mentions some troubles with Cassandra that ultimately caused them to switch over to ScyllaDB. It is always important to pay attention to such changes since these are very costly and thus is only done if there are no other options. Since Cassandra is a mainstay in many systems, it is important to learn of its limitations thoroughly to design better systems.

We wanted a database that grew alongside us, but hopefully, its maintenance needs wouldn’t grow alongside our storage needs. Unfortunately, we found that to not be the case — our Cassandra cluster exhibited serious performance issues that required increasing amounts of effort to just maintain, not improve.

pink and black hello kitty clip art — Photo by Alexander Shatov on Unsplash

Problems with Cassandra

The Basic Summary- To quote Discord’s words, it was a high-toil system — our on-call team was frequently paged for issues with the database, latency was unpredictable, and we were having to cut down on maintenance operations that became too expensive to run.
Hot Partitions- In Cassandra, all messages for a given channel and bucket will be stored together and replicated across three nodes (or whatever you’ve set the replication factor). Since reads are more expensive than writes in Cassandra, lots of concurrent reads as users interact with servers can hotspot a partition. This situation is referred to as a Hot Partition by the Discord team. To those that don’t know- a hotspot is just a partition of the database with a disproportionately high load.
Maintenance troubles- We were prone to falling behind on compactions, where Cassandra would compact SSTables on disk for more performant reads. Not only were our reads then more expensive, but we’d also see cascading latency as a node tried to compact. For more details on different compaction ways in Cassandra, check out the documentation over here.
Garbage Collectors- Turns out that even the GC process in Cassandra was a problem. So much so that the team even mentions this as an advantage of switching to Scylla (which is written in C++, so has no GC). Although ScyllaDB is most definitely not void of issues, it is void of a garbage collector, since it’s written in C++ rather than Java. Historically, our team has had many issues with the garbage collector on Cassandra, from GC pauses affecting latency, all the way to super long consecutive GC pauses that got so bad that an operator would have to manually reboot and babysit the node in question back to health. These issues were a huge source of on-call toil, and the root of many stability issues within our messages cluster.
Improvements- Let’s take a look at some of the improvements seen by the team moving to Scylla. There are two major ones-
1. Efficiency- Discord went from running 177 Cassandra nodes to just 72 ScyllaDB nodes. Each ScyllaDB node has 9 TB of disk space, up from the average of 4 TB per Cassandra node. This simplifies maintenance and costs.
2. Latency- Discord has seen better latency improvements- fetching historical messages had a p99 of between 40-125ms on Cassandra, with ScyllaDB having a nice and chill 15ms p99 latency, and message insert performance went from 5-70ms p99 on Cassandra to a steady 5ms p99 on ScyllaDB.

This was certainly an interesting read. I’ve personally never seen Scylla before this, so this helped me learn something new. What are some of your favorite databases to work with? I’d love to know.

That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, my links will be at the end of this email/post. If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here. And if you found value in this write-up, I would appreciate you sharing it with more people.

Upgrade your tech career with a premium subscription ‘Tech Made Simple’! Stay ahead of the curve in AI, software engineering, and tech industry with expert insights, tips, and resources. 20% off for new subscribers by clicking this link. Subscribe now and simplify your tech journey!

Using this discount will drop the prices-

800 INR (10 USD) → 640 INR (8 USD) per Month

8000 INR (100 USD) → 6400INR (80 USD) per year

Get 20% off for 1 year