Discussion about this post

User's avatar
Deagle's avatar

Another consideration - a clustered approach scales data and traffic homogeneously. Sharding is more likely to result in nodes that are unused or disproportionately hot. Or, at least, heterogeneous utilization is more random in the case of clustering, more deterministic of developer decisions in the case of sharding.

The recourse, then, for high loads, is different. Clusters can be rebalanced. Shards, for read-only data, can be scaled with a second cache layer, but to scale writes would require a shard splitting procedure (which I guess is also rebalancing but you have to do it yourself).

Now, if you were to implement the routing part in a third middle layer between app & DB... Is it clustered or sharded? Both! Neither? This idea demonstrates that the distinction is ultimately a difference between which process contains the node-selection implementation. The various partitioning strategies, like DHT or lookups or ranges, can arguably be implemented in either case.

Naturally, then, sharding gives the application more explicit control, and has more scaling potential. You can squeeze more, but you need to do that squeezing, and getting it wrong can backfire. From that, I'd conjecture that a clustered system is a better fit for prototypes or chaotic access patterns or complex (heterogeneously deep or wide) data models. Sharding is a better fit for well-understood access patterns that are implicated in the data structure. Pinterest certainly fits the latter category.

If you're really crazy, you can shard clusters ( O(1) + O(log N) ), but I imagine there are few real-life scenarios where the ROI justifies that level of implementation complexity.

Expand full comment
1 more comment...

No posts