How We Implemented ScyllaDB’s “Tablets” Data Distribution

June 17, 2024

Avi Kivity

How ScyllaDB implemented its new Raft-based tablets architecture, which enables teams to quickly scale out in response to traffic spikes

ScyllaDB just announced the first GA release featuring our new tablets architecture.

In case this is your first time hearing about it, tablets are the smallest replication unit in ScyllaDB. Unlike static and all-or-nothing topologies, tablets provides dynamic, parallel, and near-instant scaling operations. This approach allows for autonomous and flexible data balancing and ensures that data is sharded and replicated evenly across the cluster. That, in turn, optimizes performance, avoids imbalances, and increases efficiency.

The first blog in this series covered why we shifted from tokens to tablets and outlined the goals we set for this project. In this blog, let’s take a deeper look at how we implemented tablets via:

Indirection and abstraction
Independent tablet units
A Raft-based load balancer
Tablet-aware drivers

Indirection and Abstraction

There’s a saying in computer science that every problem can be solved with a new layer of indirection. We tested that out here, with a so-called tablets table. 😉

We added a tablets table that stores all of the tablets metadata and serves as the single source of truth for the cluster. Each tablet has its own token-range-to-nodes-and-shards mapping, and those mappings can change independently of node addition and removal.

That enables the shift from static to dynamic data distribution. Each node controls its own copy, but all of the copies are synchronized via Raft.

The tablets table tracks the details as the topology evolves. It always knows the current topology state, including the number of tablets per table, the token boundaries for each tablet, and which nodes and shards have the replicas. It also dynamically tracks tablet transitions (e.g., is the tablet currently being migrated or rebuilt?) and what the topology will look like when the transition is complete

Independent Tablet Units

Tablets dynamically distribute each table based on its size on a subset of nodes and shards. This is a much different approach than having vNodes statically distribute all tables across all nodes and shards based only on the token ring.

With vNodes, each shard has its own set of SSTables and memtables that contain the portion of the data that’s allocated to that node and shard. With this new approach, each tablet is isolated into its own mini memtable and its own mini SSTables. Each tablet runs the entire log-structured merge (LSM) tree independently of other tablets that run on this shard.

The advantage of this approach is that everything (SSTable + memtable + LSM tree) can be migrated as a unit. We just flush the memtables and copy the SSTables before streaming (because it’s easier for us to stream SSTables and not memtables). This enables very fast and very efficient migration. Another benefit: users no longer need to worry about manual cleanup operations. With vNodes, it can take quite a while to complete a cleanup since it involves rewriting all of the data on the node. With tablets, we migrate it as a unit and we can just delete the unit when we finish streaming it.

When a new node is added to the cluster, it doesn’t yet own any data. A new component that we call the load balancer (more on this below) notices an imbalance among nodes and automatically starts moving data from the existing nodes to the new node. This is all done in the background with no user intervention required.

For decommissioning nodes, there’s a similar process, just in the other direction. The load balancer is given the goal of zero data on that decommissioned node, it shifts tablets to make that happen, then the node can be removed once all tablets are migrated.

Each tablet holds approximately 5 GB of data. Different tables might have different tablet counts and involve a different number of nodes and shards. A large table will be divided into a large number of tablets that are spread across many nodes and many shards, while a smaller table will involve fewer tablets, nodes and shards. The ultimate goal is to spread tables across nodes and shards evenly, in a way that they can effectively tap the cluster’s combined CPU power and IO bandwidth.

Load Balancer

All the tablet transitioning is globally controlled by the load balancer. This includes moving data from node to node or across shards within a node, running rebuild and repair operations, etc. This means the human operator doesn’t have to perform those tasks.

The load balancer moves tablets around with the goal of achieving the delicate balance between overloading the nodes and underutilizing them. We want to maximize the available throughput (saturate the CPU and network on each node). But at the same time, we need to avoid overloading the nodes so we can keep migrations as fast as possible. To do this, the load balancer runs a loop that collects statistics on tables and tablets. It looks at which nodes have too little free space and which nodes have too much free space and it works to balance free space. It also rebalances data when we want to decommission a node.

The load balancer’s other core function is maintaining the serialization of transitions. That’s all managed via Raft, specifically the Raft Group 0 leader. For example, we rely on Raft to prevent migration during a tablet rebuild and prevent conflicting topology changes. If the human operator happens to increase the replication factor, we will rebuild tablets for the additional replicas – and we will not allow yet another RF change until the automatic rebuild of the new replicas completes.

The load balancer is hosted on a single node, but not a designated node. If that node goes down for maintenance or crashes, the load balancer will just get restarted on another node. And since we have all the tablets metadata in the tablets tables, the new load balancer instance can just pick up wherever the last one left off.

Tablet-aware drivers

Finally, it’s important to note that our older drivers will work with tablets, but they will not work as well.
We just released new tablet-aware drivers that will provide a nice performance and latency boost. We decided that the driver should not read from the tablets table because it could take too long to scan the table, plus that approach doesn’t work well with things like lambdas or cloud functions. Instead, the driver learns tablets information lazily.

The driver starts without any knowledge of where tokens are located. It makes a request to a random node. If that’s the incorrect node, the node will see that the driver missed the correct host. When it returns the data, it will also add in extra routing information that indicates the correct location.

Next time, the driver will know where that particular token lives, so it will send the request directly to the node that hosts the data. This avoids an extra hop. If the tablets get migrated later on, then the “lazy learning” process repeats.

How Does this All Play Out?

Let’s take a deeper look into monitoring metrics and even some mesmerizing tablet visualization to see how all the components come together to achieve the elasticity and speed goals laid out in the previous blog.

Conclusion

We have seen how tablets make ScyllaDB more elastic. With tablets, ScyllaDB scales out faster, scaling operations are independent, and the process requires less attention and care from the operator.

We feel like we haven’t yet exhausted the potential of tablets. Future ScyllaDB versions will bring more innovation in this space.