Avi Kivity and Dor Laor AMA: “Tablets” replication, all things async, caching, object storage & more

March 4, 2024

Cynthia Dunlop

Even though the ScyllaDB monster has tentacles instead of feet, we still like to keep our community members on their toes. From a lively ScyllaDB Lounge, to hands-on ScyllaDB Labs, to nonstop speaker engagement (we racked up thousands of chat interactions), this was not your typical virtual event. But it’s pretty standard for the conferences we host – both ScyllaDB Summit and P99 CONF.

Still, we surprised even the seasoned sea monsters by launching a previously unannounced session after the first break: an “Ask Me Anything” session with ScyllaDB Co-founders Dor Laor (CEO) and Avi Kivity (CTO).

The discussion covered topics such as the new tablets replication algorithm (replacing Vnodes), ScyllaDB’s async shard-per-core architecture, strategies for tapping object storage with S3, the CDC roadmap, and comparing database deployment options.

Here’s a taste of that discussion…

Why do tablets matter for me, the end user?

Avi: But that spoils my keynote tomorrow. 😉 One of the bigger problems that tablets can solve is the all-or-nothing approach to scaling that we have with Vnodes. Now, if you add a node, it’s all or nothing. You start streaming, and it can take hours, depending on the data shape. When it succeeds, you’ve added a node and only then it will be able to serve coordinator requests. With tablets, bootstrapping a node takes maybe a minute, and at the end of that minute, you have a new node in your cluster. But, that node doesn’t hold any data yet. The load balancer notices that an imbalance exists and starts moving tablets from the overloaded nodes toward that new node.

From that, several things follow:

You will no longer have that all-or-nothing situation. You won’t have to wait for hours to know if you succeeded or failed.
As soon as you add a node, it starts shouldering part of the load. Imagine you’re stressed for CPU or you’re reaching the end of your storage capacity. As soon as you add the node, it starts taking part in the cluster workload, reducing CPU usage from your other nodes.
Storage starts moving to the other nodes …. you don’t have to wait for hours.
Since adding a node takes just a minute, you can add multiple nodes at the same time. Raft’s consistent topology allows us to linearize parallel requests and apply them almost concurrently. After that, existing nodes will stream data to the new ones, and system load will eventually converge to a fair distribution as the process completes. So if you’re adding a node in response to an increase in your workload the database will be able to quickly react. You don’t get the full impact of adding the node immediately; that still takes time until the data is streamed. However, you get incremental capacity immediately, and it grows as you wait for data to be streamed.

A new tablet, weighing only 5GB, scheduled on a new node will take 2 minutes to be operational and relieve the load on the original servers.

Dor: Someone else just asked about migrating from Vnodes to tablets for existing deployments. Migration won’t happen automatically. In ScyllaDB 6.0, we’ll have tablets enabled for new clusters by default. But for existing clusters, it’s up to you to enable it. In ScyllaDB 6.1, tablets will likely be the default. We’ll have an automatic process that shifts Vnodes to tablets. In short, you won’t need to have a new cluster and migrate the data to it. It’s possible to do it in place – and we’ll have a process for it.

Read our engineering blog on tablets

What are some of the distinctive innovations in ScyllaDB?

Avi: The primary innovation in ScyllaDB is its thread-per-core architecture. Thread-per-core was not invented by ScyllaDB. It was used by other products, mostly closed-source products, before – although I don’t know of any databases back then using this architecture. Our emphasis is on making everything asynchronous. We can sustain a very large amount of concurrency, so we can utilize modern disks or large node counts, large CPU counts, and very fast disks. All of these require a lot of concurrency to saturate the hardware and extract every bit of performance from it. So our architecture is really oriented towards that.

And it’s not just the architecture – it’s also our mindset. Everything we do is oriented towards getting great performance while maintaining low latency. Another outcome from that was automatic tuning. We learned very quickly that it’s impossible to manually tune a database for different workloads, manage compaction settings, and so forth. It’s too complex for people to manage if they are not experts – and difficult even if they are experts. So we invested a lot of effort in having the database tune itself. This self-tuning is not perfect. But it is automatic, and it is always there. You know that if your workload changes, or something in the environment changes, ScyllaDB will adapt to it and you don’t need to be paged in order to change settings. When the workload increases, or you lose a node, the database will react to it and adapt.

Another thing is workload prioritization. Once we had the ability to isolate different internal workloads (e.g., isolate the streaming workload from the compaction workload from the user workload), we exposed it to users as a capability called workload prioritization. That lets users define separate workloads, which could be your usual OLTP workload and an ETL workload or analytics workload. You can run those two workloads simultaneously on the same cluster, without them interfering with one another.

How does ScyllaDB approach caching?

Dor: Like many infrastructure projects, ScyllaDB has its own integrated, internal cache. The servers that we run on are pretty powerful and have lots of memory, so it makes sense to utilize this memory. The database knows best about the object lifetime and the access patterns – what is used most, what’s rarely used, what’s most expensive to retrieve from the disk, and so on. So ScyllaDB does its own caching very effectively.

It’s also possible to monitor the cache utilization. We even have algorithms like heat-weighted load balancing. That comes into play, for example, when you restart a server. If you were doing upgrades, the server comes back relatively fast, but with a completely cold cache. If clients naively route queries evenly to all of your cluster nodes, then there will be one replica in the cluster (the one with a cold cache), and this could introduce a negative latency impact. ScyllaDB nodes actually know the cache utilization of their peers since we propagate these stats via Gossip. And over time, gradually, it sends more and more requests as a function of the cache utilization to all of the servers in the cluster. That’s how upgrades are done painlessly without any impact to your running operations.

Other databases don’t have this mechanism. They require you to have an external cache. That’s extremely wasteful because all of the database servers have lots of memory, and you’re just wasting that, you’re wasting all its knowledge about object utilization and the recently used objects. Those external caches, like Redis, usually aren’t great for high availability. And you need to make sure that the cache maintains the same consistency as the database – and it’s complicated.

Avi: Yes, caching is difficult, which is why it made it into that old joke about the two difficult things in computer science: naming things, cache invalidation, and off-by-one errors. It’s old, but it’s funny every time – maybe that says something about programmers’ sense of humor. 😉 The main advantage of having strong caching inside the database is that it simplifies things for the database user.

Read our engineering blog on ScyllaDB’s specialized cache

What are you hoping to achieve with the upcoming storage object storage capability and do you plan to support multiple tiers?

Dor: We are adding S3 support: some of it is already in the master branch, the rest will follow. We see a lot of potential in S3. It can benefit total cost of ownership, and it will also improve backup and restore, letting you have a passive data center just by sending S3 to another data center but without nodes (unlike the active-active that we support today). There are multiple benefits, even faster scaling later on and faster recovery from a backup. But the main advantage is its cost: compared to NVMe SSDs, it’s definitely cheaper.

The cost advantage comes with a tradeoff; the price is high latency, especially with the implementation by cloud vendors. Object storage is still way slower than pure block devices like NVMe. We will approach it by offering different options.

One approach is basically tiered storage. You can have a policy where the hot data resides on NVMe and the data that you care less about, but you still want to keep, is stored on S3. That data on S3 will be cold, and its performance will be different. It’s simplest to model this with time series access patterns. We also want to get to more advanced access patterns (e.g., completely random access) and just cache the most recently used objects in NVMe while keeping the rarely accessed objects in S3.

That’s one way to benefit from S3. Another way is the ability to still cache your entire data set on top of NVMe SSDs. You can have one or two replicas as a cached replica (stored not in RAM but in NVMe), so storage access will still be very fast when data no longer fits in RAM. The remaining replicas can use S3 instead of NVMe, and this setup will be very cost-effective. You might be able to have a 67% cost reduction or a 33% cost reduction, depending on the ratio of cache replicas versus replicas that are backed by S3.

Can you speak to the pros and cons of ScyllaDB vs. Cassandra?

Dor: If there is a con, we define it as a bug and we address it. 😉 Really, that’s what we’ve been doing for years and years. As far as the benefits, there’s no JVM – no big fat layer existing or requiring tuning. Another benefit is the caching. And we have algorithms that use the cache, like heat weighted load balancing. There’s also better observability with Grafana and Prometheus – you can even drill down per shard. Shard-per-core is another thing that guarantees performance. We also have “hints” where the client can say, “I’d like to bypass the cache because, for example, I’m doing a scan now and these values shouldn’t be cached.” Even the way that we index is different. Also, we have global indexes unlike Cassandra. That’s just a few things – there are many improvements over Cassandra.