Best Practices for Benchmarking ScyllaDB

March 4, 2021

Ivan Prisyazhynyy

Benchmarking is hard.

Or, I should say, doing a good, properly set up and calibrated, objective, and fair job of benchmarking is hard.

It is hard because there are many moving parts and nuances you should take into consideration and you must clearly understand what you are measuring. It’s not so easy to properly generate system load to reflect your real-life scenarios. It’s often not so obvious how to correctly measure and analyze the end results. After extracting benchmarking results you need to be able to read them, understand bottlenecks and other issues. You should be able to make your benchmarking results meaningful, ensure they are easily reproducible, and then be able to clearly explain these results to your peers or superiors.

There’s also hard mathematics involved: statistics and queueing theory to help with black boxes and measurements. Not to mention domain-specific knowledge of the system internals of the servers platforms, operating systems, and the software running on it.

With any Online Transaction Processing (OLTP) database — and Scylla is just one example — developers usually want to understand and measure the transaction read/write performance and what factors affect it. In such scenarios, there are usually a number of external clients constantly generating requests to the database. A number of incoming requests per unit of time called throughput or load.

100,000 Operations per second or [OPS]

Requests reach the database via a communication channel, get processed when the database is ready and then a response is sent back. The round trip time for a request to be processed is called latency. The ultimate goal of an OLTP database performance test is to find out what the latencies of requests are for various throughput rates.

1ms per request

There are thousands of requests that form the pattern of the workload. That’s why we don’t want to look at the latency for just indvidual requests, but rather, we should look at the overall results — a latency distribution. Latency distribution is a function that describes how many requests were worse than some specific latency target.

99 percentile or P99 or 99%

Database systems can’t handle an infinite amount of load. There are limits that a system can handle. How much a system is close to its maximum is called utilization. The higher utilization the higher the latency (you can learn more about the math behind this here).

80% utilization or 0.8

The end-user doesn’t want to have high latencies for OLTP workloads — those types of workloads are reliant on fast updates. Therefore we target somewhere between 200ms to 10ms for 99 percentile of latency (P99) distribution. If your P99 latencies become too high, your request queues can back up, plus you risk having request timeouts in your application, which then can cascade out of hand in repeated retries, resulting in system bottlenecking.

Thus for a quick test, we don’t really need to look at how the latency distribution changes throughout the system utilization. Often, it’s enough to check the optimal point when the throughput is at the maximum point for a given latency goal. For example, having a P99 goal of 10ms or less.

P(Latency < 10ms) = 0.99 or P99 = 10ms

Now, after we define our target benchmarking parameters, all that is left is to generate the load and measure the latencies. There are many trusted open-source frameworks available for that:

They are all relatively the same and provide similar configuration parameters. Your task is to understand which one better reflects the workload you are interested in and how to run it properly.

Under correctness, we suppose the proper configuration parameters and an environment that makes sense produce a valid workload. Remember, we want to measure performance for an open-class system!

An open-class queuing system

What does it mean? It means we must specify throughput (an arrival rate) and other related parameters. Start with the basic rules for your benchmarking:

Set specific goals — for example, if you want the system to handle X amount of throughput with Y 99 percentile latency under Z utilization for D amount of data without cache enabled.
Focus on the workload that most closely mirrors your real-world use case. Is it read heavy? Write heavy? A balanced mix?
Understand and anticipate what parts of the system your chosen workload will affect and how — how will it stress your CPUs? Your memory? Your disks? Your network?
Pick your database and system layers configuration that match your expectations (if this is your first time benchmarking Scylla, check out our capacity planning blog here.)
The target system and the data loader must be isolated — they shall not use the same compute resources.
Monitor performance of both the target system and the loader while testing. Ensure the loaders are not overloaded — look at CPU, Memory, Disks, Network utilization.
Generate load correctly — take coordinated omission into account.
Measure latency distributions. Use percentiles. Don’t average averages or percentiles.
Check if you need to tweak your setup during your testing. Example: do you need to add more servers or more clients to achieve a certain throughput? Or can you maintain your performance goals with less? Or are you limited (by budget or infrastructure) to a fixed hardware configuration?
Results must be reproducible.
More to it: [1] [2].

Finally, when you have everything set and you are ready to run:

$ bin/benchmark start

Now it’s time to specify parameters for the benchmark. Suppose we are using YCSB with scylla-native binding.

Setting Up Your Benchmark

1. Set Your Throughput Level

For the OLTP test, start with the throughput target. Use the -target flag to state desired throughput level.

-target 120000

For example -target 120000 means that we expect YCSB workers to generate 120,000 operations per second (OPS) to the database.

Why is this important? First, we wanted to look at the latency at some sustained throughput target, not visa versa. Second, without a throughput target, the system+loader pair will converge to the closed-loop system that has completely different characteristics than what we wanted to measure. The load will settle at the system equilibrium point. You will be able to find the throughput that will depend on the number of loader threads (workers) but not the latency – only service time. This is not something we expected.

For more information check out these resources on the coordinated omission problem: [1] [2] [3] and [this] great talk by Gil Tene.

2. Measure Latency the Right Way

To measure latency, it is not enough to just set a target. The latencies must be measured with the correction as we apply a closed-class loader to the open-class problem. This is what YCSB calls an intended operation.

Intended operations have points in time when they were intended to be executed according to the scheduler defined by the load target (--target). We must correct measurement if we did not manage to execute an operation in time.

The fair measurement consists of the operation latency and its correction to the point of its intended execution. Even if you don’t want to have a completely fair measurement, use “both”:

-p measurement.interval=both

Other options are “op” and “intended”. “op” is the default.

Another flag that affects measurement quality is the type of histogram used. We’ll use the High Dynamic Range (HDR) “-p measurementtype=hdrhistogram” (see this) which is fine for most use cases.

3. Latency Percentiles and Multiple Loaders

Latencies percentiles can’t be averaged. Don’t fall into this trap. Neither averages nor P99 averages make any sense.

If you run a single loader instance look for P99 — 99 percentile. If you run multiple loaders dump result histograms with:

-p measurement.histogram.verbose=true

-p hdrhistogram.fileoutput=true

-p hdrhistogram.output.path=file.hdr

Then merge them manually and extract required percentiles out of the joined result.

Remember that running multiple workloads may distort the original workload distributions they were intended to produce.

4. Number of Threads

Scylla utilizes a thread-per-core architecture design. That means that a node consists of shards that are mapped to the CPU cores, one-per-core.

In a production setup, Scylla reserves one core per node for interrupt handling and one other for Linux operating system tasks. For the system with hyperthreading (HT) it means 2 virtual cores. From that follows that the number of usable shards per node is typically Number Of Cores – 2 for a HT machine and Number Of Cores – 1 for a machine without HT.

It makes sense to select a number of YCSB worker threads to be multiple of the number of shards, and the number of nodes in the cluster. For example:

AWS Amazon i3.4xlarge has 16 vCPU (8 physical cores with HT).

⇒

Scylla node shards = (vCPUs – 2) = (16 – 2) = 14

⇒

worker threads = K × Target Throughput / Throughput per Worker = K × Workers per shard × Total Shards

where

K is parallelism factor ≥ Total Workers Throughput / Target Throughput ≥1

Workers per shard = [ Target Throughput / Total Shards ] / Throughput per Worker

Throughput per Worker = 1 second / Expected Latency per Request

Total Workers Throughput = Total Workers × Throughput per Worker

Total Shards = Nodes × Shards per node

Shards per node = (vCPU per cluster node – 2)

Nodes = a number of nodes in the cluster.

Total Workers = Total Shards × Workers per shard

Target Throughput = --target

See the example below for more details.

5. Number of Connections

If you use original Cassandra drivers you need to pick the proper number of connections per host. Scylla drivers do not require this to be configured and by default create a connection per shard. For example, if your node has 16 vCPU, and thus 14 shards, Scylla drivers will create 14 connections per host. An excess of connections may result in degraded latency.

We recommend using Scylla shard-aware drivers to achieve the best performance. As of this writing, shard-aware drivers are available in Python, Go, and Java; shard-aware C++ and Rust drivers are under active development.

The CQL database client protocol is asynchronous and allows queueing requests in a single connection. The default queue limit for local keys is 1024 and 256 for remote ones. The current binding implementation does not require this.

Both scylla.coreconnections and scylla.maxconnections define limits per node. When you see –p scylla.coreconnections=14 -p scylla.maxconnections=14 that means 14 connections per node.

Pick the number of connections per host to be divisible by the number of shards on that host.

6. Other Considerations

Consistency level (CL) does not change the consistency model (Eventual Consistency) or its strength.

Even with -p scylla.writeconsistencylevel=ONE the data will be written according to the number of a table’s replication factor (RF). Usually by default this is RF = 3. By using -p scylla.writeconsistencylevel=ONE you can omit waiting for all replicas to write the value. It will improve your latency picture a bit but would not affect utilization.

Remember that you can’t measure CPU utilization with Scylla by normal Linux tools. Check out Scylla’s own metrics to see real reactors utilization (see reactor_utilization metric in Prometheus or Scyllatop).

For the best performance, it is crucial to evenly load all available shards as much as possible. You can check out this lesson in Scylla University on how to detect hot and large partitions.

Expected Performance Target

With Scylla, you can expect about 12500 OPS / core (shard), where OPS are basic reads and write operations post replication. Don’t forget that usually `Core = 2 vCPU` for HT systems.

For example, if we insert a row with RF = 3 we can count at least 3 writes — 1 write per replica. That is 1 Transaction = 3 operations.

Formula for evaluating performance with respect to workloads is:

OPS / vCPU = [
   Transactions * Writes_Ratio * Replication_Factor +
   Transactions * Read_Ratio   * Read_Consistency_level
   ] / [ (vCPU_per_node - 2) * (nodes) ]

where Transactions == -target parameter (target throughput).

For example for workloada that is 50/50 reads and writes for a cluster of 3 nodes of i3.4xlarge (16 vCPU per node) and target of 120000 is:

[ 120K * 0.5 * 3 + 120K * 0.5 * 2 (QUORUM) ] / [ (16 - 2) * 3 nodes ] =
7142 OPS / vCPU ~ 14000 OPS / Core.

Example

Suppose we want to test how our cluster of 3 x i3.4xlarge nodes, 16 vCPU, 8 cores, 2 threads per core, 122 GB RAM, up to 10 Gigabit, 3.5TB NVMe md0 of 2 disks behaves at 120,000 OPS.

We have 3 nodes * (16 vCPU – 2 vCPU for system and IRQ) = 42 vCPU cluster = 42 shards.

For throughput 120,000 OPS we can evaluate 120,000 [OPS] / 42 [shards] = 2,858 [Request Per Shard / Second]. That means that each shard is going to get a request once each 1/2,858 of a second or every 350µs. Our expectations about 99% percentile (P99) of latency is under 10ms. Let’s pick 10ms as our target for our calculation.

We target P99 to be 10 ms. Let’s count how many worker threads we are gonna need to ensure 120,000 OPS with 10ms max per request.

Worker throughput = 1000 [ms/second] / 10 [ms/op] = 100 [op/sec] per worker thread

Total threads = Total workers = 120,000 [req/sec] / 100 [req/sec/worker] = 1200 [workers]

Workers per shard = 2,858 [Request Per Shard / Second] / 100 [Request/Second/worker] = 29 workers per shard

⇒

Total threads = Total workers = 29 [workers/shard] × 42 [shards] = 1218 [workers]

To prepare your database for a benchmark use:

cqlsh> CREATE KEYSPACE ycsb WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 3 };
cqlsh> USE ycsb;
cqlsh> CREATE TABLE usertable (
    y_id varchar PRIMARY KEY,
    field0 varchar,
    field1 varchar,
    field2 varchar,
    field3 varchar,
    field4 varchar,
    field5 varchar,
    field6 varchar,
    field7 varchar,
    field8 varchar,
    field9 varchar);

Prepare data with:

$ bin/ycsb load scylla -s -P workloads/workloada \
    -threads 84 -p recordcount=1000000000 \
    -p readproportion=0 -p updateproportion=0 \
    -p fieldcount=10 -p fieldlength=128 \
    -p insertstart=0 -p insertcount=1000000000 \
    -p scylla.username=cassandra -p scylla.password=cassandra \
    -p scylla.hosts=ip1,ip2,ip3,...

Generale load with:

$ bin/ycsb run scylla -s -P workloads/workloada \
    -target 120000 -threads 1200 -p recordcount=1000000000 \
    -p fieldcount=10 -p fieldlength=128 \
    -p operationcount=300000000 \
    -p scylla.username=cassandra -p scylla.password=cassandra \
    -p scylla.hosts=ip1,ip2,ip3,...

Learn More

You may also want to read our related article about testing and benchmarking database clusters. It has handy worksheets for how to plan your proof-of-concept (POC), and how to move Scylla into production. Also, you can discover more about getting the most out of Scylla at Scylla University. Feel free to register today. All the courses are free, and you will earn certificates you can post to your LinkedIn profile as you progress.