Async Rust in Practice: Performance, Pitfalls, Profiling

March 21, 2022

Piotr Sarna

It’s been a while since Scylla Rust Driver was born during ScyllaDB’s internal developer hackathon.

Since then, its development and adoption accelerated a lot. We’ve added many new features and published a couple of releases on crates.io. Along the way, we also stumbled upon a few interesting performance bottlenecks to investigate and overcome — read on for more details.

First Issue Arises

A few weeks ago, an interesting issue appeared on our GitHub tracker. It was reported that, despite our care in designing the driver to be efficient, it proved to be unpleasantly slower than one of the competing drivers, cassandra-cpp, which is a Rust wrapper of a C++ CQL driver. The author of latte, a latency tester for Apache Cassandra (and Scylla), pointed out that switching the back-end from cassandra-cpp to scylla-rust-driver resulted in an unacceptable performance regression. Time to investigate!

Reproducing the Problem

At first, we were unable to reproduce the issue – all experiments seemed to prove that scylla-rust-driver is at least as fast as the other drivers, and often provides better throughput and latency than all the tested alternatives.

A breakthrough came when we compared the testing environments, which should have been our first step from the start. Our driver indeed suffered from reduced performance, but it was only observed on a fully local test – where both the driver and the database node resided on the same machine. Our tests try hard to be as close to production environments as possible, so they always run in a distributed environment. Nonetheless, using a local setup turned out to have advantages too, because it’s a great simulation of a blazingly fast network – after all, loopback has very impressive latency characteristics!

Profiling

After we were able to reliably reproduce the results, it was time to look at profiling results – both the ones provided in the original issue and the ones generated by our tests.

Brendan Gregg’s flamegraphs are indispensable for performance investigations. What’s even better is that the Rust ecosystem already has fantastic support for generating flamegraphs integrated into the build system: cargo-flamegraph.

Using cargo-flamegraph is as easy as running the binary:

cargo flamegraph your-app your-params

and it produces an interactive flamegraph.svg file, which can then be browsed to look for potential bottlenecks. Interpreting flamegraphs is explained in detail in the link above, but a rule of thumb is to look for operations that take up the majority of total width of the graph – the width indicates time spent on executing a particular operation.

A flamegraph generated from one of the test runs shows that our driver indeed spends an unnerving amount of total CPU time on sending and receiving packets, with a fair part of it being spent on handling syscalls.

Rust Flamegraph before

The issue’s author was also kind enough to provide syscall statistics from both test runs – those backed by cassandra-cpp and those backed by our driver. The conclusion from the statistics was clear – scylla-rust-driver issued at least 1 syscall per query, which might be the source of elevated latency – and with a super-fast network (of which loopback is a prime example) it also means that throughput suffers – a latency of 1ms means that we won’t be able to send more than 1000 requests per second.

Finally, latte records presents used CPU time as part of its output. The scylla-rust-driver tended to cause twice as much CPU usage than the original back-end based on cassandra-cpp. That fits perfectly with the elevated number of syscalls, which need CPU to be handled.

Hint

Rust ecosystem is great at testing various small changes introduced on the dependencies of your project, which was invaluable when comparing various fixes applied to scylla-rust-driver, without having to publish anything on crates. If you want to test a particular change made to one of your dependencies, before it was published (or even on your own fork, where you applied some experimental changes yourself!), you can simply provide a git repo path in Cargo.toml:

scylla = { git = "https://github.com/scylladb/scylla-rust-driver", branch = "some_custom_branch" }

Root Cause №1

Ultimately, the root cause of the original issue was our lack of buffering of reads and writes.

Our driver manages the requests internally by queueing them into a per-connection router, which is responsible for taking the requests from the queue and sending them to target nodes and reading their responses asynchronously. In the original implementation, neither sending the requests nor receiving the responses used any kind of buffering, so each request was sent/received as soon as it was popped from the queue. That translates to issuing a system call per each request and response. While it’s a source of some CPU overhead, it was not observed to be an issue in distributed environments, because network latency hid the fact that each request needed to spend some more time on getting processed. However, a local setup quickly verified that this overhead is not negligible at all.

Solving this problem was conceptually very simple. Tokio, our runtime of choice, offers ready-to-use wrappers for buffering input and output streams — BufReader and BufWriter. The wrappers are convenient enough to provide a compatible API with their underlying buffers, so they’re basically drop-in replacements. (As a drop-in replacement for Cassandra, we in ScyllaDB especially appreciate such attributes!)

After the fix was applied, its positive effects were immediately visible in the flamegraph output – feel free to compare the graph below with the original flamegraph:

It’s clear that scylla-rust-driver spent considerably less time on syscalls — in fact, the bar representing sendmsg is now too narrow to locate with the naked eye.

Root Cause №2: a Pitfall in Async Rust

That’s not the end of the story at all! In fact, the most interesting bit was uncovered later, after the first fix was already applied.

We didn’t even properly celebrate our win against syscalls when another performance issue was posted, again, by our resourceful performance detective, the author of latte. This time, it turned out that raising the concurrency in the tool resulted in reduced performance, which was seemingly observed only when using our driver as back-end.

Quadratic Behavior?

Yes, an experiment performed by one of our engineers hinted that using a combinator for Rust futures — FuturesUnordered — appears to cause quadratic rise of execution time, compared to a similar problem being expressed without the combinator, by using Tokio’s spawn utility directly.

FuturesUnordered is a neat utility that allows the user to gather many futures in one place and await their completion. Since FuturesUnordered was also used in latte, it became the candidate for causing this regression. The suspicion was confirmed after trying out a modified version of latte that did not rely on FuturesUnordered.

The Pitfall

In order to fully grasp the problem one needs to understand how Rust async runtimes work.

Tokio’s article on that topic is a great read: https://tokio.rs/blog/2020-04-preemption, but I’ll also summarize its contents here.

In async Rust, there exists a potential problem of starving other tasks if one of them keeps polling futures in a loop, and these futures always happen to be ready due to sufficient load.

When that happens, the loop always proceeds with handling the futures, without ever giving control back to the runtime. In Tokio (and other runtimes) the act of giving back control can be issued explicitly by calling yield_now, but the runtime itself is not able to force an await to become a yield point.

In order to avoid starving other tasks, Tokio resorted to a neat trick: each task is assigned a budget, and once that budget is spent, all resources controlled by Tokio start returning a “pending” status even though they might be ready in order to force the budgetless task to yield.

That sounds perfect, but the solution comes with a price. Rust offers many convenient utilities and combinators for futures, and some of them maintain their own scheduling policies that may interfere with the semantics described above. In particular, FuturesUnordered is such a scheduler as well – it maintains its own list of ready futures that it iterates over when it is polled.

The combination of Tokio’s preemptive scheduling trick and FuturesUnordered‘s implementation is the heart of the problem — FuturesUnordered has a list of futures ready for polling and it assumes that once polled, the futures will not need to be polled again. Since such futures are not polled more than once when put in the “ready” list, the amortized time needed to serve them all is constant. However, once the budget is spent, Tokio may force such “ready” futures to return a “pending” status! The amortization is gone and it’s now entirely possible (and observable) that FuturesUnordered will iterate over its whole list of underlying futures each time it is polled, effectively causing the execution time to be quadratic with respect to wrt. the number of futures stored in FuturesUnordered.

The Solution

Since FuturesUnordered are part of the Rust’s futures crate, the issue was reported in there directly: https://github.com/rust-lang/futures-rs/issues/2526. It was recognized and triaged very quickly by one of the contributors – the response time was really impressive!

One of the suggested workarounds was to wrap the task in tokio::unconstrained marker, which effectively turns off cooperative scheduling in Tokio, which removes one of the necessary conditions for the regression to appear. However, cooperative scheduling is ultimately a good thing — since we definitely want to avoid starving tasks, and we want to keep the latencies of our projects low.

As a result, a proper fix was posted on the same day and is already part of an official release of the ‘futures’ crate – 0.3.19. The fix is a simple yet effective amendment to FuturesUnordered code – the number of futures that can be iterated over in a single poll is now capped to a constant – 32. It is a very nice consensus between turning off cooperative scheduling altogether and spawning each task separately instead of combining them into FuturesUnordered: cooperative scheduling is able to properly fight starvation by causing certain resources to artificially return a “pending” status, while FuturesUnordered no longer assumes that all of the futures listed as “ready” will indeed be ready — after going 32 of them, the control is given back.

Summary

The world of async programming in Rust is still young, but very actively developed. Investigating and getting rid of bottlenecks and pitfalls is a very useful skill, so don’t hesitate in joining in the effort — e.g. by becoming a contributor to our brand new, native Scylla driver! https://github.com/scylladb/scylla-rust-driver

The broad topic of our Rust driver and our general involvement in the Rust ecosystem was a core topic of Scylla Summit 2022. You can watch sessions on-demand now.