Skip to main content

How Rakuten Eliminated Volatile Cassandra Latencies

“With Cassandra, the latencies were extremely volatile. The exact same queries would perform dramatically different throughout the same day. This made it really difficult to commit to SLAs.”  

- Hitesh Shah, Engineering Manager at Rakuten


About Rakuten

Rakuten allows its 1.5B members to earn cash back for shopping at over 3,500 stores. Stores pay Rakuten a commission for sending members their way, and Rakuten shares that commission with its members. 

 

Rakuten’s Database Use Case

Rakuten Catalog Platform provides ML-enriched catalog data to improve search, recommendations, and other functions to deliver a superior user experience to both members and business partners. Their data processing engine normalizes, validates, transforms, and stores product data for their global operations. 

 

Rakuten’s Cassandra Challenge: Volatile Latencies

While the business was expecting this platform to support extreme growth with exceptional end-user experiences, the team was battling Cassandra’s instability, inconsistent performance at scale, and maintenance overhead. They faced JVM issues, long Garbage Collection pauses, and timeouts – plus they learned the hard way that a single slow node can bring down the entire cluster.

When Rakuten first started building their product catalog platform, they began with Apache Cassandra. Hitesh Shah, Engineering Manager at Rakuten, notes, “We ended up selecting Cassandra because that was the most natural choice for us.” It was horizontally scalable, offered automatic data replication and automatic sharding, plus was designed for fast writes. 

He continued, “Cassandra is more of a column family store and that definitely made a lot of sense for us because we have these use cases around enriching the subset of the data… you can just separate the subset of the data without impacting the operations on the other columns in the same row. Speed of deployment is very easy with Cassandra. You spin up a new node and data starts replicating and traffic will get started.”

However, getting started is not the same thing as performance over time. Within a few years of corporate and data growth, Cassandra’s limits were clearly showing. Inconsistent performance. Volatile latencies.

“Let’s take a particular select statement or select query. Cassandra is returning the results in maybe in 60 or 70 milliseconds.” Hitesh proposed as an example. “Then the same query will take 120, 130 milliseconds at a different time on the same day. Or it might even end up taking 140, 150, 160 milliseconds on a different day. Basically, the latencies were all over the place with Cassandra.”

The Java-based code, with stop-the-world pauses caused by garbage collection, client-side connection timeouts and out-of-memory errors, made it difficult to make service commitments to partners as well as internal customers. While every distributed system will only be as fast as the slowest performing node, Cassandra made this phenomenon all-too-apparent. The unpredictable behavior of Apache Cassandra combined with the need for a lot of manual intervention drove Hitesh and his team at Rakuten to consider an alternate solution: ScyllaDB. 

ScyllaDB is the #1 Apache Cassandra alternative. ScyllaDB provides the same CQL interface and queries, the same drivers, even the same on-disk SSTable format – but with a modern architecture designed to eliminate Cassandra performance issues, limitations, and operational barriers. ScyllaDB is built from the ground up in C++.  No Java overhead. No garbage collection. And performance tuning? It’s automated.


Migrating from Cassandra to ScyllaDB

Rakuten replaced 24 nodes of Cassandra with 6 nodes of ScyllaDB. ScyllaDB now lies at the heart of their core technology stack, which also involves Spark, Redis, and Kafka. Once data undergoes ML-enrichment, it is stored in ScyllaDB and sent out to partners and internal customers. ScyllaDB processes  250M+ items daily, with a read QPS of 10k-15k per node and write QPS of 3k-5k per node. 

 

Rakuten’s Cassandra Migration Results

 Rakuten can now publish items up to 5x faster, enabling faster turnaround for catalog changes. This is especially critical for peak shopping periods like Black Friday. They are achieving predictably low latencies, which allows them to commit to impressive internal and external SLAs. Moreover, they are enjoying 2.5x lower infrastructure costs following the 4x node reduction

 

Learn More About How ScyllaDB Eliminates Cassandra Challenges