Skip to main content

From Batch to Real-Time: How MoEngage Achieved Millisecond Personalization with ScyllaDB

How a leading customer engagement platform handles 250K writes per second at 1ms p99 latency with 200TB+ data At MoEngage, our mission as a leading customer engagement platform is to help marketers build deep, lasting relationships with their users by processing hundreds of billions of events each month. Initially, our data architecture was built on a solid foundation of Amazon S3 for large-scale batch analytics and Elasticsearch for search. This dual system was effective for historical segmentation but began to buckle under the modern demand for instantaneous personalization. Our clients needed to react to user actions not in minutes, but in milliseconds. For example, sending a notification based on a product just viewed or updating a user’s segment the moment they qualify for a new campaign. This shift exposed the fundamental limitations of our architecture. Querying a single user’s recent activity in S3 was prohibitively slow, requiring massive dataset scans. At the same time, our write-heavy workload overwhelmed Elasticsearch, creating performance bottlenecks and significant operational overhead from indexing and sharding. It became clear that we couldn’t just optimize our way to real-time. We needed a new, purpose-built system designed from the ground up for high-throughput ingestion and low latency queries. Editor’s note: Karthik and Atish Andhare will be sharing their experiences at Monster SCALE Summit, a free + virtual conference on extreme scale engineering. Learn more and access passes here.   Envisioning the Eventstore Our solution was to build the Eventstore – a system that can store all the user actions. It doesn’t replace our vast S3 data lake; think of it more like a high-speed, short-term memory for all user actions and events. Its sole purpose is to handle recent user activity, absorbing the constant firehose of incoming events while allowing us to instantly pull up any single user’s complete activity timeline from the last 30, 60, or 90 days. This new real-time backbone lets us see a user’s entire recent journey in milliseconds, a capability that was completely out of reach with our old architecture. With this new real-time backbone in place, we could finally unlock a class of product capabilities that our customers were demanding, moving from theoretical concepts to tangible features. The Eventstore directly powers: Instantaneous Segmentation: Instead of waiting hours for a segment to update, users are added or removed the moment their behavior meets specific criteria. This ensures communications are always sent to the right audience at exactly the right time. True Real-Time Triggers: Campaigns can be initiated the instant a user performs a key action, such as abandoning a cart or completing a purchase. This eliminates the “lag” that made triggered messages feel disconnected from the user’s immediate context. Hyper-Personalization at the Edge: We can now personalize messages using attributes from a user’s very last action. This allows for powerful use cases like including the “last product viewed” in an email, recommending content based on the “last article read,” or personalizing web content based on the “last item added to cart.” Live User Activity Feeds: Our platform’s user profile dashboard, which once showed a delayed activity history, can now display a live, up-to-the-millisecond feed of every action a user takes, giving marketers a true real-time view of their customers. Choosing Our Engine Our requirements were ambitious and non-negotiable: the system had to handle a write-heavy workload of at least 250,000 events per second with an avg latency of 1 ms, p99 of under 10ms, and it needed to scale horizontally without any performance degradation. We evaluated several distributed databases, but ScyllaDB quickly emerged as the clear frontrunner. Its architecture, a C++ rewrite of Cassandra, is engineered for raw performance, promising to harness the full power of modern hardware and deliver the predictable, ultra-low latencies we required. Also, a few of us had extensive experience working with Cassandra which made it easier to understand ScyllaDB. The ability to add nodes seamlessly to handle increasing load was the final piece of the puzzle, giving us the confidence that this was a solution that wouldn’t just meet our immediate needs, but would grow with us for years to come. Handling Multi-Tenancy As an open platform, MoEngage serves a diverse customer base, from companies trialing our product to large enterprises with varying performance and service-level agreement (SLA) expectations. This reality meant that a one-size-fits-all approach to data storage was not viable. We could not house all customer data in a single, massive cluster, as this would risk performance degradation from “noisy neighbors” and fail to meet the distinct needs of our clients. Our multi-tenancy strategy, therefore, had to be built around workload isolation from the ground up. Our decision was heavily influenced by two core ScyllaDB design principles. First, ScyllaDB recommends having one large table per cluster rather than many small ones to reduce metadata management overhead. Second, and more critically, it is a best practice to configure data retention with a Time-To-Live (TTL) at the table level, not at the cell level. Since our customers require different retention periods (15, 30, or 60 days), managing this at the row level within a single table would create significant overhead on compaction and tombstone management. Based on these constraints, we chose a strategy of physical isolation using multiple, independent ScyllaDB clusters. This approach allows us to group tenants logically based on their needs. For example: All customers with the same retention policy (e.g., 30 days) are housed in the same cluster, allowing us to use a single, efficient table-level TTL. Customers who require stricter, guaranteed SLAs can be isolated in their own dedicated cluster. All MoEngage test accounts can be grouped into a single cluster to separate their non-production workloads. This model provides the perfect balance, ensuring that the workload of one tenant group does not impact another while aligning perfectly with ScyllaDB’s operational best practices for performance and data management. Working with ScyllaDB Open Source Our Large Partition Problem One of the most critical challenges in designing our schema was avoiding the “large partition” anti-pattern in ScyllaDB. While our experiments showed that large partitions don’t significantly penalize write performance, they have a significant impact on reads and compactions. We have a use case where read won’t be able to take advantage of the clustering key ordering and hence have to fetch the entire partition and perform the filtering on the client side. In such cases querying a large partition causes ScyllaDB to fetch data from disk (if not in memtable), decompress, load into memory and then return the result. This creates significant latency overhead and puts unnecessary pressure on the cluster. With ScyllaDB’s official recommendation to keep partitions under 100MB, we knew that a naive partition key like `(user_id, tenant_id)` would be a recipe for disaster, as highly active users could easily generate gigabytes of event data over their retention period. To solve this, our schema design focused on proactively breaking up potentially large partitions into smaller, consistently sized buckets. Our analysis showed an average event row size of about 1KB, meaning our 100MB target partition size could comfortably hold around 100,000 events. A simple calculation revealed that for a 30-day retention period, any user generating more than 330 events per day would exceed this limit. To prevent this, we introduced a `bucket_id` as a core component of our partition key. Our final partition key became a composite of `(uid, tenant, bid)`. The `bucket_id` acts as a split mechanism, splitting a single user’s long event history into multiple, smaller physical partitions. For example, a bucket could represent a day or a week of activity, ensuring no single partition grows indefinitely. This foresight was crucial because a table’s partition key cannot be changed after creation. By including the `bucket_id` in our initial schema, we built in the flexibility to define and refine our exact bucketing strategy over time, guaranteeing a healthy, performant cluster as our data scales. Building for Resilience From the very beginning, two principles were non-negotiable for the Eventstore: fault tolerance and zero data loss. The system had to withstand common infrastructure failures like node loss or disk corruption, and under no circumstances could we lose data that had been acknowledged with a success response. This commitment to durability shaped every decision we made about our cluster architecture, from data replication to physical topology. To achieve this, we made a critical decision to use a Replication Factor (RF) of 3. This means that for every piece of data written, three copies (replicas) are stored on three different nodes in the cluster. With RF3, we could enforce a write consistency level of `LOCAL_QUORUM`. This setting guarantees that a write operation is only considered successful after a majority of the replicas (two out of the three) have confirmed the write to disk. This simple but powerful mechanism is our guarantee against data loss; even if one node fails mid-write, the data is already safe on at least two other nodes. Having three copies of the data is only half the battle; ensuring those copies are physically isolated is just as important. To protect against large-scale failures, we architected our clusters to be Availability Zone (AZ) aware. By leveraging ScyllaDB’s Ec2Snitch feature, we make the database aware of the underlying AWS infrastructure, treating each AWS AZ as a separate “rack.” With this configuration, combined with NetworkTopologyStrategy replication strategy, ScyllaDB intelligently places each of the three data replicas in a different AZ. This strategy ensures that we can withstand the complete failure of an entire Availability Zone without any data loss or service interruption. While this architecture provides excellent high availability against common failures, we also planned for disaster recovery scenarios, such as losing a quorum of nodes or a full region-wide outage. Since our chosen EC2 instances use ephemeral storage, our recovery strategy in these cases is to quickly bootstrap a new cluster from a previous backup. For this, we leverage ScyllaDB’s native backup capabilities and our application’s ability to replay messages from Kafka. Our process involves taking regular snapshots, supplemented by a continuous stream of incremental backups. Any data lost between the last incremental backup and point of outage is available in Kafka, by simply replaying the data from Kafka we are able to fully restore the data. This combination ensures we can rebuild a cluster to a recent, consistent state, completing our comprehensive resilience strategy from minor hiccups to major outages. Cluster Topology Choosing the right database engine was only half the equation; building a resilient and performant Eventstore meant running it on the right hardware. Our workload is fundamentally I/O-bound, characterized by a relentless, high-throughput stream of writes. This reality guided our evaluation of EC2 instance types, where the choice between local NVMe storage and network-attached EBS volumes became the central decision point. After a thorough analysis, we followed ScyllaDB’s strong recommendation and opted for storage-optimized i-series instances with local NVMe SSDs. While we considered memory-optimized instances with EBS, they proved unsuitable for our write-heavy needs. High performance `io2` EBS volumes were prohibitively expensive at our scale, and more affordable `GP3` volumes could not guarantee the p99 latencies we required and introduced risks of throttling during traffic bursts. AWS’s own guidance suggests EBS is better suited for read heavy workloads, the exact opposite of our profile. Local NVMe storage, by contrast, delivers the sustained, sub-millisecond I/O performance essential for our ingestion pipeline. Specifically, we selected the i3en instance family, which provides an excellent balance of vCPU, RAM, and the large, fast storage capacity needed to meet our heavy data retention requirements. Our approach to capacity planning is therefore not a one-time calculation but a dynamic process tied directly to our multi-tenant cluster strategy. The size and configuration of each physical cluster are determined by the specific workload of the tenants it houses. We carefully model capacity based on four key variables for each tenant: 1. The number of active users. 2. The average number of actions per user per day. 3. Their specific data retention policy (e.g., 15, 30, or 60 days). 4. The overall write and read traffic patterns. This allows us to right-size each cluster for its intended workload, ensuring performance and cost-efficiency across our entire infrastructure. Compaction Strategy A critical factor in managing the total cost of ownership for a large-scale database is controlling disk space amplification. Open Source ScyllaDB’s default Size-Tiered Compaction Strategy (STCS) requires keeping nearly 50% of disk space free for compaction operations, which would have effectively doubled our storage costs. We also experimented with the Leveled Compaction Strategy but that too required additional 50% disk space during initial bootstrapping. While ScyllaDB Enterprise offers the highly efficient Incremental Compaction Strategy (ICS) that reduces this overhead to 20%, it comes with a significant license fee. Our Operational Challenges Cluster Management Our initial capacity planning pointed us toward the i3en.3xlarge instance type (12 vCPUs, 96GB RAM, and a 7.5TB NVMe drive). To ensure low latency for our global customer base, we deployed one ScyllaDB cluster in each of our three primary AWS regions. In total, our footprint grew to approximately 50 nodes across these clusters. ScyllaDB provides region-specific, production-ready AMIs that simplify the deployment process. Our deployment workflow followed a structured path: provisioning nodes, configuration, security, and RBAC, followed by onboarding the cluster into our internal monitoring stack. Because ScyllaDB’s AMIs are self-contained, scaling out theoretically meant launching a new node and letting it complete the automated bootstrap process. Things ran smoothly until we encountered a surge in customer data within one of our regions. As disk utilization climbed and we were using STSC, we followed our runbook and added a new node to the cluster. However, this expansion revealed two critical operational hurdles: First, during our POC, a 4TB node bootstrap typically took 18 hours using vNodes. In the live production environment, this window stretched significantly. Bootstrapping took anywhere from 24 to 36 hours. In a high-growth environment, a 1.5-day lead time for scaling is a lifetime. Followed by another issue when an on-call engineer noticed the disk space on the newly joining node was hitting 90%. This was counterintuitive—why was a joining node, which hadn’t even finished taking its share of the data, running out of space? Our investigation revealed that it was caused during the RESHAPE compactions. When a new node joins, ScyllaDB reshapes the data to fit the new shard distribution. This process creates temporary data overhead. After researching similar issues reported in the community, we identified a temporary fix to get our node back to service. Allow the node to initiate the bootstrap. The moment the RESHAPE compaction begins, manually pause it. Let the node finish joining the ring to provide immediate capacity. ICS Our initial experiences led us to a conservative rule of thumb, where we felt safe onboarding new nodes when disk usage on the existing nodes reached between 40% and 45%. This buffer was a technical necessity to accommodate a 2.4x worst-case space amplification during RESHAPE compactions while bootstrapping. We experienced a glimmer of hope when we discovered ScyllaDB’s Incremental Compaction Strategy (ICS). After discussing the ICS with the ScyllaDB team, we realized we were looking at our space amplification issues through an outdated lens. The technical shift offered by ICS is profound because it utilizes a default fragment size of 1GB, meaning a single compaction job typically requires a maximum of only 2GB of disk overhead. To put that into perspective for our specific setup, the old methodology required nearly 50% of free space on a 6.9TB node to handle heavy compaction cycles safely. Under ICS, that same 6.9TB node with 12 shards would only experience roughly 110GB of overhead during compactions. This shift creates massive headroom, allowing us to move away from capping nodes at 45% capacity and safely utilizing over 80% of the disk. By drastically minimizing space amplification, ICS has effectively doubled our storage efficiency without compromising performance during critical node operations. Our Move to ScyllaDB Enterprise Our journey toward ScyllaDB Enterprise began with a rigorous Proof of Concept designed to validate three core pillars: performance, reliability, and operational efficiency. We needed to ensure that the Enterprise edition could not only handle our existing throughput but also provide an edge in performance and cluster operations. To validate these objectives without risking production stability, we deployed a parallel ScyllaDB Enterprise cluster. This environment supported dual writes, allowing us to mirror data from our existing Open Source Software (OSS) cluster to the new Enterprise setup in real-time. This side by-side comparison was instrumental in proving the superiority of the new architecture. The most significant architectural shift involved moving to i3en.6xlarge nodes. These powerful instances, equipped with 24 vCPUs, 192GB of memory, and two 7.5TB NVMe drives, allowed us to dramatically consolidate our infrastructure. By leveraging these denser nodes, we were able to shrink our total node count to just one-third of the original OSS cluster size, significantly reducing the complexity of our distributed network. Alongside this hardware upgrade, we transitioned our tables to the Incremental Compaction Strategy (ICS) to better manage disk space. Following a successful “soak-in” period where the Enterprise cluster met all performance benchmarks, we executed a structured four-step migration strategy. We first upgraded our OSS environment to ScyllaDB Enterprise 2024.1, followed by the systematic migration of tables to the ICS format. Once the tables were optimized, we began the process of downsizing the legacy OSS cluster and finalized the transition by onboarding the entire environment to ScyllaDB Manager for centralized management and automated maintenance. Lessons Learned Schema Design Is Paramount The most important aspect while using ScyllaDB is getting your schema right. It’s not just about the data model but aspects like RF, TTL, partition size, compaction strategy etc. that dictate how your ScyllaDB performs in production. Adding Nodes & Removing Nodes Take Longer As your data size grows the process of adding nodes and removing nodes becomes a lot slower with ScyllaDB’s legacy vNode-based replication. Make sure you are monitoring everything and plan for these activities ahead of time. One thing we learned is that while these operations are slower they don’t quite impact the query / write latencies significantly during these maintenance activities. POC != Production No matter how hard you try to anticipate & simulate issues in POC, your production system will always surprise you. Our Next Steps with ScyllaDB Our journey with the Eventstore has fundamentally transformed our real-time capabilities, but we’re always looking ahead to the next evolution of our architecture. One of the most exciting developments on our roadmap involves leveraging a powerful new feature in the latest versions of ScyllaDB: tablets. While our multi-cluster topology provides excellent isolation, it still requires us to plan capacity for peak workloads. In a multi-tenant world, traffic can be unpredictable. A single customer launching a wildly successful campaign can create a sudden performance hotspot on specific sets of nodes, even if the rest of the cluster has ample storage and spare compute capacity. Manually rebalancing or adding nodes to handle these temporary spikes is a significant operational challenge. This is where tablets change the game. By breaking down the token ring into smaller, movable units of data, tablets decouple data partitions from specific physical nodes. Instead of a partition being permanently owned by a set of nodes, a tablet can be automatically moved to a different node to balance the load in real-time. For us, this unlocks the holy grail of database management: true elastic scaling. When a traffic hotspot emerges, ScyllaDB can automatically rebalance the cluster by shifting tablets away from overloaded nodes to those with spare capacity. This will allow us to absorb sudden traffic surges with grace, ensuring consistent performance for all tenants without manual intervention or costly overprovisioning. It’s the key to providing on-demand compute for our customers’ biggest moments, ensuring the Eventstore remains a robust and highly elastic foundation for the future of real-time engagement at MoEngage.