SAS Institute Changing All Four Tires While Driving an AdTech Engine at Full Speed

May 9, 2022

SAS Intelligent Advertising changed its ad-serving platform from using Datastax Cassandra clusters to ScyllaDB clusters for its real-time visitor data storage. This presentation describes how this migration was executed with no downtime and with no loss of data, even as data was constantly being created or updated.

In the above video, David Blythe explains how they were able to migrate away from DataStax to ScyllaDB for their SAS 360 Match platform. He called this transition “changing all four tires while driving an adtech engine at full speed.”

SAS Customer Intelligence

David works for SAS Customer Intelligence (CI). The team provides services to digital marketers and web publishers, to support A/B testing, product recommendations, and to track customer engagement. It also includes ad serving services so that their customers can control what ads they want to see on their own sites, apps and services. Available across all channels — mobile, web, video and e-mail — SAS CI allows for creation of complex business rules for proper targeting and exclusion (for example, limiting to certain geographic areas), frequency/exposure, and competitive exclusion (to prevent competitors from appearing on the same page). Their system is used to support billions of advertising decisions monthly.

SAS CI uses ScyllaDB as a key-value store. The partition key is the user’s visitor ID. The value consists of serialized data that, for example, contains embedded collections. The record may include some static attributes, such as the visitor’s gender and interests, as well as analytical data for product recommendations. On top of that is real-time data, which has to be updated constantly, such as recent decisions the visitor has made.

This information is read in from the database at the start of a visitor session and held in-memory to maintain state during the visitor’s session. Updates to the data can be written asynchronously to update the visitor records, so write performance is not so much of a concern. They instead had a lot of concern for read performance. Because when a visitor first hits a website they don’t want there to be much time before an initial page rendering.

To support global business, SAS CI is deployed in clusters across four AWS regions, each in different countries, co-located with SAS Intelligent Decisioning servers, with hundreds of millions of rows in each cluster.

Encapsulation is Key

For David’s team, it was not just a matter of moving to ScyllaDB. At SAS Institute, developers use an abstract, object-oriented internal API that masks the back-end NoSQL database. This enables them to have multiple different implementations of databases while encapsulating business-level functions so they are never impacted by changes to underlying data systems. What this means, on a practical level, is that they are not using raw CQL, but generate CQL calls from their abstraction layer.

They designed a multi-tenant system that holds a concrete class per customer. This instance is held behind a C++ mutex (short for “mutual exclusion,” a method to share protected data simultaneously across threads) so that the tenant instance can be changed on the fly through configuration. (Any locking by the mutex is tolerable because the updates occur only rarely; usually in migration between services).

class NoSqlService
{
public:
virtual ~NoSqlService() { }

virtual NoSqlConnection* getConnection() { return NULL; }
virtual void returnConnection(NoSqlConnection *) { }
};

class NoSqlConnection
{
public:
virtual ~NoSqlConnection() { }

virtual int get(size_t timeout_ms, const std::string &customer, const std::string &id, const std::string &data_type, std::string &data_value);
virtual int getMultiple(size_t timeout_ms, const std::string &customer, const std::string &id, std::map<std::string,std::string> &data) = 0;
virtual int update(size_t timeout_ms, const std::string &customer, const std::string &id, const std::string &data_type, const std::string &data_value, time_t ttl) = 0;
virtual int remove(size_t timeout_ms, const std::string &customer, const std::string &id, const std::string &data_type) = 0;
};

An example of SAS Institute’s encapsulation code, showing two classes. The first is definition of a NoSql service, and the second creates database connections, including methods for reads (get and getMultiple), writes (update) and deletes (remove).

This abstraction allows SAS to invoke multiple different types of data management systems, whether memory-based (i.e. without an actual back-end database; used for their automated unit testing) or a flat-file implementation (for testing on local machines so that tests do not need a live cluster or virtual machine running). Then they have production implementations, including an original open source Cassandra/Thrift implementation, which they began using in 2010. They stuck with this interface until 2014. In 2015 they moved to a DataStax/CQL implementation. In 2018 they changed their schema significantly, yet were able to run in their same cluster.

That was when they first found out about ScyllaDB. The initial attraction was obvious: SAS Institute hoped to improve their bottom line by saving money. And though they were happy with their prior performance (reads were averaging around 10 milliseconds; higher during times of load) they also would be happy to get improved performance.

Seamless Migration

Their switchover was seamless. “Being able to switch to ScyllaDB involved no code changes whatsoever.” David observed, “I have to say, this is one of the rare times in all the years that I have been doing software where vendors make promises about compatibility where it actually did not require us to make any code changes.”

David’s team were still using the same open source C++ driver and sending the same CQL queries. However, there still needed to be a data migration. The key for David was “How do you maintain 24/7 access while you’re doing the migration? Because we have to be able to read the old data. And we have to be able to update the data as visitors are on the website. We have to be able to save this stuff back out. And we have to be able to finish the migration and decommission the old vendor or the old schema in a reasonable amount of time.”

The way SAS Institute’s team handled this was by creating a “migrating” implementation of their NoSql API. It wraps both the old and the new schema/vendor services. Here’s what that code looked like:

class MigratingNoSqlService : public NoSqlService
{
public:
MigratingNoSqlService(NoSqlService &old_service, NoSqlService &new_service);

NoSqlConnection* getConnection() {
return new MigratingNoSqlConnection(old_service.getConnection(), new_service.getConnection());
}
};

class MigratingNoSqlConnection : public NoSqlConnection
{
public:
MigratingNoSqlConnection(NoSqlConnection *old_service_connection, NoSqlConnection *old_service_connection);

int get(size_t timeout_ms, const std::string &customer, const std::string &id, const std::string &data_type, std::string &data_value);
// etc...
};

David explained, “The MigratingNoSqlService itself doesn’t really need to know anything about the old or new service other than just what’s the reference to it. The migrating service takes in the constructor a reference to the old service [and] it takes a reference to the new service. When you ask it for a new connection, it wraps the connections it gets from those services to create a migrating connection. And then the migrating connection has the same methods as our API. So our application logic is totally oblivious about what’s going on under the covers. All we have to do is switch a tenant over to using this migrating service.”

They’ve now made three major migrations. Their third method, with ScyllaDB, was the most effective. The various ways they tried included lazy writing, lazy reading, and, finally something David described as “aggressive lazy reading.”

Lazy Writing Strategy

Lazy writing meant to always delegate reads to the old service. Writes, however, delegate to both the old and the new service.

...
int get(size_t timeout_ms, const std::string &customer, const
std::string &id, const std::string &data_type, std::string &data_value) {
    return old_service.get(timeout_ms, customer, id, data_type, data_value);
}

int update(size_t timeout_ms, const std::string &customer,
const std::string &id, const std::string &data_type, const std::string &data_value, time_t ttl) {
    old_service.update(timeout_ms, customer, id, data_type, data_value, ttl);
    return new_service.update(timeout_ms, customer, id, data_type, data_value, ttl);
}
...

Example C++ code for migration using lazy writing. With the get method, it returns just the old service data. But with an update, it would update both the old and the new services.

Under this, the old service maintains all the data, while the new service just maintains the new and updated data. Under this strategy, they decommission the old service when they deem that the new service has “enough” data to move into production. For example, this is appropriate for data that has a short shelf life (data changes rapidly, or has a TTL). As David explained, “Which in our case is okay. If we let this run for thirty days, or sixty days, the new service would accumulate whatever data was associated with all the visitors that were on the site during those thirty or sixty days.”

The advantages of this strategy was that it was simple. The obvious disadvantage was that it sacrificed visitor data for those who were not engaged during the migration period. It also prolongs the time to decommission the old service until such a time that the new system has accumulated “enough” data to take over.

Lazy Reading Strategy

This strategy delegates to reading from the new service first; if no data is found, then it delegates to the old service. Writes occur only on the new service.

...
int get(size_t timeout_ms, const std::string &customer, const
std::string &id, const std::string &data_type, std::string &data_value) {
    int status = new_service.get(timeout_ms, customer, id, data_type, data_value);
    if (data_value.empty())
        status = old_service.get(timeout_ms, customer, id, data_type, data_value);
    return status;
}

int update(size_t timeout_ms, const std::string &customer,
const std::string &id, const std::string &data_type, const std::string &data_value, time_t ttl) {
    return new_service.update(timeout_ms, customer, id, data_type, data_value, ttl);
}
...

Example C++ code for migration using a lazy reading service. The get first reads from the new service. If the data value is empty, then it reads from the old service. When it does updates, it only updates the new service.

Again, you can only decommission the new service when it has “enough” data to take over from the old service. While this offers generally the same advantages and disadvantages of the lazy writing strategy, it has the added disadvantage of latency, from potentially requiring two reads. In SAS Institute’s situation, reads were the more sensitive part, since fast reads are vital to system responsiveness.

Aggressive Lazy Reading Strategy

This is similar to the lazy reading strategy. It reads from the new service first, and if no data is present, reads from the old system. Writes are only performed to the new service.

However, what’s different is that if no data was found on the new system, their system kicks off a Python script to walk the old service’s keyspace and copies over all relevant rows that are present in the old service, but non-existent in the new service. It can even kick off multiple scripts to read different data slices in parallel to accelerate the migration.

The advantages of this is that no data was sacrificed. Migration time was minimized, so the period for double-reads was also limited. The disadvantage was that they had to pay careful attention to their scripts. For instance, checking so that it did not run for a full day only to exit with an uncaught exception. “You’ve got to do plenty of testing to make sure you’ve got that script right.” Running these scripts also does increase the load on the database during the migration period.

During their testing, they recognized that having too many scripts running in parallel could impact performance. So they backed off the number of parallel threads and also introduced sleeps to prevent adding too much load to the old system and space out processing a little bit.

Cutting over from DataStax to ScyllaDB

The window from when ScyllaDB was approved for use at SAS Institute, and when their DataStax license was going to expire was down to two weeks. Even before ScyllaDB was approved, the SAS CI team had been able to stand up test clusters, write and test their migration service using their new aggressive lazy reading strategy, plus write their migration script.

Once the approval to move to ScyllaDB came through, they were able to stand up a production ScyllaDB cluster, switched on their migrating service for all tenants, and began to run multiple copies of their migration script in parallel. Once the scripts were finished, they were able to switch away from the migration service to the regular NoSql service for all tenants, then tore down the DataStax cluster. (Even though it was ScyllaDB under the hood, to truly make no code changes they still kept the same name of DatastaxNoSqlService.)

The migration took anywhere for a few hours for the smallest cluster to two days for their largest cluster. They did each cluster migration serially, which still meant that they had conducted all their migrations within the two-week time window they faced for licensing.

They had no downtime. No lost data.

“We didn’t have any operational headaches at all — This has been a continuing story with ScyllaDB — It has been rock solid and has given us many fewer problems than we had with our Cassandra cluster.”

Further, they had no customer complaints. “And believe me, we would have gotten some complaints if we had not been able to activate user data that was needed to target ads properly.” But the customers remained unaware of the fundamental changes occurring in the underlying service. Just as the SAS CI team intended.

Finally, read performance improved. As David put it, “Because… ScyllaDB. The average read time I believe now is in the low single-digit milliseconds. And under load, the difference between ScyllaDB and Cassandra is even more striking.”