Simplifying Cassandra and DynamoDB Migrations with the ScyllaDB Migrator

Guy Shtub

Learn about the architecture of ScyllaDB Migrator, how to use it, recent developments, and upcoming features. ScyllaDB offers both a CQL-compatible API and a DynamoDB-compatible API, allowing applications that use Apache Cassandra or DynamoDB to take advantage of reduced costs and lower latencies with minimal code changes. We previously described the two main migration strategies: cold and hot migrations. In both cases, you need to backfill ScyllaDB with historical data. Either can be efficiently achieved with the ScyllaDB Migrator. In this blog post, we will provide an update on its status. You will learn about its architecture, how to use it, recent developments, and upcoming features. The Architecture of the ScyllaDB Migrator The ScyllaDB Migrator leverages Apache Spark to migrate terabytes of data in parallel. It can migrate data from various types of sources, as illustrated in the following diagram: We initially developed it to migrate from Apache Cassandra, but we have since added support for more types of data sources. At the time of writing, the Migrator can migrate data from either: A CQL-compatible source: An Apache Cassandra table. Or a Parquet file stored locally or on Amazon S3. Or a DynamoDB-compatible source: A DynamoDB table. Or a DynamoDB table export on Amazon S3. What’s so interesting about ScyllaDB Migrator? Since it runs as an Apache Spark application, you can adjust its throughput by scaling the underlying Spark cluster. It is designed to be resilient to read or write failures. If it stops prior to completion, the migration can be restarted from where it left off. It can rename item columns along the way. When migrating from DynamoDB, the Migrator can endlessly replicate new changes to ScyllaDB. This is useful for hot migration strategies. How to Use the ScyllaDB Migrator More details are available in the official Migrator documentation. The main steps are: Set Up Apache Spark: There are several ways to set up an Apache Spark cluster, from using a pre-built image on AWS EMR to manually following the official Apache Spark documentation to using our automated Ansible playbook on your own infrastructure. You may also use Docker to run a cluster on a single machine. Prepare the Configuration File: Create a YAML configuration file that specifies the source database, target ScyllaDB cluster, and any migration option. Run the Migrator: Execute the ScyllaDB Migrator using the spark-submit command. Pass the configuration file as an argument to the migrator. Monitor the Migration: The Spark UI provides logs and metrics to help you monitor the migration process. You can track the progress and troubleshoot any issues that arise. You should also monitor the source and target databases to check whether they are saturated or not. Recent Developments The ScyllaDB Migrator has seen several significant improvements, making it more versatile and easier to use: Support for Reading DynamoDB S3 Exports: You can now migrate data from DynamoDB S3 exports directly to ScyllaDB, broadening the range of sources you can migrate from. PR #140. AWS AssumeRole Authentication: The Migrator now supports AWS AssumeRole authentication, allowing for secure access to AWS resources during the migration process. PR #150. Schema-less DynamoDB Migrations: By adopting a schema-less approach, the Migrator enhances reliability when migrating to ScyllaDB Alternator, ScyllaDB’s DynamoDB-compatible API. PR #105. Dedicated Documentation Website: The Migrator’s documentation is now available on a proper website, providing comprehensive guides, examples, and throughput tuning tips. PR #166. Update to Spark 3.5 and Scala 2.13: The Migrator has been updated to support the latest versions of Spark and Scala, ensuring compatibility and leveraging the latest features and performance improvements. PR #155. Ansible Playbook for Spark Cluster Setup: An Ansible playbook is now available to automate the setup of a Spark cluster, simplifying the initial setup process. PR #148. Publish Pre-built Assemblies: You don’t need to manually build the Migrator from the source anymore. Download the latest release and pass it to the spark-submit command. PR #158. Strengthened Continuous Integration: We have set up a testing infrastructure that reduces the risk of introducing regressions and prevents us from breaking backward compatibility. PRs #107, #121, #127. Hands-on Migration Example The content of this section has been extracted from the documentation website. The original content is kept up to date. Let’s go through a migration example to illustrate some of the points listed above. We will perform a cold migration to replicate 1,000,000 items from a DynamoDB table to ScyllaDB Alternator. The whole system is composed of the DynamoDB service, a Spark cluster with a single worker node, and a ScyllaDB cluster with a single node, as illustrated below: To make it easier for interested readers to follow along, we will create all those services using Docker. All you need is the AWS CLI and Docker. The example files can be found at https://github.com/scylladb/scylla-migrator/tree/b9be9fb684fb0e51bf7c8cbad79a1f42c6689103/docs/source/tutorials/dynamodb-to-scylladb-alternator Set Up the Services and Populate the Source Database We use Docker Compose to define each service. Our docker-compose.yml file looks as follows: Let’s break down this Docker Compose file. We define the DynamoDB service by reusing the official image amazon/dynamodb-local. We use the TCP port 8000 for communicating with DynamoDB. We define the Spark master and Spark worker services by using a custom image (see below). Indeed, the official Docker images for Spark 3.5.1 only support Scala 2.12 for now, but we need Scala 2.13. We mount the local directory ./spark-data to the Spark master container path /app so that we can supply the Migrator jar and configuration to the Spark master node. We expose the ports 8080 and 4040 of the master node to access the Spark UIs from our host environment. We allocate 2 cores and 4 GB of memory to the Spark worker node. As a general rule, we recommend allocating 2 GB of memory per core on each worker. We define the ScyllaDB service by reusing the official image scylladb/scylla. We use the TCP port 8001 for communicating with ScyllaDB Alternator. The Spark services rely on a local Dockerfile located at path ./dockerfiles/spark/Dockerfile. For the sake of completeness, here is the content of this file, which you can copy-paste: And here is the entry point used by the image, which needs to be executable: This Docker image installs Java and downloads the official Spark release. The entry point of the image takes an argument that can be either master or worker to control whether to start a master node or a worker node. Prepare your system for building the Spark Docker image with the following commands: mkdir spark-data chmod +x entrypoint.sh Finally, start all the services with the following command: docker compose up Your system’s Docker daemon will download the DynamoDB and ScyllaDB images and build our Spark Docker image. Check that you can access the Spark cluster UI by opening http://localhost:8080 in your browser. You should see your worker node in the workers list. Once all the services are up, you can access your local DynamoDB instance and your local ScyllaDB instance by using the standard AWS CLI. Make sure to configure the AWS CLI as follows before running the dynamodb commands: # Set dummy region and credentials aws configure set region us-west-1 aws configure set aws_access_key_id dummy aws configure set aws_secret_access_key dummy # Access DynamoDB

aws --endpoint-url http://localhost:8000 dynamodb list-tables

# Access ScyllaDB Alternator aws --endpoint-url http://localhost:8001 dynamodb list-tables The last preparatory step consists of creating a table in DynamoDB and filling it with random data. Create a file named create-data.sh, make it executable, and write the following content into it: This script creates a table named Example and adds 1 million items to it. It does so by invoking another script, create-25-items.sh, that uses the batch-write-item command to insert 25 items in a single call: Every added item contains an id and five columns, all filled with random data. Run the script: ./create-data.sh and wait for a couple of hours until all the data is inserted (or change the last line of create-data.sh to insert fewer items and make the demo faster). Perform the Migration Once you have set up the services and populated the source database, you are ready to perform the migration. Download the latest stable release of the Migrator in the spark-data directory:

wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar \
  –directory-prefix=./spark-data

Create a configuration file in spark-data/config.yaml and write the following content: This configuration tells the Migrator to read the items from the table Example in the dynamodb service, and to write them to the table of the same name in the scylla service. Finally, start the migration with the following command: docker compose exec spark-master \ /spark/bin/spark-submit \ --executor-memory 4G \ --executor-cores 2 \ --class com.scylladb.migrator.Migrator \ --master spark://spark-master:7077 \ --conf spark.driver.host=spark-master \ --conf spark.scylla.config=/app/config.yaml \ /app/scylla-migrator-assembly.jar This command calls spark-submit in the spark-master service with the file scylla-migrator-assembly.jar, which bundles the Migrator and all its dependencies. In the spark-submit command invocation, we explicitly tell Spark to use 4 GB of memory; otherwise, it would default to 1 GB only. We also explicitly tell Spark to use 2 cores. This is not really necessary as the default behavior is to use all the available cores, but we set it for the sake of illustration. If the Spark worker node had 20 cores, it would be better to use only 10 cores per executor to optimize the throughput (big executors require more memory management operations, which decrease the overall application performance). We would achieve this by passing --executor-cores 10, and the Spark engine would allocate two executors for our application to fully utilize the resources of the worker node. The migration process inspects the source table, replicates its schema to the target database if it does not exist, and then migrates the data. The data migration uses the Hadoop framework under the hood to leverage the Spark cluster resources. The migration process breaks down the data to transfer chunks of about 128 MB each, and processes all the partitions in parallel. Since the source is a DynamoDB table in our example, each partition translates into a scan segment to maximize the parallelism level when reading the data. Here is a diagram that illustrates the migration process: During the execution of the command, a lot of logs are printed, mostly related to Spark scheduling. Still, you should be able to spot the following relevant lines: 24/07/22 15:46:13 INFO migrator: ScyllaDB Migrator 0.9.2 … 24/07/22 15:46:20 INFO alternator: We need to transfer: 2 partitions in total 24/07/22 15:46:20 INFO alternator: Starting write… 24/07/22 15:46:20 INFO DynamoUtils: Checking for table existence at destination And when the migration ends, you will see the following line printed: 24/07/22 15:46:24 INFO alternator: Done transferring table snapshot During the migration, it is possible to monitor the underlying Spark job by opening the Spark UI available at http://localhost:4040 Example of a migration broken down in 6 tasks. The Spark UI allows us to follow the overall progress, and it can also show specific metrics such as the memory consumption of an executor. In our example the size of the source table is ~200 MB. In practice, it is common to migrate tables containing several terabytes of data. If necessary, and as long as your DynamoDB source supports a higher read throughput level, you can increase the migration throughput by adding more Spark worker nodes. The Spark engine will automatically spread the workload between all the worker nodes. Future Enhancements The ScyllaDB team is continuously improving the Migrator. Some of the upcoming features include: Support for Savepoints with DynamoDB Sources: This will allow users to resume the migration from a specific point in case of interruptions. This is currently supported with Cassandra sources only. Shard-Aware ScyllaDB Driver: The Migrator will fully take advantage of ScyllaDB’s specific optimizations for even faster migrations. Support for SQL-based Sources: For instance, migrate from MySQL to ScyllaDB. Conclusion Thanks to the ScyllaDB Migrator, migrating data to ScyllaDB has never been easier. With its robust architecture, recent enhancements, and active development, the migrator is an indispensable tool for ensuring a smooth and efficient migration process. For more information, check out the ScyllaDB Migrator lesson on ScyllaDB University. Another useful resource is the official ScyllaDB Migrator documentation. Are you using the Migrator? Any specific feature you’d like to see? For any questions about your specific use case or about the Migrator in general, tap into the community knowledge on the ScyllaDB Community Forum.