Skip to main content

Lessons Learned Leading High-Stakes Data Migrations

“No one ever said ‘meh, it’s just our database'” Every data migration is high stakes to the person leading it. Whether you’re upgrading an internal app’s database or moving 362 PB of Twitter’s data from bare metal to GCP, a lot can go awry — and you don’t want to be blamed for downtime or data loss. But a migration done right will not only optimize your project’s infrastructure. It will also leave you with a deeper understanding of your system and maybe even yield some fun “war stories” to share with your peers. To cheat a bit, why not learn from others’ experiences first? Enter Miles Ward (CTO at SADA and former Google and AWS cloud lead) and Tim Koopmans (Senior Director at ScyllaDB, performance geek and SaaS startup founder). Miles and Tim recently got together to chat about lessons they’ve personally learned from leading real-world data migrations. You can watch the complete discussion here: Let’s look at three key takeaways from the chat. 1. Start with the Hardest, Ugliest Part First It’s always tempting to start a project with some quick wins. But tackling the worst part first will yield better results overall. Miles explains, “Start with the hardest, ugliest part first because you’re going to be wrong in terms of estimating timelines and noodling through who has the correct skills for each step and what are all of the edge conditions that drive complexity.” For example, he saw this approach in action during Google’s seven-year migration of the Gmail backend (handling trillions of transactions per day) from its internal Gmail data system to Spanner. First, Google built Spanner specifically for this purpose. Then, the migration team ran roll-forwards and roll-backs of individual mailbox migrations for over two years before deciding that the performance, reliability and consistency in the new environment met their expectations. Miles added, “You also get an emotional benefit in your teams. Once that scariest part is done, everything else is easier. I think that tends to work well both interpersonally and technically.” 2. Map the Minefield You can’t safely migrate until you’ve fully mapped out every little dependency. Both Tim and Miles stress the importance of exhaustive discovery: cataloging every upstream caller, every downstream consumer, every health check and contractual downtime window before a single byte shifts. Miles warns, “If you don’t have an idea of what the consequences of your change are…you’ll design a migration that’s ignorant of those needs.” Miles then offered a cautionary anecdote from his time at Twitter, as part of a team that migrated 362 petabytes of active data from bare-metal data centers into Google Cloud. They used an 800 Gbps interconnect (about the total internet throughput at the time) and transferred everything in 43 days. To be fair, this was a data warehouse migration, so it didn’t involve hundreds of thousands of transactional queries per second. Still, Twitter’s ad systems and revenue depended entirely on that warehouse, making the migration mission-critical. Miles shared: “They brought incredible engineers and those folks worked with us for months to lay out the plan before we moved any bytes. Compare that to something done a little more slapdash. I think there are plenty of places where businesses go too slow, where they overinvest in risk management because they haven’t modeled the cost-benefit of a faster migration. But if you don’t have that modeling done, you should probably take the slow boat and do it carefully.” 3. Engineer a “Blissfully Boring” Cutover “If you’re not feeling sleepy on cut-over day,” Miles quipped, “you’ve done something terribly wrong.” But how do you get to that point? Tim  shared that he’s always found dual writes with single reads useful: you can switch over once both systems are up to speed. If the database doesn’t support dual writes, replicating writes via Change Data Capture (CDC) or something similar works well. Those strategies provide confidence that the source and target behave the same under load before you start serving real traffic. Then Tim asked Miles, “Would you say those are generally good approaches, or does it just depend?” Miles’ response: “I think the biggest driver of ‘It depends’ is that those concepts are generally sound, but real‐world migrations are more complex. You always want split writes when feasible, so you build operational experience under write load in the new environment. But sample architecture diagrams and Terraform examples make migrations look simpler than they usually are.” Another complicating factor: most companies don’t have one application on one database. They have dozens of applications talking across multiple databases, data warehouses, cache layers and so on. All of this matters when you start routing read traffic from various sources. Some systems use scheduled database-to-warehouse extractions, while others avoid streaming replication costs. Load patterns shift throughout the day as different workloads come online. That’s why you should test beyond the immediate reads after migration or when initial writes move to the new environment. So codify every step, version it and test it all multiple times – exactly the same way. And if you need to justify extra preparation or planning for migration, frame it as improving your overall high-availability design. Those practices will carry forward even after the cutover. Also, be aware that new platforms will inevitably have different operational characteristics…that’s why you’re adopting them. But these changes can break hard-coded alerts or automation. For example, maybe you had alerts set to trigger at 10,000 transactions per second, but the new system easily handles 100,000. Ensure that your previous automation still works and systematically evaluate all upstream and downstream dependencies. Follow these tips and the big day could resemble Digital Turbine’s stellar example. Miles shared, “If Digital Turbine’s database went down, its business went down. But the company’s DynamoDB to ScyllaDB migration was totally drama free. It took two and a half weeks, all buttoned up, done. It was going so well that everybody had a beer in the middle of the cutover.” Closing Thoughts Data migrations are always “high stakes.” As Miles bluntly put it, “I know that if I screw this up, I’ll piss off customers, drive them to competitors, or miss out on joint growth opportunities. It all comes down to trust. There are countless ways you can screw up an application in a way that breaches stakeholder trust. But doing careful planning, being thoughtful about the migration process, and making the right design decisions sets the team up to grow trust instead of eroding it.” Data migration projects are also great opportunities to strengthen your team’s architecture and build your own engineering expertise. Tim left us with this thought: “My advice for anyone who’s scared of running a data migration: Just have a crack at it. Do it carefully, and you’ll learn a lot about distributed systems in general – and gain all sorts of weird new insights into your own systems in particular. ” Watch the complete video (at the start of this article) for more details on these topics – as well as some fun “war stories.” Bonus: Access our free NoSQL Migration Masterclass for a deeper dive into migration strategy, missteps, and logistics.