How to Model Leaderboards for 1M Player Game with ScyllaDB

Daniel Reis

Ever wondered how a game like League of Legends, Fortnite, or even Rockband models its leaderboards? In this article, we’ll explore how to properly model a schema for leaderboards…using a monstrously fast database (ScyllaDB)! 1. Prologue Ever since I was a kid, I’ve been fascinated by games and how they’re made. My favorite childhood game was Guitar Hero 3: Legends of Rock. Well, more than a decade later, I decided to try to contribute to some games in the open source environment, like rust-ro (Rust Ragnarok Emulator) and YARG (Yet Another Rhythm Game). YARG is another rhythm game, but this project is completely open source. It unites legendary contributors in game development and design. The game was being picked up and played mostly by Guitar Hero/Rockband streamers on Twitch. I thought: Well, it’s an open-source project, so maybe I can use my database skills to create a monstrously fast leaderboard for storing past games. It started as a simple chat on their Discord, then turned into a long discussion about how to make this project grow faster. Ultimately, I decided to contribute to it by building a leaderboard with ScyllaDB. In this blog, I’ll show you some code and concepts! 2. Query-Driven Data Modeling With NoSQL, you should first understand which query you want to run depending on the paradigm (document, graph, wide-column, etc.). Focus on the query and create your schema based on that query. In this project, we will handle two types of paradigms: Key-Value Wide Column (Clusterization) Now let’s talk about the queries/features of our modeling. 2.1 Feature: Storing the matches Every time you finish a YARG gameplay, you want to submit your scores plus other in-game metrics. Basically, it will be a single query based on a main index. SELECT score, stars, missed_notes, instrument, ... FROM leaderboard.submisisons WHERE submission_id = 'some-uuid-here-omg' 2.2 Feature: Leaderboard And now our main goal: a super cool leaderboard that you don’t need to worry about after you perform good data modeling. The leaderboard is per song: every time you play a specific song, your best score will be saved and ranked. The interface has filters that dictate exactly which leaderboard to bring: song_id: required instrument: required modifiers: required difficulty: required player_id: optional score: optional Imagine our query looks like this, and it returns the results sorted by score in descending order: SELECT player_id, score, ... FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'none'} AND track_id = 'dani-california' LIMIT 100; -- player_id | score ----------------+------- -- tzach | 12000 -- danielhe4rt | 10000 -- kadoodle | 9999 ----------------+------- Can you already imagine what the final schema will look like? No? Ok, let me help you with that! 3. Data Modeling time! It’s time to take a deep dive into data modeling with ScyllaDB and better understand how to scale it. 3.1 – Matches Modeling First, let’s understand a little more about the game itself: It’s a rhythm game; You play a certain song at a time; You can activate “modifiers” to make your life easier or harder before the game; You must choose an instrument (e.g. guitar, drums, bass, and microphone). Every aspect of the gameplay is tracked, such as: Score; Missed notes; Overdrive count; Play speed (1.5x ~ 1.0x); Date/time of gameplay; And other cool stuff. Thinking about that, let’s start our data modeling. It will turn into something like this: CREATE TABLE IF NOT EXISTS leaderboard.submissions ( submission_id uuid, track_id text, player_id text, modifiers frozen<set>, score int, difficulty text, instrument text, stars int, accuracy_percentage float, missed_count int, ghost_notes_count int, max_combo_count int, overdrive_count int, speed int, played_at timestamp, PRIMARY KEY (submission_id, played_at) ); Let’s skip all the int/text values and jump to the set<text>. The set type allows you to store a list of items of a particular type. I decided to use this list to store the modifiers because it’s a perfect fit. Look at how the queries are executed: INSERT INTO leaderboard.submissions ( submission_id, track_id, modifiers, played_at ) VALUES ( some-cool-uuid-here, 'starlight-muse' {'all-taps', 'hell-mode', 'no-hopos'}, '2024-01-01 00:00:00' ); With this type, you can easily store a list of items to retrieve later. Another cool piece of information is that this query is a key-value like! What does that mean? Since you will always query it by the submission_id only, it can be categorized as a key-value. 3.2 Leaderboard Modeling Now we’ll cover some cool wide-column database concepts. In our leaderboard query, we will always need some dynamic values in the WHERE clauses. That means these values will belong to the Partition Key while the Clustering Keys will have values that can be “optional”. A partition key is a hash based on a combination of fields that you added to identify a value. Let’s imagine that you played Starlight - Muse 100x times. If you were to query this information, it would return 100x different results differentiated by Clustering Keys like score or player_id. SELECT player_id, score --- FROM leaderboard.song_leaderboard WHERE track_id = 'starlight-muse' LIMIT 100; If 1,000,000 players play this song, your query will become slow and it will become a problem in the future because your partition key consists of only one field, which is track_id. However, if you add more fields to your Partition Key, like mandatory things before playing the game, maybe you can shrink these possibilities for a faster query. Now do you see the big picture? Adding the fields like Instrument, Difficulty, and Modifiers will give you a way to split the information about that specific track evenly. Let’s imagine with some simple numbers: -- Query Partition ID: '1' SELECT player_id, score, ... FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'none'} AND -- Modifiers Changed track_id = 'starlight-muse' LIMIT 100; -- Query Partition ID: '2' SELECT player_id, score, ... FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'all-hopos'} AND -- Modifiers Changed track_id = 'starlight-muse' LIMIT 100; So, if you build the query in a specific shape it will always look for a specific token and retrieve the data based on these specific Partition Keys. Let’s take a look at the final modeling and talk about the clustering keys and the application layer: CREATE TABLE IF NOT EXISTS leaderboard.song_leaderboard ( submission_id uuid, track_id text, player_id text, modifiers frozen<set>, score int, difficulty text, instrument text, stars int, accuracy_percentage float, missed_count int, ghost_notes_count int, max_combo_count int, overdrive_count int, speed int, played_at timestamp, PRIMARY KEY ((track_id, modifiers, difficulty, instrument), score, player_id) ) WITH CLUSTERING ORDER BY (score DESC, player_id ASC); The partition key was defined as mentioned above, consisting of our REQUIRED PARAMETERS such as track_id, modifiers, difficulty and instrument. And for the Clustering Keys, we added score and player_id. Note that by default the clustering fields are ordered by score DESC and just in case a player has the same score, the criteria to choose the winner will be alphabetical ¯\(ツ)/¯. First, it’s good to understand that we will have only ONE SCORE PER PLAYER. But, with this modeling, if the player goes through the same track twice with different scores, it will generate two different entries. INSERT INTO leaderboard.song_leaderboard ( track_id, player_id, modifiers, score, difficulty, instrument, stars, played_at ) VALUES ( 'starlight-muse', 'daniel-reis', {'none'}, 133700, 'expert', 'guitar', '2023-11-23 00:00:00' ); INSERT INTO leaderboard.song_leaderboard ( track_id, player_id, modifiers, score, difficulty, instrument, stars, played_at ) VALUES ( 'starlight-muse', 'daniel-reis', {'none'}, 123700, 'expert', 'guitar', '2023-11-23 00:00:00' ); SELECT player_id, score FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'none'} AND track_id = 'starlight-muse' LIMIT 2; -- player_id | score ----------------+------- -- daniel-reis | 133700 -- daniel-reis | 123700 ----------------+------- So how do we fix this problem? Well, it’s not a problem per se. It’s a feature! As a developer, you have to create your own business rules based on the project’s needs, and this is no different. What do I mean by that? You can run a simple DELETE query before inserting the new entry. That will guarantee that you will not have specific data from the player_id with less than the new score inside that specific group of partition keys. -- Before Insert the new Gampleplay DELETE FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'none'} AND track_id = 'starlight-muse' AND player_id = 'daniel-reis' AND score <= 'your-new-score-here'; -- Now you can insert the new payload... And with that, we finished our simple leaderboard system, the same one that runs in YARG and can also be used in games with MILLIONS of entries per second 😀 4. How to Contribute to YARG Want to contribute to this wonderful open-source project? We’re building a brand new platform for all the players using: Game: Unity3d (Repository) Front-end: NextJS (Repository) Back-end: Laravel 10.x (Repository) We will need as many developers and testers as possible to discuss future implementations of the game together with the main contributors! First, make sure to join this Discord Community. This is where all the technical discussions happen with the backing of the community before going to the development board. Also, outside of Discord, the YARG community is mostly focused on the EliteAsian (core contributor and project owner) X account for development showcases. Be sure to follow him there as well.

New replay viewer HUD for #YARG! There are still some issues with it, such as consistency, however we are planning to address them by the official stable release of v0.12. pic.twitter.com/9ACIJXAZS4 — EliteAsian (@EliteAsian123) December 16, 2023

And FYI, the Lead Artist of the game, (aka Kadu) is also a Broadcast Specialist and Product Innovation Developer at Elgato who worked with streamers like: Ninja Nadeshot StoneMountain64 and the legendary DJ Marshmello. Kadu also uses his X to share some insights and early previews of new features and experimentations for YARG. So, don’t forget to follow him as well!

Here's how the replay venue looks like now, added a lot of details on the desk, really happy with the result so far, going to add a few more and start the textures pic.twitter.com/oHH27vkREe — ⚡Kadu Waengertner (@kaduwaengertner) August 10, 2023

Here are some useful links to learn more about the project: Official Website Github Repository Task Board

Fun fact: YARG got noticed by Brian Bright, project lead on Guitar Hero, who liked the fact that the project was open source. Awesome, right?

5. Conclusion Data modeling is sometimes challenging. This project involved learning many new concepts and a lot of testing together with my community on Twitch. I have also published a Gaming Leaderboard Demo, where you can get some insights on how to implement the same project using NextJS and ScyllaDB! Also, if you like ScyllaDB and want to learn more about it, I strongly suggest you watch our free Masterclass Courses or visit ScyllaDB University!