Skip to main content

Low-Latency Database Strategies Featured at P99 CONF 24

Obsessed with high-performance low-latency data systems? See the 20+ data-related tech talks sessions we’re hosting at P99 CONF 2024. P99 CONF is a (free + online) highly-technical conference for engineers who obsess over P99 percentiles and long-tail latencies. The open source, community-focused event is hosted by ScyllaDB, the company behind the monstrously fast and scalable NoSQL database (and the adorable one-eyed sea monster). Since database performance is so near and dear to ScyllaDB, we quite eagerly reached out to our friends and colleagues across the community to ensure a wide spectrum of distributed data systems, approaches, and challenges would be represented at P99 CONF. As you can see on our agenda, the response was overwhelming. This year’s attendees get to hear from – and interact with – the creators of Postgres, ScyllaDB, Turso, SlateDB, SpiceDB, Arroyo, Responsive, FerretDB, and Percona. We’re also looking forward to sessions with engineers from Redis, Oracle, TigerBeetle, AWS, QuestDB, and more. There’s a series of keynotes focused on rethinking the database, deep dives into database internals, and case studies of database engineering feats at organizations like Uber, Shopify, ShareChat, and LinkedIn. If you share our obsession with high-performance low-latency data systems, here’s a rundown of sessions to consider joining at P99 CONF 24. Register Now – It’s Free   Just In Time LSM Compaction Aleksei Kladov (TigerBeetle) TigerBeetle is a reliable and fast accounting database. Its primary on-disk data structure is a log-structured merge-tree. This talk is a deep dive into TigerBeetle’s compaction algorithm — “garbage collection” for LSM, which achieves several unusual goals: Full storage determinism across replicas, enabling recovery from disk faults. Absence of dynamic memory allocation. Highly concurrent implementation, utilizing all available CPU and disk bandwidth. Perfect pacing: resources are carefully balanced between compaction and normal transaction processing, avoiding starvation and guaranteeing bounded P100. Time-Series and Analytical Databases Walk in a Bar… Andrei Pechkurov (QuestDB) A good time-series database also has to be a decent analytical database. This implies both SQL features and efficient query processing. That’s why we recently added many optimizations to QuestDB’s SQL engine, featuring better on-disk data layout, specialized data structures, SIMD and SWAR-based code, scalable aggregation algorithms, and parallel execution pipelines. Many of these additions can be met in popular databases, some are unique to QuestDB. In this talk, we will go through the most important optimizations, and discuss what’s yet to be added and where we are when compared with the fastest analytical databases. The Next Chapter in the Sordid Love/Hate Relationship Between DBs and OSes Andy Pavlo (Carnegie Mellon) Database management systems (DBMSs) are beautiful, free-spirited software that want nothing more than to help users store and access data as quickly as possible. To achieve this goal, DBMSs have spent decades trying to avoid operating systems (OSs) at all costs. Such avoidance is necessary because OSs always try to impose their will on DBMSs and stifle their ambitions through disingenuous syscall semantics, unscalable kernel-level data structures, and excessive data copying. The many attempts to avoid the OS through kernel-bypass methods or custom hardware have such high engineering/R&D costs that few DBMSs support them. In the end, DBMSs are stuck in an abusive relationship: they need the OS to run their software and provide them with basic functionalities (e.g., memory allocation), but they do not like how the OS treats them. However, new technologies like eBPF, which allow DBMSs to run custom code safely inside the OS kernel to override its functionality, are poised to upend this power struggle. In this talk, I will present a new design approach called “user-bypass” for building high-performance database systems and services with eBPF. I will discuss recent developments in eBPF relevant to the DBMS community and what parts of a DBMS are most amenable to using it. We will also present the design of BPF-DB, an embedded DBMS written in eBPF that provides ACID transactions over multi-versioned data and runs entirely in the Linux kernel. Designing a Query Queue for ScyllaDB Avi Kivity (ScyllaDB) Database queries can be highly variable. Some are served immediately from cache, return a single row, and are done in milliseconds. Others return gigabytes or terabytes of data, take minutes or hours, and require plenty of disk I/O and compute. Deciding what concurrency to use when running these queries is a delicate balance of CPU consumption, memory consumption, and the queue designer’s nerves. A bad design can mean high latency, under-utilization of available resources, or crashing when one runs out of memory. This talk will cover the decisions we made while designing ScyllaDB’s query queue. Reliable Data Replication Cameron Morgan (Shopify) Data replication is required to make data highly available to users. Highly available in this context means users can access data in a reliable, consistent and timely fashion. Because it is so important, if this problem has not come up in your work, you have certainly used such a system. This talk focuses on the hard problems of data replication, the ones that are usually skipped in talks. What is a backfill and why do I need them to be reliable, non-blocking and often? How do you handle schema changes? How do you validate the data is correct? How can you be resistant to failure? How can you write in parallel? This talk is about the hard problems Shopify solved replicating Shopify stores to the Edge and reaching ~5M rows replicated per second with < 1s replication lag p99. Rust: A Productive Language for Writing Database Applications Carl Lerche (AWS) When you think about Rust, you might think of performance, safety, and reliability, but what about productivity? Last year, I recommended considering Rust for developing high-level applications. Rust showed great promise, but its library ecosystem needed to mature. What has changed since then? Many higher-level applications sit on top of a database. In this talk, I will explore the current state of Rust libraries for database access, focusing on ergonomics and ease of use—two crucial factors in high-level database application development. Building a Cloud Native LSM on Object Storage Chris Riccomini (Materialized View) and Rohan Desai (Responsive) This talk discusses the design and implementation of SlateDB, an open source cloud native storage engine built as a log-structured merge-tree (LSM) on top of an object store like S3, Google Cloud Storage (GCS), or Azure Blob Store (ABS). LSMs are traditionally built assuming data will reside on local storage. Building an LSM on object storage allows SlateDB to benefit from object storage replication and durability guarantees while presenting unique latency and cost challenges. We’ll discuss the design decisions and tradeoffs we faced when building SlateDB. Taming Tail Latencies in Apache Pinot with Generational ZGC Christopher Peck (Uber) The introduction of Generational ZGC mostly eliminated concerns around pause time for Java applications. This session will cover a real world application of Generational ZGC and the effects. The session will also cover how application level configs/features can be used to offset some of the trade-offs we encountered when switching to Generational ZGC. Apache Pinot is an OLAP database with an emphasis on low latency. We’ll walk through how we solved large scatter gather induced tail latencies in our Pinot clusters by switching to Generational ZGC, uncovering the low latency query potential of Pinot. We’ll also a couple of Pinot’s features which made using Generational ZGC possible. Elevating PostgreSQL: Benchmarking Vector Search Performance Daniel Seybold (benchANT) The database market is constantly evolving with new database systems addressing specific use cases such as time series data or vector search. But there is one open source database system which has been around since nearly three decades and which has gained a strong momentum over the last years, PostgreSQL. Due its pure open source approach and strong community, PostgreSQL is continuously improving on its features, performance and extensions that enable PostgreSQL to handle also specific use cases such as vector search. Over the last years, multiple native vector database systems have been established and many NoSQL and relational database systems have released vector extensions for their database systems. The same goes for PostgreSQL with two available vector search extensions, pgvector and pgvecto.rs. And since vector search performance is a crucial differentiating factor for every vector search database, we report on the latest vector search benchmark results for PostgreSQL with pgvector and pgvecto.rs. The benchmarking study covers multiple data sets of varying vector dimensions, also considering different PostgreSQL configurations, from a baseline configuration to tuned configurations. Overcoming Distributed Databases Scaling Challenges with Tablets Dor Laor (ScyllaDB) Getting fantastic performance cannot stop at the server level. Even after rewriting your code in assembly, you would need multiple servers to run at scale and to provide availability. Clusters are often sharded to achieve good performance. In this session, I will cover tablets, a new dynamic sharding design we applied at ScyllaDB in order to maximize cpu and storage utilization dynamically and to maximize elasticity speed. Why Databases Cache, but Caches Go to Disk Felipe Mendes (ScyllaDB) and Alan Kasindorf (Cache Forge) Caches and Databases are different animals. Yet, databases have always cached data and caches are exploring disks. To quantify the strengths and tradeoffs of each, ScyllaDB joined forces with Memcached’s maintainer to compare both across different scenarios. Join us as we discuss how the results trace back to each underlying architectures’ design decisions. Specifically, we’ll compare ScyllaDB row-based cache with Memcached’s in-memory hash table, and look at how Memcached’s External Flash Storage compares to ScyllaDB’s userspace I/O scheduler and asynchronous AIO/DIO. Feature Store Evolution Under Cost Constraints: When Cost is Part of the Architecture Ivan Burmistrov and David Malinge (ShareChat) At P99 CONF 23, the ShareChat team presented the scaling challenges for the ML Feature Store so it could handle 1 billion features per second. Once the system was scaled to handle the load, the next challenge the team faced was extreme cost constraints: it was required to make the same quality system much cheaper to run. This year, we will talk about approaches the team implemented in order to optimize for cost in the Cloud environment while maintaining the same SLA for the service. The talk will touch on such topics as advanced optimizations on various levels to bring down the compute, minimizing the waste when running on Kubernetes, autoscaling challenges for stateful Apache Flink jobs, and others. The talk should be useful for those who are either interested in building or optimizing an ML Feature Store or in general looking into cost optimizations in the cloud environment. Running Low-Latency Workloads on Kubernetes Jimmy Zelinskie (authzed) Configuring Kubernetes to optimally run a particular workload is best described as a continuous journey. Depending on your requirements, best practices might not only no longer apply, but actively harm performance. In this session, we document what we’ve found to work best in our journey running SpiceDB, a low-latency authorization system. Cheating the Cloud: 50% Savings with Compression Dictionaries Łukasz Paszkowsk (ScyllaDB) Discover how to slash networking costs by up to 50% with advanced compression techniques. This session covers a real-world case where the default LZ4 compression in Cassandra, with its limited 25% efficiency, was causing high costs in inter-zone replication. We’ll introduce a custom RPC compressor with external dictionary support that samples RPC traffic, trains optimized dictionaries, and seamlessly switches connections to these new dictionaries. Learn how this approach dramatically improves compression ratios, reduces cloud expenses, and enhances data transfer efficiency across distributed systems. It’s perfect for those looking to optimize cloud infrastructure. Latency, Throughput & Fault Tolerance: Designing the Arroyo Streaming Engine Micah Wylde (Arroyo) Arroyo is a distributed, stateful stream processing engine written in Rust. It combines predictable millisecond-latency processing with the throughput of a high-performance batch query engine—on top of a distributed checkpointing implementation that provides fault tolerance and exactly-once processing. These design goals are often in tension: increasing throughput generally comes at the expense of latency, and consistent checkpointing can introduce periodic latency spikes while we wait for alignment and IO. In this talk, I will cover the distributed architecture and implementation of Arroyo including the core Arrow-based dataflow engine, algorithms for stateful windowing and aggregates, and the Chandy-Lamport inspired distributed checkpointing system. You’re Doing It All Wrong Michel Stonebraker (MIT, Postgres creator) In this talk, we consider business data processing applications, which have historically been written for a three-tier architecture. Two ideas totally upset this applecart. Idea #1: The Cloud All enterprises are moving everything possible to the cloud as quickly as possible. In this new environment, you are highly encouraged to use a cloud-native architecture, whereby your system is composed of distributed functions, working in parallel, and running on a serverless (and stateless) platform like AWS Lambda or Azure Functions. You program your application as a workflow of “steps.” To make systems resilient to failures you require a separate state machine and workflow manager (e.g., AWS Step Functions, Airflow, etc.). If you use this architecture, then you don’t pay for resources when your application is idle, often a major benefit. Depending on the platform, you may also get automatic resource elasticity and load balancing; additional major benefits. Idea #2: Leverage the DBMS Obviously, your data belongs in a DBMS. However, by extension, so does the state of your application. Keeping track of application state in the DBMS allows one to provide once-and-only-once execution semantics for your workflow. One can also use the database concept of “sagas” to allow multi-transaction applications to be done to completion or not at all. Furthermore, to go an order of magnitude faster that AWS Lambda, you need to collocate your application and the DBMS. The fastest alternative is to run your application inside the DBMS using stored procedures (SPs). However, it is imperative to overcome SP weaknesses, specifically the requirement of a different language (e.g.PL/SQL) and the absence of a debugging environment. The latter can be accomplished by persisting the database log and allowing “time travel debugging” for SPs. The former can be supported by coding SPs in a conventional language such as Typescript. Extending this idea to the operating environment, one can time travel the entire system, thereby allowing recovery to a previous point in time when disasters happen (errant programs, adversary intrusions, ransomware, etc.). I will discuss one such platform (DBOS) with all of the above features. In my opinion, this is an example of why “you are doing it all wrong.” Taming Discard Latency Spikes Patryk Wróbel (ScyllaDB) Discover an interesting lesson related to the impact of discarding files on read and write latency on modern NVMe SSDs, learned while fixing a real-world problem in ScyllaDB. Dive into the way how TRIM requests are issued when online discard is enabled for the XFS file system, the problems that may occur, and possible solutions. Redis Alternatives Compared Peter Zaitsev (Percona) In my talk, I will delve into a variety of Redis alternatives, providing an unbiased analysis that encompasses emerging forks like Valley and Redix, established contenders such as DragonflyDB and KeyDB, and unique options like Microsoft Garnet and Redka. My presentation will cover critical aspects of these alternatives, including licensing models and their implications, comparisons of feature sets and functionality, governance and community support structures, and performance considerations tailored to different use cases. You will leave with a clearer understanding of how each alternative could meet specific needs, insights into open source compliance and licensing, and an appreciation of the importance of performance and support options in choosing the right solution. Join me to clarify your options and strategize your approach in the ever-changing world of Redis alternatives. Database Drivers: Performance Perspectives Piot Sarna (poolside) This talk explains how to get the most out of database drivers by understanding their design and potential. It’s a technical deep dive into how database drivers work underneath, and how to adjust their performance to your expectations. Using eBPF Off-CPU Sampling to See What Your Databases Are Really Waiting For Tanel Poder At P99 CONF 23, I introduced the general concept of using eBPF-populated Task State Arrays to keep track of all Linux applications’ (including database engines) thread states and activity without relying on the built-in instrumentation of the application. For example, the “wait events” built into database engines are not perfect; some voluntary waits (system calls) are not properly instrumented in all database engines. There are also other involuntary waits caused by OS-level issues, like memory allocation stalls, CPU queuing, and task scheduler glitches. This year, I will show the latest eBPF-based “xcapture” tool in practical use, measuring where MySQL, Postgres, and DuckDB really spend their time, both when on CPU and sleeping. All this can be done without having to change any source code of the database engine or applications running on it. Java Heap Memory Optimization to Improve P99 Query Latency at LinkedIn Scale Vivek Iyer Vaidyanathan Iyer (LinkedIn) Apache Pinot is a real-time, distributed, OLAP database designed to serve low-latency SQL queries at high throughput. It was built and open-sourced by Linkedin and powers many site facing use cases for low latency realtime analytics. Pinot Servers, the work-horses of SQL query processing, store table shards on local SSDs and memory map the columnar data buffers (data, indexes etc). In some specialized use cases where we have P99 query SLA under 100ms, the column buffers are loaded on Java heap as opposed to off heap (memory map). The data in these on heap column buffers are characterized by high cardinality, featuring a high number of unique objects alongside a notable abundance of DUPLICATE objects. Duplicate Objects waste almost 20% of the JVM heap per host. The memory-intensive nature of our OnHeap dictionary indexes leads to high Java Heap usage resulting in spiky P99 latencies due to the unpredictable nature of Java Garbage Collection. This talk will challenge the conventional notion that discourages the use of Interning methodologies and showcase how the Pinot production deployments at LinkedIn saw 20% heap savings per host along while improving P99 query latencies by 35% using a home-grown, efficient strategy of FALF Interning – Fixed-Size Array Lock-Free Interning. Designed as a small, fixed-size, open-hashmap-based object cache that duplicates objects opportunistically, these Interners work on all object types and are 17% faster than the traditional Interners. In this talk, we will present on how we used the JXRAY memory analysis to discover the problem, design, implementation and the P99 query latency improvements we observed in production @ LinkedIn Scale. We will discuss the general challenges to solve the duplicate objects problem for Java-based systems where the traditional methods of tuning JVM parameters, employing native or Guava Interners don’t work. Join the Conference Online – It’s Free