Skip to main content

Common Performance Pitfalls of Modern Storage I/O

Performance pitfalls when aiming for high-performance I/O on modern hardware and cloud platforms Over the past two blog posts in this series, we’ve explored real-world performance investigations on storage-optimized instances with high-performance NVMe RAID arrays. Part 1 examined how the Seastar IO Scheduler‘s fair-queue issues inadvertently throttle bandwidth to a fraction of the advertised instance capacity. Part 2 dove into the filesystem layer, revealing how XFS’s block size and alignment strategies could force read-modify-write operations and cut a big chunk out of write throughput. This third and final part synthesizes those findings into a practical performance checklist. Whether you’re optimizing ScyllaDB, building your own database system, or simply trying to understand why your storage isn’t delivering the advertised performance, understanding these three interconnected layers – disk, filesystem, and application – is essential. Each layer has its own assumptions of what constitutes an optimal request. When these expectations misalign, the consequences cascade down, amplifying latency and degrading throughput. This post presents a set of delicate pitfalls we’ve encountered, organized by layer. Each includes concrete examples from production investigations as well as actionable mitigation strategies. Mind the Disk — Block Size and Alignment Matter Physical and Logical sector sizes Modern SSDs, particularly high-performance NVMe drives, expose two critical properties: physical sector size and logical sector size. The logical sector size is what the operating system sees. It typically defaults to 512 bytes for backward compatibility, though many modern drives are also capable of reporting 4K. The physical sector size reflects how data is actually stored on the flash chips. It’s the size at which the drive delivers peak performance. When you submit a write request that doesn’t align to the physical sector size, the SSD controller must: Read the entire physical page containing your data Modify the portion you’re writing to Write the entire sector back to the empty/erased flash page In our investigation of AWS i7i instances, we were bitten by exactly this problem. First, IOTune, our disk benchmarking tool, was using 512 byte requests to run the benchmarks because the firmware in these new NVMes was incorrectly reporting the physical sector size as 512 bytes when it was actually 4 KB. This made us measure disks as having up to 25% less read IOPS and up to 42% less write IOPS. The measurements were used to configure the IO Scheduler, so we ended up using the disks like they were a less performant model. That’s a very good way for your business to lose money. 🙂 If you’re wondering why I only explained the reasoning for slow writes with requests unaligned to the physical sector size, it’s because we still don’t fully understand why read requests are also hit by this issue. It’s an open question and we’re still researching some leads (which I hope will get us an answer). Sustained performance If your software relies on a disk hitting a certain performance number, try to account for the fact that even dedicated provisioned NVMes have peak and sustained performance values. It’s well known that elastic storage (like EBS, for instance) has baseline performance and peak performance, but it’s less intuitive for dedicated NVMes to behave like this. Measuring disk performance with 10m+ workloads might result in 3-5% lower IOPS/throughput. That allows your software to better predict how the disk behaves under sustained load. Mind the RAID If you’re using a RAID0 for NVMe arrays, be aware that your app’s parallel architecture might resonate with the way requests end up distributed over the RAID array. RAIDs are made out of chunks and stripes. A chunk is a block of data within a single disk; when the chunk size is exceeded, the driver moves on to the next disk in the array. A stripe contains all the chunks, one from each disk that will get written sequentially. A RAID0 with 2 disks will get written like this: stripe0: chunk0 (disk0), chunk1 (disk1); stripe1; stripe2… Filesystems usually align files to the RAID stripe boundary. Depending on the write pattern of your application, you could end up stressing some of the disks more, and not leveraging the entire power of the RAID array. Key Takeaways for the Disk Layer Detection: Always verify physical and logical sector sizes if you suspect your issue might be related to this. Don’t blindly trust firmware-reported values; cross-check with benchmarking tools and adjust your app and filesystem to use the physical sector size if possible. Measurement discipline: Increase measurement time when benchmarking disks. Even dedicated NVMes can have baseline vs. peak performance. RAID awareness: RAID architecture is made out of blocks, addresses, and drivers managing them. It’s not a magic endpoint that will just amplify your N-drives array into N times the performance of a disk. Its architecture has its own set of assumptions and limitations which, together with the filesystem’s own limitations and assumptions, might interfere with your app’s. Mind the Filesystem — Block Size, Alignment, and Metadata Operations Filesystem Block Size and Request Alignment Every filesystem has its own block size, independent of the disk’s physical sector size. XFS, for instance, can be formatted with block sizes ranging from 512 bytes to 64 KB. In ScyllaDB, we used to format with 1K block sizes because we wanted the fastest commitlog writes >= 1K. For older SSDs, the physical sector size was 512 bytes. On modern 4K-optimized NVMe arrays, this choice became a liability. We realized that 4K block sizes would bring us lots of extra write throughput. This filesystem-level block size affects two critical aspects: how data is stored and aligned on disk, and how metadata is laid out. Here’s a concrete example. When Seastar issues 128 KB sequential writes to a file on 1K-block XFS, the filesystem doesn’t seem to align these to 4K boundaries (maybe because the SSD firmware reported a physical sector size of 512 bytes). Using blktrace to inspect the actual disk I/O, we observed that approximately 75% of requests aligned to 1K or 2K boundaries. For these requests, the drive controller would split them each into at most 3 parts: a head, a 4K-aligned core, and a tail. For the head and tail, the disk would perform RMW, which is very slow. That would become the dominating factor for the entire request (consisting of the 3 parts). Reformatting the filesystem with 4K block size completely transformed the alignment distribution of requests, and 100% of them aligned to 4K. This brought a lot of throughput back for us. Filesystem Metadata Operations and Blocking Consider this: when a file grows, XFS must: Allocate extents from the freespace tree, requiring B-Tree modifications and mutex locks Update inode metadata to reflect the new file size Flush metadata periodically to ensure durability Update access/change times (ctimes) on every write Each of these operations can block subsequent I/O submissions. In our research, we discovered that the RWF_NOWAIT flag (requesting non-blocking async I/O submission) was insufficiently effective when metadata operations were queued. Writes would be re-submitted from worker threads rather than the Seastar reactor, adding context-switch overhead and latency spikes. When the final size of files is known, it is beneficial to pre-allocate or pre-truncate the file to that size using functions like fallocate() or ftruncate(). This practice dramatically improves the alignment distribution across the file and helps to amortize the overhead associated with extent allocation and metadata updates. While effective, fallocate() can be an expensive operation. It can potentially impact latency, especially if the allocation groups are already busy. Truncation is significantly cheaper; this alone can offer substantial benefits. Another helpful technique is using approaches like Seastar’s sloppy_size=true, where a file’s size is doubled via truncation whenever the current limit is reached. Key Takeaways for the Filesystem Layer Format Correctly: Format XFS (or other filesystems) with block sizes matching your SSD physical sector size if possible. Most modern NVMe drives are 4K-optimized. Go lower only if there are strong restrictions – like inability to read 4K aligned or potential read amplification or you have benchmarks showing better performance for the disk used with a smaller request size. Pre-allocation: When file sizes are known, pre-truncate or pre-allocate files to their final size using `fallocate()` or truncation. This amortizes extent allocation overhead and ensures uniform alignment across the file. Metadata Flushing: Understand the filesystem’s metadata update behavior. File sizes and access time updates are expensive. If you’re doing AIO, use RWF_NOWAIT if possible and make sure it works by tracing the calls with `strace`. Data deletion has also proved to have very expensive side effects. However, on older generation NVMes, TRIM requests that accumulated in the filesystem would flush all at once, overloading the SSD controller and causing huge latency spikes for the apps running on that machine. Mind the Application — Parallelism and Request Size Strategy Parallelism tuning Modern NVMe storage devices can handle thousands of requests in flight simultaneously. Factors like the internal queue depth and the number of outstanding requests the device can accept determine the maximum achievable bandwidth and latency when properly loaded. However, application-level concurrency (threads, fibers, async tasks) must be sufficient to keep these queues full. Generally, the bandwidth vs. latency dependency is defined by two parts. If you measure latency and bandwidth while you increase the app parallelism, the latency is constant and the bandwidth grows – as long as the internal disk parallelism is not saturated. Once the disk is throughput loaded, the bandwidth stops growing or grows very little, while the latency scales almost linearly with the input increase. The relationship between throughput, latency, and queue depth follows Little’s Law: Average Queue Depth = Throughput * Average Latency For a device delivering 14.5 GB/s with 128K requests, the number of requests/second the device can handle is 14.5GB/s divided by 128K – so roughly 113k req/s. If an individual request latency is, for example, 1ms, the device queue needs at least 113 outstanding requests to sustain 14.5 GB/s. This has practical implications: if you tune your application for, say, 40 concurrent requests and later upgrade to a faster device, you’ll need to increase concurrency or you’ll under-utilize the new device. Conversely, if you over-provision concurrency, you risk queue saturation and latency spikes because the SSD controllers only have so much compute power themselves. In ScyllaDB, concurrency is expressed as the number of shards (1 per CPU core) and fibers submitting I/O in parallel. Because we want the database to perform exceptionally even in mixed workloads, we’re using Seastar’s IO Scheduler to modulate the number of requests we send to storage. The Scheduler is configured with a latency goal that it needs to follow and with the iops/bandwidth the disk can handle at peak performance. In real workloads, it’s very difficult to match the load with a static execution model you’ve built yourself, even with thorough benchmarking and testing. Hardware performance varies, and read/write patterns are usually surprising. Your best bet is to use an IO scheduler built to leverage the maximum potential of the hardware while still obeying a latency limit you’ve set. Request size strategy Many small I/O requests often fail to saturate an NVMe drive’s throughput because bandwidth is the product of IOPS and request size, and small requests hit an IOPS/latency/CPU ceiling before they hit the PCIe/NAND bandwidth ceiling. For a given request size, bandwidth is essentially IOPS x Request size. So, for instance, 4K I/Os would need extremely high IOPS to reach GB/s-class throughput. A concrete example: 350k IOPS (which is typical for an AWS i7i.4xlarge, for instance) at 4K request size will get you ~1.3GB/s (i7i.4xlarge can do 3.5GB/s easily). This looks slow in throughput terms even though the device might be doing exactly what the workload allows. NVMe devices reach peak bandwidth when they have enough outstanding requests (queue depth) to keep the many internal flash channels busy.​ If your workload issues many small requests but mostly one-at-a-time (or only a couple in flight), the drive spends time waiting between completions instead of streaming data. As a result, bandwidth stays low. Another important aspect is that with small I/O sizes, the fixed per-I/O costs (syscalls, scheduling, interrupts, copying, etc.) can dominate the latency. That means the kernel/filesystem path becomes the limiter even before the NVMe hardware does.​ This is one reason why real applications often can’t reproduce vendors’ peak numbers. In practice, you usually need high concurrency (multiple threads/fibers/jobs) and true async I/O to maintain a deep in-flight queue and approach device limits. But, to repeat one idea from the section above, also remember that the storage controller itself has a limited compute capacity. Overloading it results in higher latencies for both read and write paths. It is important to find the right request size for your application’s workload. Go as big as your latency expectations will allow. If your workload is not very predictable, keep in mind that you’ll most likely need some dynamic logic that adjusts the I/O patterns so you can squeeze every last drop of performance out of that expensive storage. Key Takeaways for the Application Layer Parallelism tuning: Benchmark your specific hardware and workload to find the optimal fiber/thread/shard count. Watch the latency. While you increase the parallelism, you’ll notice throughput increasing. At some point, throughput will plateau and latency will start to increase. That’s the sweet spot you’re looking for. Request size strategy: Picking the right request size is important. Seastar defaults to 128K for sequential writes, which works well on most modern storage (but validate it via benchmarking). If your device prefers larger or smaller requests for throughput, the cost is latency – so design your workload accordingly. In practice, we’ve never seen SSDs that cannot be saturated for throughput with 128K requests and ScyllaDB can achieve sub-millisecond latencies with this request size. Conclusion The path from application buffer to persistent storage on modern hardware is complex. Often, performance issues are counterintuitive and very difficult to track down. A 1 KB filesystem block size doesn’t “save space” on 4K-optimized SSDs; it wastes throughput by forcing read-modify-write operations. A perfectly tuned IO Scheduler can still throttle requests if given incorrect disk properties. Sufficient parallelism doesn’t guarantee high throughput if request sizes are too small to fill device queues. By minding the disk, the filesystem, and the application – and by understanding how they interact – you can fully take advantage of modern storage hardware and build systems that reliably deliver the performance that the storage vendor advertised.