The most surprising thing about storage latency is how little it has to do with the speed of the physical disk itself, and how much it’s dictated by the software stack it sits behind.
Let’s watch a read operation unfold, not in theory, but in practice. Imagine a simple redis-cli GET mykey command.
> GET mykey
"myvalue"
That "myvalue" didn’t just appear. It started its journey deep within the kernel. The request first hits the application’s memory, then gets handed off to the operating system’s page cache. If it’s not there, the OS needs to fetch it. This involves a system call (read()) which transitions from user space to kernel space. The kernel then consults the filesystem (say, ext4 or XFS). The filesystem, in turn, needs to find the data on the block device. This involves traversing its metadata structures (inodes, directory entries, block maps) to locate the physical block addresses on disk.
Now, the request hits the block layer, which manages I/O requests to the storage device. It might queue, merge, or reorder these requests to optimize for the underlying hardware. Finally, the request is translated into a command for the storage driver. If it’s a traditional spinning disk, this is where things get slow. The driver sends a SCSI or ATA command. The disk arm has to physically move to the correct track, and the platter has to rotate to the correct sector. This mechanical movement is measured in milliseconds.
If it’s an NVMe drive, the journey is vastly different and much faster. The NVMe controller has its own command queue (Submission Queue) and completion queue (Completion Queue) that bypasses much of the traditional kernel I/O path. The NVMe driver interacts directly with these queues, sending commands asynchronously. The NVMe drive itself is a flash-based device, so there are no moving parts. Accessing data is electronic, measured in microseconds or even nanoseconds. The latency difference between a spinning disk and NVMe is often 1000x or more.
The key levers you control aren’t just the "speed" of your drive, but:
- Filesystem Choice and Mount Options:
noatimeis crucial. Every file access updates the access timestamp, which is a write operation.noatimeprevents this, especially important on busy read-heavy workloads.commit=60(or higher) on journaling filesystems like ext4 or XFS can batch metadata writes, reducing their frequency. - I/O Scheduler: For spinning disks,
deadlineorcfq(thoughbfqis often better now) can help. For NVMe,noneis usually best, as the drive has its own sophisticated internal scheduler. You can check and set this withcat /sys/block/sdX/queue/schedulerandecho noop > /sys/block/sdX/queue/scheduler. - Block Size: A larger block size on the filesystem can lead to fewer I/O operations for larger reads, but can also waste space for small files. It’s a trade-off.
- Direct I/O: For specific applications (like databases) that manage their own caching and want to bypass the OS page cache,
O_DIRECTcan reduce overhead and double-buffering. - NVMe Specifics: Queue depth and number of queues are critical. A higher queue depth allows more commands to be outstanding, hiding latency.
nvme list-ctrl /dev/nvme0n1will show current settings. Applications can often tune these via their own parameters or by using libraries that expose them.
The most surprising aspect for many is how the operating system’s page cache, while designed to speed things up, can also introduce latency when a cache miss occurs and the data needs to be faulted in from disk. This is why tuning the underlying I/O path is so vital, even when you have a fast NVMe drive. The kernel’s block layer, the filesystem, and the device driver all add their own overhead, and optimizing each step can shave off precious microseconds.
The next hurdle you’ll likely face is network latency, which often becomes the bottleneck once storage latency is sufficiently minimized.