Compression is usually about shrinking files, but the real magic is how it lets you pack more data onto your disks, meaning you can serve more users or store more history without buying new hardware.

Let’s see it in action. Imagine you’ve got a 100GB dataset and you’re using zstd with a compression level of 3.

# Simulate creating a large file
dd if=/dev/zero of=large_data.bin bs=1M count=102400

# Compress it with zstd
zstd -100000 large_data.bin -o large_data.zst

After this, large_data.zst might be only 30GB. That’s a 70% space saving! But if your application is reading and writing this data constantly, the CPU cost of compressing and decompressing could be a bottleneck.

Here’s the mental model: Storage compression is a trade-off between CPU usage and storage space. Algorithms like Gzip, Zstd, and LZ4 offer different points on this spectrum.

  • Gzip: The old reliable. It’s slow but offers excellent compression ratios. Good for archiving or data that’s written once and read rarely.
  • Zstd (Zstandard): A modern, highly versatile algorithm. It offers a wide range of compression levels, from very fast (approaching LZ4) to very high compression (approaching Gzip), all with better performance than Gzip at comparable ratios. It’s often the best all-around choice.
  • LZ4: Blazingly fast. Compression and decompression are so quick they often don’t add noticeable latency to I/O operations. The trade-off is a lower compression ratio. Ideal for data that’s frequently accessed and modified.

The key levers you control are the algorithm itself and the compression level (if supported by the algorithm).

For a typical web server cache, where reads and writes are frequent, you’d want something fast. LZ4 might be your go-to.

# Example using lz4
lz4 large_data.bin large_data.lz4

If you’re storing historical logs that you rarely access but want to keep for compliance, Gzip or a high-level Zstd would be better.

# Example using gzip
gzip -9 large_data.bin # -9 is max compression

The data’s characteristics are paramount. Text files, JSON, and XML compress very well because they have a lot of redundancy. Binary files, especially already compressed ones like JPEGs or MP4s, often compress poorly or might even expand slightly.

When you’re evaluating Zstd, remember that its levels are not linear. Level 1 is extremely fast, while level 10 is significantly slower but offers a noticeable improvement in compression. Going from level 10 to level 20 will yield diminishing returns on compression ratio while drastically increasing CPU time. Most workloads find their sweet spot between levels 3 and 9.

The next concept to explore is block-level compression versus file-level compression, and how databases and file systems implement these internally.

Want structured learning?

Take the full Storage course →