Storage deduplication is surprisingly effective at shrinking data footprints, but it often works by increasing the amount of data written to disk, not decreasing it.

Let’s see it in action. Imagine we have a block of data:

0x1A2B3C4D5E6F7890...

When this block arrives at the deduplication system, it first calculates a hash of the block. Let’s say the hash is ABCDEF123456. The system checks its index to see if it’s already seen data with this hash.

  • If the hash is NOT in the index: The system writes the actual data block to storage and adds the hash ABCDEF123456 to its index, pointing to the location of the new data block.
  • If the hash IS in the index: The system discards the incoming data block. Instead of writing the data, it writes a pointer to the existing data block that has the same hash.

This pointer is much smaller than the original data block. For example, if a data block is 4MB and a pointer is just 8 bytes, we’ve saved 4MB for every subsequent identical block. This is how deduplication achieves its magic.

The core problem deduplication solves is the massive waste of storage space due to identical or near-identical data. Think about:

  • Multiple Virtual Machines: Each VM often has a base operating system image, leading to millions of identical blocks across many VMs.
  • Backups: Daily backups of the same data will contain huge amounts of unchanged blocks.
  • User Files: Many users might store the same popular software installers or documents.

Internally, deduplication relies on a few key components:

  1. Chunking: Data is broken down into variable-sized or fixed-sized blocks (chunks). Variable-sized chunking is generally more effective as it can align data boundaries better.
  2. Hashing: A cryptographic hash function (like SHA-256) is applied to each chunk to generate a unique fingerprint. Collisions (different data producing the same hash) are astronomically unlikely with good hash functions.
  3. Indexing: A database or index stores the hashes of all unique chunks and their physical locations on the storage media. This is the lookup table.
  4. Storage: The actual unique data chunks are stored, often in a content-addressable manner where the data’s hash is its address.

The primary levers you control are:

  • Chunking Granularity: Smaller chunks mean more metadata but potentially higher deduplication ratios. Larger chunks mean less metadata but might miss opportunities if only a small part of a large block changes.
  • Inline vs. Post-Process: Inline deduplication happens as data is written, saving space immediately but potentially impacting write performance. Post-process deduplication happens later, which has less impact on writes but doesn’t free up space until the process runs.
  • Compression: Deduplication is often combined with compression. Compressing chunks before hashing can further reduce storage, but it means that two identical uncompressed blocks will only deduplicate if they also compress to the same output.

The surprising part is how deduplication handles data that isn’t exactly identical but has small changes. Modern deduplication algorithms, particularly those using variable-sized chunking, can be remarkably resilient to minor edits. If you insert a single byte into a large file, only the chunk containing that byte (and potentially subsequent chunks that are re-aligned) will be new. The vast majority of the preceding data will still hash to previously seen values and be deduplicated. This is a key reason why it’s so effective for backups and VM environments.

The next challenge you’ll face is managing the metadata overhead and potential performance implications of the index lookup.

Want structured learning?

Take the full Storage course →