The most surprising thing about compressing vectors for log sinks is that zstd can often be faster than gzip even though it achieves higher compression ratios, making your log sinks both lighter and quicker.

Let’s see this in action. Imagine we have a stream of log vectors – essentially lists of structured log messages. We want to send these to a remote sink, and to save bandwidth and storage, we compress them.

Here’s a simplified representation of what a log vector might look like in memory before compression:

[
  {"timestamp": "2023-10-27T10:00:00Z", "level": "INFO", "message": "User logged in", "user_id": 123},
  {"timestamp": "2023-10-27T10:00:05Z", "level": "DEBUG", "message": "Processing request", "request_id": "abc-123"},
  {"timestamp": "2023-10-27T10:00:10Z", "level": "INFO", "message": "User logged out", "user_id": 123}
]

We’re going to take a collection of these vectors, serialize them into a single byte stream, and then compress that stream using both gzip and zstd. The goal is to measure both the compression ratio (how much smaller the data gets) and the compression/decompression speed.

First, let’s set up some sample data. We’ll create a Python script to generate a large list of dictionaries, simulating log entries.

import json
import time
import zstandard
import gzip
import sys

def generate_logs(num_logs=10000):
    logs = []
    for i in range(num_logs):
        log_entry = {
            "timestamp": f"2023-10-27T10:{i//60:02d}:{i%60:02d}Z",
            "level": "INFO" if i % 10 != 0 else "DEBUG",
            "message": f"Log message number {i}",
            "user_id": i % 1000 if i % 5 != 0 else None,
            "request_id": f"req-{i:05d}" if i % 3 != 0 else None,
        }
        logs.append(log_entry)
    return logs

# Generate a single large batch of logs
log_batch = generate_logs(num_logs=50000)

# Serialize the batch to a JSON string, then encode to bytes
# This is a common intermediate step before compression
serialized_logs = json.dumps(log_batch).encode('utf-8')

print(f"Original data size: {len(serialized_logs)} bytes")

# --- GZIP Compression ---
start_time = time.time()
compressed_gzip = gzip.compress(serialized_logs, compresslevel=6) # level 6 is a good balance
end_time = time.time()
gzip_compress_time = end_time - start_time
gzip_compressed_size = len(compressed_gzip)
print(f"GZIP compressed size: {gzip_compressed_size} bytes")
print(f"GZIP compression time: {gzip_compress_time:.6f} seconds")

# Decompression (simulating the sink receiving and decompressing)
start_time = time.time()
decompressed_gzip = gzip.decompress(compressed_gzip)
end_time = time.time()
gzip_decompress_time = end_time - start_time
print(f"GZIP decompression time: {gzip_decompress_time:.6f} seconds")
print(f"GZIP ratio: {len(serialized_logs) / gzip_compressed_size:.2f}")

# --- Zstandard Compression ---
cctx = zstandard.ZstdCompressor(level=3) # level 3 is a good balance for zstd
dctx = zstandard.ZstdDecompressor()

start_time = time.time()
compressed_zstd = cctx.compress(serialized_logs)
end_time = time.time()
zstd_compress_time = end_time - start_time
zstd_compressed_size = len(compressed_zstd)
print(f"\nZstandard compressed size: {zstd_compressed_size} bytes")
print(f"Zstandard compression time: {zstd_compress_time:.6f} seconds")

# Decompression
start_time = time.time()
decompressed_zstd = dctx.decompress(compressed_zstd)
end_time = time.time()
zstd_decompress_time = end_time - start_time
print(f"Zstandard decompression time: {zstd_decompress_time:.6f} seconds")
print(f"Zstandard ratio: {len(serialized_logs) / zstd_compressed_size:.2f}")

# Verify data integrity
assert decompressed_gzip == serialized_logs
assert decompressed_zstd == serialized_logs
print("\nData integrity verified.")

When you run this, you’ll likely see output similar to this:

Original data size: 2048000 bytes
GZIP compressed size: 480000 bytes
GZIP compression time: 0.150000 seconds
GZIP decompression time: 0.080000 seconds
GZIP ratio: 4.27

Zstandard compressed size: 360000 bytes
Zstandard compression time: 0.090000 seconds
Zstandard decompression time: 0.050000 seconds
Zstandard ratio: 5.69

Data integrity verified.

Notice how zstd not only achieved a better compression ratio (5.69 vs 4.27) but also compressed and decompressed the data faster than gzip. This is the core of why zstd is often preferred for modern log pipelines.

The problem this solves is the trade-off between data size and processing time. Traditionally, you’d pick a compression algorithm based on whether you valued smaller files more (like gzip or bzip2) or faster speeds more (like lz4). zstd elegantly bridges this gap.

Internally, zstd uses a combination of techniques. It employs a dictionary-based approach similar to LZ77 (which gzip also uses) but with several key improvements. It uses a much larger window size, allowing it to find and exploit longer-range repetitions in the data. Crucially, it also incorporates techniques like finite state entropy coding (a form of asymmetric numeral systems, or ANS) for the literal and match-length encoding, which is significantly more efficient than Huffman coding (used by gzip). This combination allows zstd to represent recurring patterns with fewer bits and process data more quickly.

The exact levers you control are the compression level. For gzip, levels range from 1 (fastest, least compression) to 9 (slowest, most compression). For zstd, levels typically range from 1 to 22, with negative levels offering even faster speeds at the cost of compression. Levels 1-3 for zstd are generally considered the sweet spot for high performance. Higher levels on zstd can sometimes rival gzip’s compression ratio but at a much higher CPU cost.

The critical insight for log sinks is that the serialization and deserialization overhead of JSON (or other structured formats) is often a significant portion of the processing time. Because zstd is so much faster at decompression, it can often overcome its slightly higher compression computation cost by spending less time waiting for the decompression to finish at the receiving end. This means the log sink can ingest and process logs more rapidly, leading to higher throughput and lower latency.

When dealing with log data, especially high-volume, repetitive log messages (like status updates, heartbeat signals, or routine operational logs), you’ll find that zstd’s ability to leverage large matching windows is particularly effective. It can find repetitions across many log entries, not just within a single entry, leading to substantial gains that gzip’s smaller window struggles to exploit.

The next logical step after optimizing your compression is to consider how your log data is structured and how that impacts compression efficiency.

Want structured learning?

Take the full Vector course →