Valkey’s HyperLogLog (HLL) can count unique items in a dataset using astonishingly little memory, even for billions of items.
Let’s see it in action. Imagine you’re tracking unique visitors to a popular website.
> HLL.ADD visitors "user123" "user456" "user123"
(integer) 2
> HLL.PFCOUNT visitors
(integer) 2
HLL.ADD adds items. Notice "user123" was added twice, but PFCOUNT (the command for HyperLogLog count) only reports 2. It inherently handles duplicates.
The magic is in how it estimates cardinality. Instead of storing every unique item (which would explode memory usage), HLL uses probabilistic counting. It hashes each item and observes patterns in the hashed values, specifically the number of leading zeros. The more leading zeros you see, the more unique items you’ve likely encountered.
Here’s the internal breakdown:
- Hashing: Every item added to an HLL is passed through a strong hash function (like MurmurHash3). This produces a fixed-size hash value (e.g., 64 bits).
- Bucketing: The hash value is split. A portion determines which "bucket" the item falls into, and the rest is used to calculate the number of leading zeros. Valkey’s HLL uses 16 bits for bucket selection, meaning there are $2^{16} = 65,536$ buckets.
- Register Updates: For each bucket, HLL stores the maximum number of leading zeros seen for any item hashing into that bucket. This is stored in a small register (typically 6 bits, allowing values from 0 to 63).
- Estimation: To get the final cardinality estimate, HLL averages the values across all registers and applies a formula that accounts for the observed leading zero patterns and the number of buckets. The formula is derived from the observation that if you see a maximum of $k$ leading zeros, you’ve likely seen around $2^k$ unique items.
This probabilistic approach allows for a tunable error rate. The default precision (-p 14) gives a standard error of about 0.625%. You can increase the number of buckets (which increases memory usage) to reduce the error. For example, -p 10 uses $2^{10} = 1024$ buckets with an error of about 2.5%, while -p 16 uses $2^{16} = 65,536$ buckets with an error of about 0.25%.
The memory footprint is fixed and surprisingly small. For the default -p 14, each of the 16,384 registers (using $2^{14}$ buckets) stores 6 bits. That’s $16,384 \times 6$ bits, which is 98,304 bits, or about 12 KB. Even for billions of unique items, the memory usage barely changes.
You can also merge HLLs. This is useful if you have, say, unique visitor counts from different servers and want the total unique count across all of them.
> HLL.ADD visitors_europe "userA" "userB"
(integer) 2
> HLL.ADD visitors_asia "userB" "userC"
(integer) 2
> HLL.PFMERGE total_visitors visitors_europe visitors_asia
OK
> HLL.PFCOUNT total_visitors
(integer) 3
Here, "userB" was counted in both, but PFMERGE correctly merges the HLLs and PFCOUNT on the merged HLL reports 3 unique visitors.
A critical detail is that HLLs are designed for cardinality estimation, not for retrieving the items themselves or counting occurrences of individual items. If you need exact counts or the actual list of unique items, HLL is not the right tool. The memory savings come at the cost of precision and the inability to reconstruct the original set.
The minimum number of buckets you can configure is $2^4$ (16 buckets), and the maximum is $2^{16}$ (65,536 buckets). Trying to set a value outside this range will result in an error.