ZeroMQ message loss isn’t a bug; it’s a feature that tells you your network or application is drowning.
Let’s say you’re seeing intermittent CURVE_CLOSE_FAILED errors or your receiver is just… not getting messages. Your ZeroMQ application is dropping messages, and it’s usually because the sending side is outpacing the receiving side, or something fundamental is broken between them. ZeroMQ itself is designed to be reliable within its own process boundaries, but it can’t magically make messages appear if the underlying network is saturated or the receiving application can’t keep up.
Here’s the breakdown of what’s likely happening and how to fix it.
The Usual Suspects: Why Messages Vanish
-
Receiver Overload (Slow Consumer): This is the most common culprit. The sender is blasting messages out faster than the receiver can process them. ZeroMQ’s internal high-water marks (HWMs) are designed to prevent unbounded memory growth, but when they’re hit, messages are dropped.
- Diagnosis: On the receiving socket, check the
HWMsetting. You can’t directly query it after connection, but you can set it. If you suspect overload, this is your first knob. Also, monitor your receiver’s CPU and I/O. Is it bottlenecked? - Fix: Increase the High Water Mark on the receiving socket. This gives the receiver more buffer space.
This tells ZeroMQ to allow up to 10,000 messages in the outbound queue for this socket before it starts dropping. This buys your receiver more time to catch up. If your receiver is truly slow, you might need to address its processing logic or scale it horizontally.receiver.setsockopt(zmq.HWM, 10000) # Default is often 1000
- Diagnosis: On the receiving socket, check the
-
Network Saturation/Congestion: Your network link between the sender and receiver is simply too full. Even if the receiver could process messages, they never make it across.
- Diagnosis: Use standard network tools like
ping(for latency and packet loss),iperf3(for bandwidth testing), andnetstat -s(to look for TCP retransmissions or other errors on the OS level). Check interface utilization (ifconfigorip a). - Fix:
- Increase Bandwidth: Upgrade your network hardware or service.
- Reduce Message Size/Frequency: If possible, send smaller messages or send them less often.
- Optimize Network Path: Ensure direct connections where possible, avoid unnecessary hops.
- Use TCP with
TCP_NODELAY: For TCP transports, ensureTCP_NODELAYis set on the sender to disable Nagle’s algorithm, which can introduce latency and batching that causes issues with high-frequency messages.
This tells the TCP stack to send data immediately, even if it’s a small amount, reducing latency for individual messages.sender.setsockopt(zmq.TCP_NODELAY, 1)
- Diagnosis: Use standard network tools like
-
Sender Overload (Less Common, but Possible): While less frequent than receiver overload, the sender itself might be struggling to send messages due to its own internal processing or overwhelming the OS network stack.
- Diagnosis: Monitor the sender’s CPU and outgoing network interface utilization. Check ZeroMQ’s internal socket options if available (though direct sender-side HWM isn’t a typical direct setting like receiver-side HWM). The
zmq.EVENT_SEND_FAILEDevent can be a clue here, though it’s often a symptom of the underlying issue rather than the cause itself. - Fix:
- Rate Limiting on Sender: Implement application-level rate limiting on the sender to control the outgoing message rate.
- Increase Sender HWM (Indirectly): While you set HWM on the receiving socket (from the sender’s perspective), ensuring the sender’s own outgoing queues aren’t filling up is key. If the sender is a PUSH socket, it doesn’t have an explicit HWM you set directly. The backpressure comes from the PULL socket on the other end. If the sender is a PUB, it has no backpressure by default.
- Optimize Sender Logic: Ensure the sender’s message generation and sending loop is as efficient as possible.
- Diagnosis: Monitor the sender’s CPU and outgoing network interface utilization. Check ZeroMQ’s internal socket options if available (though direct sender-side HWM isn’t a typical direct setting like receiver-side HWM). The
-
Socket Type Mismatch or Misconfiguration: Using the wrong socket type for your pattern can lead to unexpected behavior and message loss. For instance, a PUB socket will drop messages if no subscribers are connected or if subscribers are slow.
- Diagnosis: Review your ZeroMQ socket types (REQ/REP, PUB/SUB, PUSH/PULL, DEALER/ROUTER). Are they paired correctly for the intended communication pattern? Are you using PUB/SUB and expecting reliable delivery to all subscribers?
- Fix: Ensure you’re using appropriate socket types. For guaranteed delivery, consider PUSH/PULL (for work distribution) or DEALER/ROUTER (for more complex request/reply or routing scenarios) where backpressure is inherent. If using PUB/SUB and needing reliability, you need to implement a separate mechanism (e.g., using PUSH/PULL for command/control or explicit ACKs).
-
ZeroMQ Version/Build Issues: Though rare, bugs in specific ZeroMQ versions or custom builds could theoretically cause issues.
- Diagnosis: Check your ZeroMQ version (
pkg-config --modversion libzmqor similar). Look for known issues related to your version in the ZeroMQ issue tracker. - Fix: Upgrade to the latest stable ZeroMQ version. If using a custom build, try a standard build.
- Diagnosis: Check your ZeroMQ version (
-
Underlying Transport Issues (e.g., IPC, Inproc): While network saturation is common for TCP/IP, other transports have their own failure modes.
inprocis generally very reliable but can be affected by extreme thread contention.ipccan be affected by filesystem issues or permissions.- Diagnosis: For
inproc, monitor thread contention and context usage. Foripc, check filesystem permissions, disk space, and error logs (dmesg). - Fix: For
inproc, optimize thread management. Foripc, ensure proper filesystem access and resources.
- Diagnosis: For
-
Context Exhaustion: Each ZeroMQ context has limits on the number of sockets it can manage, and running out of file descriptors at the OS level can also manifest as socket creation failures or unexpected behavior.
- Diagnosis: Check
ulimit -n(number of open file descriptors) on both sender and receiver machines. Monitorzmq_ctx_get(ctx, ZMQ_MAX_SOCKETS). - Fix: Increase the OS file descriptor limit (
ulimit -n <new_limit>) and/or theZMQ_MAX_SOCKETSoption when creating the context if you have a very large number of sockets.
- Diagnosis: Check
After addressing these, the next issue you might hit is CURVE_AUTH errors if you’re using CurveZMQ and your keys are misconfigured, or simply that your message latency increases if you’ve over-buffered.