ZeroMQ message loss isn’t a bug; it’s a feature that tells you your network or application is drowning.

Let’s say you’re seeing intermittent CURVE_CLOSE_FAILED errors or your receiver is just… not getting messages. Your ZeroMQ application is dropping messages, and it’s usually because the sending side is outpacing the receiving side, or something fundamental is broken between them. ZeroMQ itself is designed to be reliable within its own process boundaries, but it can’t magically make messages appear if the underlying network is saturated or the receiving application can’t keep up.

Here’s the breakdown of what’s likely happening and how to fix it.

The Usual Suspects: Why Messages Vanish

  1. Receiver Overload (Slow Consumer): This is the most common culprit. The sender is blasting messages out faster than the receiver can process them. ZeroMQ’s internal high-water marks (HWMs) are designed to prevent unbounded memory growth, but when they’re hit, messages are dropped.

    • Diagnosis: On the receiving socket, check the HWM setting. You can’t directly query it after connection, but you can set it. If you suspect overload, this is your first knob. Also, monitor your receiver’s CPU and I/O. Is it bottlenecked?
    • Fix: Increase the High Water Mark on the receiving socket. This gives the receiver more buffer space.
      receiver.setsockopt(zmq.HWM, 10000) # Default is often 1000
      
      This tells ZeroMQ to allow up to 10,000 messages in the outbound queue for this socket before it starts dropping. This buys your receiver more time to catch up. If your receiver is truly slow, you might need to address its processing logic or scale it horizontally.
  2. Network Saturation/Congestion: Your network link between the sender and receiver is simply too full. Even if the receiver could process messages, they never make it across.

    • Diagnosis: Use standard network tools like ping (for latency and packet loss), iperf3 (for bandwidth testing), and netstat -s (to look for TCP retransmissions or other errors on the OS level). Check interface utilization (ifconfig or ip a).
    • Fix:
      • Increase Bandwidth: Upgrade your network hardware or service.
      • Reduce Message Size/Frequency: If possible, send smaller messages or send them less often.
      • Optimize Network Path: Ensure direct connections where possible, avoid unnecessary hops.
      • Use TCP with TCP_NODELAY: For TCP transports, ensure TCP_NODELAY is set on the sender to disable Nagle’s algorithm, which can introduce latency and batching that causes issues with high-frequency messages.
        sender.setsockopt(zmq.TCP_NODELAY, 1)
        
        This tells the TCP stack to send data immediately, even if it’s a small amount, reducing latency for individual messages.
  3. Sender Overload (Less Common, but Possible): While less frequent than receiver overload, the sender itself might be struggling to send messages due to its own internal processing or overwhelming the OS network stack.

    • Diagnosis: Monitor the sender’s CPU and outgoing network interface utilization. Check ZeroMQ’s internal socket options if available (though direct sender-side HWM isn’t a typical direct setting like receiver-side HWM). The zmq.EVENT_SEND_FAILED event can be a clue here, though it’s often a symptom of the underlying issue rather than the cause itself.
    • Fix:
      • Rate Limiting on Sender: Implement application-level rate limiting on the sender to control the outgoing message rate.
      • Increase Sender HWM (Indirectly): While you set HWM on the receiving socket (from the sender’s perspective), ensuring the sender’s own outgoing queues aren’t filling up is key. If the sender is a PUSH socket, it doesn’t have an explicit HWM you set directly. The backpressure comes from the PULL socket on the other end. If the sender is a PUB, it has no backpressure by default.
      • Optimize Sender Logic: Ensure the sender’s message generation and sending loop is as efficient as possible.
  4. Socket Type Mismatch or Misconfiguration: Using the wrong socket type for your pattern can lead to unexpected behavior and message loss. For instance, a PUB socket will drop messages if no subscribers are connected or if subscribers are slow.

    • Diagnosis: Review your ZeroMQ socket types (REQ/REP, PUB/SUB, PUSH/PULL, DEALER/ROUTER). Are they paired correctly for the intended communication pattern? Are you using PUB/SUB and expecting reliable delivery to all subscribers?
    • Fix: Ensure you’re using appropriate socket types. For guaranteed delivery, consider PUSH/PULL (for work distribution) or DEALER/ROUTER (for more complex request/reply or routing scenarios) where backpressure is inherent. If using PUB/SUB and needing reliability, you need to implement a separate mechanism (e.g., using PUSH/PULL for command/control or explicit ACKs).
  5. ZeroMQ Version/Build Issues: Though rare, bugs in specific ZeroMQ versions or custom builds could theoretically cause issues.

    • Diagnosis: Check your ZeroMQ version (pkg-config --modversion libzmq or similar). Look for known issues related to your version in the ZeroMQ issue tracker.
    • Fix: Upgrade to the latest stable ZeroMQ version. If using a custom build, try a standard build.
  6. Underlying Transport Issues (e.g., IPC, Inproc): While network saturation is common for TCP/IP, other transports have their own failure modes. inproc is generally very reliable but can be affected by extreme thread contention. ipc can be affected by filesystem issues or permissions.

    • Diagnosis: For inproc, monitor thread contention and context usage. For ipc, check filesystem permissions, disk space, and error logs (dmesg).
    • Fix: For inproc, optimize thread management. For ipc, ensure proper filesystem access and resources.
  7. Context Exhaustion: Each ZeroMQ context has limits on the number of sockets it can manage, and running out of file descriptors at the OS level can also manifest as socket creation failures or unexpected behavior.

    • Diagnosis: Check ulimit -n (number of open file descriptors) on both sender and receiver machines. Monitor zmq_ctx_get(ctx, ZMQ_MAX_SOCKETS).
    • Fix: Increase the OS file descriptor limit (ulimit -n <new_limit>) and/or the ZMQ_MAX_SOCKETS option when creating the context if you have a very large number of sockets.

After addressing these, the next issue you might hit is CURVE_AUTH errors if you’re using CurveZMQ and your keys are misconfigured, or simply that your message latency increases if you’ve over-buffered.

Want structured learning?

Take the full Zeromq course →