ZeroMQ’s automatic reconnection isn’t about waiting for a fixed interval; it’s a stateful, adaptive dance to re-establish broken connections.

Let’s see it in action. Imagine a simple publisher-subscriber setup.

Publisher (pub.py):

import zmq
import time

context = zmq.Context()
socket = context.socket(zmq.PUB)
socket.bind("tcp://*:5555")
print("Publisher started on tcp://*:5555")

for i in range(100000):
    message = f"Message {i}"
    socket.send_string(message)
    print(f"Sent: {message}")
    time.sleep(0.1)

Subscriber (sub.py):

import zmq
import time

context = zmq.Context()
socket = context.socket(zmq.SUB)
socket.setsockopt_string(zmq.SUBSCRIBE, "") # Subscribe to all
socket.connect("tcp://localhost:5555")
print("Subscriber connected to tcp://localhost:5555")

while True:
    try:
        message = socket.recv_string(zmq.DONTWAIT) # Non-blocking receive
        print(f"Received: {message}")
    except zmq.Again:
        # No message yet, do something else or just wait
        time.sleep(0.01)
    except KeyboardInterrupt:
        break

Now, run pub.py. Then, run sub.py. You’ll see messages flowing. Now, while sub.py is running, kill the pub.py process (e.g., Ctrl+C).

You’ll immediately see sub.py log something like:

Received: Message 10
Received: Message 11
...
Subscriber connected to tcp://localhost:5555

Then, if you restart pub.py (without restarting sub.py), the subscriber will resume receiving messages as if nothing happened. The zmq.Again exception is what tells us there’s no message right now, not that the connection is dead. The reconnection logic happens under the hood.

ZeroMQ handles reconnection at the transport level. When a connection breaks (e.g., network cable unplugged, process crash), the ZeroMQ library on the receiving end marks that connection as broken. It doesn’t immediately try to re-establish it. Instead, it enters a "reconnecting" state.

The magic happens when you call connect() on a socket that has previously established a connection or is in a broken state. For connection-oriented transports like TCP, ZeroMQ uses a sophisticated backoff algorithm. It will initially try to reconnect very quickly – often within milliseconds. If that fails, it increases the delay between attempts. This delay grows exponentially, but with a cap, and also includes jitter (randomness) to prevent thundering herd problems if many clients try to reconnect simultaneously.

The specific values aren’t directly exposed as a single "reconnect interval" setting in the way you might expect. Instead, they are managed internally by the transport. You can influence this behavior primarily through the zmq.CONNECT_TIMEOUT and zmq.RECONNECT_IVL socket options, though zmq.RECONNECT_IVL is more about the initial interval and zmq.RECONNECT_IVL_MAX controls the upper bound of the backoff.

  • zmq.CONNECT_TIMEOUT: This is how long the connect() call itself will block waiting for a connection to be established initially. It’s not directly about reconnection after a connection has been broken, but it’s related to the initial connection phase. A value of -1 means it will block indefinitely.
  • zmq.RECONNECT_IVL: This sets the base interval (in milliseconds) at which the client will attempt to reconnect after a connection is lost. The default is 100ms.
  • zmq.RECONNECT_IVL_MAX: This sets the maximum interval (in milliseconds) between reconnection attempts. The default is 0, meaning no upper bound is enforced by default in older versions, but newer versions might have a sensible default. If you set this to, say, 60000 (60 seconds), the delay between attempts will grow exponentially until it reaches 60 seconds, then it will stay at 60 seconds for subsequent attempts.

To explicitly configure reconnection backoff, you’d use these options:

import zmq

context = zmq.Context()
socket = context.socket(zmq.SUB)

# Set the base reconnect interval to 500ms
socket.setsockopt(zmq.RECONNECT_IVL, 500)

# Set the maximum reconnect interval to 10 seconds (10000ms)
socket.setsockopt(zmq.RECONNECT_IVL_MAX, 10000)

socket.connect("tcp://localhost:5555")

When a connection breaks, ZeroMQ doesn’t just keep hammering the server with requests. It has internal timers. The first few attempts are rapid. If they fail, the delay increases. For example, with RECONNECT_IVL=100 and no RECONNECT_IVL_MAX, the delays might look like: 100ms, 200ms, 400ms, 800ms, 1600ms, 3200ms, and so on, until some reasonable limit or a successful connection is made. The jitter is added to this to spread out reconnection attempts.

The most counterintuitive aspect is that the reconnection isn’t driven by a user-level timer or a time.sleep() in your application code. It’s an intrinsic behavior of the ZeroMQ transport layer itself. Your application code might only notice a brief pause in message delivery or, if it’s doing non-blocking receives, it will simply receive zmq.Again exceptions more frequently until the connection is re-established. The connect() call on a REQ or SUB socket is what initiates the reconnection attempt if the socket is in a disconnected state.

The next thing you’ll likely wrestle with is handling transient network partitions where both sides are up but can’t see each other, and how ZeroMQ’s patterns (like request-reply) behave during these events.

Want structured learning?

Take the full Zeromq course →