The most surprising thing about ZeroMQ’s reliable REQ-REP pattern is that it’s not inherently reliable out-of-the-box, and achieving reliability requires deliberate effort on both the REQ and REP sides.

Let’s see it in action. Imagine a simple Python script for a REQ client and a REP server.

REP Server (Python)

import zmq
import time

context = zmq.Context()
socket = context.socket(zmq.REP)
socket.bind("tcp://*:5555")

print("Server started, listening on tcp://*:5555")

while True:
    message = socket.recv()
    print(f"Received request: {message.decode()}")

    # Simulate work
    time.sleep(1)

    # Send reply
    reply_message = b"World"
    socket.send(reply_message)
    print(f"Sent reply: {reply_message.decode()}")

REQ Client (Python)

import zmq
import time

context = zmq.Context()
socket = context.socket(zmq.REQ)
socket.connect("tcp://localhost:5555")

print("Client connected to tcp://localhost:5555")

request_message = b"Hello"
print(f"Sending request: {request_message.decode()}")
socket.send(request_message)

# This is where it can hang if the server is slow or down
reply = socket.recv()
print(f"Received reply: {reply.decode()}")

If you run the server first, then the client, and the server takes longer than a second to process, the client will block indefinitely on socket.recv(). This is the default behavior.

The REQ socket is designed for a strict "send, then receive" sequence. If the REP socket doesn’t reply, the REQ socket’s internal queue fills up, and subsequent send() calls will block, or worse, if the recv() is waiting and never gets a reply, the application grinds to a halt.

To build reliability, we need to introduce retry logic on the client and, importantly, consider timeouts on both sides.

Client-Side Retries and Timeouts

The most common pattern is for the client to implement a retry loop with a timeout. ZeroMQ sockets have a RCVTIMEO option that can be set on the REQ socket.

REQ Client with Retry and Timeout (Python)

import zmq
import time

context = zmq.Context()
socket = context.socket(zmq.REQ)
socket.connect("tcp://localhost:5555")

# Set a receive timeout (e.g., 1000 milliseconds = 1 second)
socket.setsockopt(zmq.RCVTIMEO, 1000)

request_message = b"Hello"
max_retries = 3
retry_delay = 1  # seconds

for attempt in range(max_retries + 1):
    print(f"Attempt {attempt + 1}: Sending request: {request_message.decode()}")
    try:
        socket.send(request_message)
        reply = socket.recv()
        print(f"Received reply: {reply.decode()}")
        break  # Success, exit loop
    except zmq.Again:
        print(f"Attempt {attempt + 1}: Timeout occurred. Retrying...")
        if attempt < max_retries:
            time.sleep(retry_delay)
        else:
            print("Max retries reached. Giving up.")
            break
    except Exception as e:
        print(f"An error occurred: {e}")
        break

Here, zmq.RCVTIMEO is crucial. If socket.recv() doesn’t receive a reply within 1000 milliseconds, it raises a zmq.Again exception. The client then catches this, waits for retry_delay seconds, and tries sending the request again. This ensures that transient network glitches or brief server unavailability don’t cause the client to hang forever.

Server-Side Timeout (Less Common, but Important)

While client-side retries handle network issues and client-perceived delays, what if the server is stuck in an infinite loop or a very long-running operation that exceeds what the client can tolerate? The client’s timeout will eventually trigger, but the server might still be churning away, consuming resources.

There isn’t a direct zmq.SNDTIMEO on the REP socket that causes it to stop sending if the client isn’t ready. However, you can implement a similar concept by using a separate timer or by checking if the client has reconnected or timed out from its perspective. A more robust way is to have the server manage its own task execution time.

REP Server with Task Timeout (Python)

import zmq
import time

context = zmq.Context()
socket = context.socket(zmq.REP)
socket.bind("tcp://*:5555")

print("Server started, listening on tcp://*:5555")

task_timeout = 5  # seconds

while True:
    message = socket.recv()
    print(f"Received request: {message.decode()}")

    start_time = time.time()
    try:
        # Simulate work that might take too long
        time.sleep(3) # This is fine
        # time.sleep(6) # This would exceed task_timeout

        if time.time() - start_time > task_timeout:
            print("Task execution exceeded timeout. Aborting.")
            # Instead of sending a reply, we might just ignore it or send an error
            # If we don't send a reply, the client's REQ socket will eventually timeout
            continue # Go to next recv, effectively dropping this request

        reply_message = b"World"
        socket.send(reply_message)
        print(f"Sent reply: {reply_message.decode()}")

    except Exception as e:
        print(f"An error occurred during processing: {e}")
        # In case of an error, we also don't send a reply, allowing client timeout

In this server example, we track the start_time of processing. If the work takes longer than task_timeout, we print a message and do not send a reply. This causes the client’s socket.recv() (with its own timeout) to eventually fail, indicating a server-side issue.

The core idea is that REQ-REP is a stateful, synchronous handshake. The REQ socket must receive a reply for every message it sends. If it doesn’t, it gets stuck. Timeouts and retries on the client are the primary mechanism to prevent client-side deadlocks, while server-side task management is key to prevent unbounded resource consumption.

The next logical step is to consider how to handle situations where the server does reply, but the reply indicates an error, or when you need to send multiple requests without waiting for each individual reply.

Want structured learning?

Take the full Zeromq course →