strace showing a process stuck on a read() or poll() is usually because the process is waiting for data that will never arrive, or it’s stuck in a deadlock with another process.

Here’s how to break down what’s happening and how to fix it:

Common Causes and Fixes

  1. Deadlock on a Mutex or Lock:

    • Diagnosis: You’ll see the process stuck in a loop of futex() calls, specifically FUTEX_WAIT. This indicates it’s waiting for a condition variable or mutex to be signaled. If multiple processes are involved, they might be waiting on each other.
    • Fix: This is tricky because it’s a logic error in the application. The most common fix is to identify the locking order and ensure it’s consistent across all threads/processes. If it’s a single process, one thread might be holding a lock indefinitely, preventing others from proceeding. You might need to add strace -f to trace child processes and look for a pattern of acquisition and waiting. For a quick test if you suspect a specific lock, you might need to restart the offending process or the service it belongs to.
    • Why it works: Deadlocks occur when processes acquire locks in different orders. By enforcing a strict, consistent acquisition order (e.g., always acquire lock A before lock B), you prevent circular dependencies. Restarting releases all held locks, breaking the cycle.
  2. Waiting for Network Data that Never Arrives:

    • Diagnosis: The process is blocked on a read() or recv() call on a network socket. strace will show a negative return value (e.g., -1 with errno set to EAGAIN or EWOULDBLOCK if it’s non-blocking, or it will just hang indefinitely if it’s blocking).
    • Fix:
      • Check network connectivity:
        ping <remote_host>
        telnet <remote_host> <port>
        
        If ping fails or telnet times out, the issue is network-related.
      • Check firewall rules: Ensure the firewall on either the client or server side isn’t blocking the port.
      • Check the remote service: Is the service on the other end running and healthy? Can it accept connections and send data?
      • Application logic: The application might be expecting data that the client/server isn’t sending, or it’s sending data in a format the receiver can’t parse, leading to a perpetual wait. This often requires debugging the application’s communication protocol.
    • Why it works: Network hangs are usually due to a broken communication path or a service that isn’t responding. Verifying connectivity and the health of the remote service ensures the expected data flow can resume.
  3. Blocking on an Unresponsive IPC (Inter-Process Communication) Mechanism:

    • Diagnosis: The process is stuck on read() or write() calls involving pipes, message queues, or shared memory segments. strace will show the process waiting on a file descriptor associated with these IPC mechanisms.
    • Fix:
      • Inspect /proc/<pid>/fd/: Use ls -l /proc/<pid>/fd/ to see what file descriptors the process has open. Look for pipes or other IPC descriptors.
      • Check the other end of the pipe/queue: If process A is writing to a pipe and process B is reading, and process A is blocked on write(), it means process B isn’t reading fast enough, or it’s also blocked. You’ll need to strace process B to see why.
      • Check message queue limits: For message queues, check system limits with ipcs -q.
    • Why it works: IPC mechanisms are essentially specialized file descriptors. If one process is blocked writing because the other end isn’t reading, or vice-versa, the system will wait. Identifying and fixing the bottleneck at the other end resolves the hang.
  4. Waiting for Disk I/O to Complete (Rare but Possible):

    • Diagnosis: The process is stuck on a read() or write() call to a block device or file. This is less common for hanging and more common for slowness, but a severe I/O issue could cause a long wait. strace will show the system call and potentially a very long duration or a return of EIO.
    • Fix:
      • Check disk health:
        smartctl -a /dev/sdX  # Replace sdX with your disk
        dmesg | grep -iE "error|fail|ata"
        
      • Check filesystem status:
        mount | grep /path/to/mountpoint
        fsck /dev/sdX  # On an unmounted filesystem
        
      • Monitor I/O:
        iostat -xz 1
        iotop
        
        If you see extremely high %util or await times, the disk is overloaded or failing.
    • Why it works: A failing or overloaded disk can cause I/O operations to take an inordinate amount of time, effectively hanging the process that’s waiting for it. Diagnosing and resolving the underlying disk or filesystem problem is key.
  5. Infinite Loop in Signal Handling:

    • Diagnosis: The process might be stuck in a signal handler that itself blocks or deadlocks. strace might show the process receiving a signal (e.g., SIGINT, SIGTERM) and then getting stuck in futex() or other system calls within the handler.
    • Fix: This requires debugging the application’s signal handling code. Ensure signal handlers are quick, non-blocking, and don’t introduce new locking dependencies. For a quick resolution, you might need to send a SIGKILL (kill -9 <pid>) to force termination.
    • Why it works: Signal handlers execute asynchronously. If a handler attempts to acquire a lock already held by the main thread, or performs an operation that blocks indefinitely, it can freeze the process. SIGKILL bypasses normal termination and immediately stops the process.
  6. Process is Waiting for a Child Process to Exit:

    • Diagnosis: The parent process is stuck on a waitpid() or wait() system call, waiting for a child process that has itself hung or terminated abnormally. strace -f is crucial here to trace the child process and find its hang point.
    • Fix: Identify the hung child process (using strace -f on the parent) and debug that child process using the methods above. Once the child is fixed or terminated, the parent waitpid() call will return.
    • Why it works: The waitpid() system call is designed to block until a child process changes state (e.g., exits, stops). If the child process is stuck, the parent will remain blocked indefinitely.

The next error you’ll likely encounter after fixing a system hang is a related resource exhaustion, such as "Too many open files" if the hang was due to a process holding too many file descriptors open, or a timeout error from a higher-level client waiting for the now-unresponsive service.

Want structured learning?

Take the full Strace course →