strace can’t directly trace SIGKILL because it’s uncatchable and unblockable, but it can show you the system call that results in a process being killed, even by SIGKILL.

Let’s say you have a process, PID 12345, that’s misbehaving and you want to see why it’s getting killed. You suspect it’s a signal, but you’re not sure which one, or maybe it’s a SIGKILL that you can’t catch. You can attach strace to it:

strace -p 12345

If SIGKILL is sent to PID 12345, you won’t see strace printing SIGKILL directly. Instead, you’ll see the process exiting. The crucial part is the system call immediately preceding the exit.

Common Causes for Unexpected Process Termination

Here’s a breakdown of what strace might show you when a process is terminated, and the underlying reasons:

  1. OOM Killer (SIGKILL): The system is running out of memory and the Out-Of-Memory killer has stepped in to terminate a process.

    • Diagnosis: You’ll see a read() or write() system call followed by an exit_group() system call. The kernel logs will be your best friend here. Run dmesg -T | grep -i oom or journalctl -k | grep -i oom to see if the OOM killer was invoked. It will explicitly state which process was killed and why.
    • Fix: Increase system RAM, reduce the memory footprint of your applications (e.g., tune JVM heap sizes, database buffers), or adjust the oom_score_adj for critical processes to make them less likely targets. For example, to make a process less likely to be killed, you’d write a negative value to its /proc/<pid>/oom_score_adj file: echo -500 > /proc/12345/oom_score_adj. This reduces its "oom score," making it less attractive.
    • Why it works: The OOM killer assigns a score to each process based on its memory usage and other factors. By adjusting oom_score_adj, you directly influence this score, making the process less likely to be chosen for termination when memory is scarce.
  2. Uncaught SIGSEGV (Segmentation Fault): The process tried to access memory it shouldn’t have. This is a very common programming error.

    • Diagnosis: strace will show a read() or write() system call, followed by the process exiting. The kernel log (dmesg or journalctl) will explicitly mention "segfault" or "SIGSEGV". You might also see a core dump being generated.
    • Fix: Debug the application’s source code. Use tools like gdb with the core dump or by attaching to the running process to pinpoint the exact line of code causing the invalid memory access. Common causes include null pointer dereferences, buffer overflows, or use-after-free errors.
    • Why it works: Identifying and correcting the faulty memory access in the code prevents the CPU from encountering an illegal operation, thus avoiding the SIGSEGV.
  3. Uncaught SIGILL (Illegal Instruction): The process attempted to execute an invalid or unknown instruction.

    • Diagnosis: Similar to SIGSEGV, strace shows an exit, and kernel logs will indicate "SIGILL" or "illegal instruction". This can happen if a program compiled for one architecture is run on another, or if there’s a severe bug in the program’s execution flow.
    • Fix: Ensure the binary is compiled for the correct target architecture. If it’s a custom-built application, debug the code to find where an invalid instruction is being generated or executed.
    • Why it works: Correcting the code or ensuring the correct binary is run on the appropriate hardware resolves the issue of attempting to execute non-existent instructions.
  4. Uncaught SIGABRT (Aborted): The process intentionally called abort(), often due to an internal error detected by the application itself or a library it uses (like assert() failures).

    • Diagnosis: strace will show the abort() system call being invoked, followed by the process exiting. Kernel logs might confirm SIGABRT.
    • Fix: Debug the application. The abort() call is usually a consequence of a deeper problem. Examine the application’s logs or use a debugger to understand why abort() was called. Look for assert failures or explicit error handling paths that lead to abort().
    • Why it works: Fixing the underlying logical error that triggers the abort() call is the solution.
  5. External Signal Handling (e.g., SIGTERM): Another process explicitly sent a termination signal.

    • Diagnosis: strace attached to the receiving process will show it receiving a signal (e.g., --- SIGTERM {si_signo=15, si_code=SI_USER, ...} ---). You can then use ps -ef | grep <pid> to find the process that sent the signal, or check audit logs (auditd) for SYSCALL records related to kill.
    • Fix: Identify the process sending the signal and understand its intent. If it’s a deliberate shutdown, ensure the application is handling SIGTERM gracefully. If it’s unintended, investigate why the other process is sending the signal.
    • Why it works: Graceful signal handling allows the application to clean up resources before exiting, preventing data corruption or leaving the system in an inconsistent state.
  6. Resource Limits Exceeded (SIGXCPU, SIGXFSZ): The process exceeded its CPU time limit or file size limit.

    • Diagnosis: strace will show a setrlimit() call failing or the process receiving SIGXCPU or SIGXFSZ. Kernel logs might also indicate this.
    • Fix: Increase the resource limits for the process. This can be done via ulimit commands before starting the process, or by modifying system-wide limits in /etc/security/limits.conf. For example, to increase the CPU time limit to unlimited for a process: ulimit -t unlimited.
    • Why it works: By raising the allowed limits, the process is no longer constrained by these specific resource ceilings and can continue execution.

After you’ve addressed all these potential causes, the next error you might encounter is a No such file or directory when trying to access a crucial configuration file that the application needs to start correctly.

Want structured learning?

Take the full Strace course →