strace can’t directly trace SIGKILL because it’s uncatchable and unblockable, but it can show you the system call that results in a process being killed, even by SIGKILL.
Let’s say you have a process, PID 12345, that’s misbehaving and you want to see why it’s getting killed. You suspect it’s a signal, but you’re not sure which one, or maybe it’s a SIGKILL that you can’t catch. You can attach strace to it:
strace -p 12345
If SIGKILL is sent to PID 12345, you won’t see strace printing SIGKILL directly. Instead, you’ll see the process exiting. The crucial part is the system call immediately preceding the exit.
Common Causes for Unexpected Process Termination
Here’s a breakdown of what strace might show you when a process is terminated, and the underlying reasons:
-
OOM Killer (
SIGKILL): The system is running out of memory and the Out-Of-Memory killer has stepped in to terminate a process.- Diagnosis: You’ll see a
read()orwrite()system call followed by anexit_group()system call. The kernel logs will be your best friend here. Rundmesg -T | grep -i oomorjournalctl -k | grep -i oomto see if the OOM killer was invoked. It will explicitly state which process was killed and why. - Fix: Increase system RAM, reduce the memory footprint of your applications (e.g., tune JVM heap sizes, database buffers), or adjust the
oom_score_adjfor critical processes to make them less likely targets. For example, to make a process less likely to be killed, you’d write a negative value to its/proc/<pid>/oom_score_adjfile:echo -500 > /proc/12345/oom_score_adj. This reduces its "oom score," making it less attractive. - Why it works: The OOM killer assigns a score to each process based on its memory usage and other factors. By adjusting
oom_score_adj, you directly influence this score, making the process less likely to be chosen for termination when memory is scarce.
- Diagnosis: You’ll see a
-
Uncaught
SIGSEGV(Segmentation Fault): The process tried to access memory it shouldn’t have. This is a very common programming error.- Diagnosis:
stracewill show aread()orwrite()system call, followed by the process exiting. The kernel log (dmesgorjournalctl) will explicitly mention "segfault" or "SIGSEGV". You might also see a core dump being generated. - Fix: Debug the application’s source code. Use tools like
gdbwith the core dump or by attaching to the running process to pinpoint the exact line of code causing the invalid memory access. Common causes include null pointer dereferences, buffer overflows, or use-after-free errors. - Why it works: Identifying and correcting the faulty memory access in the code prevents the CPU from encountering an illegal operation, thus avoiding the
SIGSEGV.
- Diagnosis:
-
Uncaught
SIGILL(Illegal Instruction): The process attempted to execute an invalid or unknown instruction.- Diagnosis: Similar to
SIGSEGV,straceshows an exit, and kernel logs will indicate "SIGILL" or "illegal instruction". This can happen if a program compiled for one architecture is run on another, or if there’s a severe bug in the program’s execution flow. - Fix: Ensure the binary is compiled for the correct target architecture. If it’s a custom-built application, debug the code to find where an invalid instruction is being generated or executed.
- Why it works: Correcting the code or ensuring the correct binary is run on the appropriate hardware resolves the issue of attempting to execute non-existent instructions.
- Diagnosis: Similar to
-
Uncaught
SIGABRT(Aborted): The process intentionally calledabort(), often due to an internal error detected by the application itself or a library it uses (likeassert()failures).- Diagnosis:
stracewill show theabort()system call being invoked, followed by the process exiting. Kernel logs might confirmSIGABRT. - Fix: Debug the application. The
abort()call is usually a consequence of a deeper problem. Examine the application’s logs or use a debugger to understand whyabort()was called. Look forassertfailures or explicit error handling paths that lead toabort(). - Why it works: Fixing the underlying logical error that triggers the
abort()call is the solution.
- Diagnosis:
-
External Signal Handling (e.g.,
SIGTERM): Another process explicitly sent a termination signal.- Diagnosis:
straceattached to the receiving process will show it receiving a signal (e.g.,--- SIGTERM {si_signo=15, si_code=SI_USER, ...} ---). You can then useps -ef | grep <pid>to find the process that sent the signal, or check audit logs (auditd) forSYSCALLrecords related tokill. - Fix: Identify the process sending the signal and understand its intent. If it’s a deliberate shutdown, ensure the application is handling
SIGTERMgracefully. If it’s unintended, investigate why the other process is sending the signal. - Why it works: Graceful signal handling allows the application to clean up resources before exiting, preventing data corruption or leaving the system in an inconsistent state.
- Diagnosis:
-
Resource Limits Exceeded (
SIGXCPU,SIGXFSZ): The process exceeded its CPU time limit or file size limit.- Diagnosis:
stracewill show asetrlimit()call failing or the process receivingSIGXCPUorSIGXFSZ. Kernel logs might also indicate this. - Fix: Increase the resource limits for the process. This can be done via
ulimitcommands before starting the process, or by modifying system-wide limits in/etc/security/limits.conf. For example, to increase the CPU time limit to unlimited for a process:ulimit -t unlimited. - Why it works: By raising the allowed limits, the process is no longer constrained by these specific resource ceilings and can continue execution.
- Diagnosis:
After you’ve addressed all these potential causes, the next error you might encounter is a No such file or directory when trying to access a crucial configuration file that the application needs to start correctly.