strace is your digital magnifying glass for syscalls, and when an incident happens, it’s your best friend for understanding exactly what a process was doing in its last moments.
Let’s say you’ve got a service that’s gone dark, or a machine that’s suddenly acting like it’s possessed. You suspect something malicious or a critical misconfiguration. strace lets you rewind the tape, not on the code, but on the actual interactions between the process and the operating system kernel. It shows you every system call the process made – like open(), read(), write(), connect(), execve(), and the arguments it passed.
Imagine a web server, nginx, suddenly starts consuming 100% CPU and dropping connections. You can’t easily attach a debugger without potentially killing the process and losing valuable state. Instead, you can attach strace to the runaway nginx worker process.
Here’s how you’d do it:
First, find the PID of the problematic nginx worker.
ps aux | grep nginx
Let’s say you see a PID like 12345. Now, attach strace. We want to see all syscalls, follow child processes (in case nginx spawned something), and write to a file for later analysis. We’ll also want to timestamp events and use -s 1024 to ensure we capture full string arguments.
strace -p 12345 -f -s 1024 -ttt -o /tmp/nginx_strace.log
The -p 12345 tells strace to attach to process ID 12345.
The -f flag means "follow forks," so if the nginx process spawns new processes, strace will follow them too. This is crucial for understanding the full chain of events.
The -s 1024 increases the maximum string length strace will display for arguments, so you don’t get truncated paths or network addresses.
The -ttt option adds microsecond timestamps to each line, giving you a very precise timeline.
And -o /tmp/nginx_strace.log directs all the output to a file, so you don’t flood your terminal and can analyze it later.
Once the strace output file (/tmp/nginx_strace.log) is generated, you can start digging. You’re looking for anomalies. What should nginx be doing? Serving files, logging requests, maybe connecting to a backend. What shouldn’t it be doing? Making calls to unusual network addresses, trying to execve strange binaries, or repeatedly failing to open critical system files.
Consider these scenarios:
1. Unexpected Network Connections:
If your nginx is suddenly trying to connect() to 192.0.2.1:8080 (an example IP address, often used for documentation and testing), that’s a huge red flag. This could indicate a compromised process trying to exfiltrate data or connect to a command-and-control server.
- Diagnosis: Search the
stracelog forconnect(. Look for unusual IP addresses or ports.grep "connect(" /tmp/nginx_strace.log - Fix (Conceptual): If this is unexpected, the immediate step is to isolate the machine (e.g., by blocking the IP at the firewall or disconnecting it from the network) and then investigate how the process was made to initiate this connection. This might involve analyzing configuration files for malicious changes, examining loaded modules, or understanding if the process itself was compromised.
- Why it works: The
connect()syscall is the kernel’s way of establishing a TCP connection. If a process initiates a connection to an unauthorized destination, it’s a direct indicator of abnormal behavior.
2. Suspicious File Operations:
A web server shouldn’t be trying to open() or write() to sensitive system directories like /etc/shadow or /root/.ssh.
- Diagnosis: Look for
open()orwrite()syscalls targeting unexpected file paths.grep "open(" /tmp/nginx_strace.log | grep "/etc/" grep "write(" /tmp/nginx_strace.log | grep "/root/" - Fix (Conceptual): If you see attempts to write to restricted areas, it points to a privilege escalation or a malicious script trying to modify system configuration or keys. The fix involves identifying the source of the malicious script/process and removing it, and then restoring any altered files from a known good backup.
- Why it works: The
open()syscall is how processes request access to files, andwrite()is how they modify them. Unauthorized access attempts are clear signs of compromise.
3. Unexpected Process Execution:
Seeing execve() called with a path like /tmp/evil.sh or /usr/local/bin/malware is a critical indicator.
- Diagnosis: Search for
execve(.grep "execve(" /tmp/nginx_strace.log - Fix (Conceptual): This means the compromised process is trying to launch another program. You need to identify the executed binary/script, analyze its purpose, and remove it. Ensure the parent process (e.g.,
nginxin this case) is not being tricked into running it. This might involve patching vulnerabilities or cleaning up malicious files. - Why it works:
execve()is the system call that replaces the current process image with a new one, effectively running a new program. If an unexpected program is executed, it’s a direct attempt to gain more control or perform malicious actions.
4. Repeated System Call Failures:
A high rate of read() or write() calls failing with EIO (Input/output error) or ENOSPC (No space left on device) on a critical file or device could indicate disk issues or a process trying to fill up storage.
- Diagnosis: Filter for common error return codes.
EIOis often-5,ENOSPCis-28.grep " = -5 " /tmp/nginx_strace.log grep " = -28 " /tmp/nginx_strace.log - Fix (Conceptual): If it’s
ENOSPC, you need to free up disk space. If it’sEIOon a specific device, you might be looking at hardware failure. The fix depends on the error: clear logs, archive data, or investigate disk health (smartctl). - Why it works: System call return values tell you if an operation succeeded or failed. Consistent failures with specific error codes point to underlying resource exhaustion or hardware problems.
5. Unusual System Information Gathering:
Processes might query system information using syscalls like stat() or readlink() on sensitive files. While often legitimate, a pattern of querying many such files could be reconnaissance.
- Diagnosis: Look for syscalls like
stat,lstat,readlinkin conjunction with sensitive paths.grep "stat(" /tmp/nginx_strace.log | grep "/etc/" - Fix (Conceptual): This is less about an immediate "fix" and more about understanding the attacker’s intent. If you see extensive probing of system configuration, it suggests an attacker is mapping the environment before launching a more significant attack. The "fix" is to secure the discovered vulnerabilities.
- Why it works: These syscalls reveal information about files and their metadata. An attacker uses them to understand the system’s layout and identify potential targets or misconfigurations.
6. Resource Exhaustion Patterns:
While top or htop show high CPU/memory, strace can show which syscalls are being called repeatedly to achieve this. For example, a tight loop of futex() calls might indicate a process stuck waiting on a lock, or excessive poll() calls could suggest a busy-waiting scenario.
- Diagnosis: Look for syscalls that are repeated thousands or millions of times in a short period.
This command counts the occurrences of each syscall and shows the top 10.awk '{print $1}' /tmp/nginx_strace.log | sort | uniq -c | sort -nr | head -10 - Fix (Conceptual): The fix depends on the syscall. If it’s
futex, it might be a deadlock or a bug in a synchronization primitive. If it’spoll, it might be an inefficient I/O handling loop. Understanding the syscall helps you debug the application code or identify external dependencies causing the issue. - Why it works: By observing the frequency of specific syscalls, you can pinpoint the exact kernel operations that are consuming resources, rather than just seeing the symptom (high CPU).
After you’ve resolved the immediate issue and restarted your service, you might find that the system tries to re-establish a connection to a malicious IP address that was hardcoded in a configuration file that you missed.