strace can be weaponized to inject faults into running processes, allowing you to test how your applications handle unexpected system call failures.

Let’s see it in action. Imagine you have a simple Go program that reads a file:

package main

import (
	"fmt"
	"io/ioutil"
	"log"
)

func main() {
	data, err := ioutil.ReadFile("mydata.txt")
	if err != nil {
		log.Fatalf("Failed to read file: %v", err)
	}
	fmt.Println("File content:", string(data))
}

Normally, this works fine if mydata.txt exists. But what if openat syscall fails? We can simulate this.

First, find the PID of your running Go program. Let’s say it’s 12345.

Now, use strace with the -e trace=openat and -f (follow forks) options. We’ll use the -e inject flag to specify the syscall to inject a failure into, and the error= option to define the specific error code.

sudo strace -p 12345 -f -e trace=openat -e inject=openat:error=ENOENT

Here, ENOENT is the error code for "No such file or directory." When the Go program attempts to openat mydata.txt, strace intercepts the call, injects the ENOENT error, and returns it to the Go program. The program then hits its error handling: log.Fatalf("Failed to read file: %v", err).

The core idea here is that strace acts as a man-in-the-middle for system calls. When a process makes a syscall (like openat), the kernel normally handles it directly. With strace -p <PID>, strace gets notified before the kernel processes the syscall. It can then decide to let the syscall proceed, or, with -e inject, it can return a fabricated result (including an error) back to the process as if the kernel had done it.

The trace= option filters which syscalls strace even bothers to report and potentially inject into. This is crucial for performance; you don’t want strace intercepting every single syscall if you’re only testing one failure. The inject= option is the magic: it specifies the syscall name and the desired outcome. error=<errno> is the most common injection type, but you can also return specific values using retval=.

This technique is incredibly powerful for testing edge cases that are hard to reproduce reliably. Think about network failures (ECONNRESET, ETIMEDOUT), disk full errors (ENOSPC), permission denied (EACCES), or even transient issues. You can inject these errors at the syscall level to see if your application’s error handling logic is robust.

For instance, to test a scenario where writing to a file fails because the disk is full, you could target the write syscall:

sudo strace -p 12345 -f -e trace=write -e inject=write:error=ENOSPC

ENOSPC signals "No space left on device." Your application, upon receiving this error from the write syscall, should ideally have a fallback or a clear error message to the user, rather than crashing or corrupting data.

You can also inject successful return values to test code paths that are only exercised on success, or to bypass costly operations for faster testing. For example, if you wanted to ensure a file read operation always returned 0 bytes read (simulating an empty file without actually creating one), you could do:

sudo strace -p 12345 -f -e trace=read -e inject=read:retval=0

This would cause the read syscall to return 0, indicating end-of-file immediately, regardless of the actual file content.

The -f flag is important because many applications use fork() or clone() to create child processes. If your target syscall is made by a child process, without -f, strace won’t follow it and your injection won’t have the desired effect. You can also specify a specific PID to trace with -p, or even trace a new command with strace <command>.

A subtle but important point is that strace operates at the process level. This means it intercepts syscalls after they’ve been initiated by the user-space application but before the kernel fully processes them. The injected error is then returned to the application’s syscall wrapper, mimicking a genuine kernel failure. This is distinct from kernel-level fault injection mechanisms, which operate deeper within the kernel itself.

The next hurdle you’ll likely encounter is managing multiple concurrent fault injections or testing complex inter-process communication scenarios where a failure in one process needs to be correlated with the behavior of another.

Want structured learning?

Take the full Strace course →