Linux’s TCP stack is surprisingly opinionated about how much data it should buffer, and often, its defaults are way too conservative for high-performance networks.

Let’s see this in action. Imagine a simple iperf3 test between two Linux machines on a 10Gbps link. Without tuning, we might see speeds like this:

[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  8.50 GBytes  7.30 Gbits/sec                  sender
[  5]   0.00-10.00  sec  8.49 GBytes  7.29 Gbits/sec                  receiver

Now, let’s tune. We’ll adjust a few key sysctl parameters and then a socket option.

First, the sysctl tunables. These affect the kernel’s global TCP behavior.

1. net.core.rmem_max and net.core.wmem_max: These set the absolute maximum receive and send buffer sizes the kernel will allow for any socket. The defaults are often quite small, like 212992 bytes (208 KB). For a 10Gbps link with a typical latency of 10ms, the Bandwidth-Delay Product (BDP) is 10Gbps * 0.01s = 100Mb, or about 12.5 MB. We need buffers at least this large.

Diagnosis:

sysctl net.core.rmem_max
sysctl net.core.wmem_max

Fix (example for 10Gbps, 10ms latency):

sudo sysctl -w net.core.rmem_max=16777216 # 16MB
sudo sysctl -w net.core.wmem_max=16777216 # 16MB

Why it works: This allows the kernel to allocate significantly larger buffers to individual TCP connections, preventing them from being bottlenecked by insufficient kernel-level buffering.

2. net.ipv4.tcp_rmem and net.ipv4.tcp_wmem: These are three-value tuples: min default max. They define the receive and send buffer sizes for TCP sockets. The default value is what a new TCP connection starts with, and it can grow up to max. The defaults are usually around 4096 87380 6291456 (4KB, 85KB, 6MB). The max value here is often lower than net.core.rmem_max/wmem_max, making it the real limit.

Diagnosis:

sysctl net.ipv4.tcp_rmem
sysctl net.ipv4.tcp_wmem

Fix (example for 10Gbps, 10ms latency, aligning with net.core.*_max):

sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216" # min, default, max
sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216" # min, default, max

Why it works: By increasing the max values here, we allow TCP connections to dynamically grow their send and receive buffers to match the network’s capacity, up to the kernel’s absolute limit set by net.core.*_max.

3. net.ipv4.tcp_congestion_control: Linux has several congestion control algorithms. cubic is the default and generally good, but bbr (Bottleneck Bandwidth and Round-trip propagation time) can offer significant improvements on lossy or high-latency networks by actively probing for available bandwidth.

Diagnosis:

sysctl net.ipv4.tcp_congestion_control

Fix (if bbr is available):

sudo sysctl -w net.ipv4.tcp_congestion_control=bbr

Why it works: BBR aims to keep buffers full without causing excessive packet loss, leading to higher throughput and lower latency compared to traditional loss-based algorithms like cubic under certain conditions.

After applying these sysctl changes, you’ll need to make them persistent by editing /etc/sysctl.conf or a file in /etc/sysctl.d/.

Now, for socket options. These are per-application settings, often controlled by the application itself or by using setsockopt in code. For iperf3, we can use its -l (buffer length) option. This often maps to SO_RCVBUF and SO_SNDBUF.

4. SO_RCVBUF and SO_SNDBUF (Socket Options): While sysctl sets the maximum allowed, applications can explicitly set their desired buffer sizes. If an application doesn’t set them, the kernel uses the tcp_rmem/tcp_wmem defaults. Setting these directly can sometimes override or complement the sysctl settings, ensuring the application asks for the buffer sizes it needs.

Diagnosis (using iperf3): Run iperf3 without specific buffer options. Then run it with a large buffer:

iperf3 -c <server_ip> -t 10 -P 4
iperf3 -c <server_ip> -t 10 -P 4 -l 4M # Sets client send buffer to 4MB

Diagnosis (for receiver, often controlled by server side or a separate -u option for UDP, but for TCP receiver, it’s often the kernel net.core.rmem_max and net.ipv4.tcp_rmem that dominate unless the app explicitly sets SO_RCVBUF).

Fix (using iperf3’s -l option for client send buffer):

iperf3 -c <server_ip> -t 10 -P 4 -l 4M

Why it works: This tells the iperf3 client to request a 4MB send buffer from the kernel. Combined with the higher sysctl limits, this ensures the client is ready to push data as fast as the network allows. The receiver’s buffer size is also critical, often adjusted via sysctl or by the receiving application.

Let’s re-run iperf3 after applying the sysctl tuning and the -l 4M option:

[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  11.8 GBytes  10.1 Gbits/sec                  sender
[  5]   0.00-10.00  sec  11.8 GBytes  10.1 Gbits/sec                  receiver

Notice the significant jump in throughput.

5. net.ipv4.tcp_window_scaling: This is usually enabled by default (1), but it’s fundamental. It allows TCP window sizes to exceed the original 65,535-byte limit by using a scaling factor, which is essential for high BDP networks.

Diagnosis:

sysctl net.ipv4.tcp_window_scaling

Fix:

sudo sysctl -w net.ipv4.tcp_window_scaling=1

Why it works: Without window scaling, TCP’s effectiveness is severely limited on any modern network.

6. net.ipv4.tcp_timestamps: Also usually enabled (1). Timestamps help with accurate RTT measurement and protect against wrapped sequence numbers, which is more relevant on very high-speed links.

Diagnosis:

sysctl net.ipv4.tcp_timestamps

Fix:

sudo sysctl -w net.ipv4.tcp_timestamps=1

Why it works: Provides more robust RTT measurements for the congestion control algorithm and protection against certain types of network errors.

The next hurdle you’ll likely face is application-level buffering or limitations in the network path itself (e.g., firewalls, intermediate routers with smaller MTUs, or saturated links).

Want structured learning?

Take the full Tcp course →