The most surprising thing about TCP buffer tuning is that more isn’t always better, and the optimal values depend heavily on network latency, not just raw bandwidth.
Let’s see this in action. Imagine two machines, sender and receiver, connected by a high-latency link. We want to maximize data transfer.
On the sender:
# Check current send buffer size (in bytes)
sysctl net.ipv4.tcp_rmem
# Example output: 4096 87380 6291456
# Check current receive buffer size (in bytes)
sysctl net.ipv4.tcp_wmem
# Example output: 4096 16384 6291456
On the receiver:
# Check current send buffer size (in bytes)
sysctl net.ipv4.tcp_rmem
# Example output: 4096 87380 6291456
# Check current receive buffer size (in bytes)
sysctl net.ipv4.tcp_wmem
# Example output: 4096 16384 6291456
The tcp_rmem and tcp_wmem parameters control the receive and send buffer sizes, respectively. Each has three values: min, default, and max.
min: The minimum amount of memory to reserve for TCP sockets.default: The default amount of memory allocated.max: The maximum amount of memory that can be allocated.
The critical insight is that the default value for tcp_rmem and tcp_wmem is often too small for high-bandwidth, high-latency networks (often called "long fat networks" or LFNs). TCP’s throughput is limited by the Bandwidth-Delay Product (BDP). The BDP is the maximum amount of data that can be "in flight" on the network at any given time.
BDP = Bandwidth (bits/sec) * Latency (seconds)
To achieve maximum throughput, the TCP send window size should be at least as large as the BDP. The TCP send window is dynamically adjusted, but the maximum size it can reach is capped by the receiver’s advertised window, which is influenced by the tcp_rmem setting on the receiving side. Similarly, the sender’s tcp_wmem influences how much data the sender’s kernel can buffer before sending it out.
Let’s say we have a 10 Gbps (10,000,000,000 bits/sec) link with a round-trip time (RTT) of 100 ms (0.1 seconds).
BDP = 10,000,000,000 bits/sec * 0.1 seconds = 1,000,000,000 bits
To convert this to bytes, divide by 8:
BDP = 1,000,000,000 bits / 8 bits/byte = 125,000,000 bytes.
This means we need to be able to hold at least 125 MB of data in flight. The default tcp_rmem and tcp_wmem (often around 64 KB or 128 KB) are nowhere near this.
To tune for this LFN scenario, we’d increase the max values for tcp_rmem and tcp_wmem on both the sender and receiver.
On sender and receiver (apply to both):
# Example: Tune for a 10 Gbps link with 100ms RTT
# Set min to a reasonable small value, default to something larger,
# and max to at least BDP (e.g., 125,000,000 bytes).
# We'll use slightly larger values to be safe and accommodate potential spikes.
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 200000000"
sudo sysctl -w net.ipv4.tcp_wmem="4096 16384 200000000"
The min and default values are less critical for peak throughput but affect smaller connections or initial connection setup. Setting max to 200000000 (200 MB) provides ample room for our 125 MB BDP.
These changes are temporary and will reset on reboot. To make them permanent, edit /etc/sysctl.conf (or a file in /etc/sysctl.d/):
net.ipv4.tcp_rmem = 4096 87380 200000000
net.ipv4.tcp_wmem = 4096 16384 200000000
Then apply them:
sudo sysctl -p
The tcp_rmem on the receiver dictates the maximum window size the sender can use. The tcp_wmem on the sender dictates how much data the sender’s kernel can buffer before it’s handed off to the network stack; this helps prevent the sender from being starved by slow network conditions or a slow receiver.
A common mistake is to only tune the max value. However, the default value for tcp_wmem is often very small (16384 bytes). If the sender’s application is faster than the network, the sender’s kernel can quickly fill its send buffer and block the application. Increasing the default tcp_wmem can help smooth out bursts of data from the application.
The tcp_rmem and tcp_wmem settings work together. The receiver advertises its available buffer space (limited by its tcp_rmem max) to the sender. The sender then uses this advertised window, combined with its own tcp_wmem capacity, to decide how much data to send. If either side has a restrictive buffer, throughput suffers.
A subtle point often overlooked is that these buffers are per-socket. While sysctl changes the system-wide defaults and maximums, individual applications can, and sometimes do, override these values using setsockopt if they need specific tuning. However, for general high-throughput network services, setting these sysctl parameters correctly is paramount.
After tuning, you’ll want to monitor your actual throughput using tools like iperf3 and check the TCP congestion window (cwnd) using ss -ti or netstat -s to ensure you are utilizing the available bandwidth effectively. You might see cwnd reaching values that correspond to your BDP.
The next thing you’ll likely encounter is the TCP congestion control algorithm itself, which dynamically adjusts the cwnd based on network conditions like packet loss and latency, and you’ll want to understand how algorithms like Cubic, BBR, or Reno interact with buffer sizes.