The surprising truth about scaling to a million TCP connections is that it’s less about the sheer number of connections and more about how efficiently the kernel manages the state for each one.
Let’s see this in action. Imagine a simple Go program that just accepts connections and immediately closes them.
package main
import (
"fmt"
"net"
"os"
)
func main() {
port := "8080"
listener, err := net.Listen("tcp", ":"+port)
if err != nil {
fmt.Fprintf(os.Stderr, "Error listening: %v\n", err)
os.Exit(1)
}
defer listener.Close()
fmt.Printf("Listening on port %s\n", port)
for {
conn, err := listener.Accept()
if err != nil {
fmt.Fprintf(os.Stderr, "Error accepting: %v\n", err)
continue
}
go func(c net.Conn) {
defer c.Close()
// Immediately close the connection
}(conn)
}
}
If you run this on a typical Linux machine without any tuning, you’ll likely hit a wall well before a million connections. The operating system has a lot of work to do for each TCP connection: tracking its state (SYN_SENT, ESTABLISHED, CLOSE_WAIT, etc.), managing its send and receive buffers, and allocating memory for its associated data structures.
The core problem is that the kernel, by default, is conservative with its resources to ensure stability on a wide range of hardware. When you start pushing for massive concurrency, these defaults become bottlenecks.
Here’s how the kernel manages connection state:
struct sock: This is the primary kernel data structure representing a socket. Each TCP connection has one. It holds all the state, including IP addresses, ports, sequence numbers, window sizes, and pointers to other related data.- TCP Control Block (TCB): While not a distinct C struct in all kernels, it conceptually represents the TCP-specific state within
struct sock. This includes congestion control parameters, retransmission timers, and other TCP options. - Network Buffers (
sk_buff): For each connection, the kernel needs to manage memory for sending and receiving data.sk_buffstructures are used to hold network packet data. High connection counts mean many smallsk_buffallocations, which can lead to memory fragmentation and high overhead. - File Descriptors: Each active socket is represented by a file descriptor in user space. The
ulimit -nsetting controls the maximum number of file descriptors a process can have, and the system-widefs.file-maxlimits the total number of open files.
To scale to 1M+ connections, you need to adjust several kernel parameters. These parameters control how much memory the kernel can use for network buffers, how many file descriptors are available, and how it handles ephemeral ports.
The primary levers you’ll pull are:
-
Ephemeral Port Range: When clients connect to a server, they use a source port that the server’s OS assigns dynamically. This range needs to be large enough to avoid running out of ports if your server is also acting as a client (e.g., for outbound connections or proxying).
- Diagnosis: Check the current range:
sysctl net.ipv4.ip_local_port_range - Fix: Expand the range. A common setting for high-scale servers is
sysctl -w net.ipv4.ip_local_port_range="1024 65535". This gives you almost 64,000 ports to work with for outgoing connections. - Why it works: This ensures that even if your server initiates many connections, it has a vast pool of available source ports to choose from, preventing "port exhaustion" errors on the client side of those connections.
- Diagnosis: Check the current range:
-
TCP TIME_WAIT State: When a TCP connection closes, one side (typically the one that initiated the close with FIN) enters the
TIME_WAITstate for a period (usually 60 seconds, 2*MSL). This is to ensure that any delayed packets from the previous incarnation of the connection don’t interfere with a new connection on the same socket pair. With millions of connections, you can have millions of sockets stuck inTIME_WAIT, consuming memory and port numbers.- Diagnosis: Check current
TIME_WAITcount:ss -tan | grep TIME-WAIT | wc -l - Fix: Enable
tcp_tw_reuseandtcp_tw_recycle(with caution for NAT).sysctl -w net.ipv4.tcp_tw_reuse=1andsysctl -w net.ipv4.tcp_fin_timeout=30(reducing the FIN timeout can also help, thoughtcp_tw_reuseis more direct). Fortcp_tw_recycle, use with extreme care as it can break connections behind NAT. - Why it works:
tcp_tw_reuse=1allows a new connection to be established if the local system is about to send a SYN packet and the remote end is inTIME_WAITand the sequence numbers are consistent, effectively allowing the reuse of the socket before the fullTIME_WAITperiod expires.tcp_fin_timeoutreduces how long sockets stay inFIN_WAIT_2.
- Diagnosis: Check current
-
Maximum Number of Open Files (File Descriptors): Each network connection consumes a file descriptor. You need to increase both the per-process limit and the system-wide limit.
- Diagnosis: Check per-process limit:
ulimit -n. Check system-wide limit:sysctl fs.file-max. - Fix: For the per-process limit, edit
/etc/security/limits.confand add lines like:
Then, increase the system-wide limit:* soft nofile 1048576 * hard nofile 1048576sysctl -w fs.file-max=2097152(set it to at least twice your target connection count). You’ll need to restart services or reboot forlimits.confchanges to take full effect, andsysctlchanges are immediate but not persistent across reboots unless added to/etc/sysctl.conf. - Why it works: This directly increases the number of concurrent connections your processes can open and the total number of file handles the kernel can manage across all processes.
- Diagnosis: Check per-process limit:
-
TCP Buffer Sizes: Each connection has receive and send buffers. If these are too small, throughput suffers. If they are too large, memory usage per connection can become prohibitive at high scales. The kernel can dynamically tune these, but setting sensible minimums and maximums is important.
- Diagnosis: Check current values:
sysctl net.core.rmem_max,sysctl net.core.wmem_max,sysctl net.ipv4.tcp_rmem,sysctl net.ipv4.tcp_wmem. - Fix: Increase
net.core.rmem_maxandnet.core.wmem_maxto a reasonable large value, e.g.,sysctl -w net.core.rmem_max=16777216andsysctl -w net.core.wmem_max=16777216. Also, tunenet.ipv4.tcp_rmemandnet.ipv4.tcp_wmemto allow for larger buffers. A common setting issysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"andsysctl -w net.ipv4.tcp_wmem="4096 65536 16777216". The three values are min, default, and max buffer size. - Why it works: This allows TCP to use larger buffers, which can improve throughput over high-latency networks and reduce the number of packets needed to fill the pipe. The kernel’s auto-tuning mechanism will then use these larger maximums effectively.
- Diagnosis: Check current values:
-
TCP Congestion Control: While not directly a scaling limit, an efficient congestion control algorithm is crucial. For high-performance servers,
tcp_congestion_controlmight be set tocubic(default) orbbr.- Diagnosis: Check current algorithm:
sysctl net.ipv4.tcp_congestion_control. - Fix: If
cubicis not performing well, considerbbrif your kernel supports it and your network has high bandwidth-delay product.sysctl -w net.ipv4.tcp_congestion_control=bbr. - Why it works:
bbraims to improve throughput and reduce latency by directly measuring bandwidth and round-trip time, rather than relying solely on packet loss as a signal for congestion.
- Diagnosis: Check current algorithm:
-
Network Queue Management (Backlog): The kernel maintains a queue for incoming connection requests (SYN queue) and a queue for established connections waiting to be processed. These backlogs need to be large enough to handle bursts of connections.
- Diagnosis: Check current backlog:
sysctl net.core.somaxconn(max backlog for listening sockets) andsysctl net.ipv4.tcp_max_syn_backlog(SYN queue size). - Fix: Increase
net.core.somaxconn:sysctl -w net.core.somaxconn=4096. Increasenet.ipv4.tcp_max_syn_backlog:sysctl -w net.ipv4.tcp_max_syn_backlog=2048. Note thatsomaxconnis also a parameter in your application’slisten()call, which must be at least as large as the kernel setting. - Why it works: A larger backlog ensures that incoming connection requests, especially during a traffic spike, are not immediately dropped by the kernel before the application can accept them.
- Diagnosis: Check current backlog:
Tuning these parameters allows the kernel to efficiently manage the state and resources for a massive number of concurrent TCP connections.
After tuning, the next hurdle you’ll likely encounter is application-level processing or the limits of your CPU and memory to handle the actual work for each connection.