TCP’s behavior inside container networking stacks, especially when managed by orchestrators like Kubernetes and advanced CNI plugins like Cilium, often deviates from what you might expect from a traditional, bare-metal environment. The most surprising truth is that TCP connections are not simply passing through a transparent network; instead, they are actively managed, modified, and sometimes entirely re-written by the container runtime, the operating system’s networking stack, and the CNI plugin.

Let’s see this in action. Imagine a simple HTTP request from a Pod A to a Service S in Kubernetes.

Pod A (IP: 10.244.0.5) wants to talk to Service S (ClusterIP: 10.96.0.10, Port: 80).

On Pod A’s host (let’s say its IP is 192.168.1.100), the kernel sees an outgoing packet destined for 10.96.0.10:80.

# On Pod A's node, observing traffic destined for the Service IP
tcpdump -ni any host 10.96.0.10 and port 80

The packet will likely be intercepted by iptables rules managed by kube-proxy or Cilium. Let’s assume kube-proxy for a moment. The destination IP 10.96.0.10 will be rewritten to the IP of one of the backend Pods for Service S, say 10.244.1.20. The source IP of Pod A (10.244.0.5) will also be rewritten to the IP of the node (192.168.1.100) if using NAT.

# Kernel sees this *before* it leaves the node
# Source: 192.168.1.100:54321 (ephemeral port on the node)
# Dest:   10.244.1.20:80 (actual backend Pod IP and port)

If Cilium is in use, it might employ eBPF to handle this routing and NAT more efficiently, often avoiding iptables altogether, but the principle of destination and source IP translation remains.

The problem this solves is fundamental: Pods within a Kubernetes cluster need to communicate with each other using stable IP addresses (Pod IPs) and abstract service endpoints (Service IPs), even though the underlying network infrastructure might not support these directly or might have a flat IP scheme. Kubernetes needs a way to map these internal, virtual IPs to actual network paths and physical interfaces.

Here’s how the mental model breaks down:

  1. Pod IP: Each Pod gets its own IP address within the cluster’s IP address space (e.g., 10.244.0.0/16). This is what the application inside the Pod sees as its own IP.
  2. Service IP (ClusterIP): A stable, virtual IP address for a set of backend Pods. Applications outside the Pod network can access this IP to reach any of the Pods behind the Service.
  3. CNI Plugin (e.g., Calico, Flannel, Cilium): This is the crucial component responsible for setting up the Pod network. It allocates Pod IPs, configures network interfaces (like veth pairs), and establishes routing or overlay networks between nodes so Pods on different nodes can communicate.
  4. kube-proxy / Cilium’s eBPF: This component (or its eBPF equivalent in Cilium) acts as the Kubernetes network service discoverer. It watches for Service and EndpointSlice objects. When a Pod tries to connect to a Service IP, kube-proxy (or Cilium’s eBPF) intercepts the traffic. It dynamically programs iptables rules (or eBPF maps) to translate the Service IP and port to the IP and port of one of the healthy backend Pods. This is often done using Destination Network Address Translation (DNAT).
  5. Node’s Network Interface: For traffic leaving a node, the source IP of the originating Pod is typically translated to the node’s IP address (Source Network Address Translation - SNAT). This is necessary because the external network (or even other nodes in a non-overlay network) might not know how to route back to a Pod’s internal IP.

The key levers you control are primarily through your CNI plugin’s configuration and Kubernetes networking primitives:

  • CNI Plugin Choice & Configuration: Whether you use Flannel, Calico, Cilium, etc., dictates how Pod IPs are assigned and how nodes connect. Cilium, for instance, offers policy enforcement and advanced load balancing via eBPF.
  • Service Definition: Defining Services with appropriate selector fields links the Service IP to the target Pods.
  • Network Policies: Kubernetes NetworkPolicy objects, or Cilium’s richer CiliumNetworkPolicy, allow you to define firewall rules at the Pod level, influencing which TCP connections are allowed to be established.
  • Node IP Configuration: The IP addresses assigned to your Kubernetes nodes are critical for SNAT and inter-node communication.

When using Cilium with eBPF, the process of routing and load balancing for Service IPs is handled directly in the kernel’s network stack via eBPF programs, bypassing iptables and kube-proxy in many scenarios. This offers significant performance gains and allows for more granular control, including L3/L4 and L7 load balancing, and identity-based security policies rather than IP-based ones. The eBPF programs hook into various network events, inspect packets, and make decisions about forwarding, NAT, and policy enforcement.

A common point of confusion is how TCP connection tracking (conntrack) interacts with this. When DNAT occurs, the kernel needs to track the original source IP and port and the translated destination IP and port. This allows for the return traffic to be correctly SNATted back to the originating Pod. For example, a packet from 10.244.0.5:12345 to 10.96.0.10:80 might be translated to 192.168.1.100:54321 to 10.244.1.20:80. The conntrack table will store this mapping. If the return packet from 10.244.1.20:80 to 192.168.1.100:54321 arrives, conntrack will help reverse the SNAT, sending it to 10.244.0.5:12345.

The fact that conntrack entries are tied to the node’s interface and IP when NAT is involved is a critical detail. If your CNI plugin doesn’t properly manage or account for conntrack state, or if conntrack tables become full on a node, you can experience intermittent connection failures or dropped packets, particularly under heavy load or with many concurrent connections to Services.

Understanding how TCP’s state is managed across these layers, from the Pod’s perspective to the node’s network interface and the CNI’s routing logic, is key to debugging connectivity issues. The next challenge is often understanding how TCP Fast Open or other TCP optimizations are affected by this network virtualization and NAT.

Want structured learning?

Take the full Tcp course →