Troubleshoot TCP in Production: Capture and Analyze (2026)

TCP connections are silently failing in production, and you’re seeing intermittent packet loss or connection resets that don’t make sense.

This usually happens because a network device in the path is silently dropping or mangling TCP packets, and the application layer is too high-level to see it.

Common Causes and Fixes

Firewall State Table Exhaustion
- Diagnosis: Check firewall logs for "state table full" or similar messages. On Linux, you can check conntrack usage:
```
conntrack -S
```
  Look for entries close to max_entries.
- Fix: Increase the state table size on the firewall. For example, on a Cisco ASA, you might use show memory detail to check usage and show running-config | include timeout to see current timeouts. Adjusting timeouts (e.g., timeout tcp established 3600) can also help by freeing up entries faster, but increasing the table size is the direct fix.
- Why it works: The firewall can’t track new connections if its memory for tracking existing ones is full, leading it to drop new SYN packets or subsequent data packets.
MTU Mismatch / Black Hole
- Diagnosis: Perform a ping with the "do not fragment" (DF) flag set and progressively larger packet sizes. On Linux:
```
ping -M do -s 1472 <destination_ip>
```
  If you can’t reach a certain size (e.g., 1472 bytes, which is 1500 MTU minus IP and TCP headers), but smaller sizes work, you have an MTU issue. You might see "Packet needs to be fragmented but DF set" errors.
- Fix: Set the MTU on the sending interface to match the smallest MTU in the path. For example, on a Linux server’s Ethernet interface:
```
ip link set dev eth0 mtu 1400
```
  Alternatively, configure Path MTU Discovery (PMTUD) correctly on your endpoints and network devices, or use TCP MSS clamping on firewalls.
- Why it works: If a packet is too large for a link in the path and the router cannot fragment it (because DF is set), it will silently drop it. This fix ensures packets are small enough to traverse the path without fragmentation.
TCP Window Scaling Issues
- Diagnosis: Use tcpdump on both ends of the connection and analyze the win (window size) and ws (window scale) values. Look for connections where the advertised window size is consistently small, or where the scale factor is zero or very low.
```
tcpdump -i eth0 -s 0 -w tcp_window.pcap 'tcp port 80'
```
  Then analyze with Wireshark.
- Fix: Ensure TCP window scaling is enabled and correctly negotiated on both client and server. On Linux, check sysctl net.ipv4.tcp_window_scaling. If it’s 0, enable it:
```
sysctl -w net.ipv4.tcp_window_scaling=1
```
  Also, ensure net.ipv4.tcp_rmem and net.ipv4.tcp_wmem are set to reasonable values (e.g., 4096 87380 6291456).
- Why it works: Without window scaling, the maximum TCP window size is 65,535 bytes, which is insufficient for high-bandwidth, high-latency links ("long fat networks"). If scaling is disabled or misconfigured, throughput plummets.
TCP Keepalive Issues
- Diagnosis: If connections are dropping after a period of inactivity, check if keepalives are enabled and configured appropriately. Look for application logs indicating connections being closed unexpectedly.
- Fix: Enable TCP keepalives at the OS level and set appropriate intervals. On Linux:
```
sysctl -w net.ipv4.tcp_keepalive_time=3600  # 1 hour
sysctl -w net.ipv4.tcp_keepalive_intvl=60  # 1 minute
sysctl -w net.ipv4.tcp_keepalive_probes=5  # 5 probes
```
  This means after 1 hour of idle time, the OS will send probes every minute, and if 5 probes go unanswered, the connection is considered dead.
- Why it works: Network devices (like firewalls or load balancers) often have idle connection timeouts. TCP keepalives send small packets to keep the connection alive in these devices’ state tables.
ECN (Explicit Congestion Notification) Misconfiguration
- Diagnosis: Look for repeated TCP Retransmission and TCP Dup ACK events in tcpdump or packet captures, especially if the network is not saturated. If ECN is enabled, you might see ECN: ECE (ECN-Echo) flags in TCP segments, but if a device in the path doesn’t support ECN, it might drop packets marked with ECN bits.
- Fix: Disable ECN on endpoints if it’s causing issues and not fully supported by the network path. On Linux, you can disable it via sysctl:
```
sysctl -w net.ipv4.tcp_ecn=0
```
  Alternatively, configure network devices to properly handle ECN markings.
- Why it works: ECN is designed to signal congestion without dropping packets. However, if intermediate devices drop packets marked for ECN, it leads to packet loss that looks like a standard congestion event but can be harder to trace.
TCP Selective Acknowledgement (SACK) Issues
- Diagnosis: If you see high retransmission rates without obvious packet loss, analyze tcpdump for a lack of SACK information or inconsistent SACK blocks. This can happen if SACK is enabled but poorly implemented by a device or OS.
- Fix: Ensure SACK is enabled and functioning correctly. On Linux, it’s usually enabled by default. You can check sysctl net.ipv4.tcp_sack. If you suspect a specific device, you might need to disable it on the endpoints as a last resort.
- Why it works: SACK allows the receiver to acknowledge non-contiguous blocks of received data. This significantly improves performance when multiple packets are lost in a single window, as the sender only needs to retransmit the missing segments. If SACK is broken, performance degrades severely.

After fixing these, you might encounter TCP Zero Window issues if your application isn’t consuming data fast enough, leading to sender pauses.

Common Causes and Fixes

More Deep Dives in Tcp