The most surprising thing about debugging file protocol issues with tcpdump is how much of the "magic" you can actually see happening, byte by byte, between client and server.

Let’s say you’ve got a user complaining about slow file access over NFS or SMB. They’re pulling a large file, or saving a bunch of small ones, and it’s taking ages. Or maybe it’s just intermittently failing. You’ve checked network latency, disk I/O on the server, and CPU – all look fine. The next step is to see what the file protocol itself is saying.

Here’s a live example. We’ll capture traffic between an NFS client and server.

sudo tcpdump -i eth0 -w nfs_debug.pcap host 192.168.1.100 and host 192.168.1.200

This command tells tcpdump to:

  • -i eth0: Listen on the eth0 network interface. Replace eth0 with your actual interface name (e.g., ens192, enp0s3).
  • -w nfs_debug.pcap: Write the captured packets to a file named nfs_debug.pcap. This is crucial for later analysis with tools like Wireshark.
  • host 192.168.1.100 and host 192.168.1.200: Filter for traffic specifically between the client (let’s say 192.168.1.100) and the server (192.168.1.200). Replace these with your actual IP addresses.

Now, have the user reproduce the slow file operation. Once they’re done, you can stop tcpdump (Ctrl+C) and analyze nfs_debug.pcap in Wireshark.

When you open the pcap file in Wireshark, you’ll see a stream of packets. For NFS, you’re looking for specific RPC (Remote Procedure Call) operations. You can filter in Wireshark using nfs.

Let’s break down what you’re seeing. NFS works by sending RPC requests from the client to the server. Common requests include:

  • LOOKUP: The client asks "does this file or directory exist?"
  • GETATTR: The client asks for file attributes (permissions, size, timestamps).
  • READDIR: The client asks for a list of files in a directory.
  • READ: The client asks to read data from a file.
  • WRITE: The client asks to write data to a file.
  • COMMIT: The client tells the server to make written data permanent.

You’ll see a sequence like this: Client sends LOOKUP for /data/bigfile.txt. Server responds with AttrConfirm (attributes confirmed). Client sends GETATTR. Server responds with file size and other metadata. Client sends READ request for, say, bytes 0-8191. Server responds with the data. This repeats for subsequent chunks of the file.

For SMB (Server Message Block), the protocol is different but the principle is the same. You’d filter with smb. Common SMB operations include:

  • NT_CREATE: Client requests to open a file or create it.
  • QUERY_INFO: Client asks for file information.
  • READ: Client reads data.
  • WRITE: Client writes data.
  • CLOSE: Client closes the file handle.

The key is to look for patterns. Is the client waiting a long time between sending a READ request and receiving the data? Is the server sending data back very slowly? Are there many retransmissions (indicated by [TCP Retransmission] in Wireshark)?

Common NFS Issues and How to Spot Them:

  1. High Latency/Packet Loss:

    • Diagnosis: In Wireshark, look for [TCP Retransmission] and high Time values between related requests/responses. On the command line, a simple ping -c 10 192.168.1.200 from the client will show basic latency.
    • Fix: Address underlying network issues. This could mean upgrading network hardware, improving Wi-Fi signal, or optimizing routing.
    • Why it works: NFS relies on timely RPC responses. Even moderate latency can cause significant delays when multiplied by thousands of small requests.
  2. NFS Version Mismatch/Configuration:

    • Diagnosis: Check /etc/exports on the NFS server and /etc/fstab or mount output on the client. Ensure they agree on the NFS version (v3, v4, v4.1, v4.2) and security mechanisms (e.g., sec=sys, sec=krb5). In Wireshark, look at the NFS protocol details pane. The Version field will tell you what’s being negotiated.
    • Fix: Standardize on the highest supported common version. For example, on the server’s /etc/exports, ensure a line looks like: /shared/data 192.168.1.0/24(rw,sync,no_subtree_check,fsid=0,crossmnt,no_root_squash,v4) (syntax varies by distribution). On the client’s /etc/fstab or mount command: server:/shared/data /mnt/nfs nfs4 defaults,auto,nofail,_netdev 0 0.
    • Why it works: Older NFS versions might not support performance-enhancing features or might be less efficient. Mismatched security can lead to authentication failures or fallbacks to less performant modes.
  3. Server Overload (CPU/Memory):

    • Diagnosis: On the NFS server, run top or htop. Look for high CPU usage, particularly by nfsd processes, or excessive memory usage leading to swapping.
    • Fix: Tune nfsd kernel threads (e.g., sudo sysctl -w fs.nfs.server.threads.max=32 and fs.nfs.server.threads.min=16). Increase RAM if necessary.
    • Why it works: nfsd threads handle incoming NFS requests. If there aren’t enough threads, or if the server is struggling with other tasks, requests queue up, leading to slow responses.
  4. rsize/wsize Mismatch or Too Small:

    • Diagnosis: In Wireshark, observe the read_data_len and write_data_len fields in the NFS READ and WRITE requests/responses. If these are consistently small (e.g., 1024, 4096 bytes), it indicates small transfer sizes. Check the client’s mount options in /etc/fstab or mount output for rsize and wsize values.
    • Fix: Increase rsize and wsize on the client mount. A good starting point is often 32768 or 65536. For example, on the client: sudo mount -o remount,rsize=65536,wsize=65536 /mnt/nfs. If using /etc/fstab, edit the line to include rsize=65536,wsize=65536.
    • Why it works: These options control the maximum amount of data transferred in a single NFS read or write operation. Larger values reduce the number of RPC calls needed for large file transfers, significantly improving throughput.
  5. sync vs. async Writes:

    • Diagnosis: Check the NFS mount options on the client. If it’s mounted with sync, every write operation waits for the data to be physically written to disk before acknowledging the client. async allows the server to acknowledge the write once it’s in its cache. In tcpdump, you’ll see the server’s response to WRITE operations. A sync mount will have longer latencies on writes.
    • Fix: For performance-critical workloads where data loss on server crash is acceptable, consider using async on the client mount (e.g., sudo mount -o remount,async /mnt/nfs). However, sync is the default and safer for data integrity.
    • Why it works: sync forces a disk flush for every write, which is slow. async defers this, allowing the server to batch writes and improve throughput, but at the risk of losing data if the server crashes before flushing its cache.

Common SMB Issues and How to Spot Them:

  1. SMB Dialect Mismatch:

    • Diagnosis: In Wireshark, look at the SMB protocol details. The Dialect field shows what version of SMB is being used (e.g., 2.02, 2.1, 3.0, 3.02, 3.1.1). Older dialects are less performant.
    • Fix: Configure both client and server to prefer newer dialects. This is often controlled by OS settings or Samba configuration (smb.conf on Linux). For example, in smb.conf on the server, add server min protocol = NT1 (for SMBv1, widely deprecated but sometimes needed for very old clients) or server min protocol = SMB2_10 for SMB 2.1, or server max protocol = SMB3 for SMB 3.
    • Why it works: Newer SMB dialects include significant performance improvements, better handling of concurrent operations, and more efficient packet structures.
  2. Large MTU Mismatch:

    • Diagnosis: tcpdump can reveal this. If you see many small packets where you’d expect larger ones, or if there are many TCP retransmissions, an MTU issue might be present. Use ping -M do -s <packet_size> <server_ip> from the client to find the largest packet size that doesn’t get fragmented.
    • Fix: Ensure the MTU is consistent across all network devices between the client and server, including network interface cards, switches, and routers. Set the MTU to 1500 for standard Ethernet, or higher (e.g., 9000 for jumbo frames) if supported and configured end-to-end.
    • Why it works: A mismatched MTU can cause packets to be fragmented or dropped, leading to retransmissions and severely degraded performance. SMB traffic, especially large file transfers, benefits greatly from a large, consistent MTU.
  3. SMB Signing or Encryption Overhead:

    • Diagnosis: In Wireshark, examine the SMB packet details. Look for flags indicating signing or encryption. High CPU usage on the client or server during file operations can also be a symptom.
    • Fix: If security policies allow, disable SMB signing or encryption. This is often configured in Group Policy on Windows clients/servers or in smb.conf on Samba. For example, in smb.conf, you might set server signing = disabled or smb encrypt = disabled (use with caution).
    • Why it works: Cryptographic operations for signing and encryption consume CPU cycles, adding latency to every SMB transaction. Disabling them speeds up operations but reduces security.

After fixing these issues, the next common problem you’ll encounter is a sudden surge in disk I/O errors as the faster protocol starts hammering the underlying storage.

Want structured learning?

Take the full Tcpdump course →