NFS client requests can be dropped by the server’s network stack if the server is overloaded, leading to client-side timeouts and retries.
Common Causes and Fixes for NFS Timeouts
-
NFS Server CPU Saturation: The NFS server’s CPU is too busy to process incoming requests promptly.
- Diagnosis: On the NFS server, run
toporhtopand look for processes consuming high CPU, especiallynfsdorrpc.statd. - Fix: Increase the number of
nfsdthreads. Edit/etc/nfs.conf(or/etc/sysconfig/nfson older systems) and setnfsd.threadsto a higher value (e.g.,nfsd.threads = 128). Restart the NFS server service:systemctl restart nfs-server. This allows the server to handle more concurrent NFS requests. - Diagnosis: Check NFS server load using
sar -u 1 5for CPU utilization. - Fix: If CPU usage consistently exceeds 80%, consider offloading some NFS traffic to other servers or upgrading the server’s CPU.
- Diagnosis: On the NFS server, run
-
NFS Server Network Interface Saturation: The server’s network interface is overwhelmed with traffic.
- Diagnosis: On the NFS server, use
iftop -i <interface_name>(e.g.,iftop -i eth0) to monitor network bandwidth usage per connection. - Fix: Increase the server’s network bandwidth. This might involve upgrading the network interface card (NIC) or the network switch port. Ensure the NIC is configured for maximum speed and duplex (e.g.,
ethtool <interface_name>). - Diagnosis: Check for dropped packets on the server’s interface using
ip -s link show <interface_name>. Look for increases in thedroppedoroverruncounters. - Fix: If
droppedpackets are high, it indicates the interface is not keeping up. This points to either a hardware bottleneck or a driver issue. Ensure you are using a recent, stable NIC driver.
- Diagnosis: On the NFS server, use
-
NFS Server Memory Pressure: The server is swapping or experiencing high memory utilization, impacting
nfsdperformance.- Diagnosis: On the NFS server, run
free -hortopand check theavailablememory and swap usage. - Fix: Increase the server’s RAM. If swapping is occurring, even a small amount, it can drastically slow down I/O operations.
- Diagnosis: Monitor memory usage with
sar -r 1 5. - Fix: Tune kernel parameters related to memory management, such as
vm.dirty_ratioandvm.dirty_background_ratio, if memory is consistently high but not exhausted.
- Diagnosis: On the NFS server, run
-
NFS Server Disk I/O Bottleneck: The underlying storage on the NFS server cannot keep up with read/write requests.
- Diagnosis: On the NFS server, use
iostat -xz 1 5to monitor disk utilization (%util), await times (await), and queue sizes (avgqu-sz). - Fix: Upgrade the server’s storage. This could mean moving from HDDs to SSDs, using faster SSDs, or implementing a RAID configuration that balances performance and redundancy.
- Diagnosis: Check specific filesystem performance with
iotop. - Fix: If using a network-attached storage (NAS) device, ensure its internal performance is adequate and its network connection to the NFS server is not a bottleneck.
- Diagnosis: On the NFS server, use
-
NFS Client Network Issues: Packet loss or high latency between the client and server.
- Diagnosis: On the NFS client, use
ping -c 100 <nfs_server_ip>to check for packet loss and latency. - Fix: Troubleshoot the network path. This might involve checking intermediate switches, routers, or firewall rules that could be introducing latency or dropping packets.
- Diagnosis: Use
mtr <nfs_server_ip>to identify specific hops with high latency or packet loss. - Fix: Ensure jumbo frames are consistently configured (or not configured) across the entire path if enabled, as mismatches are a common source of dropped packets.
- Diagnosis: On the NFS client, use
-
NFS Mount Options on the Client: Inappropriate or overly aggressive mount options can cause timeouts.
- Diagnosis: Examine the client’s
/etc/fstabor the output ofmount | grep nfsfor options likersize,wsize,hard,intr,timeo, andretrans. - Fix: Experiment with different
rsizeandwsizevalues. For example, tryrsize=32768,wsize=32768. Ensurehardmounts are used for reliability. Ifintris used, consider removing it if it’s causing spurious timeouts, as it can interrupt operations that might not be safely interruptible. Adjusttimeo(timeout) andretrans(retransmissions) if timeouts are too aggressive (e.g.,timeo=140,retrans=3). These values are in tenths of a second. - Fix: Re-mount the filesystem with the new options:
mount -o remount,rsize=32768,wsize=32768 /mnt/nfs_share.
- Diagnosis: Examine the client’s
-
NFS Server Kernel/Module Issues: Bugs in the NFS server kernel module or related RPC services.
- Diagnosis: Check system logs (
/var/log/messages,dmesg) on the NFS server for any NFS-related errors or warnings. - Fix: Ensure the NFS server’s kernel and user-space utilities are updated to the latest stable versions for your distribution. Sometimes, specific kernel versions have known NFS performance issues.
- Diagnosis: Check system logs (
The next error you might encounter after fixing these issues is an NFS server busy error (ESTALE or ENOSPC), indicating that while requests are reaching the server, the server is unable to complete the operation due to resource constraints or filesystem full conditions.
NFS traffic analysis in Wireshark is less about watching packets fly by and more about interpreting the conversation between client and server to understand why a particular operation took longer than expected, or why it failed entirely.
Let’s say you’re debugging a slow ls command on an NFS mount. You’d capture traffic on the client, filter for nfs, and then look for the LOOKUP and READDIR RPC calls.
Here’s what a typical READDIR sequence might look like, from the client’s perspective:
1. Client -> Server: NFS3 CALL: READDIR(dir_handle, offset=0, count=8192)
2. Server -> Client: NFS3 REPLY: READDIR(directory_entries...)
The crucial metric here is the time between the CALL and the REPLY for that specific RPC. In Wireshark, this is the "Time delta from previous displayed packet" column. If that delta is large for many READDIR calls, it means the server is slow to respond.
But what if the server doesn’t respond? This is where you see TCP retransmissions or UDP packet loss indicators in Wireshark.
Consider this captured sequence:
Client IP.51234 -> Server IP.2049: UDP: NFS3 CALL: READDIR(dir_handle, offset=0, count=8192)
[TCP Retransmission] or [UDP: re-transmission] (if UDP checksums are off and it's being retransmitted by a lower layer, or if the client is using TCP for NFSv3 which is uncommon but possible)
... (many seconds later) ...
Client IP.51234 -> Server IP.2049: UDP: NFS3 CALL: READDIR(dir_handle, offset=0, count=8192)
Server IP.2049 -> Client IP.51234: UDP: NFS3 REPLY: READDIR(directory_entries...)
The gap between the initial CALL and the eventual REPLY (or a retransmission of the CALL) is the problem. Wireshark will show a large "Time delta" for the first READDIR CALL if the server took a long time to send its reply. If the packet is lost entirely, you might see the client send the same CALL again, and Wireshark will show the delta between the two identical calls.
To understand why that delta is large, you need to correlate this with server-side metrics.
Mental Model: The NFS Conversation
- Client Initiates: The client needs to perform an operation (read a file, list a directory, create a file). It packages this as an NFS Remote Procedure Call (RPC).
- Network Transit: The RPC request travels over the network. This is where latency, packet loss, and congestion become factors.
- Server Receives & Processes: The NFS server daemon (
nfsd) receives the RPC. It then has to:- Parse the request: Understand what the client wants.
- Check permissions/ACLs: Ensure the client is allowed to do this.
- Interact with the filesystem: Read from disk, write to disk, create inodes, etc. This is often the slowest part.
- Generate a reply: Package the result of the operation.
- Network Transit (Return): The RPC reply travels back to the client.
- Client Receives & Processes: The client receives the reply and completes the operation (e.g., displays file contents, updates directory listing).
Key Wireshark Fields for NFS Analysis:
nfs.rpc.stat: For NFSv3,0isSUCCESS,1isPROGUNAVAIL,2isPROGNOTREGISTERED,5isGARBAGE_ARGS,6isSYSTEM_ERR. For NFSv4, look at thenfs4.statusfield.nfs.fh: The file handle. Useful for tracking operations on the same file.nfs.stateid: For NFSv4, crucial for tracking the state of a file or directory.nfs.read.count,nfs.write.count: The size of data being read or written. Large values here on a slow connection will increase theTime delta.nfs.dir.offset,nfs.dir.count: ForREADDIRcalls, these indicate how much of the directory has been read and how much the client is requesting.tcp.analysis.retransmission,udp.analysis.retransmission: If you see these, packets are being lost somewhere between client and server.frame.time_delta: The time since the previous packet. When this is large for an NFSCALLpacket, it means the server took a long time to respond.
The most surprising thing about NFS performance debugging is how often the problem isn’t in the NFS protocol itself, but in the underlying network or the server’s ability to interact with its storage. An NFS READDIR call might look simple, but if it triggers a disk seek on a slow HDD, or if the network fabric between client and server has a single congested link, that one READDIR can take seconds. You’re not just looking at NFS packets; you’re looking at all packets between client and server and correlating them with server-side load and I/O.
The next step in understanding this is to delve into specific NFS operations like getattr, setattr, and write, and how their performance characteristics differ and what they reveal about server load and network conditions.