Index | Thread | Search

From:
Alexander Bluhm <bluhm@openbsd.org>
Subject:
Re: sys/nfs: fix TCP socket use after reconnect
To:
tech@openbsd.org
Date:
Fri, 24 Apr 2026 10:32:10 +0200

Download raw body.

Thread
On Tue, Apr 21, 2026 at 12:55:46PM +0200, Kirill A. Korinsky wrote:
> tech@,
> 
> There is a race in the NFS/TCP client when several requests are active
> and the connection stalls or times out. One thread enters nfs_receive(),
> takes the current socket pointer, drops nfs_sndlock(), and waits in
> soreceive(). Another thread on the same mount may then enter
> nfs_reconnect(), decide the connection is dead, and close that same
> socket.
> 
> The problem is that the first thread may still use the old socket after
> reconnect has already replaced it. Under heavy parallel NFS/TCP load,
> that can wedge reconnect or socket shutdown paths.
> 
> Suggested fix is simple: take a socket reference before the receive path
> uses the socket, and release it only after that path is done.
> 
> This is not a strict lock order deadlock. It starts as a socket lifetime
> bug, then can show up as an apparent deadlock if the wedged path still
> owns kernel_lock. On a 2 CPU machine that is enough for a full hang: one
> CPU owns kernel_lock, the other spins waiting for it.
> 
> If system has more than 2 CPU, it will survive, but NFS share will be
> locked, and on reboot system hangs with "sync devices".
> 
> I encountered it few times per week when mount NFS share to laptop via
> wifi with options: soft,intr,tcp,-x=2
> 
> I use this diff for more than a week now, and haven't seen NFS issues so
> far, it actually works quite good. Yes, some times IO returns invalid
> command and I see in dmesg: short receive ... from nfs server ...;
> but it is a kind of expected behaviour.
> 
> Tests? Feedback? Ok?

You replace some return with goto errout.  This does some additional
cleanup.  As error == EWOULDBLOCK the first block is not taken and
as *mp should be NULL after an error the additional m_freemp()
should hav no effect.  Also nfs_realign() should have no effect.

Still it needs additional review for this new error path.  I this
change intensional?

bluhm

> Index: sys/nfs/nfs_socket.c
> ===================================================================
> RCS file: /home/cvs/src/sys/nfs/nfs_socket.c,v
> diff -u -p -r1.156 nfs_socket.c
> --- sys/nfs/nfs_socket.c	16 Feb 2025 16:05:07 -0000	1.156
> +++ sys/nfs/nfs_socket.c	10 Apr 2026 21:13:27 -0000
> @@ -525,7 +525,7 @@ nfs_send(struct socket *so, struct mbuf 
>  int
>  nfs_receive(struct nfsreq *rep, struct mbuf **aname, struct mbuf **mp)
>  {
> -	struct socket *so;
> +	struct socket *so = NULL;
>  	struct uio auio;
>  	struct iovec aio;
>  	struct mbuf *m;
> @@ -577,6 +577,11 @@ tryagain:
>  			}
>  			goto tryagain;
>  		}
> +		/*
> +		 * Keep the socket alive while using the snapshot taken
> +		 * under nfs_sndlock(), even if another thread reconnects.
> +		 */
> +		soref(so);
>  		while (rep->r_flags & R_MUSTRESEND) {
>  			m = m_copym(rep->r_mreq, 0, M_COPYALL, M_WAIT);
>  			nfsstats.rpcretries++;
> @@ -586,9 +591,13 @@ tryagain:
>  			if (error) {
>  				if (error == EINTR || error == ERESTART ||
>  				    (error = nfs_reconnect(rep)) != 0) {
> +					sorele(so);
> +					so = NULL;
>  					nfs_sndunlock(&rep->r_nmp->nm_flag);
>  					return (error);
>  				}
> +				sorele(so);
> +				so = NULL;
>  				goto tryagain;
>  			}
>  		}
> @@ -608,8 +617,10 @@ tryagain:
>  				error = soreceive(so, NULL, &auio, NULL, NULL,
>  				    &rcvflg, 0);
>  				if (error == EWOULDBLOCK && rep) {
> -					if (rep->r_flags & R_SOFTTERM)
> -						return (EINTR);
> +					if (rep->r_flags & R_SOFTTERM) {
> +						error = EINTR;
> +						goto errout;
> +					}
>  					/*
>  					 * looks like the server died after it
>  					 * received the request, make sure
> @@ -678,8 +689,10 @@ tryagain:
>  				    &rcvflg, 0);
>  				m_freem(control);
>  				if (error == EWOULDBLOCK && rep) {
> -					if (rep->r_flags & R_SOFTTERM)
> -						return (EINTR);
> +					if (rep->r_flags & R_SOFTTERM) {
> +						error = EINTR;
> +						goto errout;
> +					}
>  				}
>  			} while (error == EWOULDBLOCK ||
>  			    (!error && *mp == NULL && control));
> @@ -690,6 +703,10 @@ tryagain:
>  			len -= auio.uio_resid;
>  		}
>  errout:
> +		if (so != NULL) {
> +			sorele(so);
> +			so = NULL;
> +		}
>  		if (error && error != EINTR && error != ERESTART) {
>  			m_freemp(mp);
>  			if (error != EPIPE)
> 
> -- 
> wbr, Kirill
> 
>