From: Alexander Bluhm Subject: Re: sys/nfs: fix TCP socket use after reconnect To: tech@openbsd.org Date: Fri, 24 Apr 2026 10:32:10 +0200 On Tue, Apr 21, 2026 at 12:55:46PM +0200, Kirill A. Korinsky wrote: > tech@, > > There is a race in the NFS/TCP client when several requests are active > and the connection stalls or times out. One thread enters nfs_receive(), > takes the current socket pointer, drops nfs_sndlock(), and waits in > soreceive(). Another thread on the same mount may then enter > nfs_reconnect(), decide the connection is dead, and close that same > socket. > > The problem is that the first thread may still use the old socket after > reconnect has already replaced it. Under heavy parallel NFS/TCP load, > that can wedge reconnect or socket shutdown paths. > > Suggested fix is simple: take a socket reference before the receive path > uses the socket, and release it only after that path is done. > > This is not a strict lock order deadlock. It starts as a socket lifetime > bug, then can show up as an apparent deadlock if the wedged path still > owns kernel_lock. On a 2 CPU machine that is enough for a full hang: one > CPU owns kernel_lock, the other spins waiting for it. > > If system has more than 2 CPU, it will survive, but NFS share will be > locked, and on reboot system hangs with "sync devices". > > I encountered it few times per week when mount NFS share to laptop via > wifi with options: soft,intr,tcp,-x=2 > > I use this diff for more than a week now, and haven't seen NFS issues so > far, it actually works quite good. Yes, some times IO returns invalid > command and I see in dmesg: short receive ... from nfs server ...; > but it is a kind of expected behaviour. > > Tests? Feedback? Ok? You replace some return with goto errout. This does some additional cleanup. As error == EWOULDBLOCK the first block is not taken and as *mp should be NULL after an error the additional m_freemp() should hav no effect. Also nfs_realign() should have no effect. Still it needs additional review for this new error path. I this change intensional? bluhm > Index: sys/nfs/nfs_socket.c > =================================================================== > RCS file: /home/cvs/src/sys/nfs/nfs_socket.c,v > diff -u -p -r1.156 nfs_socket.c > --- sys/nfs/nfs_socket.c 16 Feb 2025 16:05:07 -0000 1.156 > +++ sys/nfs/nfs_socket.c 10 Apr 2026 21:13:27 -0000 > @@ -525,7 +525,7 @@ nfs_send(struct socket *so, struct mbuf > int > nfs_receive(struct nfsreq *rep, struct mbuf **aname, struct mbuf **mp) > { > - struct socket *so; > + struct socket *so = NULL; > struct uio auio; > struct iovec aio; > struct mbuf *m; > @@ -577,6 +577,11 @@ tryagain: > } > goto tryagain; > } > + /* > + * Keep the socket alive while using the snapshot taken > + * under nfs_sndlock(), even if another thread reconnects. > + */ > + soref(so); > while (rep->r_flags & R_MUSTRESEND) { > m = m_copym(rep->r_mreq, 0, M_COPYALL, M_WAIT); > nfsstats.rpcretries++; > @@ -586,9 +591,13 @@ tryagain: > if (error) { > if (error == EINTR || error == ERESTART || > (error = nfs_reconnect(rep)) != 0) { > + sorele(so); > + so = NULL; > nfs_sndunlock(&rep->r_nmp->nm_flag); > return (error); > } > + sorele(so); > + so = NULL; > goto tryagain; > } > } > @@ -608,8 +617,10 @@ tryagain: > error = soreceive(so, NULL, &auio, NULL, NULL, > &rcvflg, 0); > if (error == EWOULDBLOCK && rep) { > - if (rep->r_flags & R_SOFTTERM) > - return (EINTR); > + if (rep->r_flags & R_SOFTTERM) { > + error = EINTR; > + goto errout; > + } > /* > * looks like the server died after it > * received the request, make sure > @@ -678,8 +689,10 @@ tryagain: > &rcvflg, 0); > m_freem(control); > if (error == EWOULDBLOCK && rep) { > - if (rep->r_flags & R_SOFTTERM) > - return (EINTR); > + if (rep->r_flags & R_SOFTTERM) { > + error = EINTR; > + goto errout; > + } > } > } while (error == EWOULDBLOCK || > (!error && *mp == NULL && control)); > @@ -690,6 +703,10 @@ tryagain: > len -= auio.uio_resid; > } > errout: > + if (so != NULL) { > + sorele(so); > + so = NULL; > + } > if (error && error != EINTR && error != ERESTART) { > m_freemp(mp); > if (error != EPIPE) > > -- > wbr, Kirill > >