Index | Thread | Search

From:
Vitaliy Makkoveev <mvs@openbsd.org>
Subject:
Re: fstat sysctl crashes kernel
To:
Mark Kettenis <mark.kettenis@xs4all.nl>
Cc:
Alexander Bluhm <bluhm@openbsd.org>, otto@bsdbox.dev, tech@openbsd.org
Date:
Fri, 9 Aug 2024 14:58:24 +0300

Download raw body.

Thread
On Fri, Aug 09, 2024 at 12:37:25PM +0200, Mark Kettenis wrote:
> > Date: Fri, 9 Aug 2024 10:20:41 +0200
> > From: Alexander Bluhm <bluhm@openbsd.org>
> > 
> > On Thu, Aug 08, 2024 at 09:27:36PM +0300, Vitaliy Makkoveev wrote:
> > > > On 8 Aug 2024, at 20:46, Alexander Bluhm <bluhm@openbsd.org> wrote:
> > > > 
> > > > On Thu, Aug 08, 2024 at 09:47:27AM -0600, Theo de Raadt wrote:
> > > >> Todd C. Miller <millert@openbsd.org> wrote:
> > > >> 
> > > >>> On Thu, 08 Aug 2024 13:21:54 +0200, Alexander Bluhm wrote:
> > > >>> 
> > > >>>> Independent of that, can we commit the chunk below?  When searching
> > > >>>> for a specific process, there is not need to traverse the list to
> > > >>>> the end.  Also I check for arg >= 0 as this is cosistent with the
> > > >>>> arg < -1 check before.  Then you don't have to think about what is
> > > >>>> special about process 0.  There is no functional difference as
> > > >>>> process 0 has PS_SYSTEM set and is skipped anyway.
> > > >>> 
> > > >>> Yes, this makes sense. OK millert@
> > > >> 
> > > >> No matter what we do, we don't get an atomic view.  Punting the problem
> > > >> to userland sidesteps the difficult problem in the kernel.
> > > > 
> > > > For me is important that we commit something instead of discussing
> > > > pro and con of various diffs.
> > > > 
> > > > The idea with the marker sounds like the correct thing, but adds a
> > > > lot of complexity.  mvs@ already worked on something simmilar.  The
> > > > easiest fix without locking concerns is my userland fstat diff.
> > > > Locking the whole process list while doing copyout over NFS may
> > > > open another can of worms.
> > > > 
> > > 
> > > Userland pages are wired at this moment, you can???t stuck in swap
> > > on nfs here, sysctl_vslock() cares about that.
> > > 
> > > > Both bluhm@ and mvs@ diff survive my stress test for hours.  Usually
> > > > the test crashes or hangs within minutes.  Currently I am testing
> > > > with witness to find fallout.
> > > > 
> > > 
> > > Nice to hear. Personal me, I prefer allprocess rwlock(9). It is much
> > > simpler and left userland as is. Also, is allows to remove kernel
> > > lock around some process stuff like prfind(). For example, from the 
> > > filt_procattach() where claudio@ left it to make ps dereference safe.
> > 
> > mvs@ diff has finished witness testing.  It complains already during
> > boot.
> > 
> > reordering: ld.so libc libcrypto sshd sshd-session ssh-agent.
> > starting early daemons: syslogd pflogd ntpd.
> > starting RPC daemons:.
> > witness: lock order reversal:
> >  1st 0xd7af9474 inode (&ip->i_lock)
> >  2nd 0xd0f32d68 allpslk (&allprocess_lock)
> > lock order [1] inode (&ip->i_lock) -> [2] allpslk (&allprocess_lock)
> > lock order data 0xd0ca8820 -> 0xd0cf51b4 is missing
> > lock order [2] allpslk (&allprocess_lock) -> [1] inode (&ip->i_lock)
> > #0  witness_checkorder+0x291
> > #1  rw_enter+0x4d
> > #2  rrw_enter+0x41
> > #3  ufs_lock+0x27
> > #4  VOP_LOCK+0x50
> > #5  vn_lock+0x90
> > #6  vrele+0x35
> > #7  unveil_destroy+0x76
> > #8  exit1+0x378
> > #9  sys_exit+0x14
> > #10 syscall+0x3f2
> > #11 Xsyscall_untramp+0xa9
> > savecore: no core dump
> > checking quotas: done.
> > clearing /tmp
> > kern.securelevel: 0 -> 1
> > turning on accounting
> > witness: lock order reversal:
> >  1st 0xd0f32d68 allpslk (&allprocess_lock)
> >  2nd 0xd0ed1114 acctlk (acctlk)
> > lock order [1] allpslk (&allprocess_lock) -> [2] acctlk (acctlk)
> > #0  witness_checkorder+0x291
> > #1  rw_enter_read+0x32
> > #2  acct_process+0x38
> > #3  exit1+0x35d
> > #4  sys_exit+0x14
> > #5  syscall+0x3f2
> > #6  Xsyscall_untramp+0xa9
> > lock order [2] acctlk (acctlk) -> [1] allpslk (&allprocess_lock)
> > #0  witness_checkorder+0x291
> > #1  rw_enter_write+0x30
> > #2  fork1+0x138
> > #3  kthread_create+0x32
> > #4  sys_acct+0xfe
> > #5  syscall+0x3f2
> > #6  Xsyscall_untramp+0xa9
> > creating runtime link editor directory cache.
> > preserving editor files.
> > starting network daemons: sshd snmpd smtpd sndiod.
> 
> We should not introduce more lock order reversals.
> 
> And I actually agree with Mateusz: the fact that you can sleep while
> holding an rwlock doesn't mean it is a good idea.
> 

In fact you can't. This is the sysctl(2) path, so userland pages were
wired before and copyout() will not sleep. You can sleep while doing
copyout() even on uniprocessor machine. Obviously, in this case process
could be killed concurrently, and this is was the real reason why
uvm_vslock() was added on top of sys_sysctl() long time ago, before any
sleeping lock appeared. In the rwlock(9) era, this has nice side effect,
which allows you avoid slowdown while holding rwlock(9) and doing
copyin() or copyout().

Well, the diff. The main exit1() problem is that we start process
teardown before unlink it from `allprocess' list, so it is still
accessible in half-destroyed state. Current code doesn't allow you to
unlink it before teardown.

This time we have we need to protect `ps' dereference in sysctl(2)
loops. Both loops have PS_EXITING check, so we could sleep between
PS_EXITING setting and teardown start. Since `ps' will be unlinked after
this sleep we don't need any process iterator, because LIST_NEXT(ps)
will be valid in any case.

My previous similar diffs were failed in NFS cases, so it makes sense to
drop non NFS setups.

Index: sys/kern/kern_exit.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_exit.c,v
diff -u -p -r1.230 kern_exit.c
--- sys/kern/kern_exit.c	6 Aug 2024 18:41:20 -0000	1.230
+++ sys/kern/kern_exit.c	9 Aug 2024 11:51:30 -0000
@@ -151,6 +151,9 @@ exit1(struct proc *p, int xexit, int xsi
 			    PS_ISPWAIT);
 			wakeup(pr->ps_pptr);
 		}
+
+		/* Wait for concurrent `allprocess' loops */
+		refcnt_finalize(&pr->ps_refcnt, "psdtor");
 	}
 
 	/* unlink ourselves from the active threads */
Index: sys/kern/kern_fork.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_fork.c,v
diff -u -p -r1.261 kern_fork.c
--- sys/kern/kern_fork.c	6 Aug 2024 08:44:54 -0000	1.261
+++ sys/kern/kern_fork.c	9 Aug 2024 11:51:30 -0000
@@ -178,6 +178,8 @@ thread_new(struct proc *parent, vaddr_t 
 void
 process_initialize(struct process *pr, struct proc *p)
 {
+	refcnt_init(&pr->ps_refcnt);
+
 	/* initialize the thread links */
 	pr->ps_mainproc = p;
 	TAILQ_INIT(&pr->ps_threads);
Index: sys/kern/kern_sysctl.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_sysctl.c,v
diff -u -p -r1.436 kern_sysctl.c
--- sys/kern/kern_sysctl.c	8 Aug 2024 15:02:36 -0000	1.436
+++ sys/kern/kern_sysctl.c	9 Aug 2024 11:51:30 -0000
@@ -1686,6 +1686,9 @@ sysctl_file(int *name, u_int namelen, ch
 				/* not the pid we are looking for */
 				continue;
 			}
+
+			refcnt_take(&pr->ps_refcnt);
+
 			matched = 1;
 			fdp = pr->ps_fd;
 			if (pr->ps_textvp)
@@ -1702,6 +1705,9 @@ sysctl_file(int *name, u_int namelen, ch
 				FILLIT(fp, fdp, i, NULL, pr);
 				FRELE(fp, p);
 			}
+
+			refcnt_rele_wake(&pr->ps_refcnt);
+
 			/* pid is unique, stop searching */
 			if (arg >= 0)
 				break;
@@ -1721,6 +1727,9 @@ sysctl_file(int *name, u_int namelen, ch
 				/* not the uid we are looking for */
 				continue;
 			}
+
+			refcnt_take(&pr->ps_refcnt);
+
 			fdp = pr->ps_fd;
 			if (fdp->fd_cdir)
 				FILLIT(NULL, NULL, KERN_FILE_CDIR, fdp->fd_cdir, pr);
@@ -1734,6 +1743,8 @@ sysctl_file(int *name, u_int namelen, ch
 				FILLIT(fp, fdp, i, NULL, pr);
 				FRELE(fp, p);
 			}
+
+			refcnt_rele_wake(&pr->ps_refcnt);
 		}
 		break;
 	default:
Index: sys/sys/proc.h
===================================================================
RCS file: /cvs/src/sys/sys/proc.h,v
diff -u -p -r1.367 proc.h
--- sys/sys/proc.h	6 Aug 2024 08:44:54 -0000	1.367
+++ sys/sys/proc.h	9 Aug 2024 11:51:30 -0000
@@ -142,6 +142,8 @@ struct pinsyscall {
  *	T	itimer_mtx
  */
 struct process {
+	struct refcnt ps_refcnt;
+
 	/*
 	 * ps_mainproc is the original thread in the process.
 	 * It's only still special for the handling of