Index | Thread | Search

From:
Martin Pieuchot <mpi@grenadille.net>
Subject:
Re: [EXT] Re: Kernel protection fault in fill_kproc()
To:
Gerhard Roth <gerhard_roth@genua.de>, "cjeker@diehard.n-r-g.com" <cjeker@diehard.n-r-g.com>, "mvs@openbsd.org" <mvs@openbsd.org>, "tech@openbsd.org" <tech@openbsd.org>, Carsten Beckmann <carsten_beckmann@genua.de>, "mark.kettenis@xs4all.nl" <mark.kettenis@xs4all.nl>
Date:
Tue, 2 Sep 2025 21:17:04 +0200

Download raw body.

Thread
On 14/08/25(Thu) 12:59, Martin Pieuchot wrote:
> On 13/08/25(Wed) 15:51, Martin Pieuchot wrote:
> > On 13/08/25(Wed) 13:43, Gerhard Roth wrote:
> > > On Wed, 2025-08-13 at 15:25 +0200, Claudio Jeker wrote:
> > > > On Wed, Aug 13, 2025 at 03:46:39PM +0300, Vitaliy Makkoveev wrote:
> > > > > On Wed, Aug 13, 2025 at 02:35:39PM +0200, Martin Pieuchot wrote:
> > > > > > On 12/08/25(Tue) 14:36, Vitaliy Makkoveev wrote:
> > > > > > > On Tue, Aug 12, 2025 at 12:40:21PM +0200, Martin Pieuchot wrote:
> > > > > > > > On 12/08/25(Tue) 13:00, Vitaliy Makkoveev wrote:
> > > > > > > > > On Tue, Aug 12, 2025 at 11:49:12AM +0200, Mark Kettenis wrote:
> > > > > > > > > > > Date: Tue, 12 Aug 2025 11:56:10 +0300
> > > > > > > > > > > From: Vitaliy Makkoveev <mvs@openbsd.org>
> > > > > > > > > > > 
> > > > > > > > > > > On Tue, Aug 12, 2025 at 07:22:29AM +0200, Claudio Jeker wrote:
> > > > > > > > > > > > On Mon, Aug 11, 2025 at 10:45:05AM +0000, Gerhard Roth wrote:
> > > > > > > > > > > > > About a year ago, the call to uvm_exit() was moved outside of the 
> > > > > > > > > > > > > KERNEL_LOCK() in the reaper() by mpi@. Now we observed a kernel
> > > > > > > > > > > > > protection fault that results from this change.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > In fill_kproc() we read the vmspace pointer (vm) right at the very
> > > > > > > > > > > > > beginning of the function:
> > > > > > > > > > > > > 
> > > > > > > > > > > > >         struct vmspace *vm = pr->ps_vmspace;
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Sometime later, we try to access it:
> > > > > > > > > > > > > 
> > > > > > > > > > > > >         /* fixups that can only be done in the kernel */
> > > > > > > > > > > > >         if ((pr->ps_flags & PS_ZOMBIE) == 0) {
> > > > > > > > > > > > >                 if ((pr->ps_flags & PS_EMBRYO) == 0 && vm != NULL)
> > > > > > > > > > > > >                         ki->p_vm_rssize = vm_resident_count(vm);
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > In the meantime the process might have exited and the reaper() can free
> > > > > > > > > > > > > the vmspace by calling uvm_exit(). After that, the 'vm' pointer in
> > > > > > > > > > > > > fill_kproc() points to stale memory. Accessing it will yield a kernel
> > > > > > > > > > > > > protection fault.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > BTW: only after freeing the vmspace of the process, the PS_ZOMBIE flag
> > > > > > > > > > > > > is set by the reaper().
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I propose to put the reaper()'s call to uvm_exit() back under the
> > > > > > > > > > > > > kernel lock to avoid the fault.
> > > > > > > > > > > > 
> > > > > > > > > > > > In my opinion the fill_kproc() code is wrong and it should not look at
> > > > > > > > > > > > pr->ps_vmspace if the PS_EXITING flag is set for the process.
> > > > > > > > > > > > 
> > > > > > > > > > > > exit1() sets PS_EXITING flag early on and after that point the vm can be
> > > > > > > > > > > > purged so the vm_resident_count() is probably wrong anyway.
> > > > > > > > > > 
> > > > > > > > > > I guess that is safe since exit1() still runs with the kernel lock
> > > > > > > > > > held.  Which means that the PS_EXITING flags can't be set while
> > > > > > > > > > fill_kproc() runs, since it holds the kernel lock.
> > > > > > > > > > 
> > > > > > > > > > > The only fill_kproc() is the sysctl_doproc() which does the check in the
> > > > > > > > > > > beginning of the allprocess loop:
> > > > > > > > > > > 
> > > > > > > > > > >         for (; pr != NULL; pr = LIST_NEXT(pr, ps_list)) {
> > > > > > > > > > >                 /* XXX skip processes in the middle of being zapped */
> > > > > > > > > > >                 if (pr->ps_pgrp == NULL)
> > > > > > > > > > >                         continue;
> > > > > > > > > > 
> > > > > > > > > > There is some other code where an "external observer" looks at
> > > > > > > > > > ps_vmspace in kern/sys_process.c:process_domem():
> > > > > > > > > > 
> > > > > > > > > >         vm = tr->ps_vmspace;
> > > > > > > > > >         if ((tr->ps_flags & PS_EXITING) || (vm->vm_refcnt < 1))
> > > > > > > > > >                 return EFAULT;
> > > > > > > > > >         addr = uio->uio_offset;
> > > > > > > > > > 
> > > > > > > > > >         uvmspace_addref(vm);
> > > > > > > > > > 
> > > > > > > > > >         error = uvm_io(&vm->vm_map, uio, UVM_IO_FIXPROT);
> > > > > > > > > > 
> > > > > > > > > >         uvmspace_free(vm);
> > > > > > > > > > 
> > > > > > > > > > So that checks PS_EXITING as well, but also checks the refcnt.
> > > > > > > > > > 
> > > > > > > > > > As you can see this also takes a reference of the vmspace.  I guess
> > > > > > > > > > that's necessary since uvm_io() may sleep.
> > > > > > > > > 
> > > > > > > > > The problem lies in the unlocked uvm_exit(pr) which starts teardown of
> > > > > > > > > vmspace. Simple adding  uvmspace_addref() somewhere else will not help
> > > > > > > > > you because there is no guarantees that your uvmspace_addref() is the
> > > > > > > > > winner. So you need to serialize the uvm_exit() and the
> > > > > > > > > uvmspace_addref() thread. This means it should be moved back under
> > > > > > > > > kernel lock.
> > > > > > > > 
> > > > > > > > Indeed, clearing `ps_vmspace' isn't safe without KERNEL_LOCK().  The
> > > > > > > > drawback of that approach is that pmap_destroy(), which is costly, will
> > > > > > > > no longer be executed in parallel.
> > > > > > > > 
> > > > > > > > > In my initial diff I propose to move uvm_exit(pr) after the kernel
> > > > > > > > > locked section of reaper(). This mead the vmspace teardown will start
> > > > > > > > > after the process being unlinek from the allprocess or zombprocess lists
> > > > > > > > > and not accessed by sysctl(2).
> > > > > > > > 
> > > > > > > > I'd suggest skipping dereferencing `ps_vmspace' for now to be coherent
> > > > > > > > with what is done by sysctl_proc_args().
> > > > > > > > 
> > > > > > > > Accessing zombie process descriptors out of the KERNEL_LOCK() would be a
> > > > > > > > must for killing the reaper.  Anyone interested?
> > > > > > > > 
> > > > > > > > Diff below is untested, I'm too busy atm.
> > > > > > > > 
> > > > > > > 
> > > > > > > You do lockless uvm_exit(pr) before setting PS_ZOMBIE flag. This is the
> > > > > > > case I described below, no guarantees that your uvmspace_addref(vm)
> > > > > > > thread is the winner.
> > > > > > 
> > > > > > Please re-read the diff, there's such guarantees thanks to PS_EXITING
> > > > > > which is set before uvm_purge().
> > > > > > 
> > > > > 
> > > > > So, the PS_EXITING bit is enough? For that purpose do you setting
> > > > > PS_ZOMBIE?
> > > > 
> > > > > > > > +       /* exiting/zombie process might no longer have VM space. */
> > > > > > > > +       if ((pr->ps_flags & (PS_ZOMBIE|PS_EXITING)) == 0) {
> > > > 
> > > > Here there is indeed no need to check for PS_ZOMBIE because PS_EXITING is
> > > > set but never cleared. So if PS_ZOMBIE is set then PS_EXITING must also be
> > > > set since you can only become a zombie by exiting first.
> > > > 
> > > > > > > > +               vm = pr->ps_vmspace;
> > > > > > > > +               uvmspace_addref(vm);
> > > > > > > > +       }
> > > > > > > > +
> > > > 
> > > > I think this diff is the better way to fix this issue.
> > > 
> > > But not before one bug is fixed: it possibly calls uvmspace_free()
> > > with a NULL pointer.
> > 
> > Updated diff below.  It does fix that and remove another version of the
> > race in tty.c. 
> > 
> > Unfortunately still untested.
> 
> I'm now running with this diff.  I'd appreciate more testing and
> reviews.

Anyone?

Index: kern/kern_sysctl.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_sysctl.c,v
diff -u -p -r1.482 kern_sysctl.c
--- kern/kern_sysctl.c	6 Aug 2025 14:00:33 -0000	1.482
+++ kern/kern_sysctl.c	13 Aug 2025 13:46:14 -0000
@@ -2051,11 +2051,17 @@ fill_kproc(struct process *pr, struct ki
 {
 	struct session *s = pr->ps_session;
 	struct tty *tp;
-	struct vmspace *vm = pr->ps_vmspace;
+	struct vmspace *vm = NULL;
 	struct timespec booted, st, ut, utc;
 	struct tusage tu;
 	int isthread;
 
+	/* exiting/zombie process might no longer have VM space. */
+	if ((pr->ps_flags & PS_EXITING) == 0) {
+		vm = pr->ps_vmspace;
+		uvmspace_addref(vm);
+	}
+
 	isthread = p != NULL;
 	if (!isthread) {
 		p = pr->ps_mainproc;		/* XXX */
@@ -2082,7 +2088,7 @@ fill_kproc(struct process *pr, struct ki
 	}
 
 	/* fixups that can only be done in the kernel */
-	if ((pr->ps_flags & PS_ZOMBIE) == 0) {
+	if ((pr->ps_flags & PS_EXITING) == 0) {
 		if ((pr->ps_flags & PS_EMBRYO) == 0 && vm != NULL)
 			ki->p_vm_rssize = vm_resident_count(vm);
 		calctsru(&tu, &ut, &st, NULL);
@@ -2103,13 +2109,15 @@ fill_kproc(struct process *pr, struct ki
 #endif
 	}
 
+	uvmspace_free(vm);
+
 	/* get %cpu and schedule state: just one thread or sum of all? */
 	if (isthread) {
 		ki->p_pctcpu = p->p_pctcpu;
 		ki->p_stat   = p->p_stat;
 	} else {
 		ki->p_pctcpu = 0;
-		ki->p_stat = (pr->ps_flags & PS_ZOMBIE) ? SDEAD : SIDL;
+		ki->p_stat = (pr->ps_flags & PS_EXITING) ? SDEAD : SIDL;
 		TAILQ_FOREACH(p, &pr->ps_threads, p_thr_link) {
 			ki->p_pctcpu += p->p_pctcpu;
 			/* find best state: ONPROC > RUN > STOP > SLEEP > .. */
Index: kern/tty.c
===================================================================
RCS file: /cvs/src/sys/kern/tty.c,v
diff -u -p -r1.180 tty.c
--- kern/tty.c	12 Jun 2025 20:37:58 -0000	1.180
+++ kern/tty.c	13 Aug 2025 13:47:13 -0000
@@ -2198,8 +2198,8 @@ empty:		ttyprintf(tp, "empty foreground 
 			if (run2 || pctcpu2 > pctcpu)
 				goto update_pickpr;
 
-			/* if p has less cpu or is zombie, then it's worse */
-			if (pctcpu2 < pctcpu || (pr->ps_flags & PS_ZOMBIE))
+			/* if p has less cpu or is exiting, then it's worse */
+			if (pctcpu2 < pctcpu || (pr->ps_flags & PS_EXITING))
 				continue;
 update_pickpr:
 			pickpr = pr;
@@ -2209,7 +2209,7 @@ update_pickpr:
 
 		/* Calculate percentage cpu, resident set size. */
 		calc_pctcpu = (pctcpu * 10000 + FSCALE / 2) >> FSHIFT;
-		if ((pickpr->ps_flags & (PS_EMBRYO | PS_ZOMBIE)) == 0 &&
+		if ((pickpr->ps_flags & (PS_EMBRYO | PS_EXITING)) == 0 &&
 		    pickpr->ps_vmspace != NULL)
 			rss = vm_resident_count(pickpr->ps_vmspace);
 
Index: uvm/uvm_map.c
===================================================================
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
diff -u -p -r1.345 uvm_map.c
--- uvm/uvm_map.c	3 Jun 2025 08:38:17 -0000	1.345
+++ uvm/uvm_map.c	13 Aug 2025 13:47:09 -0000
@@ -3426,7 +3426,7 @@ uvmspace_purge(struct vmspace *vm)
 void
 uvmspace_free(struct vmspace *vm)
 {
-	if (atomic_dec_int_nv(&vm->vm_refcnt) == 0) {
+	if (vm != NULL && atomic_dec_int_nv(&vm->vm_refcnt) == 0) {
 		/*
 		 * Sanity check.  Kernel threads never end up here and
 		 * userland ones already tear down there VM space in