From: Gerhard Roth Subject: Re: [EXT] Re: Kernel protection fault in fill_kproc() To: "cjeker@diehard.n-r-g.com" , "mvs@openbsd.org" Cc: "tech@openbsd.org" , Carsten Beckmann , "mark.kettenis@xs4all.nl" Date: Wed, 13 Aug 2025 13:43:59 +0000 On Wed, 2025-08-13 at 15:25 +0200, Claudio Jeker wrote: > On Wed, Aug 13, 2025 at 03:46:39PM +0300, Vitaliy Makkoveev wrote: > > On Wed, Aug 13, 2025 at 02:35:39PM +0200, Martin Pieuchot wrote: > > > On 12/08/25(Tue) 14:36, Vitaliy Makkoveev wrote: > > > > On Tue, Aug 12, 2025 at 12:40:21PM +0200, Martin Pieuchot wrote: > > > > > On 12/08/25(Tue) 13:00, Vitaliy Makkoveev wrote: > > > > > > On Tue, Aug 12, 2025 at 11:49:12AM +0200, Mark Kettenis wrote: > > > > > > > > Date: Tue, 12 Aug 2025 11:56:10 +0300 > > > > > > > > From: Vitaliy Makkoveev > > > > > > > > > > > > > > > > On Tue, Aug 12, 2025 at 07:22:29AM +0200, Claudio Jeker wrote: > > > > > > > > > On Mon, Aug 11, 2025 at 10:45:05AM +0000, Gerhard Roth wrote: > > > > > > > > > > About a year ago, the call to uvm_exit() was moved outside of the  > > > > > > > > > > KERNEL_LOCK() in the reaper() by mpi@. Now we observed a kernel > > > > > > > > > > protection fault that results from this change. > > > > > > > > > > > > > > > > > > > > In fill_kproc() we read the vmspace pointer (vm) right at the very > > > > > > > > > > beginning of the function: > > > > > > > > > > > > > > > > > > > >         struct vmspace *vm = pr->ps_vmspace; > > > > > > > > > > > > > > > > > > > > Sometime later, we try to access it: > > > > > > > > > > > > > > > > > > > >         /* fixups that can only be done in the kernel */ > > > > > > > > > >         if ((pr->ps_flags & PS_ZOMBIE) == 0) { > > > > > > > > > >                 if ((pr->ps_flags & PS_EMBRYO) == 0 && vm != NULL) > > > > > > > > > >                         ki->p_vm_rssize = vm_resident_count(vm); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In the meantime the process might have exited and the reaper() can free > > > > > > > > > > the vmspace by calling uvm_exit(). After that, the 'vm' pointer in > > > > > > > > > > fill_kproc() points to stale memory. Accessing it will yield a kernel > > > > > > > > > > protection fault. > > > > > > > > > > > > > > > > > > > > BTW: only after freeing the vmspace of the process, the PS_ZOMBIE flag > > > > > > > > > > is set by the reaper(). > > > > > > > > > > > > > > > > > > > > I propose to put the reaper()'s call to uvm_exit() back under the > > > > > > > > > > kernel lock to avoid the fault. > > > > > > > > > > > > > > > > > > In my opinion the fill_kproc() code is wrong and it should not look at > > > > > > > > > pr->ps_vmspace if the PS_EXITING flag is set for the process. > > > > > > > > > > > > > > > > > > exit1() sets PS_EXITING flag early on and after that point the vm can be > > > > > > > > > purged so the vm_resident_count() is probably wrong anyway. > > > > > > > > > > > > > > I guess that is safe since exit1() still runs with the kernel lock > > > > > > > held.  Which means that the PS_EXITING flags can't be set while > > > > > > > fill_kproc() runs, since it holds the kernel lock. > > > > > > > > > > > > > > > The only fill_kproc() is the sysctl_doproc() which does the check in the > > > > > > > > beginning of the allprocess loop: > > > > > > > > > > > > > > > >         for (; pr != NULL; pr = LIST_NEXT(pr, ps_list)) { > > > > > > > >                 /* XXX skip processes in the middle of being zapped */ > > > > > > > >                 if (pr->ps_pgrp == NULL) > > > > > > > >                         continue; > > > > > > > > > > > > > > There is some other code where an "external observer" looks at > > > > > > > ps_vmspace in kern/sys_process.c:process_domem(): > > > > > > > > > > > > > >         vm = tr->ps_vmspace; > > > > > > >         if ((tr->ps_flags & PS_EXITING) || (vm->vm_refcnt < 1)) > > > > > > >                 return EFAULT; > > > > > > >         addr = uio->uio_offset; > > > > > > > > > > > > > >         uvmspace_addref(vm); > > > > > > > > > > > > > >         error = uvm_io(&vm->vm_map, uio, UVM_IO_FIXPROT); > > > > > > > > > > > > > >         uvmspace_free(vm); > > > > > > > > > > > > > > So that checks PS_EXITING as well, but also checks the refcnt. > > > > > > > > > > > > > > As you can see this also takes a reference of the vmspace.  I guess > > > > > > > that's necessary since uvm_io() may sleep. > > > > > > > > > > > > The problem lies in the unlocked uvm_exit(pr) which starts teardown of > > > > > > vmspace. Simple adding  uvmspace_addref() somewhere else will not help > > > > > > you because there is no guarantees that your uvmspace_addref() is the > > > > > > winner. So you need to serialize the uvm_exit() and the > > > > > > uvmspace_addref() thread. This means it should be moved back under > > > > > > kernel lock. > > > > > > > > > > Indeed, clearing `ps_vmspace' isn't safe without KERNEL_LOCK().  The > > > > > drawback of that approach is that pmap_destroy(), which is costly, will > > > > > no longer be executed in parallel. > > > > > > > > > > > In my initial diff I propose to move uvm_exit(pr) after the kernel > > > > > > locked section of reaper(). This mead the vmspace teardown will start > > > > > > after the process being unlinek from the allprocess or zombprocess lists > > > > > > and not accessed by sysctl(2). > > > > > > > > > > I'd suggest skipping dereferencing `ps_vmspace' for now to be coherent > > > > > with what is done by sysctl_proc_args(). > > > > > > > > > > Accessing zombie process descriptors out of the KERNEL_LOCK() would be a > > > > > must for killing the reaper.  Anyone interested? > > > > > > > > > > Diff below is untested, I'm too busy atm. > > > > > > > > > > > > > You do lockless uvm_exit(pr) before setting PS_ZOMBIE flag. This is the > > > > case I described below, no guarantees that your uvmspace_addref(vm) > > > > thread is the winner. > > > > > > Please re-read the diff, there's such guarantees thanks to PS_EXITING > > > which is set before uvm_purge(). > > > > > > > So, the PS_EXITING bit is enough? For that purpose do you setting > > PS_ZOMBIE? > > > > > > +       /* exiting/zombie process might no longer have VM space. */ > > > > > +       if ((pr->ps_flags & (PS_ZOMBIE|PS_EXITING)) == 0) { > > Here there is indeed no need to check for PS_ZOMBIE because PS_EXITING is > set but never cleared. So if PS_ZOMBIE is set then PS_EXITING must also be > set since you can only become a zombie by exiting first. > > > > > > +               vm = pr->ps_vmspace; > > > > > +               uvmspace_addref(vm); > > > > > +       } > > > > > + > > I think this diff is the better way to fix this issue. But not before one bug is fixed: it possibly calls uvmspace_free() with a NULL pointer. >   > > > This pattern is already present in multiple places in the tree. > > > > > > > Putting the uvmspace_free(vm) just after the kernel locked section in > > > > reaper makes it safe, because at this point the process is not accessible > > > > from the alloprocess or zombprocess loops. Also this fixes the > > > > sysctl_proc_args(), which has the same issue with unlocked uvm_exit(pr) > > > > before setting PS_ZOMBIE flag. > > > > > > This choice is not helping with unlocking more of the reaper and > > > introduces incoherency.  This is IMHO the wrong path to follow. > > > > > > > > Index: kern/kern_sysctl.c > > > > > =================================================================== > > > > > RCS file: /cvs/src/sys/kern/kern_sysctl.c,v > > > > > diff -u -p -r1.482 kern_sysctl.c > > > > > --- kern/kern_sysctl.c  6 Aug 2025 14:00:33 -0000       1.482 > > > > > +++ kern/kern_sysctl.c  12 Aug 2025 10:35:07 -0000 > > > > > @@ -2051,11 +2051,17 @@ fill_kproc(struct process *pr, struct ki > > > > >  { > > > > >         struct session *s = pr->ps_session; > > > > >         struct tty *tp; > > > > > -       struct vmspace *vm = pr->ps_vmspace; > > > > > +       struct vmspace *vm = NULL; > > > > >         struct timespec booted, st, ut, utc; > > > > >         struct tusage tu; > > > > >         int isthread; > > > > >   > > > > > +       /* exiting/zombie process might no longer have VM space. */ > > > > > +       if ((pr->ps_flags & (PS_ZOMBIE|PS_EXITING)) == 0) { > > > > > +               vm = pr->ps_vmspace; > > > > > +               uvmspace_addref(vm); > > > > > +       } > > > > > + > > > > >         isthread = p != NULL; > > > > >         if (!isthread) { > > > > >                 p = pr->ps_mainproc;            /* XXX */ > > > > > @@ -2102,6 +2108,8 @@ fill_kproc(struct process *pr, struct ki > > > > >                         ki->p_cpuid = CPU_INFO_UNIT(p->p_cpu); > > > > >  #endif > > > > >         } > > > > > + > > > > > +       uvmspace_free(vm); > > > > >   > > > > >         /* get %cpu and schedule state: just one thread or sum of all? */ > > > > >         if (isthread) { > > > > > > > > > > > > > > > > > > > > > > >