Download raw body.
Kernel protection fault in fill_kproc()
On Wed, Aug 13, 2025 at 03:46:39PM +0300, Vitaliy Makkoveev wrote:
> On Wed, Aug 13, 2025 at 02:35:39PM +0200, Martin Pieuchot wrote:
> > On 12/08/25(Tue) 14:36, Vitaliy Makkoveev wrote:
> > > On Tue, Aug 12, 2025 at 12:40:21PM +0200, Martin Pieuchot wrote:
> > > > On 12/08/25(Tue) 13:00, Vitaliy Makkoveev wrote:
> > > > > On Tue, Aug 12, 2025 at 11:49:12AM +0200, Mark Kettenis wrote:
> > > > > > > Date: Tue, 12 Aug 2025 11:56:10 +0300
> > > > > > > From: Vitaliy Makkoveev <mvs@openbsd.org>
> > > > > > >
> > > > > > > On Tue, Aug 12, 2025 at 07:22:29AM +0200, Claudio Jeker wrote:
> > > > > > > > On Mon, Aug 11, 2025 at 10:45:05AM +0000, Gerhard Roth wrote:
> > > > > > > > > About a year ago, the call to uvm_exit() was moved outside of theĀ
> > > > > > > > > KERNEL_LOCK() in the reaper() by mpi@. Now we observed a kernel
> > > > > > > > > protection fault that results from this change.
> > > > > > > > >
> > > > > > > > > In fill_kproc() we read the vmspace pointer (vm) right at the very
> > > > > > > > > beginning of the function:
> > > > > > > > >
> > > > > > > > > struct vmspace *vm = pr->ps_vmspace;
> > > > > > > > >
> > > > > > > > > Sometime later, we try to access it:
> > > > > > > > >
> > > > > > > > > /* fixups that can only be done in the kernel */
> > > > > > > > > if ((pr->ps_flags & PS_ZOMBIE) == 0) {
> > > > > > > > > if ((pr->ps_flags & PS_EMBRYO) == 0 && vm != NULL)
> > > > > > > > > ki->p_vm_rssize = vm_resident_count(vm);
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > In the meantime the process might have exited and the reaper() can free
> > > > > > > > > the vmspace by calling uvm_exit(). After that, the 'vm' pointer in
> > > > > > > > > fill_kproc() points to stale memory. Accessing it will yield a kernel
> > > > > > > > > protection fault.
> > > > > > > > >
> > > > > > > > > BTW: only after freeing the vmspace of the process, the PS_ZOMBIE flag
> > > > > > > > > is set by the reaper().
> > > > > > > > >
> > > > > > > > > I propose to put the reaper()'s call to uvm_exit() back under the
> > > > > > > > > kernel lock to avoid the fault.
> > > > > > > >
> > > > > > > > In my opinion the fill_kproc() code is wrong and it should not look at
> > > > > > > > pr->ps_vmspace if the PS_EXITING flag is set for the process.
> > > > > > > >
> > > > > > > > exit1() sets PS_EXITING flag early on and after that point the vm can be
> > > > > > > > purged so the vm_resident_count() is probably wrong anyway.
> > > > > >
> > > > > > I guess that is safe since exit1() still runs with the kernel lock
> > > > > > held. Which means that the PS_EXITING flags can't be set while
> > > > > > fill_kproc() runs, since it holds the kernel lock.
> > > > > >
> > > > > > > The only fill_kproc() is the sysctl_doproc() which does the check in the
> > > > > > > beginning of the allprocess loop:
> > > > > > >
> > > > > > > for (; pr != NULL; pr = LIST_NEXT(pr, ps_list)) {
> > > > > > > /* XXX skip processes in the middle of being zapped */
> > > > > > > if (pr->ps_pgrp == NULL)
> > > > > > > continue;
> > > > > >
> > > > > > There is some other code where an "external observer" looks at
> > > > > > ps_vmspace in kern/sys_process.c:process_domem():
> > > > > >
> > > > > > vm = tr->ps_vmspace;
> > > > > > if ((tr->ps_flags & PS_EXITING) || (vm->vm_refcnt < 1))
> > > > > > return EFAULT;
> > > > > > addr = uio->uio_offset;
> > > > > >
> > > > > > uvmspace_addref(vm);
> > > > > >
> > > > > > error = uvm_io(&vm->vm_map, uio, UVM_IO_FIXPROT);
> > > > > >
> > > > > > uvmspace_free(vm);
> > > > > >
> > > > > > So that checks PS_EXITING as well, but also checks the refcnt.
> > > > > >
> > > > > > As you can see this also takes a reference of the vmspace. I guess
> > > > > > that's necessary since uvm_io() may sleep.
> > > > >
> > > > > The problem lies in the unlocked uvm_exit(pr) which starts teardown of
> > > > > vmspace. Simple adding uvmspace_addref() somewhere else will not help
> > > > > you because there is no guarantees that your uvmspace_addref() is the
> > > > > winner. So you need to serialize the uvm_exit() and the
> > > > > uvmspace_addref() thread. This means it should be moved back under
> > > > > kernel lock.
> > > >
> > > > Indeed, clearing `ps_vmspace' isn't safe without KERNEL_LOCK(). The
> > > > drawback of that approach is that pmap_destroy(), which is costly, will
> > > > no longer be executed in parallel.
> > > >
> > > > > In my initial diff I propose to move uvm_exit(pr) after the kernel
> > > > > locked section of reaper(). This mead the vmspace teardown will start
> > > > > after the process being unlinek from the allprocess or zombprocess lists
> > > > > and not accessed by sysctl(2).
> > > >
> > > > I'd suggest skipping dereferencing `ps_vmspace' for now to be coherent
> > > > with what is done by sysctl_proc_args().
> > > >
> > > > Accessing zombie process descriptors out of the KERNEL_LOCK() would be a
> > > > must for killing the reaper. Anyone interested?
> > > >
> > > > Diff below is untested, I'm too busy atm.
> > > >
> > >
> > > You do lockless uvm_exit(pr) before setting PS_ZOMBIE flag. This is the
> > > case I described below, no guarantees that your uvmspace_addref(vm)
> > > thread is the winner.
> >
> > Please re-read the diff, there's such guarantees thanks to PS_EXITING
> > which is set before uvm_purge().
> >
>
> So, the PS_EXITING bit is enough? For that purpose do you setting
> PS_ZOMBIE?
>
> > This pattern is already present in multiple places in the tree.
> >
> > > Putting the uvmspace_free(vm) just after the kernel locked section in
> > > reaper makes it safe, because at this point the process is not accessible
> > > from the alloprocess or zombprocess loops. Also this fixes the
> > > sysctl_proc_args(), which has the same issue with unlocked uvm_exit(pr)
> > > before setting PS_ZOMBIE flag.
> >
> > This choice is not helping with unlocking more of the reaper and
> > introduces incoherency. This is IMHO the wrong path to follow.
What "incoherency" are you talking? Do you understand that the end of
life cycle of vmspace will be the same with my diff and with the
refcounting?
> >
> > > > Index: kern/kern_sysctl.c
> > > > ===================================================================
> > > > RCS file: /cvs/src/sys/kern/kern_sysctl.c,v
> > > > diff -u -p -r1.482 kern_sysctl.c
> > > > --- kern/kern_sysctl.c 6 Aug 2025 14:00:33 -0000 1.482
> > > > +++ kern/kern_sysctl.c 12 Aug 2025 10:35:07 -0000
> > > > @@ -2051,11 +2051,17 @@ fill_kproc(struct process *pr, struct ki
> > > > {
> > > > struct session *s = pr->ps_session;
> > > > struct tty *tp;
> > > > - struct vmspace *vm = pr->ps_vmspace;
> > > > + struct vmspace *vm = NULL;
> > > > struct timespec booted, st, ut, utc;
> > > > struct tusage tu;
> > > > int isthread;
> > > >
> > > > + /* exiting/zombie process might no longer have VM space. */
> > > > + if ((pr->ps_flags & (PS_ZOMBIE|PS_EXITING)) == 0) {
> > > > + vm = pr->ps_vmspace;
> > > > + uvmspace_addref(vm);
> > > > + }
> > > > +
> > > > isthread = p != NULL;
> > > > if (!isthread) {
> > > > p = pr->ps_mainproc; /* XXX */
> > > > @@ -2102,6 +2108,8 @@ fill_kproc(struct process *pr, struct ki
> > > > ki->p_cpuid = CPU_INFO_UNIT(p->p_cpu);
> > > > #endif
> > > > }
> > > > +
> > > > + uvmspace_free(vm);
> > > >
> > > > /* get %cpu and schedule state: just one thread or sum of all? */
> > > > if (isthread) {
> > > >
> > > >
> > >
> >
> >
>
Kernel protection fault in fill_kproc()