From: Mark Kettenis <mark.kettenis@xs4all.nl>
Subject: Re: Kernel protection fault in fill_kproc()
To: Vitaliy Makkoveev <mvs@openbsd.org>
Cc: gerhard_roth@genua.de, tech@openbsd.org, mpi@openbsd.org, carsten_beckmann@genua.de
Date: Tue, 12 Aug 2025 11:49:12 +0200

> Date: Tue, 12 Aug 2025 11:56:10 +0300
> From: Vitaliy Makkoveev <mvs@openbsd.org>
> 
> On Tue, Aug 12, 2025 at 07:22:29AM +0200, Claudio Jeker wrote:
> > On Mon, Aug 11, 2025 at 10:45:05AM +0000, Gerhard Roth wrote:
> > > About a year ago, the call to uvm_exit() was moved outside of the 
> > > KERNEL_LOCK() in the reaper() by mpi@. Now we observed a kernel
> > > protection fault that results from this change.
> > > 
> > > In fill_kproc() we read the vmspace pointer (vm) right at the very
> > > beginning of the function:
> > > 
> > >         struct vmspace *vm = pr->ps_vmspace;
> > > 
> > > Sometime later, we try to access it:
> > > 
> > > 	/* fixups that can only be done in the kernel */
> > > 	if ((pr->ps_flags & PS_ZOMBIE) == 0) {
> > > 		if ((pr->ps_flags & PS_EMBRYO) == 0 && vm != NULL)
> > > 			ki->p_vm_rssize = vm_resident_count(vm);
> > > 
> > > 
> > > In the meantime the process might have exited and the reaper() can free
> > > the vmspace by calling uvm_exit(). After that, the 'vm' pointer in
> > > fill_kproc() points to stale memory. Accessing it will yield a kernel
> > > protection fault.
> > > 
> > > BTW: only after freeing the vmspace of the process, the PS_ZOMBIE flag
> > > is set by the reaper().
> > > 
> > > I propose to put the reaper()'s call to uvm_exit() back under the
> > > kernel lock to avoid the fault.
> > 
> > In my opinion the fill_kproc() code is wrong and it should not look at
> > pr->ps_vmspace if the PS_EXITING flag is set for the process.
> > 
> > exit1() sets PS_EXITING flag early on and after that point the vm can be
> > purged so the vm_resident_count() is probably wrong anyway.

I guess that is safe since exit1() still runs with the kernel lock
held.  Which means that the PS_EXITING flags can't be set while
fill_kproc() runs, since it holds the kernel lock.

> The only fill_kproc() is the sysctl_doproc() which does the check in the
> beginning of the allprocess loop:
> 
>         for (; pr != NULL; pr = LIST_NEXT(pr, ps_list)) {
>                 /* XXX skip processes in the middle of being zapped */
>                 if (pr->ps_pgrp == NULL)
>                         continue;

There is some other code where an "external observer" looks at
ps_vmspace in kern/sys_process.c:process_domem():

        vm = tr->ps_vmspace;
        if ((tr->ps_flags & PS_EXITING) || (vm->vm_refcnt < 1))
                return EFAULT;
        addr = uio->uio_offset;

        uvmspace_addref(vm);

        error = uvm_io(&vm->vm_map, uio, UVM_IO_FIXPROT);

        uvmspace_free(vm);

So that checks PS_EXITING as well, but also checks the refcnt.

As you can see this also takes a reference of the vmspace.  I guess
that's necessary since uvm_io() may sleep.

> According your suggestion the check should be like in this diff:
> 
> Index: sys/kern/kern_sysctl.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_sysctl.c,v
> diff -u -p -r1.481 kern_sysctl.c
> --- sys/kern/kern_sysctl.c	24 Jul 2025 19:42:41 -0000	1.481
> +++ sys/kern/kern_sysctl.c	12 Aug 2025 08:54:07 -0000
> @@ -1922,14 +1922,10 @@ sysctl_doproc(int *name, u_int namelen, 
>  	doingzomb = 0;
>  again:
>  	for (; pr != NULL; pr = LIST_NEXT(pr, ps_list)) {
> -		/* XXX skip processes in the middle of being zapped */
> -		if (pr->ps_pgrp == NULL)
> -			continue;
> -
>  		/*
> -		 * Skip embryonic processes.
> +		 * Skip embryonic processes or the exiting processes.

There should be no "the" there.

>  		 */
> -		if (pr->ps_flags & PS_EMBRYO)
> +		if (pr->ps_flags & (PS_EXITING | PS_EMBRYO))
>  			continue;
>  
>  		/*
> 
> 

That may not be enough.  Or at least this makes us rely on vslock to
avoid sleeps in copyout().

There is also code in kern/tty.c:ttyinfo() that would need a
PS_EXITING check.

An alternative approach may be to bring ps_vmspace under the ps_mtx
lock.  That probably would require a lock, take a reference, unlock
dance.  Not sure if that is a direction claudio@ wants to go though.