Download raw body.
On Mon, May 12, 2025 at 08:42:40PM +0200, Martin Pieuchot wrote:
> Diff below moves the tearing down of VM space to exit1(). It implies
> that processes will now be charged for cleaning their VM space. As a
> result their %sys time will increase.
>
> uvm_purge() is called in the "top" part of exit1() when the process is
> still allowed to sleep. To allow this piece of code to be executed in
> parallel, all recursive kernel_lock tokens need to be released.
>
> In the diff below uvm_purge() is called twice. The first time in
> exit1() and the second time, doing nothing but integrity checks, in
> uvm_exit(). Another approach could be to just check that the map is
> empty. Any preferences?
>
> Comments? Oks?
Comments inline.
> Index: kern/kern_exit.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_exit.c,v
> diff -u -p -r1.245 kern_exit.c
> --- kern/kern_exit.c 2 May 2025 05:04:38 -0000 1.245
> +++ kern/kern_exit.c 12 May 2025 18:10:35 -0000
> @@ -241,6 +241,14 @@ exit1(struct proc *p, int xexit, int xsi
> */
> if (pr->ps_pptr->ps_sigacts->ps_sigflags & SAS_NOCLDWAIT)
> atomic_setbits_int(&pr->ps_flags, PS_NOZOMBIE);
> +
> + /* Teardown the virtual address space. */
> +#ifdef MULTIPROCESSOR
> + __mp_release_all(&kernel_lock);
> +#endif
> + uvm_purge(pr);
> + KERNEL_LOCK();
> +
I prefer if PS_NOZOMBIE is set at the end (after uvm_purge).
I know it should not matter but it makes more sense in my head to mark a
process with PS_NOZOMBIE close to the end of exit1.
> }
>
> p->p_fd = NULL; /* zap the thread's copy */
> Index: uvm/uvm_extern.h
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvm_extern.h,v
> diff -u -p -r1.180 uvm_extern.h
> --- uvm/uvm_extern.h 19 Nov 2024 06:18:26 -0000 1.180
> +++ uvm/uvm_extern.h 12 May 2025 18:05:34 -0000
> @@ -268,6 +268,7 @@ int uvm_fault(vm_map_t, vaddr_t, vm_fa
>
> vaddr_t uvm_uarea_alloc(void);
> void uvm_uarea_free(struct proc *);
> +void uvm_purge(struct process *);
> void uvm_exit(struct process *);
> void uvm_init_limits(struct plimit *);
> boolean_t uvm_kernacc(caddr_t, size_t, int);
> @@ -401,6 +402,7 @@ void uvmspace_init(struct vmspace *, s
> void uvmspace_exec(struct proc *, vaddr_t, vaddr_t);
> struct vmspace *uvmspace_fork(struct process *);
> void uvmspace_addref(struct vmspace *);
> +void uvmspace_purge(struct vmspace *);
> void uvmspace_free(struct vmspace *);
> struct vmspace *uvmspace_share(struct process *);
> int uvm_share(vm_map_t, vaddr_t, vm_prot_t,
> Index: uvm/uvm_glue.c
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvm_glue.c,v
> diff -u -p -r1.88 uvm_glue.c
> --- uvm/uvm_glue.c 21 Mar 2025 13:19:33 -0000 1.88
> +++ uvm/uvm_glue.c 12 May 2025 18:05:34 -0000
> @@ -286,6 +286,19 @@ uvm_uarea_free(struct proc *p)
> }
>
> /*
> + * uvm_purge: teardown a virtual address space
> + */
> +void
> +uvm_purge(struct process *pr)
> +{
> + struct vmspace *vm = pr->ps_vmspace;
> +
> + KERNEL_ASSERT_UNLOCKED();
> +
> + uvmspace_purge(vm);
> +}
I think here you miss a check. Neither uvm_purge() nor uvmspace_purge()
check the vmspace refcount. In my diff I used this here:
if (atomic_load_int(&vm->vm_refcnt) != 1)
return;
At least kthread_exit() calls exit1() with a shared vm_refcnt of proc0
and without that check a kthread_exit() would purge the kernel vmspace
which is very bad.
You call uvmspace_purge() in uvmspace_free() which is called by uvm_exit
so the above early return works since uvmspace_free() decreases the refcnt
and would then do the work (which should never happen since proc0 never
drops its last reference).
The other places, where the vm_refcnt is increased, are in process_domem()
(ptrace) and in sysctl_proc_args() both check PS_EXITING upfront but then
sleep while holding the reference. So if the process exits while those are
asleep we also end up in fireworks. For those parts I'm not even sure if
going through exit1, reaper and process_zap before those call uvmspace_free()
would not result in some use-after-free or other bad access.
I try to look at all the logic there and think about the consequences when
it comes to vmspace references.
> +
> +/*
> * uvm_exit: exit a virtual address space
> */
> void
> Index: uvm/uvm_map.c
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvm_map.c,v
> diff -u -p -r1.341 uvm_map.c
> --- uvm/uvm_map.c 21 Apr 2025 14:46:18 -0000 1.341
> +++ uvm/uvm_map.c 12 May 2025 18:05:34 -0000
> @@ -2491,6 +2491,19 @@ uvm_map_teardown(struct vm_map *map)
> entry = TAILQ_NEXT(entry, dfree.deadq);
> }
>
> +#ifdef VMMAP_DEBUG
> + numt = numq = 0;
> + RBT_FOREACH(entry, uvm_map_addr, &map->addr)
> + numt++;
> + TAILQ_FOREACH(entry, &dead_entries, dfree.deadq)
> + numq++;
> + KASSERT(numt == numq);
> +#endif
I would add a comment here why all these values are reset.
It is not fully obvious that the upper parts of uvm_map_teardown() leaves
the map in a inconsistent state. In my diff I had this comment here:
/* reinit RB tree since the above removal leaves the tree corrupted */
> + RBT_INIT(uvm_map_addr, &map->addr);
> + map->size = 0;
> + map->min_offset = 0;
> + map->max_offset = 0;
> + map->flags &= ~VM_MAP_ISVMSPACE;
> vm_map_unlock(map);
>
> /* Remove address selectors. */
> @@ -2503,18 +2516,7 @@ uvm_map_teardown(struct vm_map *map)
> uvm_addr_destroy(map->uaddr_brk_stack);
> map->uaddr_brk_stack = NULL;
>
> -#ifdef VMMAP_DEBUG
> - numt = numq = 0;
> - RBT_FOREACH(entry, uvm_map_addr, &map->addr)
> - numt++;
> - TAILQ_FOREACH(entry, &dead_entries, dfree.deadq)
> - numq++;
> - KASSERT(numt == numq);
> -#endif
> uvm_unmap_detach(&dead_entries, 0);
> -
> - pmap_destroy(map->pmap);
> - map->pmap = NULL;
> }
>
> /*
> @@ -3395,6 +3397,21 @@ uvmspace_addref(struct vmspace *vm)
> atomic_inc_int(&vm->vm_refcnt);
> }
>
> +void
> +uvmspace_purge(struct vmspace *vm)
> +{
> +#ifdef SYSVSHM
> + /* Get rid of any SYSV shared memory segments. */
> + if (vm->vm_shm != NULL) {
> + KERNEL_LOCK();
> + shmexit(vm);
> + KERNEL_UNLOCK();
> + }
> +#endif
> +
> + uvm_map_teardown(&vm->vm_map);
> +}
> +
> /*
> * uvmspace_free: free a vmspace data structure
> */
> @@ -3407,16 +3424,11 @@ uvmspace_free(struct vmspace *vm)
> * all of the mappings and pages they hold, then call the pmap
> * module to reclaim anything left.
> */
> -#ifdef SYSVSHM
> - /* Get rid of any SYSV shared memory segments. */
> - if (vm->vm_shm != NULL) {
> - KERNEL_LOCK();
> - shmexit(vm);
> - KERNEL_UNLOCK();
> - }
> -#endif
> + uvmspace_purge(vm);
> +
> + pmap_destroy(vm->vm_map.pmap);
> + vm->vm_map.pmap = NULL;
>
> - uvm_map_teardown(&vm->vm_map);
> pool_put(&uvm_vmspace_pool, vm);
> }
> }
>
>
With the refcnt issue fixed I think this looks very good.
--
:wq Claudio