From: Claudio Jeker Subject: Re: uvm_purge() To: tech@openbsd.org Date: Thu, 15 May 2025 14:11:12 +0200 On Thu, May 15, 2025 at 01:01:07PM +0200, Martin Pieuchot wrote: > On 15/05/25(Thu) 11:17, Claudio Jeker wrote: > > On Mon, May 12, 2025 at 08:42:40PM +0200, Martin Pieuchot wrote: > > > Diff below moves the tearing down of VM space to exit1(). It implies > > > that processes will now be charged for cleaning their VM space. As a > > > result their %sys time will increase. > > > > > > uvm_purge() is called in the "top" part of exit1() when the process is > > > still allowed to sleep. To allow this piece of code to be executed in > > > parallel, all recursive kernel_lock tokens need to be released. > > > > > > In the diff below uvm_purge() is called twice. The first time in > > > exit1() and the second time, doing nothing but integrity checks, in > > > uvm_exit(). Another approach could be to just check that the map is > > > empty. Any preferences? > > > > > > Comments? Oks? > > > > Comments inline. > > Thanks for the review. Answers and new diff below. > > > > > > Index: kern/kern_exit.c > > > =================================================================== > > > RCS file: /cvs/src/sys/kern/kern_exit.c,v > > > diff -u -p -r1.245 kern_exit.c > > > --- kern/kern_exit.c 2 May 2025 05:04:38 -0000 1.245 > > > +++ kern/kern_exit.c 12 May 2025 18:10:35 -0000 > > > @@ -241,6 +241,14 @@ exit1(struct proc *p, int xexit, int xsi > > > */ > > > if (pr->ps_pptr->ps_sigacts->ps_sigflags & SAS_NOCLDWAIT) > > > atomic_setbits_int(&pr->ps_flags, PS_NOZOMBIE); > > > + > > > + /* Teardown the virtual address space. */ > > > +#ifdef MULTIPROCESSOR > > > + __mp_release_all(&kernel_lock); > > > +#endif > > > + uvm_purge(pr); > > > + KERNEL_LOCK(); > > > + > > > > I prefer if PS_NOZOMBIE is set at the end (after uvm_purge). > > I know it should not matter but it makes more sense in my head to mark a > > process with PS_NOZOMBIE close to the end of exit1. > > I agree, such diff is ok mpi@. I won't mix this refactoring with the > introduction of uvm_purge() though. Sure, we can do this after this is in. > > > Index: uvm/uvm_glue.c > > > =================================================================== > > > RCS file: /cvs/src/sys/uvm/uvm_glue.c,v > > > diff -u -p -r1.88 uvm_glue.c > > > --- uvm/uvm_glue.c 21 Mar 2025 13:19:33 -0000 1.88 > > > +++ uvm/uvm_glue.c 12 May 2025 18:05:34 -0000 > > > @@ -286,6 +286,19 @@ uvm_uarea_free(struct proc *p) > > > } > > > > > > /* > > > + * uvm_purge: teardown a virtual address space > > > + */ > > > +void > > > +uvm_purge(struct process *pr) > > > +{ > > > + struct vmspace *vm = pr->ps_vmspace; > > > + > > > + KERNEL_ASSERT_UNLOCKED(); > > > + > > > + uvmspace_purge(vm); > > > +} > > > > I think here you miss a check. Neither uvm_purge() nor uvmspace_purge() > > check the vmspace refcount. In my diff I used this here: > > > > if (atomic_load_int(&vm->vm_refcnt) != 1) > > return; > > > > At least kthread_exit() calls exit1() with a shared vm_refcnt of proc0 > > and without that check a kthread_exit() would purge the kernel vmspace > > which is very bad. > > Indeed. I forgot about kernel threads exiting. I deliberately decided > to not check for refcounts for userland processes. See below. > > > The other places, where the vm_refcnt is increased, are in process_domem() > > (ptrace) and in sysctl_proc_args() both check PS_EXITING upfront but then > > sleep while holding the reference. So if the process exits while those are > > asleep we also end up in fireworks. For those parts I'm not even sure if > > going through exit1, reaper and process_zap before those call uvmspace_free() > > would not result in some use-after-free or other bad access. > > I try to look at all the logic there and think about the consequences when > > it comes to vmspace references. > > When a process wants to peak at the VM space of a dying one, the > vm_map_lock() is what guarantees integrity of the data structures. > So in the worst case the VM space will appear empty. So that would then result in a failed uvm_io operation. I guess that is fine. > I decided to not care about reference counts for userland processes > because I want to be sure the process itself will be charged for > cleaning its VM space. This can be seen as a huge munmap(2) of the > whole address space. I agree with that. We want the process to do the work in exit1 and not have some other thread do this on behalf. > The reference protects the 'struct vmspace' which needs to stay around > for the reasons mentioned above. I am more worried that some code in this exit1, reaper, process_zap jungle would change a value inside struct vmspace that would break the assumption you made about the map lock. I had a look and did not spot anything right away. > For that reasons updated diff below skips uvmspace_purge() in exit1() for > kernel threads. This is fine with me. > Done. I also moved cleanups needed for VMMAP_DEBUG builds where they belong. Some bikeshed requests below. Apart from that OK claudio@ I will throw it onto some systems to test. > Index: uvm/uvm_map.c > =================================================================== > RCS file: /cvs/src/sys/uvm/uvm_map.c,v > diff -u -p -r1.341 uvm_map.c > --- uvm/uvm_map.c 21 Apr 2025 14:46:18 -0000 1.341 > +++ uvm/uvm_map.c 15 May 2025 10:56:06 -0000 > @@ -305,7 +305,7 @@ vaddr_t uvm_maxkaddr; > * > * Addresses are unique. > * Entries with start == end may only exist if they are the first entry > - * (sorted by address) within a free-memory tree. > + (sorted by address) within a free-memory tree. You lost a * here. > */ > > static inline int > @@ -2491,6 +2491,24 @@ uvm_map_teardown(struct vm_map *map) > entry = TAILQ_NEXT(entry, dfree.deadq); > } > > +#ifdef VMMAP_DEBUG > + numt = numq = 0; > + RBT_FOREACH(entry, uvm_map_addr, &map->addr) > + numt++; > + TAILQ_FOREACH(entry, &dead_entries, dfree.deadq) > + numq++; > + KASSERT(numt == numq); > + /* > + * Let VMMAP_DEBUG checks work on an empty map if uvm_map_teardown() > + * is called twice. > + */ > + map->size = 0; > + map->min_offset = 0; > + map->max_offset = 0; > + map->flags &= ~VM_MAP_ISVMSPACE; > +#endif > + /* reinit RB tree since the above removal leaves the tree corrupted. */ > + RBT_INIT(uvm_map_addr, &map->addr); > vm_map_unlock(map); > > /* Remove address selectors. */ > @@ -2503,18 +2521,7 @@ uvm_map_teardown(struct vm_map *map) > uvm_addr_destroy(map->uaddr_brk_stack); > map->uaddr_brk_stack = NULL; > > -#ifdef VMMAP_DEBUG > - numt = numq = 0; > - RBT_FOREACH(entry, uvm_map_addr, &map->addr) > - numt++; > - TAILQ_FOREACH(entry, &dead_entries, dfree.deadq) > - numq++; > - KASSERT(numt == numq); > -#endif > uvm_unmap_detach(&dead_entries, 0); > - > - pmap_destroy(map->pmap); > - map->pmap = NULL; > } > > /* > @@ -3395,6 +3402,26 @@ uvmspace_addref(struct vmspace *vm) > atomic_inc_int(&vm->vm_refcnt); > } > > +void > +uvmspace_purge(struct vmspace *vm) > +{ > +#ifdef SYSVSHM > + /* Get rid of any SYSV shared memory segments. */ > + if (vm->vm_shm != NULL) { > + KERNEL_LOCK(); > + shmexit(vm); > + KERNEL_UNLOCK(); > + } > +#endif > + > + /* > + * Lock the map, to wait out all other references to it. delete > + * all of the mappings and pages they hold, then call the pmap > + * module to reclaim anything left. This comment is not quite correct anymore. At least I think 'then call the pmap module to reclaim anything left.' should be removed, since that is a reference to the pmap_destroy() call which is now in uvmspace_free(). > + */ > + uvm_map_teardown(&vm->vm_map); > +} > + > /* > * uvmspace_free: free a vmspace data structure > */ > @@ -3402,21 +3429,11 @@ void > uvmspace_free(struct vmspace *vm) > { > if (atomic_dec_int_nv(&vm->vm_refcnt) == 0) { > - /* > - * lock the map, to wait out all other references to it. delete > - * all of the mappings and pages they hold, then call the pmap > - * module to reclaim anything left. > - */ > -#ifdef SYSVSHM > - /* Get rid of any SYSV shared memory segments. */ > - if (vm->vm_shm != NULL) { > - KERNEL_LOCK(); > - shmexit(vm); > - KERNEL_UNLOCK(); > - } > -#endif > + uvmspace_purge(vm); > + I think you could add the pmap comment here if you like, but I have to say that the code here is now very self-explanatory. > + pmap_destroy(vm->vm_map.pmap); > + vm->vm_map.pmap = NULL; > > - uvm_map_teardown(&vm->vm_map); > pool_put(&uvm_vmspace_pool, vm); > } > } > > -- :wq Claudio