Index | Thread | Search

From:
Claudio Jeker <cjeker@diehard.n-r-g.com>
Subject:
Re: uvm_purge()
To:
tech@openbsd.org
Date:
Thu, 15 May 2025 14:11:12 +0200

Download raw body.

Thread
On Thu, May 15, 2025 at 01:01:07PM +0200, Martin Pieuchot wrote:
> On 15/05/25(Thu) 11:17, Claudio Jeker wrote:
> > On Mon, May 12, 2025 at 08:42:40PM +0200, Martin Pieuchot wrote:
> > > Diff below moves the tearing down of VM space to exit1().  It implies
> > > that processes will now be charged for cleaning their VM space.  As a
> > > result their %sys time will increase.
> > > 
> > > uvm_purge() is called in the "top" part of exit1() when the process is
> > > still allowed to sleep.  To allow this piece of code to be executed in
> > > parallel, all recursive kernel_lock tokens need to be released. 
> > > 
> > > In the diff below uvm_purge() is called twice.  The first time in
> > > exit1() and the second time, doing nothing but integrity checks, in
> > > uvm_exit().  Another approach could be to just check that the map is
> > > empty.  Any preferences?
> > > 
> > > Comments?  Oks?
> > 
> > Comments inline.
> 
> Thanks for the review.  Answers and new diff below.
> 
> >  
> > > Index: kern/kern_exit.c
> > > ===================================================================
> > > RCS file: /cvs/src/sys/kern/kern_exit.c,v
> > > diff -u -p -r1.245 kern_exit.c
> > > --- kern/kern_exit.c	2 May 2025 05:04:38 -0000	1.245
> > > +++ kern/kern_exit.c	12 May 2025 18:10:35 -0000
> > > @@ -241,6 +241,14 @@ exit1(struct proc *p, int xexit, int xsi
> > >  		 */
> > >  		if (pr->ps_pptr->ps_sigacts->ps_sigflags & SAS_NOCLDWAIT)
> > >  			atomic_setbits_int(&pr->ps_flags, PS_NOZOMBIE);
> > > +
> > > +		/* Teardown the virtual address space. */
> > > +#ifdef MULTIPROCESSOR
> > > +		__mp_release_all(&kernel_lock);
> > > +#endif
> > > +		uvm_purge(pr);
> > > +		KERNEL_LOCK();
> > > +
> > 
> > I prefer if PS_NOZOMBIE is set at the end (after uvm_purge).
> > I know it should not matter but it makes more sense in my head to mark a
> > process with PS_NOZOMBIE close to the end of exit1.
> 
> I agree, such diff is ok mpi@.  I won't mix this refactoring with the
> introduction of uvm_purge() though.

Sure, we can do this after this is in.
 
> > > Index: uvm/uvm_glue.c
> > > ===================================================================
> > > RCS file: /cvs/src/sys/uvm/uvm_glue.c,v
> > > diff -u -p -r1.88 uvm_glue.c
> > > --- uvm/uvm_glue.c	21 Mar 2025 13:19:33 -0000	1.88
> > > +++ uvm/uvm_glue.c	12 May 2025 18:05:34 -0000
> > > @@ -286,6 +286,19 @@ uvm_uarea_free(struct proc *p)
> > >  }
> > >  
> > >  /*
> > > + * uvm_purge: teardown a virtual address space
> > > + */
> > > +void
> > > +uvm_purge(struct process *pr)
> > > +{
> > > +	struct vmspace *vm = pr->ps_vmspace;
> > > +
> > > +	KERNEL_ASSERT_UNLOCKED();
> > > +
> > > +	uvmspace_purge(vm);
> > > +}
> > 
> > I think here you miss a check. Neither uvm_purge() nor uvmspace_purge()
> > check the vmspace refcount. In my diff I used this here:
> > 
> > 	if (atomic_load_int(&vm->vm_refcnt) != 1)
> > 		return;
> > 
> > At least kthread_exit() calls exit1() with a shared vm_refcnt of proc0
> > and without that check a kthread_exit() would purge the kernel vmspace
> > which is very bad.
> 
> Indeed.  I forgot about kernel threads exiting.  I deliberately decided
> to not check for refcounts for userland processes.   See below.
> 
> > The other places, where the vm_refcnt is increased, are in process_domem()
> > (ptrace) and in sysctl_proc_args() both check PS_EXITING upfront but then
> > sleep while holding the reference. So if the process exits while those are
> > asleep we also end up in fireworks. For those parts I'm not even sure if
> > going through exit1, reaper and process_zap before those call uvmspace_free()
> > would not result in some use-after-free or other bad access.
> > I try to look at all the logic there and think about the consequences when
> > it comes to vmspace references.
> 
> When a process wants to peak at the VM space of a dying one, the
> vm_map_lock() is what guarantees integrity of the data structures.
> So in the worst case the VM space will appear empty.

So that would then result in a failed uvm_io operation. I guess that is
fine.
 
> I decided to not care about reference counts for userland processes
> because I want to be sure the process itself will be charged for
> cleaning its VM space.  This can be seen as a huge munmap(2) of the
> whole address space.

I agree with that. We want the process to do the work in exit1 and not
have some other thread do this on behalf.

> The reference protects the 'struct vmspace' which needs to stay around
> for the reasons mentioned above.

I am more worried that some code in this exit1, reaper, process_zap jungle
would change a value inside struct vmspace that would break the assumption
you made about the map lock. I had a look and did not spot anything right
away.
 
> For that reasons updated diff below skips uvmspace_purge() in exit1() for
> kernel threads.

This is fine with me.
 
> Done.  I also moved cleanups needed for VMMAP_DEBUG builds where they belong.

Some bikeshed requests below. Apart from that OK claudio@
I will throw it onto some systems to test.

 
> Index: uvm/uvm_map.c
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvm_map.c,v
> diff -u -p -r1.341 uvm_map.c
> --- uvm/uvm_map.c	21 Apr 2025 14:46:18 -0000	1.341
> +++ uvm/uvm_map.c	15 May 2025 10:56:06 -0000
> @@ -305,7 +305,7 @@ vaddr_t uvm_maxkaddr;
>   *
>   * Addresses are unique.
>   * Entries with start == end may only exist if they are the first entry
> - * (sorted by address) within a free-memory tree.
> +  (sorted by address) within a free-memory tree.

You lost a * here.

>   */
>  
>  static inline int
> @@ -2491,6 +2491,24 @@ uvm_map_teardown(struct vm_map *map)
>  		entry = TAILQ_NEXT(entry, dfree.deadq);
>  	}
>  
> +#ifdef VMMAP_DEBUG
> +	numt = numq = 0;
> +	RBT_FOREACH(entry, uvm_map_addr, &map->addr)
> +		numt++;
> +	TAILQ_FOREACH(entry, &dead_entries, dfree.deadq)
> +		numq++;
> +	KASSERT(numt == numq);
> +	/*
> +	 * Let VMMAP_DEBUG checks work on an empty map if uvm_map_teardown()
> +	 * is called twice.
> +	 */
> +	map->size = 0;
> +	map->min_offset = 0;
> +	map->max_offset = 0;
> +	map->flags &= ~VM_MAP_ISVMSPACE;
> +#endif
> +	/* reinit RB tree since the above removal leaves the tree corrupted. */
> +	RBT_INIT(uvm_map_addr, &map->addr);
>  	vm_map_unlock(map);
>  
>  	/* Remove address selectors. */
> @@ -2503,18 +2521,7 @@ uvm_map_teardown(struct vm_map *map)
>  	uvm_addr_destroy(map->uaddr_brk_stack);
>  	map->uaddr_brk_stack = NULL;
>  
> -#ifdef VMMAP_DEBUG
> -	numt = numq = 0;
> -	RBT_FOREACH(entry, uvm_map_addr, &map->addr)
> -		numt++;
> -	TAILQ_FOREACH(entry, &dead_entries, dfree.deadq)
> -		numq++;
> -	KASSERT(numt == numq);
> -#endif
>  	uvm_unmap_detach(&dead_entries, 0);
> -
> -	pmap_destroy(map->pmap);
> -	map->pmap = NULL;
>  }
>  
>  /*
> @@ -3395,6 +3402,26 @@ uvmspace_addref(struct vmspace *vm)
>  	atomic_inc_int(&vm->vm_refcnt);
>  }
>  
> +void
> +uvmspace_purge(struct vmspace *vm)
> +{
> +#ifdef SYSVSHM
> +	/* Get rid of any SYSV shared memory segments. */
> +	if (vm->vm_shm != NULL) {
> +		KERNEL_LOCK();
> +		shmexit(vm);
> +		KERNEL_UNLOCK();
> +	}
> +#endif
> +
> +	/*
> +	 * Lock the map, to wait out all other references to it.  delete
> +	 * all of the mappings and pages they hold, then call the pmap
> +	 * module to reclaim anything left.

This comment is not quite correct anymore.
At least I think 'then call the pmap module to reclaim anything left.'
should be removed, since that is a reference to the pmap_destroy() call
which is now in uvmspace_free().

> +	 */
> +	uvm_map_teardown(&vm->vm_map);
> +}
> +
>  /*
>   * uvmspace_free: free a vmspace data structure
>   */
> @@ -3402,21 +3429,11 @@ void
>  uvmspace_free(struct vmspace *vm)
>  {
>  	if (atomic_dec_int_nv(&vm->vm_refcnt) == 0) {
> -		/*
> -		 * lock the map, to wait out all other references to it.  delete
> -		 * all of the mappings and pages they hold, then call the pmap
> -		 * module to reclaim anything left.
> -		 */
> -#ifdef SYSVSHM
> -		/* Get rid of any SYSV shared memory segments. */
> -		if (vm->vm_shm != NULL) {
> -			KERNEL_LOCK();
> -			shmexit(vm);
> -			KERNEL_UNLOCK();
> -		}
> -#endif
> +		uvmspace_purge(vm);
> +

I think you could add the pmap comment here if you like, but I have to say
that the code here is now very self-explanatory.

> +		pmap_destroy(vm->vm_map.pmap);
> +		vm->vm_map.pmap = NULL;
>  
> -		uvm_map_teardown(&vm->vm_map);
>  		pool_put(&uvm_vmspace_pool, vm);
>  	}
>  }
> 
> 

-- 
:wq Claudio