Index | Thread | Search

From:
Claudio Jeker <cjeker@diehard.n-r-g.com>
Subject:
Re: Faster _exit(2) for a faster userland: R.I.P the reaper
To:
Mark Kettenis <mark.kettenis@xs4all.nl>
Cc:
tedu@tedunangst.com, dlg@openbsd.org, visa@openbsd.org, tech@openbsd.org, jan@openbsd.org, ajacoutot@openbsd.org
Date:
Wed, 7 May 2025 11:11:35 +0200

Download raw body.

Thread
On Tue, May 06, 2025 at 02:52:11PM +0200, Mark Kettenis wrote:
> > Date: Tue, 6 May 2025 10:46:01 +0200
> > From: Claudio Jeker <cjeker@diehard.n-r-g.com>
> > 
> > On Mon, May 05, 2025 at 10:46:38PM +0200, Mark Kettenis wrote:
> > > > Date: Mon, 5 May 2025 17:25:02 +0200
> > > > From: Claudio Jeker <cjeker@diehard.n-r-g.com>
> > > > 
> > > > On Sun, May 04, 2025 at 08:38:06AM -0600, Theo de Raadt wrote:
> > > > > Martin Pieuchot <mpi@grenadille.net> wrote:
> > > > > 
> > > > > > > I do not understand any of the diff.  It seems to be trying to return a
> > > > > > > child to the parent, before resources are cleared.
> > > > > > 
> > > > > > It doesn't.  Order of operations didn't change.  The KERNEL_LOCK() is
> > > > > > there and asserted.
> > > > > 
> > > > > I did not mention the KERNEL_LOCK().
> > > > > 
> > > > > > I disagree.  I need to remove the reaper.  That *is* the problem.
> > > > > 
> > > > > You are using proc0 in an very dangerous way.
> > > > > 
> > > > > > > I've called this uvmspace_purge().  It would be a bit like
> > > > > > > 
> > > > > > >      uvm_unmap_remove(p->p_vmspace, 0, VM_MAXUSER_ADDRESS,
> > > > > > >          &dead_entries, FALSE, TRUE, TRUE)
> > > > > > > 
> > > > > > > Except there are clearly optimizations.  The function can sleep.  It is
> > > > > > > called quite early in process context.
> > > > > > > [...] 
> > > > > > 
> > > > > > There's no need to add a new function.  Calling uvm_exit() from exit1()
> > > > > > is enough and work but that's not buying us much.
> > > > > 
> > > > > As proc0?  Disagree strongly on this direction.  The process should free
> > > > > it's own pages out of it's own map.  It can do that if uvm_exit() is split.
> > > > > 
> > > > > But you've made up your mind to ignore everything everyone says.  You don't
> > > > > even reply to emails about the proposals.  Your public requests for feedback
> > > > > come off as fake.
> > > > 
> > > > Here is a minimal diff implementing uvm_purge() in the dumbest way
> > > > possible. It removes the reaper from the hot path without any scary hacks.
> > > > 
> > > > It only fixes uvm_map_teardown() so it can be called again.
> > > 
> > > But we shouldn't have to call it again no?
> > 
> > There is a complication that has to do with VM space sharing (which is
> > only used by proc0). So it is easier for now to offload proc0 handling to
> > uvm_exit() (which does the refcnt update and check if it hit 0).
> >  
> > I think we can kill VM space sharing and use real threads for kthread.
> > This would remove the vm space refcnt and make code a fair bit simpler.
> > guenther@ wanted that as well so maybe I look into that.
> > 
> > > > I tried to zap the pmap first (with pmap_remove of the full user address
> > > > space) but that is very slow.
> > > 
> > > Our page tables are typically only populated sparsely.  So this
> > > doesn't surprise me.
> > 
> > Yes, it was a try, I was hoping that pmap_remove would be walk the sparse
> > pmap tables efficently. It does not.
> >  
> > > > I think pmap can be optimized a lot here but we don't have an api
> > > > for that yet.
> > > 
> > > What optimizations are you thinking off?  The only "easy" optimization
> > > I found in the teardown was replacing the TLB flushes with a single
> > > ASID-based TLB flush on arm64.
> > 
> > So there are a few things I noticed. uvm_map_teardown() spends about the
> > same amount of time in uvm as in pmap. It calls pmap many many times so
> > there is a lot of call overhead.
> 
> On arm64 more than 50% of the time spent in pmap during teardown was
> spent doing the individual TLB flushes.
> 
> > The moment uvm purges the userland mappings a split KU arch (arm64,
> > sparc64, ...) can nuke the pmap, flush out the tlbs via ASID and then
> > instruct uvm_map_teardown() to skip any pmap calls.
> 
> Only sparc64 is completely split KU; arm64 is somewhere in the middle,
> with a single address space that is divided in two halves with
> separate page tables covering each half.
> 
> But even on these architectures you can't simply skip any pmap calls.
> The problem here is pmap_page_protect() and the associated management
> of the PV lists.  If you simply skip pmap_remove() calls, the PV lists
> will contain dangling pointers.

It should be possible to do the pmap_remove() bits upfront. E.g. call
pmap_remove_pv() which takes the pte_desc off the PV list.
In the end what we want is a function that cleans out all U entries from
the pmap in the most efficent way. Right now pmap_remove is called a lot
and there is a fair bit of overhead (e.g. pmap_vp_lookup of every page in
the range).
 
> > For systems witk K + U in pmap (i386, amd64) the situation is more complex. 
> > In that case we can kill all U mappings, even do the ASID flush but we
> > need to keep the K mappings around and with that enough of pmap so that
> > cpu_switchto() is able to jump off the thread.
> 
> Right.  I think this (killing all U mappings) is what mpi@ tried to
> achieve with the "proc0 borrowing".  But I think the operation is too
> machine specific because it is intimately related to the TLB flushing
> operations.  A better approach would be to have the amd64 pmap switch
> the underlying pmap to point at the kernel page tables.

A lot of those issues come from the fact that uvm_exit frees the vmspace.
This is why I created uvm_purge. It cleans out the user mappings without
destroying the vmspace since we still need some of that. The vmspace can
only be removed once we switched away from the process and its kernel
stack.

This is why I think one can not simply move code from the reaper into
exit1. It needs to be split into work that can be done in process context
and work that really needs to happen once we're off the process.

> > Still doing this will still allow uvm_map_teardown() to skip most pmap
> > calls.
> 
> As indicated above, this isn't so simple.  We can only really skip
> pmap_remove() calls for pages that are completely private to the uvm
> space that we're tearing down.  But I'm not even sure if we can easily
> tell whether a page is private or not based on the uvm data
> structures.

The idea is that if we do the pmap work upfront then we can shortcut
pmap_remove() in uvm_map_teardown() since the work is already done.
If you mmap 4GB of mem but only touch 7 pages in that range then
pmap_remove will do 1+ million pmap_vp_lookup() calls for 7 hits.
I think this is something that can be optimized.
 
> > As you said the pmap tables are populated very sparesly and uvm removes
> > these blocks in a inefficent way.
> >  
> > > Speaking about that optimization, I think your diff breaks it.  The
> > > optimization relies on calling pmap_deactivate() in cpu_exit() before
> > > teardown starts.  But now that you do the teardown in exit1() it
> > > happens before that call is made.  So that'll have to be fixed.
> > 
> > Agreed.
> 
> Anyway, this can be fixed.  And I agree with your overall strategy.
> We'll still need to have the reaper do the final pmap_destroy().  But
> that should be a cheap operations once all the userland mappings have
> been removed from the pmap.
> 
> > > > Index: kern/kern_exit.c
> > > > ===================================================================
> > > > RCS file: /cvs/src/sys/kern/kern_exit.c,v
> > > > diff -u -p -r1.245 kern_exit.c
> > > > --- kern/kern_exit.c	2 May 2025 05:04:38 -0000	1.245
> > > > +++ kern/kern_exit.c	5 May 2025 15:18:17 -0000
> > > > @@ -235,6 +235,10 @@ exit1(struct proc *p, int xexit, int xsi
> > > >  		free(pr->ps_libcpin.pn_pins, M_PINSYSCALL,
> > > >  		    pr->ps_libcpin.pn_npins * sizeof(u_int));
> > > >  
> > > > +		KERNEL_UNLOCK();
> > > > +		uvm_purge(pr);
> > > > +		KERNEL_LOCK();
> > > > +
> > > >  		/*
> > > >  		 * If parent has the SAS_NOCLDWAIT flag set, we're not
> > > >  		 * going to become a zombie.
> > > > Index: uvm/uvm_extern.h
> > > > ===================================================================
> > > > RCS file: /cvs/src/sys/uvm/uvm_extern.h,v
> > > > diff -u -p -r1.180 uvm_extern.h
> > > > --- uvm/uvm_extern.h	19 Nov 2024 06:18:26 -0000	1.180
> > > > +++ uvm/uvm_extern.h	5 May 2025 15:18:17 -0000
> > > > @@ -269,6 +269,7 @@ int			uvm_fault(vm_map_t, vaddr_t, vm_fa
> > > >  vaddr_t			uvm_uarea_alloc(void);
> > > >  void			uvm_uarea_free(struct proc *);
> > > >  void			uvm_exit(struct process *);
> > > > +void			uvm_purge(struct process *);
> > > >  void			uvm_init_limits(struct plimit *);
> > > >  boolean_t		uvm_kernacc(caddr_t, size_t, int);
> > > >  
> > > > Index: uvm/uvm_glue.c
> > > > ===================================================================
> > > > RCS file: /cvs/src/sys/uvm/uvm_glue.c,v
> > > > diff -u -p -r1.88 uvm_glue.c
> > > > --- uvm/uvm_glue.c	21 Mar 2025 13:19:33 -0000	1.88
> > > > +++ uvm/uvm_glue.c	5 May 2025 15:18:17 -0000
> > > > @@ -297,6 +297,29 @@ uvm_exit(struct process *pr)
> > > >  	uvmspace_free(vm);
> > > >  }
> > > >  
> > > > +void
> > > > +uvm_purge(struct process *pr)
> > > > +{
> > > > +	struct vmspace *vm = pr->ps_vmspace;
> > > > +
> > > > +	/* only proc0 uses shared vm space */
> > > > +	if (atomic_load_int(&vm->vm_refcnt) != 1)
> > > > +		return;
> > > > +
> > > > +#ifdef SYSVSHM
> > > > +	/* Get rid of any SYSV shared memory segments. */
> > > > +	if (vm->vm_shm != NULL) {
> > > > +		KERNEL_LOCK();
> > > > +		shmexit(vm);
> > > > +		KERNEL_UNLOCK();
> > > > +	}
> > > > +#endif
> > > > +
> > > > +//	pmap_remove(vm->vm_map.pmap, 0, VM_MAXUSER_ADDRESS);
> > > > +
> > > > +	uvm_map_teardown(&vm->vm_map);
> > > > +}
> > > > +
> > > >  /*
> > > >   * uvm_init_limit: init per-process VM limits
> > > >   *
> > > > Index: uvm/uvm_map.c
> > > > ===================================================================
> > > > RCS file: /cvs/src/sys/uvm/uvm_map.c,v
> > > > diff -u -p -r1.341 uvm_map.c
> > > > --- uvm/uvm_map.c	21 Apr 2025 14:46:18 -0000	1.341
> > > > +++ uvm/uvm_map.c	5 May 2025 15:18:17 -0000
> > > > @@ -138,7 +138,6 @@ int			 uvm_map_pageable_wire(struct vm_m
> > > >  			    vaddr_t, vaddr_t, int);
> > > >  void			 uvm_map_setup_entries(struct vm_map*);
> > > >  void			 uvm_map_setup_md(struct vm_map*);
> > > > -void			 uvm_map_teardown(struct vm_map*);
> > > >  void			 uvm_map_vmspace_update(struct vm_map*,
> > > >  			    struct uvm_map_deadq*, int);
> > > >  void			 uvm_map_kmem_grow(struct vm_map*,
> > > > @@ -2490,6 +2489,7 @@ uvm_map_teardown(struct vm_map *map)
> > > >  		/* Update wave-front. */
> > > >  		entry = TAILQ_NEXT(entry, dfree.deadq);
> > > >  	}
> > > > +	RBT_INIT(uvm_map_addr, &map->addr);
> > > >  
> > > >  	vm_map_unlock(map);
> > > >  
> > > > @@ -2512,9 +2512,6 @@ uvm_map_teardown(struct vm_map *map)
> > > >  	KASSERT(numt == numq);
> > > >  #endif
> > > >  	uvm_unmap_detach(&dead_entries, 0);
> > > > -
> > > > -	pmap_destroy(map->pmap);
> > > > -	map->pmap = NULL;
> > > >  }
> > > >  
> > > >  /*
> > > > @@ -3417,6 +3414,9 @@ uvmspace_free(struct vmspace *vm)
> > > >  #endif
> > > >  
> > > >  		uvm_map_teardown(&vm->vm_map);
> > > > +
> > > > +		pmap_destroy(vm->vm_map.pmap);
> > > > +
> > > >  		pool_put(&uvm_vmspace_pool, vm);
> > > >  	}
> > > >  }
> > > > Index: uvm/uvm_map.h
> > > > ===================================================================
> > > > RCS file: /cvs/src/sys/uvm/uvm_map.h,v
> > > > diff -u -p -r1.94 uvm_map.h
> > > > --- uvm/uvm_map.h	15 Nov 2024 02:59:23 -0000	1.94
> > > > +++ uvm/uvm_map.h	5 May 2025 15:18:17 -0000
> > > > @@ -359,6 +359,7 @@ void		uvm_unmap(struct vm_map *, vaddr_t
> > > >  void		uvm_unmap_detach(struct uvm_map_deadq *, int);
> > > >  int		uvm_unmap_remove(struct vm_map*, vaddr_t, vaddr_t,
> > > >  		    struct uvm_map_deadq *, boolean_t, boolean_t, boolean_t);
> > > > +void		uvm_map_teardown(struct vm_map*);
> > > >  void		uvm_map_set_uaddr(struct vm_map*, struct uvm_addr_state**,
> > > >  		    struct uvm_addr_state*);
> > > >  int		uvm_map_mquery(struct vm_map*, vaddr_t*, vsize_t, voff_t, int);
> > > > 
> > > > 
> > 
> > -- 
> > :wq Claudio
> > 

-- 
:wq Claudio