Index | Thread | Search

From:
Mark Kettenis <mark.kettenis@xs4all.nl>
Subject:
Re: Faster _exit(2) for a faster userland: R.I.P the reaper
To:
Mark Kettenis <mark.kettenis@xs4all.nl>
Cc:
cjeker@diehard.n-r-g.com, tedu@tedunangst.com, dlg@openbsd.org, visa@openbsd.org, tech@openbsd.org, jan@openbsd.org, ajacoutot@openbsd.org
Date:
Mon, 12 May 2025 17:42:57 +0200

Download raw body.

Thread
> Date: Sun, 11 May 2025 23:27:38 +0200
> From: Mark Kettenis <mark.kettenis@xs4all.nl>

Just a heads-up that I got a panic running this diff.  So it probably
needs more work.

> > Date: Wed, 7 May 2025 11:11:35 +0200
> > From: Claudio Jeker <cjeker@diehard.n-r-g.com>
> 
> I'm not sure we should let ourselves be distracted by optimizing pmap
> teardown.  But I did explore things a bit further (on arm64) and some
> observations are relevant to the ordering of operations in exit1().
> 
> > 
> > On Tue, May 06, 2025 at 02:52:11PM +0200, Mark Kettenis wrote:
> > > > Date: Tue, 6 May 2025 10:46:01 +0200
> > > > From: Claudio Jeker <cjeker@diehard.n-r-g.com>
> > > > 
> > > > On Mon, May 05, 2025 at 10:46:38PM +0200, Mark Kettenis wrote:
> > > > > > Date: Mon, 5 May 2025 17:25:02 +0200
> > > > > > From: Claudio Jeker <cjeker@diehard.n-r-g.com>
> > > > > > 
> > > > > > On Sun, May 04, 2025 at 08:38:06AM -0600, Theo de Raadt wrote:
> > > > > > > Martin Pieuchot <mpi@grenadille.net> wrote:
> > > > > > > 
> > > > > > > > > I do not understand any of the diff.  It seems to be trying to return a
> > > > > > > > > child to the parent, before resources are cleared.
> > > > > > > > 
> > > > > > > > It doesn't.  Order of operations didn't change.  The KERNEL_LOCK() is
> > > > > > > > there and asserted.
> > > > > > > 
> > > > > > > I did not mention the KERNEL_LOCK().
> > > > > > > 
> > > > > > > > I disagree.  I need to remove the reaper.  That *is* the problem.
> > > > > > > 
> > > > > > > You are using proc0 in an very dangerous way.
> > > > > > > 
> > > > > > > > > I've called this uvmspace_purge().  It would be a bit like
> > > > > > > > > 
> > > > > > > > >      uvm_unmap_remove(p->p_vmspace, 0, VM_MAXUSER_ADDRESS,
> > > > > > > > >          &dead_entries, FALSE, TRUE, TRUE)
> > > > > > > > > 
> > > > > > > > > Except there are clearly optimizations.  The function can sleep.  It is
> > > > > > > > > called quite early in process context.
> > > > > > > > > [...] 
> > > > > > > > 
> > > > > > > > There's no need to add a new function.  Calling uvm_exit() from exit1()
> > > > > > > > is enough and work but that's not buying us much.
> > > > > > > 
> > > > > > > As proc0?  Disagree strongly on this direction.  The process should free
> > > > > > > it's own pages out of it's own map.  It can do that if uvm_exit() is split.
> > > > > > > 
> > > > > > > But you've made up your mind to ignore everything everyone says.  You don't
> > > > > > > even reply to emails about the proposals.  Your public requests for feedback
> > > > > > > come off as fake.
> > > > > > 
> > > > > > Here is a minimal diff implementing uvm_purge() in the dumbest way
> > > > > > possible. It removes the reaper from the hot path without any scary hacks.
> > > > > > 
> > > > > > It only fixes uvm_map_teardown() so it can be called again.
> > > > > 
> > > > > But we shouldn't have to call it again no?
> > > > 
> > > > There is a complication that has to do with VM space sharing (which is
> > > > only used by proc0). So it is easier for now to offload proc0 handling to
> > > > uvm_exit() (which does the refcnt update and check if it hit 0).
> > > >  
> > > > I think we can kill VM space sharing and use real threads for kthread.
> > > > This would remove the vm space refcnt and make code a fair bit simpler.
> > > > guenther@ wanted that as well so maybe I look into that.
> > > > 
> > > > > > I tried to zap the pmap first (with pmap_remove of the full user address
> > > > > > space) but that is very slow.
> > > > > 
> > > > > Our page tables are typically only populated sparsely.  So this
> > > > > doesn't surprise me.
> > > > 
> > > > Yes, it was a try, I was hoping that pmap_remove would be walk the sparse
> > > > pmap tables efficently. It does not.
> > > >  
> > > > > > I think pmap can be optimized a lot here but we don't have an api
> > > > > > for that yet.
> > > > > 
> > > > > What optimizations are you thinking off?  The only "easy" optimization
> > > > > I found in the teardown was replacing the TLB flushes with a single
> > > > > ASID-based TLB flush on arm64.
> > > > 
> > > > So there are a few things I noticed. uvm_map_teardown() spends about the
> > > > same amount of time in uvm as in pmap. It calls pmap many many times so
> > > > there is a lot of call overhead.
> > > 
> > > On arm64 more than 50% of the time spent in pmap during teardown was
> > > spent doing the individual TLB flushes.
> > > 
> > > > The moment uvm purges the userland mappings a split KU arch (arm64,
> > > > sparc64, ...) can nuke the pmap, flush out the tlbs via ASID and then
> > > > instruct uvm_map_teardown() to skip any pmap calls.
> > > 
> > > Only sparc64 is completely split KU; arm64 is somewhere in the middle,
> > > with a single address space that is divided in two halves with
> > > separate page tables covering each half.
> > > 
> > > But even on these architectures you can't simply skip any pmap calls.
> > > The problem here is pmap_page_protect() and the associated management
> > > of the PV lists.  If you simply skip pmap_remove() calls, the PV lists
> > > will contain dangling pointers.
> > 
> > It should be possible to do the pmap_remove() bits upfront. E.g. call
> > pmap_remove_pv() which takes the pte_desc off the PV list.
> > In the end what we want is a function that cleans out all U entries from
> > the pmap in the most efficent way. Right now pmap_remove is called a lot
> > and there is a fair bit of overhead (e.g. pmap_vp_lookup of every page in
> > the range).
> 
> So the diff below adds a missing pmap_remove_pv() call into
> pmap_vp_destroy_l2_l3() and then moves the call to pmap_vp_destroy()
> that was in pmap_release() into pmap_deactivate().  That'll tear down
> the pmaps and V->P data structures as soon as the last thread in a
> process calls cpu_exit().
> 
> I added short-circuit checks in pmap_remove() and pmap_protect() but
> those didn't result in measurable performance changes in my kernel
> build benchmarking.
> 
> In fact the only measurable change for the diff is an increase in
> system time attributed to the build.  We go from:
> 
>     1m01.85s real     5m24.67s user     2m36.90s system
> 
> to:
> 
>     1m01.89s real     5m23.25s user     2m53.81s system
> 
> That increase in system time makes sense since we're now doing some of
> the work of the reaper when the process exits.
> 
> I'm curious to see what happens on your 80-core beast.
> 
> > > > Still doing this will still allow uvm_map_teardown() to skip most pmap
> > > > calls.
> > > 
> > > As indicated above, this isn't so simple.  We can only really skip
> > > pmap_remove() calls for pages that are completely private to the uvm
> > > space that we're tearing down.  But I'm not even sure if we can easily
> > > tell whether a page is private or not based on the uvm data
> > > structures.
> > 
> > The idea is that if we do the pmap work upfront then we can shortcut
> > pmap_remove() in uvm_map_teardown() since the work is already done.
> > If you mmap 4GB of mem but only touch 7 pages in that range then
> > pmap_remove will do 1+ million pmap_vp_lookup() calls for 7 hits.
> > I think this is something that can be optimized.
> 
> That is a bit of a silly thing to do.  But I guess doing silly things
> is the norm these days.  I don't think pmap_vp_lookup() is terribly
> expensive, especially if you do a lookup of sequential addresses as
> the relevant parts of the data structures will be cached.  At least my
> benchmarks don't point into that direction.  But since we have to walk
> the entire page table anyway to tear down the page directories we
> might as well optimize things and clear out everything that did end up
> being mapped.
> 
> So I think the order in which things need to be done is:
> 
> 1. Make sure all threads in the process have switched away from the
>    userland page tables.
> 
> 2. Flush all TLB entries for the process.
> 
> 3. Tear down the pmap.
> 
> 4. Tear down the vmspace.
> 
> With my diff, step 4 still happens in the reaper.  Moving it into
> exit1() needs a little bit of care since step 1-3 currently happen in
> cpu_exit() which is the last thing that exit1() does...
> 
> 
> Index: arch/arm64/arm64/pmap.c
> ===================================================================
> RCS file: /cvs/src/sys/arch/arm64/arm64/pmap.c,v
> diff -u -p -r1.111 pmap.c
> --- arch/arm64/arm64/pmap.c	14 Feb 2025 18:36:04 -0000	1.111
> +++ arch/arm64/arm64/pmap.c	11 May 2025 20:42:38 -0000
> @@ -695,6 +695,9 @@ pmap_remove(pmap_t pm, vaddr_t sva, vadd
>  	struct pte_desc *pted;
>  	vaddr_t va;
>  
> +	if (!pm->pm_active)
> +		return;
> +
>  	pmap_lock(pm);
>  	for (va = sva; va < eva; va += PAGE_SIZE) {
>  		pted = pmap_vp_lookup(pm, va, NULL);
> @@ -1007,7 +1010,6 @@ pmap_destroy(pmap_t pm)
>  void
>  pmap_release(pmap_t pm)
>  {
> -	pmap_vp_destroy(pm);
>  }
>  
>  void
> @@ -1035,7 +1037,10 @@ pmap_vp_destroy_l2_l3(pmap_t pm, struct 
>  				if (pted == NULL)
>  					continue;
>  				vp3->vp[l] = NULL;
> -				
> +
> +				if (PTED_MANAGED(pted))
> +					pmap_remove_pv(pted);
> +
>  				pool_put(&pmap_pted_pool, pted);
>  			}
>  			pool_put(&pmap_vp_pool, vp3);
> @@ -1505,6 +1510,8 @@ pmap_deactivate(struct proc *p)
>  
>  	cpu_tlb_flush_asid_all((uint64_t)pm->pm_asid << 48);
>  	cpu_tlb_flush_asid_all((uint64_t)(pm->pm_asid | ASID_USER) << 48);
> +
> +	pmap_vp_destroy(pm);
>  }
>  
>  /*
> @@ -1634,6 +1641,9 @@ pmap_page_protect(struct vm_page *pg, vm
>  void
>  pmap_protect(pmap_t pm, vaddr_t sva, vaddr_t eva, vm_prot_t prot)
>  {
> +	if (!pm->pm_active)
> +		return;
> +
>  	if (prot & (PROT_READ | PROT_EXEC)) {
>  		pmap_lock(pm);
>  		while (sva < eva) {
> 
>