Download raw body.
Faster _exit(2) for a faster userland: R.I.P the reaper
> Date: Sun, 11 May 2025 23:27:38 +0200
> From: Mark Kettenis <mark.kettenis@xs4all.nl>
Just a heads-up that I got a panic running this diff. So it probably
needs more work.
> > Date: Wed, 7 May 2025 11:11:35 +0200
> > From: Claudio Jeker <cjeker@diehard.n-r-g.com>
>
> I'm not sure we should let ourselves be distracted by optimizing pmap
> teardown. But I did explore things a bit further (on arm64) and some
> observations are relevant to the ordering of operations in exit1().
>
> >
> > On Tue, May 06, 2025 at 02:52:11PM +0200, Mark Kettenis wrote:
> > > > Date: Tue, 6 May 2025 10:46:01 +0200
> > > > From: Claudio Jeker <cjeker@diehard.n-r-g.com>
> > > >
> > > > On Mon, May 05, 2025 at 10:46:38PM +0200, Mark Kettenis wrote:
> > > > > > Date: Mon, 5 May 2025 17:25:02 +0200
> > > > > > From: Claudio Jeker <cjeker@diehard.n-r-g.com>
> > > > > >
> > > > > > On Sun, May 04, 2025 at 08:38:06AM -0600, Theo de Raadt wrote:
> > > > > > > Martin Pieuchot <mpi@grenadille.net> wrote:
> > > > > > >
> > > > > > > > > I do not understand any of the diff. It seems to be trying to return a
> > > > > > > > > child to the parent, before resources are cleared.
> > > > > > > >
> > > > > > > > It doesn't. Order of operations didn't change. The KERNEL_LOCK() is
> > > > > > > > there and asserted.
> > > > > > >
> > > > > > > I did not mention the KERNEL_LOCK().
> > > > > > >
> > > > > > > > I disagree. I need to remove the reaper. That *is* the problem.
> > > > > > >
> > > > > > > You are using proc0 in an very dangerous way.
> > > > > > >
> > > > > > > > > I've called this uvmspace_purge(). It would be a bit like
> > > > > > > > >
> > > > > > > > > uvm_unmap_remove(p->p_vmspace, 0, VM_MAXUSER_ADDRESS,
> > > > > > > > > &dead_entries, FALSE, TRUE, TRUE)
> > > > > > > > >
> > > > > > > > > Except there are clearly optimizations. The function can sleep. It is
> > > > > > > > > called quite early in process context.
> > > > > > > > > [...]
> > > > > > > >
> > > > > > > > There's no need to add a new function. Calling uvm_exit() from exit1()
> > > > > > > > is enough and work but that's not buying us much.
> > > > > > >
> > > > > > > As proc0? Disagree strongly on this direction. The process should free
> > > > > > > it's own pages out of it's own map. It can do that if uvm_exit() is split.
> > > > > > >
> > > > > > > But you've made up your mind to ignore everything everyone says. You don't
> > > > > > > even reply to emails about the proposals. Your public requests for feedback
> > > > > > > come off as fake.
> > > > > >
> > > > > > Here is a minimal diff implementing uvm_purge() in the dumbest way
> > > > > > possible. It removes the reaper from the hot path without any scary hacks.
> > > > > >
> > > > > > It only fixes uvm_map_teardown() so it can be called again.
> > > > >
> > > > > But we shouldn't have to call it again no?
> > > >
> > > > There is a complication that has to do with VM space sharing (which is
> > > > only used by proc0). So it is easier for now to offload proc0 handling to
> > > > uvm_exit() (which does the refcnt update and check if it hit 0).
> > > >
> > > > I think we can kill VM space sharing and use real threads for kthread.
> > > > This would remove the vm space refcnt and make code a fair bit simpler.
> > > > guenther@ wanted that as well so maybe I look into that.
> > > >
> > > > > > I tried to zap the pmap first (with pmap_remove of the full user address
> > > > > > space) but that is very slow.
> > > > >
> > > > > Our page tables are typically only populated sparsely. So this
> > > > > doesn't surprise me.
> > > >
> > > > Yes, it was a try, I was hoping that pmap_remove would be walk the sparse
> > > > pmap tables efficently. It does not.
> > > >
> > > > > > I think pmap can be optimized a lot here but we don't have an api
> > > > > > for that yet.
> > > > >
> > > > > What optimizations are you thinking off? The only "easy" optimization
> > > > > I found in the teardown was replacing the TLB flushes with a single
> > > > > ASID-based TLB flush on arm64.
> > > >
> > > > So there are a few things I noticed. uvm_map_teardown() spends about the
> > > > same amount of time in uvm as in pmap. It calls pmap many many times so
> > > > there is a lot of call overhead.
> > >
> > > On arm64 more than 50% of the time spent in pmap during teardown was
> > > spent doing the individual TLB flushes.
> > >
> > > > The moment uvm purges the userland mappings a split KU arch (arm64,
> > > > sparc64, ...) can nuke the pmap, flush out the tlbs via ASID and then
> > > > instruct uvm_map_teardown() to skip any pmap calls.
> > >
> > > Only sparc64 is completely split KU; arm64 is somewhere in the middle,
> > > with a single address space that is divided in two halves with
> > > separate page tables covering each half.
> > >
> > > But even on these architectures you can't simply skip any pmap calls.
> > > The problem here is pmap_page_protect() and the associated management
> > > of the PV lists. If you simply skip pmap_remove() calls, the PV lists
> > > will contain dangling pointers.
> >
> > It should be possible to do the pmap_remove() bits upfront. E.g. call
> > pmap_remove_pv() which takes the pte_desc off the PV list.
> > In the end what we want is a function that cleans out all U entries from
> > the pmap in the most efficent way. Right now pmap_remove is called a lot
> > and there is a fair bit of overhead (e.g. pmap_vp_lookup of every page in
> > the range).
>
> So the diff below adds a missing pmap_remove_pv() call into
> pmap_vp_destroy_l2_l3() and then moves the call to pmap_vp_destroy()
> that was in pmap_release() into pmap_deactivate(). That'll tear down
> the pmaps and V->P data structures as soon as the last thread in a
> process calls cpu_exit().
>
> I added short-circuit checks in pmap_remove() and pmap_protect() but
> those didn't result in measurable performance changes in my kernel
> build benchmarking.
>
> In fact the only measurable change for the diff is an increase in
> system time attributed to the build. We go from:
>
> 1m01.85s real 5m24.67s user 2m36.90s system
>
> to:
>
> 1m01.89s real 5m23.25s user 2m53.81s system
>
> That increase in system time makes sense since we're now doing some of
> the work of the reaper when the process exits.
>
> I'm curious to see what happens on your 80-core beast.
>
> > > > Still doing this will still allow uvm_map_teardown() to skip most pmap
> > > > calls.
> > >
> > > As indicated above, this isn't so simple. We can only really skip
> > > pmap_remove() calls for pages that are completely private to the uvm
> > > space that we're tearing down. But I'm not even sure if we can easily
> > > tell whether a page is private or not based on the uvm data
> > > structures.
> >
> > The idea is that if we do the pmap work upfront then we can shortcut
> > pmap_remove() in uvm_map_teardown() since the work is already done.
> > If you mmap 4GB of mem but only touch 7 pages in that range then
> > pmap_remove will do 1+ million pmap_vp_lookup() calls for 7 hits.
> > I think this is something that can be optimized.
>
> That is a bit of a silly thing to do. But I guess doing silly things
> is the norm these days. I don't think pmap_vp_lookup() is terribly
> expensive, especially if you do a lookup of sequential addresses as
> the relevant parts of the data structures will be cached. At least my
> benchmarks don't point into that direction. But since we have to walk
> the entire page table anyway to tear down the page directories we
> might as well optimize things and clear out everything that did end up
> being mapped.
>
> So I think the order in which things need to be done is:
>
> 1. Make sure all threads in the process have switched away from the
> userland page tables.
>
> 2. Flush all TLB entries for the process.
>
> 3. Tear down the pmap.
>
> 4. Tear down the vmspace.
>
> With my diff, step 4 still happens in the reaper. Moving it into
> exit1() needs a little bit of care since step 1-3 currently happen in
> cpu_exit() which is the last thing that exit1() does...
>
>
> Index: arch/arm64/arm64/pmap.c
> ===================================================================
> RCS file: /cvs/src/sys/arch/arm64/arm64/pmap.c,v
> diff -u -p -r1.111 pmap.c
> --- arch/arm64/arm64/pmap.c 14 Feb 2025 18:36:04 -0000 1.111
> +++ arch/arm64/arm64/pmap.c 11 May 2025 20:42:38 -0000
> @@ -695,6 +695,9 @@ pmap_remove(pmap_t pm, vaddr_t sva, vadd
> struct pte_desc *pted;
> vaddr_t va;
>
> + if (!pm->pm_active)
> + return;
> +
> pmap_lock(pm);
> for (va = sva; va < eva; va += PAGE_SIZE) {
> pted = pmap_vp_lookup(pm, va, NULL);
> @@ -1007,7 +1010,6 @@ pmap_destroy(pmap_t pm)
> void
> pmap_release(pmap_t pm)
> {
> - pmap_vp_destroy(pm);
> }
>
> void
> @@ -1035,7 +1037,10 @@ pmap_vp_destroy_l2_l3(pmap_t pm, struct
> if (pted == NULL)
> continue;
> vp3->vp[l] = NULL;
> -
> +
> + if (PTED_MANAGED(pted))
> + pmap_remove_pv(pted);
> +
> pool_put(&pmap_pted_pool, pted);
> }
> pool_put(&pmap_vp_pool, vp3);
> @@ -1505,6 +1510,8 @@ pmap_deactivate(struct proc *p)
>
> cpu_tlb_flush_asid_all((uint64_t)pm->pm_asid << 48);
> cpu_tlb_flush_asid_all((uint64_t)(pm->pm_asid | ASID_USER) << 48);
> +
> + pmap_vp_destroy(pm);
> }
>
> /*
> @@ -1634,6 +1641,9 @@ pmap_page_protect(struct vm_page *pg, vm
> void
> pmap_protect(pmap_t pm, vaddr_t sva, vaddr_t eva, vm_prot_t prot)
> {
> + if (!pm->pm_active)
> + return;
> +
> if (prot & (PROT_READ | PROT_EXEC)) {
> pmap_lock(pm);
> while (sva < eva) {
>
>
Faster _exit(2) for a faster userland: R.I.P the reaper